Google BigQuery: The Definitive Guide. Data Warehousing, Analytics, and Machine Learning at Scale - Helion

ebook

Autor: Valliappa Lakshmanan, Jordan Tigani
ISBN: 978-14-920-4441-3
stron: 498, Format: ebook
Data wydania: 2019-10-23
Księgarnia: Helion

Cena książki: 186,15 zł (poprzednio: 216,45 zł)
Oszczędzasz: 14% (-30,30 zł)

Osoby, które kupiły tę książkę, wybierały także »

Work with petabyte-scale datasets while building a collaborative, agile workplace in the process. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. BigQuery enables enterprises to efficiently store, query, ingest, and learn from their data in a convenient framework. With this book, you’ll examine how to analyze data at scale to derive insights from large datasets efficiently.

Valliappa Lakshmanan, tech lead for Google Cloud Platform, and Jordan Tigani, engineering director for the BigQuery team, provide best practices for modern data warehousing within an autoscaled, serverless public cloud. Whether you want to explore parts of BigQuery you’re not familiar with or prefer to focus on specific tasks, this reference is indispensable.

Osoby które kupowały "Google BigQuery: The Definitive Guide. Data Warehousing, Analytics, and Machine Learning at Scale", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Google BigQuery: The Definitive Guide. Data Warehousing, Analytics, and Machine Learning at Scale eBook -- spis treści

Preface
- Who Is This Book For?
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. What Is Google BigQuery?
- Data Processing Architectures
  - Relational Database Management System
  - MapReduce Framework
  - BigQuery: A Serverless, Distributed SQL Engine
- Working with BigQuery
  - Deriving Insights Across Datasets
  - ETL, EL, and ELT
  - Powerful Analytics
  - Simplicity of Management
- How BigQuery Came About
- What Makes BigQuery Possible?
  - Separation of Compute and Storage
  - Storage and Networking Infrastructure
  - Managed Storage
  - Integration with Google Cloud Platform
  - Security and Compliance
- Summary
2. Query Essentials
- Simple Queries
  - Retrieving Rows by Using SELECT
  - Aliasing Column Names with AS
  - Filtering with WHERE
  - SELECT *, EXCEPT, REPLACE
  - Subqueries with WITH
  - Sorting with ORDER BY
- Aggregates
  - Computing Aggregates by Using GROUP BY
  - Counting Records by Using COUNT
  - Filtering Grouped Items by Using HAVING
  - Finding Unique Values by Using DISTINCT
- A Brief Primer on Arrays and Structs
  - Creating Arrays by Using ARRAY_AGG
  - Array of STRUCT
  - TUPLE
  - Working with Arrays
  - UNNEST an Array
- Joining Tables
  - The JOIN Explained
  - INNER JOIN
  - CROSS JOIN
  - OUTER JOIN
- Saving and Sharing
  - Query History and Caching
  - Saved Queries
  - Views Versus Shared Queries
- Summary
3. Data Types, Functions, and Operators
- Numeric Types and Functions
  - Mathematical Functions
  - Standard-Compliant Floating-Point Division
  - SAFE Functions
  - Comparisons
  - Precise Decimal Calculations with NUMERIC
- Working with BOOL
  - Logical Operations
  - Conditional Expressions
  - Cleaner NULL-Handling with COALESCE
  - Casting and Coercion
  - Using COUNTIF to Avoid Casting Booleans
- String Functions
  - Internationalization
  - Printing and Parsing
  - String Manipulation Functions
  - Transformation Functions
  - Regular Expressions
  - Summary of String Functions
- Working with TIMESTAMP
  - Parsing and Formatting Timestamps
  - Extracting Calendar Parts
  - Arithmetic with Timestamps
  - Date, Time, and DateTime
- Working with GIS Functions
- Summary
4. Loading Data into BigQuery
- The Basics
  - Loading from a Local Source
  - Specifying a Schema
  - Copying into a New Table
  - Data Management (DDL and DML)
  - Loading Data Efficiently
    - Impact of compression and staging via Google Cloud Storage
    - Price and quota
- Federated Queries and External Data Sources
  - How to Use Federated Queries
    - Wildcards
    - Temporary table
    - Loading and querying Parquet and ORC
    - Loading and querying Hive partitions
  - When to Use Federated Queries and External Data Sources
    - Exploratory work using federated queries
    - ELT in SQL for experimentation
    - External query in Cloud SQL
  - Interactive Exploration and Querying of Data in Google Sheets
    - Loading Google Sheets data into BigQuery
    - Populating a Google Sheets spreadsheet with data from BigQuery
    - Exploring BigQuery tables using Sheets
    - Exploring BigQuery tables as a data sheet in Google Sheets
    - Joining Sheets data with a large dataset in BigQuery
  - SQL Queries on Data in Cloud Bigtable
    - NoSQL Queries based on a row-key prefix
    - Ad hoc SQL queries on Cloud Bigtable data
    - Improving performance
- Transfers and Exports
  - Data Transfer Service
    - Data locality
    - Setting up destination table
    - Create a transfer job
    - Scheduled queries
    - Cross-region dataset copy
  - Exporting Stackdriver Logs
  - Using Cloud Dataflow to Read/Write from BigQuery
    - Using a Dataflow template to load directly from MySQL
    - Writing a Dataflow job
    - Using the Streaming API directly
- Moving On-Premises Data
  - Data Migration Methods
- Summary
5. Developing with BigQuery
- Developing Programmatically
  - Accessing BigQuery via the REST API
    - Dataset manipulation
    - Table manipulation
    - Querying
    - Limitations
  - Google Cloud Client Library
    - Dataset manipulation
      - Dataset information
      - Creating a dataset
      - Deleting a dataset
      - Modifying attributes of a dataset
    - Table management
      - Obtaining table properties
      - Deleting a table
      - Creating an empty table
      - Updating a tables schema
      - Inserting rows into a table
      - Creating an empty table with schema
      - Loading a pandas DataFrame
      - Loading from a URI
      - Loading from a local file
      - Copying a table
      - Extracting data from a table
      - Browsing the rows of a table
    - Querying
      - Dry run
      - Executing the query
      - Creating a pandas DataFrame
      - Parameterized queries
- Accessing BigQuery from Data Science Tools
  - Notebooks on Google Cloud Platform
    - Jupyter Magics
    - Running a parameterized query
    - Saving query results to pandas
  - Working with BigQuery, pandas, and Jupyter
  - Working with BigQuery from R
  - Cloud Dataflow
  - JDBC/ODBC drivers
  - Incorporating BigQuery Data into Google Slides (in G Suite)
- Bash Scripting with BigQuery
  - Creating Datasets and Tables
    - Checking whether a dataset exists
    - Creating a dataset in a different project
    - Creating a table
    - Complex schema
    - Copying datasets
    - Loading and inserting data
    - Extracting data
  - Executing Queries
    - Previewing data
    - Creating views
  - BigQuery Objects
    - Showing details
    - Updating
- Summary
6. Architecture of BigQuery
- High-Level Architecture
  - Life of a Query Request
    - Step 1: HTTP POST
    - Step 2: Routing
    - Step 3: Job Server
    - Step 4: Query engine
    - Step 5: Returning the query results
  - BigQuery Upgrades
- Query Engine (Dremel)
  - Dremel Architecture
    - Query Master
    - Scheduler
    - Worker Shard
    - Shuffle
  - Query Execution
    - Scan-filter-count query
      - Stage 0
      - Poststage 0
      - Stage 1
    - Scan-filter-aggregate query
      - Stage 0
      - Stage 1
      - Stage 2
    - Scan-filter-aggregate query with high cardinality
      - Stage 0
      - Distributed sort
    - Broadcast JOIN query
    - Hash join query
- Storage
  - Storage Data
    - Physical storage: Colossus
    - Storage format: Capacitor
  - Metadata
    - Storage sets
    - Time travel
    - Storage optimization
    - Partitioning
    - Clustering
      - Reclustering
      - Performance optimizations with clustered tables
    - DML
    - Meta-File
- Summary
7. Optimizing Performance and Cost
- Principles of Performance
  - Key Drivers of Performance
  - Controlling Cost
    - Estimating per-query cost
    - Finding the most expensive queries
- Measuring and Troubleshooting
  - Measuring Query Speed Using REST API
  - Measuring Query Speed Using BigQuery Workload Tester
  - Troubleshooting Workloads Using Stackdriver
  - Reading Query Plan Information
    - Obtaining query plan information from the job details
    - Visualizing the query plan information
- Increasing Query Speed
  - Minimizing I/O
    - Be purposeful in SELECT
    - Reducing data being read
    - Reducing the number of expensive computations
  - Caching the Results of Previous Queries
    - Caching intermediate results
    - Accelerating queries with BI Engine
  - Performing Efficient Joins
    - Denormalization
    - Avoiding self-joins of large tables
    - Reducing the data being joined
    - Using a window function instead of self-join
    - Joining with precomputed values
    - JOIN versus denormalization
  - Avoiding Overwhelming a Worker
    - Limiting large sorts
    - Data skew
    - Optimizing user-defined functions
  - Using Approximate Aggregation Functions
    - Approximate count
    - Approximate top
    - HLL functions
- Optimizing How Data Is Stored and Accessed
  - Minimizing Network Overhead
    - Compressed, partial responses
    - Batching multiple requests
    - Bulk reads using BigQuery Storage API
  - Choosing an Efficient Storage Format
    - Internal versus external data sources
    - Setting up life cycle management on staging buckets
    - Storing data as arrays of structs
    - Storing data as geography types
  - Partitioning Tables to Reduce Scan Size
    - Antipattern: Table suffixes and wildcards
    - Partitioned tables
  - Clustering Tables Based on High-Cardinality Keys
    - Clustering by the partitioning column
    - Reclustering
    - Side benefits of clustering
- Time-Insensitive Use Cases
  - Batch Queries
  - File Loads
- Summary
  - Checklist
8. Advanced Queries
- Reusable Queries
  - Parameterized Queries
    - Named parameters
    - Named timestamp parameters
    - Positional parameters
    - Array and struct parameters
  - SQL User-Defined Functions
    - Persistent UDFs
    - Public UDFs
  - Reusing Parts of Queries
    - Correlated subquery
    - WITH clause
    - Defining constants
- Advanced SQL
  - Working with Arrays
    - Using arrays to preserve ordering
    - Using arrays to store repeated fields
    - Using arrays for generating data
    - Array functions
  - Window Functions
    - Aggregate analytic functions
    - Navigation functions
    - Numbering functions
  - Table Metadata
    - Building queries dynamically
    - Labels and tags
    - Time travel
  - Data Definition Language and Data Manipulation Language
    - DDL
      - Options list
      - Empty table
      - Changing options
    - DML
      - Insert SELECT
      - Insert VALUES
      - Insert VALUES with subquery SELECT
      - Deleting rows
      - Updating row values
      - MERGE statement
- Beyond SQL
  - JavaScript UDFs
  - Scripting
    - A sequence of statements
    - Temporary tables
    - Anatomy of a simple script
    - Looping
    - Stored procedures
    - Parameters to stored procedures
- Advanced Functions
  - BigQuery Geographic Information Systems
    - Geographic types
    - Creating Polygons
    - GIS predicate functions
    - GIS Measures
    - Geometry transformations and aggregations
  - Useful Statistical Functions
    - Statistics
    - Quantiles
    - Correlation
  - Hash Algorithms
    - Fingerprint function
    - MD5 and SHA
    - UUID
    - Random number generator
- Summary
9. Machine Learning in BigQuery
- What Is Machine Learning?
  - Formulating a Machine Learning Problem
  - Types of Machine Learning Problems
    - Regression
    - Classification
    - Recommender
    - Clustering
    - Unstructured data
    - Summary of model types
- Building a Regression Model
  - Choose the Label
  - Exploring the Dataset to Find Features
    - Impact of station
    - Day of week
    - Number of bicycles
  - Creating a Training Dataset
  - Training and Evaluating the Model
    - Evaluating the model
    - Combining days of the week
    - Bucketizing the hour of day
  - Predicting with the Model
    - The need for TRANSFORM
    - Generating batch predictions
  - Examining Model Weights
  - More-Complex Regression Models
    - Deep Neural Networks
    - Gradient-boosted trees
    - Human insights and auxiliary data
- Building a Classification Model
  - Training
  - Evaluation
  - Prediction
  - Choosing the Threshold
- Customizing BigQuery ML
  - Controlling Data Split
  - Balancing Classes
  - Regularization
- k-Means Clustering
  - Whats Being Clustered?
  - Clustering Bicycle Stations
  - Carrying Out Clustering
  - Understanding the Clusters
  - Data-Driven Decisions
- Recommender Systems
  - The MovieLens Dataset
  - Matrix Factorization
  - Making Recommendations
    - Filtering out previously rated movies
    - Customer targeting
    - Batch predictions for all users and movies
  - Incorporating User and Movie Information
    - Obtaining user and product factors
    - Creating input features
    - Training hybrid recommendation model
- Custom Machine Learning Models on GCP
  - Hyperparameter Tuning
    - Hyperparameter tuning using scripting
    - Hyperparameter tuning in Python
    - Hyperparameter tuning using AI Platform
  - AutoML
  - Support for TensorFlow
    - TensorFlows BigQueryReader
    - Using pandas
    - Apache Beam/Cloud Dataflow
    - Exporting to TensorFlow
    - Predicting with TensorFlow models
- Summary
10. Administering and Securing BigQuery
- Infrastructure Security
- Identity and Access Management
  - Identity
  - Role
    - Predefined roles
    - Primitive roles
    - Custom roles
  - Resource
- Administering BigQuery
  - Job Management
  - Authorizing Users
  - Restoring Deleted Records and Tables
  - Continuous Integration/Continuous Deployment
    - Invoking BigQuery from a Cloud Function
    - Putting table, view, and function creation under version control
  - Cost/Billing Exports
    - Costs by month by product
    - Visualizing the billing report
    - Labels
  - Dashboards, Monitoring, and Audit Logging
    - Cloud Security Command Center
    - Stackdriver monitoring and audit logging
- Availability, Disaster Recovery, and Encryption
  - Zones, Regions, and Multiregions
  - BigQuery and Failure Handling
    - Disk failures
    - Machine failures
    - Zonal failures
    - Regional failures
  - Durability, Backups, and Disaster Recovery
  - Privacy and Encryption
    - Access transparency
    - Virtual Private Cloud Service Controls
    - Customer-Managed Encryption Keys
- Regulatory Compliance
  - Data Locality
  - Restricting Access to Subsets of Data
    - Authorized views
    - Dynamic filtering based on user
  - Removing All Transactions Related to a Single Individual
    - DML
    - Crypto-shredding
  - Data Loss Prevention
  - CMEK
  - Data Exfiltration Protection
- Summary
Index