Data Science on the Google Cloud Platform. 2nd Edition - Helion
ISBN: 9781098118914
stron: 462, Format: ebook
Data wydania: 2022-03-29
Księgarnia: Helion
Cena książki: 228,65 zł (poprzednio: 265,87 zł)
Oszczędzasz: 14% (-37,22 zł)
Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline with cloud native tools on GCP.
Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way.
You'll learn how to:
- Employ best practices in building highly scalable data and ML pipelines on Google Cloud
- Automate and schedule data ingest using Cloud Run
- Create and populate a dashboard in Data Studio
- Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery
- Conduct interactive data exploration with BigQuery
- Create a Bayesian model with Spark on Cloud Dataproc
- Forecast time series and do anomaly detection with BigQuery ML
- Aggregate within time windows with Dataflow
- Train explainable machine learning models with Vertex AI
- Operationalize ML with Vertex AI Pipelines
Osoby które kupowały "Data Science on the Google Cloud Platform. 2nd Edition", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Data Science on the Google Cloud Platform. 2nd Edition eBook -- spis treści
- Preface
- Who This Book Is For
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Making Better Decisions Based on Data
- Many Similar Decisions
- The Role of Data Scientists
- Scrappy Environment
- Full Stack Cloud Data Scientists
- Collaboration
- Best Practices
- Simple to Complex Solutions
- Cloud Computing
- Serverless
- A Probabilistic Decision
- Probabilistic Approach
- Probability Density Function
- Cumulative Distribution Function
- Choices Made
- Choosing Cloud
- Not a Reference Book
- Getting Started with the Code
- Agile Architecture for Data Science on Google Cloud
- What Is Agile Architecture?
- No-Code, Low-Code
- Use Managed Services
- Summary
- Suggested Resources
- 2. Ingesting Data into the Cloud
- Airline On-Time Performance Data
- Knowability
- Causality
- TrainingServing Skew
- Downloading Data
- Hub-and-Spoke Architecture
- Dataset Fields
- Separation of Compute and Storage
- Scaling Up
- Scaling Out with Sharded Data
- Scaling Out with Data-in-Place
- Ingesting Data
- Reverse Engineering a Web Form
- Dataset Download
- Exploration and Cleanup
- Uploading Data to Google Cloud Storage
- Loading Data into Google BigQuery
- Advantages of a Serverless Columnar Database
- Staging on Cloud Storage
- Access Control
- Ingesting CSV Files
- Partitioning
- Scheduling Monthly Downloads
- Ingesting in Python
- Cloud Run
- Securing Cloud Run
- Deploying and Invoking Cloud Run
- Scheduling Cloud Run
- Summary
- Code Break
- Suggested Resources
- Airline On-Time Performance Data
- 3. Creating Compelling Dashboards
- Explain Your Model with Dashboards
- Why Build a Dashboard First?
- Accuracy, Honesty, and Good Design
- Loading Data into Cloud SQL
- Create a Google Cloud SQL Instance
- Create Table of Data
- Interacting with the Database
- Querying Using BigQuery
- Schema Exploration
- Using Preview
- Using Table Explorer
- Creating BigQuery View
- Building Our First Model
- Contingency Table
- Threshold Optimization
- Building a Dashboard
- Getting Started with Data Studio
- Creating Charts
- Adding End-User Controls
- Showing Proportions with a Pie Chart
- Explaining a Contingency Table
- Modern Business Intelligence
- Digitization
- Natural Language Queries
- Connected Sheets
- Summary
- Suggested Resources
- Explain Your Model with Dashboards
- 4. Streaming Data: Publication and Ingest with Pub/Sub and Dataflow
- Designing the Event Feed
- Transformations Needed
- Architecture
- Getting Airport Information
- Sharing Data
- Sharing a Cloud Storage dataset
- Sharing a BigQuery dataset
- Dataplex and Analytics Hub
- Time Correction
- Apache Beam/Cloud Dataflow
- Parsing Airports Data
- Adding Time Zone Information
- Converting Times to UTC
- Correcting Dates
- Creating Events
- Reading and Writing to the Cloud
- Running the Pipeline in the Cloud
- Publishing an Event Stream to Cloud Pub/Sub
- Speed-Up Factor
- Get Records to Publish
- How Many Topics?
- Iterating Through Records
- Building a Batch of Events
- Publishing a Batch of Events
- Real-Time Stream Processing
- Streaming in Dataflow
- Windowing a Pipeline
- Streaming Aggregation
- Using Event Timestamps
- Executing the Stream Processing
- Analyzing Streaming Data in BigQuery
- Real-Time Dashboard
- Summary
- Suggested Resources
- Designing the Event Feed
- 5. Interactive Data Exploration with Vertex AI Workbench
- Exploratory Data Analysis
- Exploration with SQL
- Reading a Query Explanation
- Exploratory Data Analysis in Vertex AI Workbench
- Jupyter Notebooks
- Creating a Notebook
- Jupyter Commands
- Installing Packages
- Jupyter Magic for Google Cloud
- Exploring Arrival Delays
- Basic Statistics
- Plotting Distributions
- Quality Control
- Oddball values
- Outlier removal: Big data is different
- Filtering data on occurrence frequency
- Arrival Delay Conditioned on Departure Delay
- Distribution of arrival delays
- Applying a probabilistic decision threshold
- Empirical probability distribution function
- The answer is...
- Evaluating the Model
- Random Shuffling
- Splitting by Date
- Training and Testing
- Summary
- Suggested Resources
- Exploratory Data Analysis
- 6. Bayesian Classifier with Apache Spark on Cloud Dataproc
- MapReduce and the Hadoop Ecosystem
- How MapReduce Works
- Apache Hadoop
- Google Cloud Dataproc
- Need for Higher-Level Tools
- Jobs, Not Clusters
- Preinstalling Software
- Quantization Using Spark SQL
- JupyterLab on Cloud Dataproc
- Independence Check Using BigQuery
- Spark SQL in JupyterLab
- Histogram Equalization
- Bayesian Classification
- Bayes in Each Bin
- Evaluating the Model
- Dynamically Resizing Clusters
- Comparing to Single Threshold Model
- Orchestration
- Submitting a Spark Job
- Workflow Template
- Cloud Composer
- Autoscaling
- Serverless Spark
- Summary
- Suggested Resources
- MapReduce and the Hadoop Ecosystem
- 7. Logistic Regression Using Spark ML
- Logistic Regression
- How Logistic Regression Works
- Spark ML Library
- Getting Started with Spark Machine Learning
- Spark Logistic Regression
- Creating a Training Dataset
- Dealing with corner cases
- Creating training examples
- Training the Model
- Predicting Using the Model
- Evaluating a Model
- Creating a Training Dataset
- Feature Engineering
- Experimental Framework
- Choosing a metric
- Creating the held-out dataset
- Feature Selection
- Creating a large cluster
- Increasing quota
- Autoscale up and down
- Removing features
- Feature Transformations
- Scaling
- Clipping
- Feature Creation
- Categorical Variables
- Repeatable, Real Time
- Experimental Framework
- Summary
- Suggested Resources
- Logistic Regression
- 8. Machine Learning with BigQuery ML
- Logistic Regression
- Presplit Data
- Interrogating the Model
- Evaluating the Model
- Scale and Simplicity
- Nonlinear Machine Learning
- XGBoost
- Hyperparameter Tuning
- Vertex AI AutoML Tables
- Time Window Features
- Taxi-Out Time
- Compounding Delays
- Causality
- Time Features
- Departure Hour
- Transform Clause
- Categorical Variable
- Feature Cross
- Summary
- Suggested Resources
- Logistic Regression
- 9. Machine Learning with TensorFlow in Vertex AI
- Toward More Complex Models
- Preparing BigQuery Data for TensorFlow
- Reading Data into TensorFlow
- Training and Evaluation in Keras
- Model Function
- Features
- Inputs
- Training the Keras Model
- Saving and Exporting
- Deep Neural Network
- Wide-and-Deep Model in Keras
- Representing Air Traffic Corridors
- Bucketing
- Feature Crossing
- Wide-and-Deep Classifier
- Deploying a Trained TensorFlow Model to Vertex AI
- Concepts
- Uploading Model
- Creating Endpoint
- Deploying Model to Endpoint
- Invoking the Deployed Model
- Summary
- Suggested Resources
- Toward More Complex Models
- 10. Getting Ready for MLOps with Vertex AI
- Developing and Deploying Using Python
- Writing model.py
- Writing the Training Pipeline
- Predefined Split
- AutoML
- Hyperparameter Tuning
- Parameterize Model
- Shorten Training Run
- Metrics During Training
- Hyperparameter Tuning Pipeline
- Best Trial to Completion
- Explaining the Model
- Configuring Explanations Metadata
- Creating and Deploying Model
- Obtaining Explanations
- Summary
- Suggested Resources
- Developing and Deploying Using Python
- 11. Time-Windowed Features for Real-Time Machine Learning
- Time Averages
- Apache Beam and Cloud Dataflow
- Why Apache Beam?
- Why Dataflow?
- Starting points
- Reading and Writing
- Reading from BigQuery
- Local JSON input
- Filtering
- Time Windowing
- Assigning a timestamp
- Sliding windows
- Computing moving average
- Removing duplicates
- Apache Beam and Cloud Dataflow
- Machine Learning Training
- Machine Learning Dataset
- Label
- Data split
- Distance bug
- Monitoring and verification
- Training the Model
- Changes from Chapter 10
- AutoML model
- Custom model
- Machine Learning Dataset
- Streaming Predictions
- Reuse Transforms
- Input and Output
- Invoking Model
- Reusing Endpoint
- Shared handle
- Per-worker instance
- Batching Predictions
- Streaming Pipeline
- Writing to BigQuery
- Executing Streaming Pipeline
- Late and Out-of-Order Records
- Uniformly distributed delay
- Exponential distribution
- Normal distribution
- Watermarks and triggers
- Possible Streaming Sinks
- Choosing a sink
- Cloud Bigtable
- Designing tables
- Designing the row key
- Streaming into Cloud Bigtable
- Querying from Cloud Bigtable
- Summary
- Suggested Resources
- Time Averages
- 12. The Full Dataset
- Four Years of Data
- Creating Dataset
- Dataset split
- Shuffling data
- Need for continuous training
- More powerful machines
- Training Model
- Evaluation
- RMSE
- Confusion matrix
- Impact of threshold
- Impact of a feature
- Analyzing errors
- Categorical features
- Creating Dataset
- Summary
- Suggested Resources
- Four Years of Data
- Conclusion
- A. Considerations for Sensitive Data Within Machine Learning Datasets
- Handling Sensitive Information
- Sensitive Data in Columns
- Sensitive Data in Natural Language Datasets
- Sensitive Data in Free-Form Unstructured Data
- Sensitive Data in a Combination of Fields
- Sensitive Data in Unstructured Content
- Protecting Sensitive Data
- Removing Sensitive Data
- Masking Sensitive Data
- Coarsening Sensitive Data
- Establishing a Governance Policy
- Handling Sensitive Information
- Index