Dataproc Cookbook - Helion

ISBN: 9781098157661
stron: 438, Format: ebook
Data wydania: 2025-06-03
Księgarnia: Helion
Cena książki: 228,65 zł (poprzednio: 265,87 zł)
Oszczędzasz: 14% (-37,22 zł)
Want to build big data solutions in Google Cloud? Dataproc Cookbook is your hands-on guide to mastering Dataproc and the essential GCP fundamentals—like networking, security, monitoring, and cost optimization--that apply across Google Cloud services. Learn practical skills that not only fast-track your Dataproc expertise, but also help you succeed with a wide range of GCP technologies.
Written by data experts Narasimha Sadineni and Anu Venkataraman, this cookbook tackles real-world use cases like serverless Spark jobs, Kubernetes-native deployments, and cost-optimized data lake workflows. You'll learn how to create ephemeral and persistent Dataproc clusters, run secure data science workloads, implement monitoring solutions, and plan effective migration and optimization strategies.
- Create Dataproc clusters on Compute Engine and Kubernetes Engine
- Run data science workloads on Dataproc
- Execute Spark jobs on Dataproc Serverless
- Optimize Dataproc clusters to be cost effective and performant
- Monitor Spark jobs in various ways
- Orchestrate various workloads and activities
- Use different methods for migrating data and workloads from existing Hadoop clusters to Dataproc
Osoby które kupowały "Dataproc Cookbook", wybierały także:
- Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 747,50 zł, (29,90 zł -96%)
- Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Sieci przedsi 427,14 zł, (29,90 zł -93%)
- Jak zhakowa 125,00 zł, (10,00 zł -92%)
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Deep Web bez tajemnic. Kurs video. Pozyskiwanie ukrytych danych 186,88 zł, (29,90 zł -84%)
Spis treści
Dataproc Cookbook eBook -- spis treści
- Preface
- Who Should Read This Book
- Why We Wrote This Book
- Navigating This Book
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Creating a Dataproc Cluster
- Installing Google Cloud CLI
- Problem
- Solution
- Discussion
- Granting Identity and Access Management Privileges to a User
- Problem
- Solution
- Discussion
- Configuring a Network and Firewall Rules
- Problem
- Solution
- Discussion
- See Also
- Creating a Dataproc Cluster from a Web UI
- Problem
- Solution
- Discussion
- Creating a Dataproc Cluster Using Gcloud
- Problem
- Solution
- Discussion
- Creating a Dataproc Cluster Using API Endpoints
- Problem
- Solution
- Discussion
- Creating a Dataproc Cluster Using Terraform
- Problem
- Solution
- Discussion
- Creating a Cluster Using Python
- Problem
- Solution
- Discussion
- Duplicating a Dataproc Cluster
- Problem
- Solution
- Discussion
- Installing Google Cloud CLI
- 2. Running Hive, Spark, and Sqoop Workloads
- Adding Required Privileges for Jobs
- Problem
- Solution
- Discussion
- Project-level roles
- Service-level roles
- Custom roles
- Staging and temporary buckets
- See Also
- Generating 1 TB of Data Using a MapReduce Job
- Problem
- Solution
- Discussion
- Submitting the job
- Monitoring the job
- Running a Hive Job to Show Records from an Employee Table
- Problem
- Solution
- Discussion
- Metastore options
- Hive query execution process
- Step 1: Stage the HiveQL script in the GCS bucket
- Step 2: Submit the Hive job to the Dataproc cluster via the console
- See Also
- Converting XML Data to Parquet Using Scala Spark on Dataproc
- Problem
- Solution
- Discussion
- See Also
- Converting XML Data to Parquet Using PySpark on Dataproc
- Problem
- Solution
- Discussion
- Staging the PySpark code
- Submitting the PySpark job
- See Also
- Submitting a SparkR Job
- Problem
- Solution
- Discussion
- Staging the SparkR code
- Submitting the SparkR job
- Monitoring the SparkR job
- Migrating Data from Cloud SQL to Hive Using Sqoop Job
- Problem
- Solution
- Discussion
- Setting up a MySQL source database (optional)
- Creating a sample table and data in MySQL
- Triggering the Sqoop data transfer job
- Choosing Deployment Modes When Submitting a Spark Job to Dataproc
- Problem
- Solution
- Client mode
- Cluster mode
- Discussion
- See Also
- Adding Required Privileges for Jobs
- 3. Advanced Dataproc Cluster Configuration
- Creating an Autoscaling Policy
- Problem
- Solution
- Discussion
- YARN configuration
- Worker configuration
- Attaching an Autoscaling Policy to a Dataproc Cluster
- Problem
- Solution
- Discussion
- Optimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling Policy
- Problem
- Solution
- Discussion
- Adding Local SSDs to Dataproc Worker Nodes
- Problem
- Solution
- Discussion
- Creating a Cluster with a Custom Image
- Problem
- Solution
- Discussion
- Building a Cluster with Custom Machine Types
- Problem
- Solution
- Discussion
- Bootstrapping Dataproc Clusters with Initialization Scripts
- Problem
- Solution
- Discussion
- Scheduling Automatic Deletion of Unused Clusters
- Problem
- Solution
- Discussion
- Overriding Hadoop Configurations
- Problem
- Solution
- Discussion
- Creating an Autoscaling Policy
- 4. Serverless Spark and Ephemeral Dataproc Clusters
- Running on Dataproc: Serverless Versus Ephemeral Clusters
- Problem
- Solution
- Discussion
- Dataproc Serverless
- Ephemeral Dataproc clusters
- Running a Sequence of Jobs on an Ephemeral Cluster
- Problem
- Solution
- Discussion
- Executing a Spark Batch Job to Convert XML Data to Parquet on Dataproc Serverless
- Problem
- Solution
- Discussion
- See Also
- Running a Serverless Job Using the Premium Tier Configuration
- Problem
- Solution
- Discussion
- Giving a Unique Custom Name to a Dataproc Serverless Spark Job
- Problem
- Solution
- Discussion
- See Also
- Cloning a Dataproc Serverless Spark Job
- Problem
- Solution
- Discussion
- Running a Serverless Job on Spark RAPIDS Accelerator
- Problem
- Solution
- Discussion
- See Also
- Configuring a Spark History Server
- Problem
- Solution
- Discussion
- See Also
- Writing Spark Events to the Spark History Server from Dataproc Serverless
- Problem
- Solution
- Discussion
- Monitoring Serverless Spark Jobs
- Problem
- Solution
- Discussion
- See Also
- Calculating the Price of a Serverless Batch
- Problem
- Solution
- Discussion
- See Also
- Running on Dataproc: Serverless Versus Ephemeral Clusters
- 5. Dataproc on Google Kubernetes Engine
- Creating a Kubernetes Cluster
- Problem
- Solution
- Discussion
- Application requirements
- Infrastructure
- Node pool strategy
- Network
- Kubernetes distribution
- Security
- Prerequisites
- Creating a Dataproc Cluster on a GKE Cluster
- Problem
- Solution
- Discussion
- Running Spark Jobs on a Dataproc GKE Cluster
- Problem
- Solution
- Discussion
- Customizing Node Pools
- Problem
- Solution
- Discussion
- Autoscaling in a GKE Cluster
- Problem
- Solution
- Discussion
- Achieving Zonal High Availability for Dataproc Jobs
- Problem
- Solution
- Discussion
- Creating a Kubernetes Cluster
- 6. Dataproc Metastore
- Creating a Dataproc Metastore Service Instance
- Problem
- Solution
- Discussion
- See Also
- Attaching a DPMS Instance to One or More Clusters
- Problem
- Solution
- Discussion
- Creating Tables and Verifying Metadata in DPMS
- Problem
- Solution
- Discussion
- Installing an External Hive Metastore
- Problem
- Solution
- Discussion
- Attaching an External Apache Hive Metastore to the Cluster
- Problem
- Solution
- Discussion
- Searching for Metadata in a Dataplex Data Catalog
- Problem
- Solution
- Discussion
- Automating the Backup of a DPMS Instance
- Problem
- Solution
- Discussion
- Prerequisites
- Creating a Cloud Function
- Creating a Pub/Sub topic
- Creating a Cloud Scheduler job
- Creating a Dataproc Metastore Service Instance
- 7. Connecting from Dataproc to GCP Services
- Reading from GCS and Writing to a BigQuery Table
- Problem
- Solution
- Discussion
- Reading from a Cloud SQL Table
- Problem
- Solution
- Discussion
- Writing to GCS in Delta Format
- Problem
- Solution
- Discussion
- Integrating a Dataproc-Managed Delta Lake with BigLake
- Problem
- Solution
- Discussion
- Syncing a Delta Table on Dataproc
- Creating a BigLake connection
- Locating connection information
- Granting GCS permissions for BigLake access
- Creating the BigLake table
- Connecting to GCP Services Using Dataproc Templates
- Problem
- Solution
- Discussion
- Spark Job Running on Dataproc Reading from GCS and Writing to Bigtable
- Problem
- Solution
- Discussion
- See Also
- Reading from GCS and Writing to a BigQuery Table
- 8. Configuring Logging in Dataproc
- Understanding Different Types of Logs in Dataproc
- Problem
- Solution
- Discussion
- Understanding Cloud Logging
- Problem
- Solution
- Discussion
- Log source
- Log Router
- Log sink
- Logging buckets
- Viewing Logs in Cloud Logging
- Problem
- Solution
- Discussion
- Accessing logs from the web UI
- Viewing logs using gcloud command
- Searching for logs using the REST API
- Routing Dataproc Logs to Cloud Logging
- Problem
- Solution
- Discussion
- Attaching Custom Labels to Logging
- Problem
- Solution
- Discussion
- Optimizing Cloud Logging Costs
- Problem
- Solution
- Discussion
- Sinking Logs to BigQuery
- Problem
- Solution
- Discussion
- Understanding Different Types of Logs in Dataproc
- 9. Setting Up Monitoring and Dashboards
- Monitoring Cluster Status
- Problem
- Solution
- Discussion
- Comprehensive service monitoring
- Application-centric monitoring
- Exploring Predefined Metrics Charts
- Problem
- Solution
- Discussion
- Creating Charts Using Metrics Explorer
- Problem
- Solution
- Discussion
- Creating Dashboards Using Metrics Explorer
- Problem
- Solution
- Discussion
- Setting Up Alerts
- Problem
- Solution
- Discussion
- Migrating Dashboards from One Project to Another
- Problem
- Solution
- Discussion
- Creating Custom Log-Based Metrics
- Problem
- Solution
- Discussion
- Creating a Log-based metric
- Viewing metrics data from Metrics Explorer
- Monitoring Cluster Status
- 10. Dataproc Security
- Managing Identities in Dataproc Clusters
- Problem
- Solution
- Discussion
- Service account approach
- Personal cluster authentication
- Secure multitenancy approach
- Securing Your Perimeter Using VPC Service Controls
- Problem
- Solution
- Discussion
- Creating a new VPC service perimeter
- Allowing outside users access to resources in a secured perimeter
- Authenticating Using Kerberos
- Problem
- Solution
- Discussion
- Installing Ranger
- Problem
- Solution
- Discussion
- Creating an encrypt/decrypt key
- Encrypting a password using the key
- Copying the password to the GCS location
- Securing Cluster Resources Using Ranger
- Problem
- Solution
- Discussion
- Creating users in Ranger
- Creating Ranger policies to secure access to folders
- Creating a Ranger policy
- Managing Credentials in the Google Cloud Environment
- Problem
- Solution
- Discussion
- Creating a secret
- Accessing a secret in a Spark job (PySpark) and using it to connect to a SQL server
- Hadoop credentials approach
- Enforcing Restrictions Across All Clusters
- Problem
- Solution
- Discussion
- Enforcing specific machine types in a Dataproc cluster
- Enforcing the cost_center label for Dataproc clusters
- Testing the organization policies
- Tokenizing Sensitive Data
- Problem
- Solution
- Discussion
- Creating a deidentified template in DLP
- Creating a KMS key
- Creating a deidentification template
- Tokenizing first name and last name using a Spark job
- Managing Identities in Dataproc Clusters
- 11. Performance Tuning and Cost Optimization
- Sizing a Dataproc Cluster
- Problem
- Solution
- Discussion
- Scenario 1: Configuring a cluster for batch requirements
- Scenario 2: Designing a streaming application
- Choosing the Right Disks for Big Data Workloads on Dataproc
- Problem
- Solution
- Discussion
- Benchmarking Clusters with Performance Tuning
- Problem
- Solution
- Discussion
- Creating a cluster to run benchmarking jobs
- Running a TeraGen job to generate 500 GB of data
- Running TeraSort to sort 500 GB of data generated by TeraGen
- Capturing benchmarking results
- Navigating the Spark UI
- Problem
- Solution
- Discussion
- Optimizing Spark Jobs
- Problem
- Solution
- Discussion
- Installing Sparklens for Profiling Spark Applications
- Problem
- Solution
- Discussion
- Driver and executor wall-clock times
- Critical path
- See Also
- Identifying Spark Job Errors and Bottlenecks
- Problem
- Solution
- Discussion
- Setting up test data
- Replicating driver OOM
- Replicating executor OOM
- Replicating spill to memory (spill to disk)
- Understanding the YARN UI
- Problem
- Solution
- Discussion
- Cluster information path
- Applications information path
- Scheduler home screen
- Calculating the Cost of a Dataproc Cluster
- Problem
- Solution
- Discussion
- Optimizing Cost in Dataproc Clusters
- Problem
- Solution
- Discussion
- Choosing the right architecture for your Dataproc cluster
- Tuning the cluster configuration
- Oversubscribing the cluster
- Optimizing logging
- Sizing a Dataproc Cluster
- 12. Orchestrating Dataproc Workloads
- Understanding the Prerequisites for Installing Cloud Composer
- Problem
- Solution
- Discussion
- Configuring IAM permissions
- Planning the network and subnet IP ranges
- Deploying a Cloud Composer Environment
- Problem
- Solution
- Discussion
- Scheduling a Job in Composer
- Problem
- Solution
- Discussion
- Parameterizing Variables
- Problem
- Solution
- Discussion
- Scaling Up a Cloud Composer Environment
- Problem
- Solution
- Discussion
- Infrastructure
- DAG-level optimizations
- Running Spark Jobs Using Vertex AI Machine Learning Pipelines
- Problem
- Solution
- Discussion
- Scheduling a Dataproc Job in Event Driven Using a Cloud Function
- Problem
- Solution
- Discussion
- Using Dataproc Workflow Templates
- Problem
- Solution
- Discussion
- Understanding the Prerequisites for Installing Cloud Composer
- 13. Using Spark Notebooks on Dataproc
- Deciding Which Notebook Environments to Choose
- Problem
- Solution
- Discussion
- Configuring Notebooks on a Dataproc Cluster
- Problem
- Solution
- Discussion
- Running Spark Scala and PySpark Notebooks on Dataproc
- Problem
- Solution
- Discussion
- Running the first Spark Scala application
- Running the first PySpark application
- Managing Libraries and Configs
- Problem
- Solution
- Discussion
- Creating Dataproc-Enabled Vertex AI Workbench Instances
- Problem
- Solution
- Discussion
- Running Spark notebooks
- Integrating with Git
- Mounting a GCS bucket
- Executing Notebooks Using Spark Serverless Sessions
- Problem
- Solution
- Discussion
- Deciding Which Notebook Environments to Choose
- 14. Migrating from On-Premises and Public Cloud Services to GCP
- Planning Migration
- Problem
- Solution
- Discussion
- Compute and cluster planning
- Hive migration
- Spark migration
- Additional migration considerations
- Architectural decisions
- IAM policy redesign
- Reliability and region selection
- Cost optimization
- Testing and validation
- Automation
- Upskilling
- See Also
- Data Migration Strategies
- Problem
- Solution
- Discussion
- Migrating Data with STS
- Problem
- Solution
- Discussion
- Preparing the source
- Preparing the target
- Creating a storage transfer job
- Triggering a transfer job manually
- Scheduling a transfer job
- Accessing AWS S3 Data Using BigLake Tables
- Problem
- Solution
- Discussion
- Migrating Metadata
- Problem
- Solution
- Discussion
- Choosing a target service in GCP
- Choosing the migration strategy
- Migrating Applications to Google Cloud
- Problem
- Solution
- Discussion
- Planning Migration
- Index





