reklama - zainteresowany?

Dataproc Cookbook - Helion

Dataproc Cookbook
ebook
Autor: Narasimha Sadineni, Anuyogam Venkataraman
ISBN: 9781098157661
stron: 438, Format: ebook
Data wydania: 2025-06-03
Księgarnia: Helion

Cena książki: 228,65 zł (poprzednio: 265,87 zł)
Oszczędzasz: 14% (-37,22 zł)

Dodaj do koszyka Dataproc Cookbook

Want to build big data solutions in Google Cloud? Dataproc Cookbook is your hands-on guide to mastering Dataproc and the essential GCP fundamentals—like networking, security, monitoring, and cost optimization--that apply across Google Cloud services. Learn practical skills that not only fast-track your Dataproc expertise, but also help you succeed with a wide range of GCP technologies.

Written by data experts Narasimha Sadineni and Anu Venkataraman, this cookbook tackles real-world use cases like serverless Spark jobs, Kubernetes-native deployments, and cost-optimized data lake workflows. You'll learn how to create ephemeral and persistent Dataproc clusters, run secure data science workloads, implement monitoring solutions, and plan effective migration and optimization strategies.

  • Create Dataproc clusters on Compute Engine and Kubernetes Engine
  • Run data science workloads on Dataproc
  • Execute Spark jobs on Dataproc Serverless
  • Optimize Dataproc clusters to be cost effective and performant
  • Monitor Spark jobs in various ways
  • Orchestrate various workloads and activities
  • Use different methods for migrating data and workloads from existing Hadoop clusters to Dataproc

Dodaj do koszyka Dataproc Cookbook

 

Osoby które kupowały "Dataproc Cookbook", wybierały także:

  • Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1
  • Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Sieci przedsi
  • Jak zhakowa
  • Windows Media Center. Domowe centrum rozrywki
  • Deep Web bez tajemnic. Kurs video. Pozyskiwanie ukrytych danych

Dodaj do koszyka Dataproc Cookbook

Spis treści

Dataproc Cookbook eBook -- spis treści

  • Preface
    • Who Should Read This Book
    • Why We Wrote This Book
    • Navigating This Book
    • Conventions Used in This Book
    • Using Code Examples
    • OReilly Online Learning
    • How to Contact Us
    • Acknowledgments
  • 1. Creating a Dataproc Cluster
    • Installing Google Cloud CLI
      • Problem
      • Solution
      • Discussion
    • Granting Identity and Access Management Privileges to a User
      • Problem
      • Solution
      • Discussion
    • Configuring a Network and Firewall Rules
      • Problem
      • Solution
      • Discussion
      • See Also
    • Creating a Dataproc Cluster from a Web UI
      • Problem
      • Solution
      • Discussion
    • Creating a Dataproc Cluster Using Gcloud
      • Problem
      • Solution
      • Discussion
    • Creating a Dataproc Cluster Using API Endpoints
      • Problem
      • Solution
      • Discussion
    • Creating a Dataproc Cluster Using Terraform
      • Problem
      • Solution
      • Discussion
    • Creating a Cluster Using Python
      • Problem
      • Solution
      • Discussion
    • Duplicating a Dataproc Cluster
      • Problem
      • Solution
      • Discussion
  • 2. Running Hive, Spark, and Sqoop Workloads
    • Adding Required Privileges for Jobs
      • Problem
      • Solution
      • Discussion
        • Project-level roles
        • Service-level roles
        • Custom roles
        • Staging and temporary buckets
      • See Also
    • Generating 1 TB of Data Using a MapReduce Job
      • Problem
      • Solution
      • Discussion
        • Submitting the job
        • Monitoring the job
    • Running a Hive Job to Show Records from an Employee Table
      • Problem
      • Solution
      • Discussion
        • Metastore options
        • Hive query execution process
          • Step 1: Stage the HiveQL script in the GCS bucket
          • Step 2: Submit the Hive job to the Dataproc cluster via the console
      • See Also
    • Converting XML Data to Parquet Using Scala Spark on Dataproc
      • Problem
      • Solution
      • Discussion
      • See Also
    • Converting XML Data to Parquet Using PySpark on Dataproc
      • Problem
      • Solution
      • Discussion
        • Staging the PySpark code
        • Submitting the PySpark job
      • See Also
    • Submitting a SparkR Job
      • Problem
      • Solution
      • Discussion
        • Staging the SparkR code
        • Submitting the SparkR job
        • Monitoring the SparkR job
    • Migrating Data from Cloud SQL to Hive Using Sqoop Job
      • Problem
      • Solution
      • Discussion
        • Setting up a MySQL source database (optional)
        • Creating a sample table and data in MySQL
        • Triggering the Sqoop data transfer job
    • Choosing Deployment Modes When Submitting a Spark Job to Dataproc
      • Problem
      • Solution
        • Client mode
        • Cluster mode
      • Discussion
      • See Also
  • 3. Advanced Dataproc Cluster Configuration
    • Creating an Autoscaling Policy
      • Problem
      • Solution
      • Discussion
        • YARN configuration
        • Worker configuration
    • Attaching an Autoscaling Policy to a Dataproc Cluster
      • Problem
      • Solution
      • Discussion
    • Optimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling Policy
      • Problem
      • Solution
      • Discussion
    • Adding Local SSDs to Dataproc Worker Nodes
      • Problem
      • Solution
      • Discussion
    • Creating a Cluster with a Custom Image
      • Problem
      • Solution
      • Discussion
    • Building a Cluster with Custom Machine Types
      • Problem
      • Solution
      • Discussion
    • Bootstrapping Dataproc Clusters with Initialization Scripts
      • Problem
      • Solution
      • Discussion
    • Scheduling Automatic Deletion of Unused Clusters
      • Problem
      • Solution
      • Discussion
    • Overriding Hadoop Configurations
      • Problem
      • Solution
      • Discussion
  • 4. Serverless Spark and Ephemeral Dataproc Clusters
    • Running on Dataproc: Serverless Versus Ephemeral Clusters
      • Problem
      • Solution
      • Discussion
        • Dataproc Serverless
        • Ephemeral Dataproc clusters
    • Running a Sequence of Jobs on an Ephemeral Cluster
      • Problem
      • Solution
      • Discussion
    • Executing a Spark Batch Job to Convert XML Data to Parquet on Dataproc Serverless
      • Problem
      • Solution
      • Discussion
      • See Also
    • Running a Serverless Job Using the Premium Tier Configuration
      • Problem
      • Solution
      • Discussion
    • Giving a Unique Custom Name to a Dataproc Serverless Spark Job
      • Problem
      • Solution
      • Discussion
      • See Also
    • Cloning a Dataproc Serverless Spark Job
      • Problem
      • Solution
      • Discussion
    • Running a Serverless Job on Spark RAPIDS Accelerator
      • Problem
      • Solution
      • Discussion
      • See Also
    • Configuring a Spark History Server
      • Problem
      • Solution
      • Discussion
      • See Also
    • Writing Spark Events to the Spark History Server from Dataproc Serverless
      • Problem
      • Solution
      • Discussion
    • Monitoring Serverless Spark Jobs
      • Problem
      • Solution
      • Discussion
      • See Also
    • Calculating the Price of a Serverless Batch
      • Problem
      • Solution
      • Discussion
      • See Also
  • 5. Dataproc on Google Kubernetes Engine
    • Creating a Kubernetes Cluster
      • Problem
      • Solution
      • Discussion
        • Application requirements
        • Infrastructure
        • Node pool strategy
        • Network
        • Kubernetes distribution
        • Security
        • Prerequisites
    • Creating a Dataproc Cluster on a GKE Cluster
      • Problem
      • Solution
      • Discussion
    • Running Spark Jobs on a Dataproc GKE Cluster
      • Problem
      • Solution
      • Discussion
    • Customizing Node Pools
      • Problem
      • Solution
      • Discussion
    • Autoscaling in a GKE Cluster
      • Problem
      • Solution
      • Discussion
    • Achieving Zonal High Availability for Dataproc Jobs
      • Problem
      • Solution
      • Discussion
  • 6. Dataproc Metastore
    • Creating a Dataproc Metastore Service Instance
      • Problem
      • Solution
      • Discussion
      • See Also
    • Attaching a DPMS Instance to One or More Clusters
      • Problem
      • Solution
      • Discussion
    • Creating Tables and Verifying Metadata in DPMS
      • Problem
      • Solution
      • Discussion
    • Installing an External Hive Metastore
      • Problem
      • Solution
      • Discussion
    • Attaching an External Apache Hive Metastore to the Cluster
      • Problem
      • Solution
      • Discussion
    • Searching for Metadata in a Dataplex Data Catalog
      • Problem
      • Solution
      • Discussion
    • Automating the Backup of a DPMS Instance
      • Problem
      • Solution
      • Discussion
        • Prerequisites
        • Creating a Cloud Function
        • Creating a Pub/Sub topic
        • Creating a Cloud Scheduler job
  • 7. Connecting from Dataproc to GCP Services
    • Reading from GCS and Writing to a BigQuery Table
      • Problem
      • Solution
      • Discussion
    • Reading from a Cloud SQL Table
      • Problem
      • Solution
      • Discussion
    • Writing to GCS in Delta Format
      • Problem
      • Solution
      • Discussion
    • Integrating a Dataproc-Managed Delta Lake with BigLake
      • Problem
      • Solution
      • Discussion
        • Syncing a Delta Table on Dataproc
        • Creating a BigLake connection
        • Locating connection information
        • Granting GCS permissions for BigLake access
        • Creating the BigLake table
    • Connecting to GCP Services Using Dataproc Templates
      • Problem
      • Solution
      • Discussion
    • Spark Job Running on Dataproc Reading from GCS and Writing to Bigtable
      • Problem
      • Solution
      • Discussion
      • See Also
  • 8. Configuring Logging in Dataproc
    • Understanding Different Types of Logs in Dataproc
      • Problem
      • Solution
      • Discussion
    • Understanding Cloud Logging
      • Problem
      • Solution
      • Discussion
        • Log source
        • Log Router
        • Log sink
        • Logging buckets
    • Viewing Logs in Cloud Logging
      • Problem
      • Solution
      • Discussion
        • Accessing logs from the web UI
        • Viewing logs using gcloud command
        • Searching for logs using the REST API
    • Routing Dataproc Logs to Cloud Logging
      • Problem
      • Solution
      • Discussion
    • Attaching Custom Labels to Logging
      • Problem
      • Solution
      • Discussion
    • Optimizing Cloud Logging Costs
      • Problem
      • Solution
      • Discussion
    • Sinking Logs to BigQuery
      • Problem
      • Solution
      • Discussion
  • 9. Setting Up Monitoring and Dashboards
    • Monitoring Cluster Status
      • Problem
      • Solution
      • Discussion
        • Comprehensive service monitoring
        • Application-centric monitoring
    • Exploring Predefined Metrics Charts
      • Problem
      • Solution
      • Discussion
    • Creating Charts Using Metrics Explorer
      • Problem
      • Solution
      • Discussion
    • Creating Dashboards Using Metrics Explorer
      • Problem
      • Solution
      • Discussion
    • Setting Up Alerts
      • Problem
      • Solution
      • Discussion
    • Migrating Dashboards from One Project to Another
      • Problem
      • Solution
      • Discussion
    • Creating Custom Log-Based Metrics
      • Problem
      • Solution
      • Discussion
        • Creating a Log-based metric
        • Viewing metrics data from Metrics Explorer
  • 10. Dataproc Security
    • Managing Identities in Dataproc Clusters
      • Problem
      • Solution
      • Discussion
        • Service account approach
        • Personal cluster authentication
        • Secure multitenancy approach
    • Securing Your Perimeter Using VPC Service Controls
      • Problem
      • Solution
      • Discussion
        • Creating a new VPC service perimeter
        • Allowing outside users access to resources in a secured perimeter
    • Authenticating Using Kerberos
      • Problem
      • Solution
      • Discussion
    • Installing Ranger
      • Problem
      • Solution
      • Discussion
        • Creating an encrypt/decrypt key
        • Encrypting a password using the key
        • Copying the password to the GCS location
    • Securing Cluster Resources Using Ranger
      • Problem
      • Solution
      • Discussion
        • Creating users in Ranger
        • Creating Ranger policies to secure access to folders
        • Creating a Ranger policy
    • Managing Credentials in the Google Cloud Environment
      • Problem
      • Solution
      • Discussion
        • Creating a secret
        • Accessing a secret in a Spark job (PySpark) and using it to connect to a SQL server
        • Hadoop credentials approach
    • Enforcing Restrictions Across All Clusters
      • Problem
      • Solution
      • Discussion
        • Enforcing specific machine types in a Dataproc cluster
        • Enforcing the cost_center label for Dataproc clusters
        • Testing the organization policies
    • Tokenizing Sensitive Data
      • Problem
      • Solution
      • Discussion
        • Creating a deidentified template in DLP
        • Creating a KMS key
        • Creating a deidentification template
        • Tokenizing first name and last name using a Spark job
  • 11. Performance Tuning and Cost Optimization
    • Sizing a Dataproc Cluster
      • Problem
      • Solution
      • Discussion
        • Scenario 1: Configuring a cluster for batch requirements
        • Scenario 2: Designing a streaming application
    • Choosing the Right Disks for Big Data Workloads on Dataproc
      • Problem
      • Solution
      • Discussion
    • Benchmarking Clusters with Performance Tuning
      • Problem
      • Solution
      • Discussion
        • Creating a cluster to run benchmarking jobs
        • Running a TeraGen job to generate 500 GB of data
        • Running TeraSort to sort 500 GB of data generated by TeraGen
        • Capturing benchmarking results
    • Navigating the Spark UI
      • Problem
      • Solution
      • Discussion
    • Optimizing Spark Jobs
      • Problem
      • Solution
      • Discussion
    • Installing Sparklens for Profiling Spark Applications
      • Problem
      • Solution
      • Discussion
        • Driver and executor wall-clock times
        • Critical path
      • See Also
    • Identifying Spark Job Errors and Bottlenecks
      • Problem
      • Solution
      • Discussion
        • Setting up test data
        • Replicating driver OOM
        • Replicating executor OOM
        • Replicating spill to memory (spill to disk)
    • Understanding the YARN UI
      • Problem
      • Solution
      • Discussion
        • Cluster information path
        • Applications information path
        • Scheduler home screen
    • Calculating the Cost of a Dataproc Cluster
      • Problem
      • Solution
      • Discussion
    • Optimizing Cost in Dataproc Clusters
      • Problem
      • Solution
      • Discussion
        • Choosing the right architecture for your Dataproc cluster
        • Tuning the cluster configuration
        • Oversubscribing the cluster
        • Optimizing logging
  • 12. Orchestrating Dataproc Workloads
    • Understanding the Prerequisites for Installing Cloud Composer
      • Problem
      • Solution
      • Discussion
        • Configuring IAM permissions
        • Planning the network and subnet IP ranges
    • Deploying a Cloud Composer Environment
      • Problem
      • Solution
      • Discussion
    • Scheduling a Job in Composer
      • Problem
      • Solution
      • Discussion
    • Parameterizing Variables
      • Problem
      • Solution
      • Discussion
    • Scaling Up a Cloud Composer Environment
      • Problem
      • Solution
      • Discussion
        • Infrastructure
        • DAG-level optimizations
    • Running Spark Jobs Using Vertex AI Machine Learning Pipelines
      • Problem
      • Solution
      • Discussion
    • Scheduling a Dataproc Job in Event Driven Using a Cloud Function
      • Problem
      • Solution
      • Discussion
    • Using Dataproc Workflow Templates
      • Problem
      • Solution
      • Discussion
  • 13. Using Spark Notebooks on Dataproc
    • Deciding Which Notebook Environments to Choose
      • Problem
      • Solution
      • Discussion
    • Configuring Notebooks on a Dataproc Cluster
      • Problem
      • Solution
      • Discussion
    • Running Spark Scala and PySpark Notebooks on Dataproc
      • Problem
      • Solution
      • Discussion
        • Running the first Spark Scala application
        • Running the first PySpark application
    • Managing Libraries and Configs
      • Problem
      • Solution
      • Discussion
    • Creating Dataproc-Enabled Vertex AI Workbench Instances
      • Problem
      • Solution
      • Discussion
        • Running Spark notebooks
        • Integrating with Git
        • Mounting a GCS bucket
    • Executing Notebooks Using Spark Serverless Sessions
      • Problem
      • Solution
      • Discussion
  • 14. Migrating from On-Premises and Public Cloud Services to GCP
    • Planning Migration
      • Problem
      • Solution
      • Discussion
        • Compute and cluster planning
        • Hive migration
        • Spark migration
        • Additional migration considerations
          • Architectural decisions
          • IAM policy redesign
          • Reliability and region selection
          • Cost optimization
          • Testing and validation
          • Automation
          • Upskilling
      • See Also
    • Data Migration Strategies
      • Problem
      • Solution
      • Discussion
    • Migrating Data with STS
      • Problem
      • Solution
      • Discussion
        • Preparing the source
        • Preparing the target
        • Creating a storage transfer job
        • Triggering a transfer job manually
        • Scheduling a transfer job
    • Accessing AWS S3 Data Using BigLake Tables
      • Problem
      • Solution
      • Discussion
    • Migrating Metadata
      • Problem
      • Solution
      • Discussion
        • Choosing a target service in GCP
        • Choosing the migration strategy
    • Migrating Applications to Google Cloud
      • Problem
      • Solution
      • Discussion
  • Index

Dodaj do koszyka Dataproc Cookbook

Code, Publish & WebDesing by CATALIST.com.pl



(c) 2005-2025 CATALIST agencja interaktywna, znaki firmowe należą do wydawnictwa Helion S.A.