Apache Hudi: The Definitive Guide. Building Robust, Open, and High-Performing Data Lakehouses - Helion

ebook

Autor: Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran
ISBN: 9781098173791
stron: 290, Format: ebook
Data wydania: 2025-10-24
Księgarnia: Helion

Cena książki: 29,90 zł (poprzednio: 230,00 zł)
Oszczędzasz: 87% (-200,10 zł)

Osoby, które kupiły tę książkę, wybierały także »

Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi. With this practical guide, data engineers, data architects, and software architects will discover how to seamlessly build an interoperable lakehouse from disparate data sources and deliver faster insights using your query engine of choice.

Authors Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, and Rebecca Bilbro provide practical examples and insights to help you unlock the full potential of data lakehouses for different levels of analytics, from batch to interactive to streaming. You'll also learn how to evaluate storage choices and leverage built-in automated table optimizations to build, maintain, and operate production data applications.

This book helps you:

Understand the need for transactional data lakehouses and the challenges associated with building them
Explore data ecosystem support provided by Apache Hudi for popular data sources and query engines
Perform different write and read operations on Apache Hudi tables and effectively use them for various use cases, including batch and stream applications
Apply different storage techniques and considerations such as indexing and clustering to maximize your lakehouse performance
Build end-to-end incremental data pipelines using Apache Hudi for faster ingestion and fresher analytics

Osoby które kupowały "Apache Hudi: The Definitive Guide. Building Robust, Open, and High-Performing Data Lakehouses", wybierały także:

The Ansible Workshop. Hands-On Learning For Rapid Mastery 665,00 zł, (39,90 zł -94%)
Cisco CCNA 200-301. Kurs video. Administrowanie bezpieczeństwem sieci. Część 3 665,00 zł, (39,90 zł -94%)
Cisco CCNA 200-301. Kurs video. Administrowanie urządzeniami Cisco. Część 2 665,00 zł, (39,90 zł -94%)
Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 665,00 zł, (39,90 zł -94%)
Jak zhakowa 125,00 zł, (10,00 zł -92%)

Spis treści

Apache Hudi: The Definitive Guide. Building Robust, Open, and High-Performing Data Lakehouses eBook -- spis treści

1. What Is Apache Hudi?
- The Evolution of Data Management Architectures
  - The Rise of Data Lakehouses
  - Ubers Transactional Data Lake Problem
  - What Is Hudi?
- The Hudi Stack
  - Native Table Format
  - Pluggable Table Format
  - Storage Engine
    - Indexes
    - Lake cache (currently in development)
    - Concurrency control
    - Table services
  - Programming API
    - Writers
    - Readers
  - User Access
    - SQL
    - Code
  - Shared Platform Components
- Hudi in the Real World
- Summary
2. Getting Started with Hudi
- Basic Operations
  - Create the Table
    - Initial table layout
    - Record key fields
    - Partition fields
  - Insert, Update, Delete, and Fetch Records
    - COW table layout after writes
    - Timeline, actions, and instants
      - Action timestamps
      - Action types
      - Action states
- Choose a Table Type
  - Create a Merge-on-Read Table
  - MOR Tables Layout After Writes
    - Base files and log files
    - File groups and file slices
      - File slices in COW tables
      - File slices in MOR tables
      - File slicing
  - Copy-on-Write Versus Merge-on-Read
    - COW tables update process
    - MOR tables update process
    - The trade-offs
- Advanced Usage
  - Create Table As Select
  - Merge Source Data into the Table
  - Update and Delete Using Nonrecord Key Fields
  - Time Travel Query
  - Incremental Query
- Summary
3. Writing to Hudi
- Breaking Down the Write Flow
  - Start Commit
  - Prepare Records
    - Merging duplicate records
    - Indexing
  - Partition Records
  - Write to Storage
  - Commit Changes
  - Summarize the upsert Flow
- Exploring Write Operations
  - Define Table Properties
  - Use INSERT INTO
    - Insert versus bulk insert
    - Small-file handling for insert and upsert operations
    - Sort modes in bulk_insert
    - Execute as upsert
  - Perform Partial Merge with MERGE INTO
  - Perform Deletion
    - Delete partitions efficiently
  - Overwrite Partition or Table
- Highlighting Noteworthy Features
  - Key Generators
  - Merge Modes
  - Schema Evolution on Write
  - Bootstrapping
- Summary
4. Reading from Hudi
- Integrating with Query Engines
  - Query Lifecycle
  - Data Catalog
  - Hudi Integration
    - Get pruned file slices
    - Read file slices efficiently
- Exploring Query Types
  - Snapshot Query
  - Time Travel Query
  - Incremental Query: Latest-State Mode
    - Record-level change tracking
    - The parameters
  - Incremental Query: The Change Data Capture Mode
- Highlighting Noteworthy Features
  - Streaming Read
  - Schema Evolution on Read
  - Read Using Rust or Python
- Summary
5. Achieving Efficiency with Indexing
- Overview of the Indexes in Hudi
- Index Acceleration for Writes
  - General-Purpose Multimodal Indexing
    - Index storage with the metadata table
    - The record index
  - Writer-Side Indexes
    - The bucket index
    - The simple index
    - The bloom index
  - Comparison of Writer Indexing Choices
- Index Acceleration for Reads
  - Data Skipping
    - The files index
    - The column_stats and partition_stats indexes
    - The pruning process
  - Equality Matching
    - The record index
    - The secondary index
  - Indexing on Expressions
    - The expression index with column_stats
    - The expression index with bloom_filter
- Build the Right Indexes
- Summary
6. Maintaining and Optimizing Hudi Tables
- Table Service Overview
  - Deployment Mode: Inline
  - Deployment Mode: Async Execution
  - Deployment Mode: Standalone
  - Choosing a Suitable Mode
- Compaction
  - Schedule Compaction
  - Execute Compaction
- Clustering
  - Schedule Clustering
  - Execute Clustering
  - Layout Optimization Strategies
    - Linear sorting
    - Space-filling curves
  - Clustering Versus Compaction
- Cleaning
  - Schedule Cleaning
  - Execute Cleaning
- Indexing
- Summary
7. Concurrency Control in Hudi
- Why Concurrency Control Is Harder in Data Lakehouses
- Concurrency Control Techniques
- Multiwriter Scenarios
  - Why Multiwriters Are Necessary
  - Multiwriter Scenarios for OCC
    - Backfilling data
    - Deleting older data
    - Post-processing data using clustering services
    - Scaling ingestion/ETL
  - Multiwriter Scenarios for NBCC and MVCC
    - Scenarios with overlapping data modifications
    - High-contention workloads
  - The Simple Default: Single Writer with Table Services
    - Single writer with inline table services
    - Single writer with async table services
- How Hudi Handles Concurrency Control
  - The Foundations of Hudis Concurrency Control
    - Snapshot isolation
    - OCC
    - MVCC
    - NBCC
  - The Three-Step Commit Process
  - Conflict Detection and Resolution
  - Locking Mechanisms
  - Challenges in Multiwriter Systems
- Using Multiwriter Support in Hudi
  - Enabling Multiwriter Support
  - Configuring the Locking Mechanism
    - Zookeeper-based locking
    - Hive Metastore-based locking
    - DynamoDB-based locking
    - Storage-based locking
  - Multiwriters Using Hudi Streamer
  - Multiwriters Using Spark Data Source Writer
  - Single Writer and Multiple Table Services
  - Disabling Multiwriter Support
- Tips and Best Practices
  - Implement Partitioning and File Grouping
  - Enable Early Conflict Detection
  - Optimize Locking Mechanisms
  - Use Async Table Services
  - Reduce Write Conflicts and Wasted Resources
  - Prevent Data Duplication When Using Multiple Writers
- Summary
8. Building a Lakehouse Using Hudi Streamer
- Alcubierres Data Silo Woes
  - Data Quality Assurance and Deduplication
  - Heterogeneous Data and Schema Evolution
  - Data Management, Localization, and Consistency
  - Problem Recap
  - Lakehouse Architecture to the Rescue
- What Is Hudi Streamer?
- Getting Started with Hudi Streamer
  - Ingesting Data from S3
  - Ingesting Data from Kafka
    - Handling schema evolution
    - Normalizing timestamps
  - Ingesting Data from RDBMS
- Hudi Streamer in Action
  - Preparing the Upstream Source
    - Creating the first batch of data
    - Setting up the Kafka stack
    - Starting the Debezium connector tasks
  - Setting Up Hudi Streamer
    - Configuring the source
    - Configuring the Hudi writer
    - Working with data catalogs
    - Configuring the data catalog sync
    - Triggering CDC
  - Unlocking the Power of Analytics
    - Verifying the data using SQL
    - Visualizing the data using dashboards
- Exploring the Hudi Streamer Options
  - General Options
  - Source Options
  - Operational Options
    - Operation modes
    - Minimum sync interval
    - Graceful termination
    - Other operational options
- Summary
9. Running Hudi in Production
- Operating with Ease
  - Getting to Know the CLI
    - Understanding the setup
    - Checking table info
    - Inspecting commits
    - Inspecting file slices and statistics
    - Managing table services
  - Performing Table Operations
    - Using savepoint and restore
      - Understanding savepoints
      - Using the restore process
      - Using savepoint and restore via the Hudi CLI
    - Repairing data with deduplication
    - Changing table types
      - Changing from COW to MOR
      - Changing from MOR to COW
    - Upgrading and downgrading table versions
      - Understanding table versions
      - Upgrading a table version
      - Downgrading a table version
- Integrating into the Platform
  - Triggering Post-Commit Callbacks
    - HTTP endpoints
    - Kafka endpoints
    - Pulsar endpoints
  - Wiring Up Monitoring Systems
    - Enabling metrics in Hudi
    - Available metrics
    - Integration examples
      - Prometheus and Grafana
      - Datadog
      - AWS CloudWatch
    - Building custom metrics dashboards
    - Best practices for monitoring
  - Syncing with Catalogs
    - Catalog synchronization
    - Metadata versioning
    - Supported catalog integrations
    - Example: Hive Metastore sync with HMS mode
    - Using multiple catalog syncs
- Performance Tuning
  - Fundamental Tuning Principles
    - Table type selection
    - File sizing
    - Partitioning
  - Write Performance Tuning
    - Tuning parallelism
    - Tuning indexes
    - Bulk insert optimizations
  - Read Performance Tuning
    - Data skipping with the metadata table
    - Query types for MOR tables
  - Table Services Tuning
    - Compaction
    - Clustering
    - Cleaning
- Summary
10. Building an End-to-End Lakehouse Solution
- Architecture Overview
- RetailMax Corp: A Real-World Lakehouse Scenario
- Architecture: Implementing Medallion with Hudi
- Configuring RetailMaxs Hudi Tables
  - Primary Keys (hoodie.datasource.write.recordkey.field)
  - Precombine Keys (hoodie.datasource.write.precombine.field)
  - Partitioning (hoodie.datasource.write.partitionpath.field)
  - Table Types (COW Versus MOR)
- Bronze Layer: Ingesting Upstream Data
  - Setting Up Upstream Data Sources
  - Streaming Mutable, Transactional Data with Debezium, Flink, and Hudi
    - Capturing CDC from PostgreSQL
    - Processing CDC events with Flink
    - Writing to Bronze Hudi tables
    - Handling schema evolution
  - Ingesting Application Event Streams with Hudi Kafka Connect Sink
    - Using the Hudi Sink Connector for Kafka Connect
    - Transaction coordination and performance
- Silver Layer: Creating Derived Datasets
  - Goals of the Silver Layer for RetailMax
  - Stream-Based Transformations with Hudi Streamer
  - Batch and Incremental Transformations with Spark SQL
  - Maintaining Data Quality and Consistency in the Silver Layer
- Gold Layer: Querying the Lakehouse for Insights
  - Interactive Analytics with Trino
  - Batch Analytics and Reporting with Spark SQL
  - Advanced Querying: Time Travel and Point-in-Time Analysis
- Business Layer: AI-Driven Insights for RetailMax
  - Preparing Data for AI/Machine Learning in the Gold Layer
  - Building a Knowledge Base for LLM-Powered Applications with Ray and Hudi
- Operationalizing and Optimizing the Hudi Lakehouse
  - Concurrency Control and Multiwriter Scenarios
  - Monitoring the Lakehouse
  - Disaster Recovery and Data Resilience
- Performance Benchmarks and Considerations
- Summary