Delta Lake: The Definitive Guide - Helion
ISBN: 9781098151904
stron: 382, Format: ebook
Data wydania: 2024-10-30
Księgarnia: Helion
Cena książki: 237,15 zł (poprzednio: 285,72 zł)
Oszczędzasz: 17% (-48,57 zł)
Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques.
Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale.
This book helps you:
- Understand key data reliability challenges and how Delta Lake solves them
- Explain the critical role of Delta transaction logs as a single source of truth
- Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino
- Architect data lakehouses with the medallion architecture
- Optimize Delta Lake performance with features like deletion vectors and liquid clustering
Osoby które kupowały "Delta Lake: The Definitive Guide", wybierały także:
- Biologika Sukcesji Pokoleniowej. Sezon 2. Za 117,27 zł, (12,90 zł -89%)
- Biologika Sukcesji Pokoleniowej. Sezon I. 117,27 zł, (12,90 zł -89%)
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,14 zł, (12,90 zł -86%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
Spis treści
Delta Lake: The Definitive Guide eBook -- spis treści
- Foreword by Michael Armbrust
- Foreword by Dominique Brezinski
- Preface
- Who This Book Is For
- How This Book Is Organized
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- Denny
- Tristen
- Scott
- Prashanth
- 1. Introduction to the Delta Lake Lakehouse Format
- The Genesis of Delta Lake
- Data Warehousing, Data Lakes, and Data Lakehouses
- Data warehousing
- Data lakes
- Lakehouses (or data lakehouses)
- Project Tahoe to Delta Lake: The Early Years Months
- Data Warehousing, Data Lakes, and Data Lakehouses
- What Is Delta Lake?
- Common Use Cases
- Key Features
- Anatomy of a Delta Lake Table
- Delta Transaction Protocol
- Understanding the Delta Lake Transaction Log at the File Level
- The Single Source of Truth
- The Relationship Between Metadata and Data
- Multiversion Concurrency Control (MVCC) File and Data Observations
- Observing the Interaction Between the Metadata and Data
- Table Features
- Delta Kernel
- Delta UniForm
- Conclusion
- The Genesis of Delta Lake
- 2. Installing Delta Lake
- Delta Lake Docker Image
- Delta Lake for Python
- PySpark Shell
- JupyterLab Notebook
- Scala Shell
- Delta Rust API
- ROAPI
- Native Delta Lake Libraries
- Multiple Bindings Available
- Installing the Delta Lake Python Package
- Apache Spark with Delta Lake
- Setting Up Delta Lake with Apache Spark
- Prerequisite: Set Up Java
- Setting Up an Interactive Shell
- Spark SQL shell
- PySpark shell
- Spark Scala shell
- PySpark Declarative API
- Databricks Community Edition
- Create a Cluster with Databricks Runtime
- Importing Notebooks
- Attaching Notebooks
- Conclusion
- Delta Lake Docker Image
- 3. Essential Delta Lake Operations
- Create
- Creating a Delta Lake Table
- Loading Data into a Delta Lake Table
- INSERT INTO
- Append
- CREATE TABLE AS SELECT
- The Transaction Log
- Read
- Querying Data from a Delta Lake Table
- Reading with Time Travel
- Update
- Delete
- Deleting Data from a Delta Lake Table
- Overwriting Data in a Delta Lake Table
- The replace method
- Overwrite mode
- INSERT OVERWRITE
- Merge
- Other Useful Actions
- Parquet Conversions
- Regular Parquet conversion
- Iceberg conversion
- Delta Lake Metadata and History
- Parquet Conversions
- Conclusion
- Create
- 4. Diving into the Delta Lake Ecosystem
- Connectors
- Apache Flink
- Flink DataStream Connector
- Installing the Connector
- DeltaSource API
- Bounded mode
- Builder options
- Generating the bounded source
- Continuous mode
- Builder options
- Generating the continuous source
- Table schema discovery
- Using the DeltaSource
- Bounded mode
- DeltaSink API
- Builder options
- Exactly-once guarantees
- End-to-End Example
- Kafka Delta Ingest
- Install Rust
- Build the Project
- Set up your local environment
- Build the connector
- Run the Ingestion Flow
- Trino
- Getting Started
- Connector requirements
- Running the Hive Metastore
- Configuring and Using the Trino Connector
- Using Show Catalogs
- Creating a Schema
- Show Schemas
- Working with Tables
- Data types
- CREATE TABLE options
- Creating tables
- Listing tables
- Inspecting tables
- Using INSERT
- Querying Delta tables
- Updating rows
- Creating tables with selection
- Table Operations
- Vacuum
- Table optimization
- Metadata tables
- Table history
- Change Data Feed
- Viewing table properties
- Modifying table properties
- Deleting tables
- Getting Started
- Conclusion
- 5. Maintaining Your Delta Lake
- Using Delta Lake Table Properties
- Delta Lake Table Properties Reference
- Create an Empty Table with Properties
- Populate the Table
- Evolve the Table Schema
- Add or Modify Table Properties
- Remove Table Properties
- Delta Lake Table Optimization
- The Problem with Big Tables and Small Files
- Creating the small file problem
- Using OPTIMIZE to Fix the Small File Problem
- OPTIMIZE
- Z-Order Optimize
- The Problem with Big Tables and Small Files
- Table Tuning and Management
- Partitioning Your Tables
- Table partitioning rules
- Choose the right partition column
- Defining Partitions on Table Creation
- Migrating from a Nonpartitioned to a Partitioned Table
- Partition metadata management
- Viewing partition metadata
- Partitioning Your Tables
- Repairing, Restoring, and Replacing Table Data
- Recovering and Replacing Tables
- Deleting Data and Removing Partitions
- The Life Cycle of a Delta Lake Table
- Restoring Your Table
- Cleaning Up
- Vacuum
- Dropping tables
- Removing all traces of a Delta Lake Table
- Conclusion
- Using Delta Lake Table Properties
- 6. Building Native Applications with Delta Lake
- Getting Started
- Python
- Reading large datasets
- Partitions
- File statistics
- Writing data
- Merging/updating
- Going beyond Pandas
- RecordBatch
- Table
- DataSet
- Reading large datasets
- Rust
- Reading large data
- Writing data
- Merging/updating
- Building a Lambda
- Python
- Rust
- Concurrent writes on AWS S3
- S3DynamoDBLogStore
- DynamoDB lock
- Concurrency with S3-compatible stores
- Python
- Whats Next
- Getting Started
- 7. Streaming In and Out of Your Delta Lake
- Streaming and Delta Lake
- Streaming Versus Batch Processing
- Streaming terminology
- Source
- Sink
- Checkpoint
- Watermark
- Apache Flink
- Apache Spark
- Delta-rs
- Streaming terminology
- Delta as Source
- Delta as Sink
- Streaming Versus Batch Processing
- Delta Streaming Options
- Limit the Input Rate
- Ignore Updates or Deletes
- The ignoreDeletes setting
- The ignoreChanges setting
- Example
- Initial Processing Position
- Initial Snapshot with withEventTimeOrder
- Advanced Usage with Apache Spark
- Idempotent Stream Writes
- Idempotent writes
- Merge
- Delta Lake Performance Metrics
- Metrics
- Custom metrics
- Idempotent Stream Writes
- Auto Loader and Delta Live Tables
- Auto Loader
- Delta Live Tables
- Change Data Feed
- Using Change Data Feed
- Enabling the change data feed
- Reading the changes feed
- Specifying boundaries for batch processes
- Specifying boundaries for streaming processes
- Schema
- Using Change Data Feed
- Conclusion
- Streaming and Delta Lake
- 8. Advanced Features
- Generated Columns, Keys, and IDs
- Comments and Constraints
- Comments
- Delta Table Constraints
- Deletion Vectors
- Merge-on-Read
- Stepping Through Deletion Vectors
- Conclusion
- 9. Architecting Your Lakehouse
- The Lakehouse Architecture
- What Is a Lakehouse?
- Learning from Data Warehouses
- Learning from Data Lakes
- The Dual-Tier Data Architecture
- Lakehouse Architecture
- Foundations with Delta Lake
- Open Source on Open Standards in an Open Ecosystem
- Open file format
- Self-describing table metadata
- Open table specification
- Delta Universal Format (UniForm)
- Transaction Support
- Serializable writes
- Snapshot isolation for reads
- Support for incremental processing
- Support for time travel
- Schema Enforcement and Governance
- Schema-on-write
- Schema-on-read
- Separation between storage and compute
- Support for transactional streaming
- Unified access for analytical and ML workloads
- The Delta Sharing Protocol
- Open Source on Open Standards in an Open Ecosystem
- The Medallion Architecture
- Exploring the Bronze Layer
- Minimal transformations and augmentation
- Exploring the Silver Layer
- Used for cleaning and filtering data
- Establishes a layer for augmenting data
- Enable data quality checks and balances
- Exploring the Gold Layer
- Establishes high trust and high consistency
- Exploring the Bronze Layer
- Streaming Medallion Architecture
- Conclusion
- The Lakehouse Architecture
- 10. Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
- Performance Objectives
- Maximizing Read Performance
- Point queries
- Range queries
- Aggregations
- Maximizing Write Performance
- Trade-offs
- Conflict avoidance
- Maximizing Read Performance
- Performance Considerations
- Partitioning
- Structure
- Pitfalls
- File sizes
- Table Utilities
- OPTIMIZE
- Z-Ordering
- Optimization automation in Spark
- Autocompaction
- Optimized writes
- Vacuum
- Databricks autotuning
- Table Statistics
- How statistics help
- File statistics
- Partition pruning and data skipping
- Z-Ordering revisited
- Lead by example
- Cluster By
- Explanation
- Example
- Bloom Filter Index
- A deeper look
- Configuration
- Partitioning
- Conclusion
- Performance Objectives
- 11. Successful Design Patterns
- Slashing Compute Costs
- High-Speed Solutions
- Smart Device Integration
- Comcasts smart remote
- Earlier attempts
- Delta Lake reduces the complexity
- Efficient Streaming Ingestion
- Streaming Ingestion
- The Inception of Delta Rust
- The Evolution of Ingestion
- Coordinating Complex Systems
- Combining Operational Data Stores at DoorDash
- Change Data Capture
- Delta and Flink in Harmony
- Conclusion
- Slashing Compute Costs
- 12. Foundations of Lakehouse Governance and Security
- Lakehouse Governance
- The Emergence of Data Governance
- Data Products and Their Relationship to Data Assets
- Data Products in the Lakehouse
- Maintaining High Trust
- Data Assets and Access
- The Data Asset Model
- Unifying Governance Between Data Warehouses and Lakes
- Permissions Management
- Filesystem Permissions
- Cloud Object Store Access Controls
- Identity and Access Management
- Identity
- Authentication
- Authorization
- Access management
- Data Security
- Role-based access controls
- Establishing roles around personas
- Data classification
- Data assets and policy-as-code
- Create an S3 bucket
- Create an S3 Access Grants instance
- Create the trust policy
- Create the S3 data access policy
- Applying policies at the role level
- Read
- ReadWrite
- Admin
- Limitations of RBAC
- Fine-Grained Access Controls for the Lakehouse
- Conclusion
- 13. Metadata Management, Data Flow, and Lineage
- Metadata Management
- What Is Metadata Management?
- Data Catalogs
- Data Reliability, Stewards, and Permissions Management
- Why the Metastore Matters
- Unity Catalog
- Data Flow and Lineage
- Data Lineage
- Data application or workflow lineage
- Use case: Automating data lineage using OpenLineage
- Getting started with OpenLineage
- Data Sharing
- Automating Data Life Cycles
- Using table properties to manage data life cycles
- Add the retention policy to the Delta table
- Audit Logging
- Monitoring and Alerting
- General compliance monitoring
- Data quality and pipeline degradations
- What Is Data Discovery?
- Data Lineage
- Conclusion
- Metadata Management
- 14. Data Sharing with the Delta Sharing Protocol
- The Basics of Delta Sharing
- Data Providers
- Data Recipients
- Delta Sharing Server
- Using the REST APIs
- Anatomy of the REST URI
- List Shares
- Get Share
- List Schemas in Share
- List tables in schema
- List All Tables in Share
- Delta Sharing Clients
- Delta Sharing with Apache Spark
- PySpark client
- Spark Scala client
- Spark SQL client
- Stream Processing with Delta Shares
- Delta Sharing Community Connectors
- Delta Sharing with Apache Spark
- Conclusion
- The Basics of Delta Sharing
- Index