Delta Lake: Up and Running - Helion

ebook

Autor: Bennie Haelen, Dan Davis
ISBN: 9781098139681
stron: 266, Format: ebook
Data wydania: 2023-10-16
Księgarnia: Helion

Cena książki: 186,15 zł (poprzednio: 216,45 zł)
Oszczędzasz: 14% (-30,30 zł)

Osoby, które kupiły tę książkę, wybierały także »

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS.

This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights.

You'll learn how to:

Use modern data management and data engineering techniques
Understand how ACID transactions bring reliability to data lakes at scale
Run streaming and batch jobs against your data lake concurrently
Execute update, delete, and merge commands against your data lake
Use time travel to roll back and examine previous data versions
Build a streaming data quality pipeline following the medallion architecture

Osoby które kupowały "Delta Lake: Up and Running", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Delta Lake: Up and Running eBook -- spis treści

Preface
- How to Contact Us
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- Acknowledgment
1. The Evolution of Data Architectures
- A Brief History of Relational Databases
- Data Warehouses
  - Data Warehouse Architecture
  - Dimensional Modeling
- Data Warehouse Benefits and Challenges
- Introducing Data Lakes
- Data Lakehouse
  - Data Lakehouse Benefits
  - Implementing a Lakehouse
- Delta Lake
- The Medallion Architecture
- The Delta Ecosystem
  - Delta Lake Storage
  - Delta Sharing
  - Delta Connectors
- Conclusion
2. Getting Started with Delta Lake
- Getting a Standard Spark Image
- Using Delta Lake with PySpark
- Running Delta Lake in the Spark Scala Shell
- Running Delta Lake on Databricks
- Creating and Running a Spark Program: helloDeltaLake
- The Delta Lake Format
  - Parquet Files
    - Advantages of Parquet files
    - Writing a Parquet file
  - Writing a Delta Table
- The Delta Lake Transaction Log
  - How the Transaction Log Implements Atomicity
  - Breaking Down Transactions into Atomic Commits
  - The Transaction Log at the File Level
    - Write multiple writes to the same file
    - Reading the latest version of a Delta table
    - Failure scenario with a write operation
    - Update scenario
  - Scaling Massive Metadata
    - Checkpoint file example
    - Displaying the checkpoint file
- Conclusion
3. Basic Operations on Delta Tables
- Creating a Delta Table
  - Creating a Delta Table with SQL DDL
  - The DESCRIBE Statement
  - Creating Delta Tables with the DataFrameWriter API
    - Creating a managed table
    - Creating an unmanaged table
  - Creating a Delta Table with the DeltaTableBuilder API
  - Generated Columns
- Reading a Delta Table
  - Reading a Delta Table with SQL
  - Reading a Table with PySpark
- Writing to a Delta Table
  - Cleaning Out the YellowTaxis Table
  - Inserting Data with SQL INSERT
  - Appending a DataFrame to a Table
- Using the OverWrite Mode When Writing to a Delta Table
  - Inserting Data with the SQL COPY INTO Command
  - Partitions
    - Partitioning by a single column
    - Partitioning by multiple columns
    - Checking if a partition exists
    - Selectively updating Delta partitions with replaceWhere
- User-Defined Metadata
  - Using SparkSession to Set Custom Metadata
  - Using the DataFrameWriter to Set Custom Metadata
- Conclusion
4. Table Deletes, Updates, and Merges
- Deleting Data from a Delta Table
  - Table Creation and DESCRIBE HISTORY
  - Performing the DELETE Operation
  - DELETE Performance Tuning Tips
- Updating Data in a Table
  - Use Case Description
  - Updating Data in a Table
  - UPDATE Performance Tuning Tips
- Upsert Data Using the MERGE Operation
  - Use Case Description
  - The MERGE Dataset
  - The MERGE Statement
    - Modifying unmatched rows using MERGE
  - Analyzing the MERGE operation with DESCRIBE HISTORY
  - Inner Workings of the MERGE Operation
- Conclusion
5. Performance Tuning
- Data Skipping
- Partitioning
  - Partitioning Warnings and Considerations
- Compact Files
  - Compaction
  - OPTIMIZE
    - OPTIMIZE considerations
- ZORDER BY
  - ZORDER BY Considerations
- Liquid Clustering
  - Enabling Liquid Clustering
  - Operations on Clustered Columns
    - Changing clustered columns
    - Viewing clustered columns
    - Removing clustered columns
  - Liquid Clustering Warnings and Considerations
- Conclusion
6. Using Time Travel
- Delta Lake Time Travel
  - Restoring a Table
  - Restoring via Timestamp
  - Time Travel Under the Hood
  - RESTORE Considerations and Warnings
  - Querying an Older Version of a Table
- Data Retention
  - Data File Retention
  - Log File Retention
  - Setting File Retention Duration Example
  - Data Archiving
- VACUUM
  - VACUUM Syntax and Examples
  - How Often Should You Run VACUUM and Other Maintenance Tasks?
  - VACUUM Warnings and Considerations
- Changing Data Feed
  - Enabling the CDF
  - Viewing the CDF
  - CDF Warnings and Considerations
- Conclusion
7. Schema Handling
- Schema Validation
  - Viewing the Schema in the Transaction Log Entries
  - Schema on Write
  - Schema Enforcement Example
    - Matching schema
    - Schema with an additional column
- Schema Evolution
  - Adding a Column
  - Missing Data Column in Source DataFrame
  - Changing a Column Data Type
  - Adding a NullType Column
- Explicit Schema Updates
  - Adding a Column to a Table
  - Adding Comments to a Column
  - Changing Column Ordering
  - Delta Lake Column Mapping
  - Renaming a Column
  - Replacing the Table Columns
  - Dropping a Column
  - The REORG TABLE Command
  - Changing Column Data Type or Name
- Conclusion
8. Operations on Streaming Data
- Streaming Overview
  - Spark Structured Streaming
  - Delta Lake and Structured Streaming
- Streaming Examples
  - Hello Streaming World
    - Creating the streaming query
    - The query process log
    - The checkpoint file
  - AvailableNow Streaming
  - Updating the Source Records
    - The StreamingQuery class
    - Reprocessing all or part of the source records
  - Reading a Stream from the Change Data Feed
- Conclusion
9. Delta Sharing
- Conventional Methods of Data Sharing
  - Legacy and Homegrown Solutions
  - Proprietary Vendor Solutions
  - Cloud Object Storage
- Open Source Delta Sharing
  - Delta Sharing Goals
- Delta Sharing Under the Hood
  - Data Providers and Recipients
  - Benefits of the Design
- The delta-sharing Repository
  - Step 1: Installing the Python Connector
  - Step 2: Installing the Profile File
  - Step 3: Reading a Shared Table
- Conclusion
10. Building a Lakehouse on Delta Lake
- Storage Layer
  - What Is a Data Lake?
  - Types of Data
  - Key Benefits of a Cloud Data Lake
- Data Management
- SQL Analytics
  - SQL Analytics via Spark SQL
  - SQL Analytics via Other Delta Lake Integrations
- Data for Data Science and Machine Learning
  - Challenges with Traditional Machine Learning
  - Delta Lake Features That Support Machine Learning
  - Putting It All Together
- Medallion Architecture
  - The Bronze Layer (Raw Data)
  - The Silver Layer
  - The Gold Layer
  - The Complete Lakehouse
- Conclusion
Index