Apache Iceberg: The Definitive Guide - Helion
ISBN: 9781098148584
stron: 344, Format: ebook
Data wydania: 2024-05-02
Księgarnia: Helion
Cena książki: 29,90 zł (poprzednio: 249,17 zł)
Oszczędzasz: 88% (-219,27 zł)
Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way.
Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg.
With this book, you'll learn:
- The architecture of Apache Iceberg tables
- What happens under the hood when you perform operations on Iceberg tables
- How to further optimize Iceberg tables for maximum performance
- How to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio
Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.
Osoby które kupowały "Apache Iceberg: The Definitive Guide", wybierały także:
- Cisco CCNA 200-301. Kurs video. Administrowanie bezpieczeństwem sieci. Część 3 665,00 zł, (39,90 zł -94%)
- Cisco CCNA 200-301. Kurs video. Administrowanie urządzeniami Cisco. Część 2 665,00 zł, (39,90 zł -94%)
- Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 665,00 zł, (39,90 zł -94%)
- Impact of P2P and Free Distribution on Book Sales 427,14 zł, (29,90 zł -93%)
- Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Programowanie i automatyzacja sieci 443,33 zł, (39,90 zł -91%)
Spis treści
Apache Iceberg: The Definitive Guide eBook -- spis treści
- Foreword by Gerrit Kazmaier
- Foreword by Raghu Ramakrishnan
- Foreword by Rick Sears
- Preface
- About This Book
- Why We Wrote This Book
- What You Will Find Inside
- How to Use This Book
- Feedback and Questions
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- I. Fundamentals of Apache Iceberg
- 1. Introduction to Apache Iceberg
- How Did We Get Here? A Brief History
- Foundational Components of a System Designed for OLAP Workloads
- Storage
- File format
- Table format
- Storage engine
- Catalog
- Compute engine
- Bringing It All Together
- Foundational Components of a System Designed for OLAP Workloads
- The Data Warehouse
- A Brief History
- Pros and Cons of a Data Warehouse
- The Data Lake
- A Brief History
- Pros and Cons of a Data Lake
- Should I Run Analytics on a Data Lake or a Data Warehouse?
- The Data Lakehouse
- What Is a Table Format?
- Hive: The Original Table Format
- Modern Data Lake Table Formats
- What Is Apache Iceberg?
- How Apache Iceberg Came to Be
- The Apache Iceberg Architecture
- Key Features of Apache Iceberg
- ACID transactions
- Partition evolution
- Hidden partitioning
- Row-level table operations
- Time travel
- Version rollback
- Schema evolution
- Conclusion
- How Did We Get Here? A Brief History
- 2. The Architecture of Apache Iceberg
- The Data Layer
- Datafiles
- Delete Files
- Positional delete files
- Equality delete files
- The Metadata Layer
- Manifest Files
- Manifest Lists
- Metadata Files
- Puffin Files
- The Catalog
- Conclusion
- The Data Layer
- 3. Lifecycle of Write and Read Queries
- Writing Queries in Apache Iceberg
- Create the Table
- Send the query to the engine
- Write the metadata file
- Update the catalog file to commit changes
- Insert the Query
- Send the query to the engine
- Check the catalog
- Write the datafiles and metadata files
- Update the catalog file to commit changes
- Merge Query
- Send the query to the engine
- Check the catalog
- Write datafiles and metadata files
- Update the catalog file to commit changes
- Create the Table
- Reading Queries in Apache Iceberg
- The SELECT Query
- Send the query to the engine
- Check the catalog
- Get information from the metadata file
- Get information from the manifest list
- Get information from the manifest file
- The Time-Travel Query
- Send the query to the engine
- Check the catalog
- Get information from the metadata file
- Get information from the manifest list
- Get information from the manifest file
- The SELECT Query
- Conclusion
- Writing Queries in Apache Iceberg
- 4. Optimizing the Performance of Iceberg Tables
- Compaction
- Hands-on with Compaction
- Compaction Strategies
- Automating Compaction
- Sorting
- Z-order
- Partitioning
- Hidden Partitioning
- Partition Evolution
- Other Partitioning Considerations
- Copy-on-Write Versus Merge-on-Read
- Copy-on-Write
- Merge-on-Read
- Configuring COW and MOR
- Other Considerations
- Metrics Collection
- Rewriting Manifests
- Optimizing Storage
- Write Distribution Mode
- Object Storage Considerations
- Datafile Bloom Filters
- Conclusion
- 5. Iceberg Catalogs
- Requirements of an Iceberg Catalog
- Catalog Comparison
- The Hadoop Catalog
- Pros and cons of the Hadoop catalog
- Hadoop catalog use cases
- Configuring Spark to use the Hadoop catalog
- The Hive Catalog
- Pros and cons of the Hive catalog
- Hive catalog use cases
- Configuring Spark to use the Hive catalog
- The AWS Glue Catalog
- Pros and cons of the AWS Glue catalog
- AWS Glue catalog use cases
- Configuring Spark to use the AWS Glue catalog
- The Nessie Catalog
- Pros and cons of the Nessie catalog
- Nessie catalog use cases
- Configuring Spark to use the Project Nessie catalog
- The REST Catalog
- Pros and cons of the REST catalog
- REST catalog use cases
- Configuring Spark to use the REST catalog
- The JDBC Catalog
- Pros and cons of the JDBC catalog
- JDBC catalog use cases
- Configuring Spark to use the JDBC catalog
- Other Catalogs
- The Hadoop Catalog
- Catalog Migration
- Using the Apache Iceberg Catalog Migration CLI
- Using an Engine
- register_table()
- snapshot()
- Conclusion
- II. Hands-on with Apache Iceberg
- 6. Apache Spark
- Configuration
- Configuring Apache Iceberg and Spark
- Configuring via the CLI
- Configuring via Python code (PySpark)
- Configuring the Catalogs
- Using org.apache.iceberg.spark.SparkCatalog
- Using org.apache.iceberg.spark.SparkSessionCatalog
- Using a custom catalog
- Starting Spark with All the Configurations (AWS Glue Example)
- Configuring Apache Iceberg and Spark
- Data Definition Language Operations
- CREATE TABLE
- Create a table with partitions
- Use the CREATE TABLEAS SELECT statement
- ALTER TABLE
- Rename a table
- Set table properties
- Add a column
- Rename a column
- Modify a column
- Drop a column
- Alter a Table with Icebergs Spark SQL Extensions
- Add/drop/replace a partition
- Set the write order
- Set the write distribution
- Set/drop identifier fields
- DROP TABLE
- CREATE TABLE
- Reading Data
- The Select All Query
- The Filter Rows Query
- Aggregation Queries
- Count the records
- Find the average
- Sum the values
- Find the maximum
- Using Window Functions
- Writing Data
- INSERT INTO
- MERGE INTO
- INSERT OVERWRITE
- Static overwrite
- Dynamic overwrite
- DELETE FROM
- UPDATE
- Iceberg Table Maintenance Procedures
- Expire Snapshots
- Rewrite Datafiles
- Rewrite Manifests
- Remove Orphan Files
- Conclusion
- Configuration
- 7. Dremios SQL Query Engine
- Configuration
- Data Definition Language Operations
- CREATE TABLE
- CREATE TABLEAS SELECT
- CREATE TABLE with partitioning and sorting
- CREATE TABLE with row access and column masking
- ALTER TABLE
- ADD COLUMNS
- MODIFY COLUMN
- ALTER COLUMN
- DROP COLUMN
- DROP TABLE
- CREATE TABLE
- Reading Data
- Using the SELECT Query
- Filtering Rows
- Using Aggregated Queries
- Count records
- Find the average
- Sum the value
- Find the maximum
- Using Window Functions
- Writing Data
- INSERT INTO
- COPY INTO
- MERGE INTO
- DELETE
- UPDATE
- Iceberg Table Maintenance
- Expire Snapshots
- Rewrite Datafiles
- Rewrite Manifests
- Conclusion
- 8. AWS Glue
- Configuration
- Creating a Glue Database
- Configuring the Glue ETL Job
- Configure the data source
- Basic properties
- Advanced properties
- Create a Table Using the Glue Data Catalog
- Read the Table
- Insert the Data
- Conclusion
- Configuration
- 9. Apache Flink
- Configuration
- Prerequisites
- Start the Flink Cluster and Flink SQL Client
- Data Definition Language Operations
- CREATE CATALOG
- The Hadoop catalog
- The Hive catalog
- Custom catalogs
- CREATE DATABASE
- CREATE TABLE
- CREATE TABLEPARTITIONED BY
- CREATE TABLELIKE
- ALTER TABLE
- DROP TABLE
- CREATE CATALOG
- Reading Data
- Flink SQL Batch Read
- Flink SQL Streaming Read
- Metadata Table
- History
- Metadata logs
- Snapshots
- Writing Data
- INSERT INTO
- INSERT OVERWRITE
- UPSERT
- Flink DataFrame and Table API with Apache Iceberg Tables
- Prerequisites
- Configuring the Flink Job
- Starting the Cluster and Building the Package
- Running the Job
- Conclusion
- Configuration
- III. Apache Iceberg in Practice
- 10. Apache Iceberg in Production
- Apache Iceberg Metadata Tables
- The history Metadata Table
- The metadata_log_entries Metadata Table
- The snapshots Metadata Table
- The files Metadata Table
- The manifests Metadata Table
- The partitions Metadata Table
- The all_data_files Metadata Table
- The all_manifests Metadata Table
- The refs Metadata Table
- The entries Metadata Table
- Using the Metadata Tables in Conjunction
- Get data on all the files added in a snapshot
- Get a detailed overview of the lifecycle of a particular datafile
- Track the evolution of the table by partition across snapshots
- Monitor files associated with a particular branch
- Find file differences between two branches of a table
- Find the growth in storage by the latest snapshot of each branch
- Isolation of Changes with Branches
- Table Branching and Tagging
- Table branching
- Table tagging
- Catalog Branching and Tagging
- Catalog branching
- Catalog tagging
- Table Branching and Tagging
- Multitable Transactions
- Rolling Back Changes
- Rolling Back at the Table Level
- rollback_to_snapshot
- rollback_to_timestamp
- set_current_snapshot
- cherrypick_snapshot
- Rolling Back at the Catalog Level
- Rolling Back at the Table Level
- Conclusion
- Apache Iceberg Metadata Tables
- 11. Streaming with Apache Iceberg
- Streaming with Spark
- Streaming into Iceberg with Spark
- Streaming from Iceberg with Spark
- Streaming with Flink
- Streaming into Iceberg with Flink
- Flink for stream reading
- Flink for stream writing
- Example of Streaming into Iceberg with Flink
- Streaming into Iceberg with Flink
- Streaming with Kafka Connect
- The Iceberg Kafka Sink
- Configuring the Apache Iceberg Kafka sink
- Setting up Kafka Connect with Apache Iceberg
- The Iceberg Kafka Sink
- Streaming with AWS
- Conclusion
- Streaming with Spark
- 12. Governance and Security
- Securing Datafiles
- Securing Files: Best Practices
- Hadoop Distributed File System
- Access control lists
- Encryption
- Permissions
- Amazon Simple Storage Service
- Encryption
- SSE-S3 (SSE with S3-managed keys)
- SSE-KMS (SSE with the AWS Key Management Service)
- SSE-C (SSE with customer-provided keys)
- Bucket policies
- Identity and Access Management
- Object ACLs
- Encryption
- Azure Data Lake Storage
- ADLS encryption
- Role-Based Access Control
- ACLs
- Google Cloud Storage
- Encryption at rest and in transit
- Identity and access management
- Bucket policies
- Object ACLs
- Securing and Governing at the Semantic Layer
- Semantic Layer Best Practices
- Dremio
- Data lineage of virtual datasets
- Built-in wiki for documentation
- Role-, column-, and row-based access rules
- Role-based access control
- Column-based access control
- Row-based column access
- Trino
- Securing and Governing at the Catalog Level
- Nessie
- Tabular
- AWS Glue and Lake Formation
- Define data categories with tags
- Define data access policies with TBAC
- Apply policies to datasets
- Monitor and audit access
- Review and revise policies as needed
- Leverage integration with other AWS services
- Additional Security and Governance Considerations
- Conclusion
- Securing Datafiles
- 13. Migrating to Apache Iceberg
- Migration Considerations
- Three-Step In-Place Migration Plan
- Four-Phase Shadow Migration Plan
- Migrating Hive Tables to Apache Iceberg
- The Snapshot Procedure
- The Migrate Procedure
- Migrating Delta Lake to Apache Iceberg
- Migrating Apache Hudi to Apache Iceberg
- Migrating Individual Files to Apache Iceberg
- Using the add_files Procedure
- Migrating from Delta Lake or Apache Hudi Without Preserving History
- Migrating from Anywhere by Rewriting Data
- Migrating Data to a New Iceberg Table
- Migrating Data into an Existing Iceberg Table
- The COPY INTO command
- The INSERT INTO SELECT command
- Conclusion
- Migration Considerations
- 14. Real-World Use Cases of Apache Iceberg
- Ensuring High-Quality Data with Write-Audit-Publish in Apache Iceberg
- WAP Using Icebergs Branching Feature
- Create a branch
- Write the data
- Audit the data
- NULL values
- Duplicate records
- Date consistency
- Applying fixes
- Publish the changes
- WAP Using Icebergs Branching Feature
- Running BI Workloads on the Data Lake
- Land the Raw Data into the Data Lake
- Curate Virtual Data Marts/Data Products
- Create a Reflection to Accelerate Our Dashboard
- Connect Our View to Our BI Tool
- Benefits of Running BI Workloads on the Data Lake
- Implementing Change Data Capture with Apache Iceberg
- Create Apache Iceberg Tables
- Apply Updates from Operational Systems
- Create the Change Log View to Capture Changes
- Merge Changed Data in the Aggregated Table
- Conclusion
- Ensuring High-Quality Data with Write-Audit-Publish in Apache Iceberg
- Index