Practical Lakehouse Architecture - Helion

ebook

Autor: Gaurav Ashok Thalpati
ISBN: 9781098152970
stron: 286, Format: ebook
Data wydania: 2024-07-24
Księgarnia: Helion

Cena książki: 183,08 zł (poprzednio: 247,41 zł)
Oszczędzasz: 26% (-64,33 zł)

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Analiza danych

This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.

Practical Lakehouse Architecture shows you how to:

Understand key lakehouse concepts and features like transaction support, time travel, and schema evolution
Understand the differences between traditional and lakehouse data architectures
Differentiate between various file formats and table formats
Design lakehouse architecture layers for storage, compute, metadata management, and data consumption
Implement data governance and data security within the platform
Evaluate technologies and decide on the best technology stack to implement the lakehouse for your use case
Make critical design decisions and address practical challenges to build a future-ready data platform
Start your lakehouse implementation journey and migrate data from existing systems to the lakehouse

Osoby które kupowały "Practical Lakehouse Architecture", wybierały także:

Sztuczki w Excelu. Kurs video. Triki u 99,00 zł, (44,55 zł -55%)
Databricks. Kurs video. Wst 149,00 zł, (67,05 zł -55%)
Tabele i wykresy przestawne dla ka 198,98 zł, (89,54 zł -55%)
Statystyka. Kurs video. Przewodnik dla student 128,98 zł, (58,04 zł -55%)
R i pakiet shiny. Kurs video. Interaktywne aplikacje w analizie danych 149,00 zł, (67,05 zł -55%)

Spis treści

Practical Lakehouse Architecture eBook -- spis treści

Preface
- Who Should Read This Book?
- Why I Wrote This Book
- Navigating This Book
- OReilly Online Learning
- Conventions Used in This Book
- How to Contact Us
- Acknowledgments
1. Introduction to Lakehouse Architecture
- Understanding Data Architecture
  - What Is Data Architecture?
  - How Does Data Architecture Help Build a Data Platform?
    - Defining core components
    - Defining component interdependencies and data flow
    - Defining guiding principles
    - Defining the technology stack
    - Aligning with overall vision and data strategy
  - Core Components of a Data Platform
    - Source systems
      - Internal and external source systems
      - Batch, near real-time, and streaming systems
      - Structured, semi-structured, and unstructured data
    - Data ingestion
      - Batch ingestion
      - Near real-time
      - Streaming
    - Data storage
      - General storage
      - Purpose-built storage
    - Data processing and transformations
      - Data validation and cleansing
      - Data transformation
      - Data curation and serving
    - Data consumption and delivery
      - BI workloads
      - Ad hoc/Interactive analysis
      - Downstream applications and APIs
      - AI and ML workloads
    - Common services
      - Metadata management
      - Data governance and data security
      - Data operations
- Why Do We Need a New Data Architecture?
- Lakehouse Architecture: A New Pattern
  - The Lakehouse: Best of Both Worlds
    - How does a lakehouse get data lake features?
    - How does a lakehouse get data warehouse features?
  - Understanding Lakehouse Architecture
    - Storage layer
      - Cloud storage
      - Open file formats
      - Open table formats
    - Compute layer
      - Open-source engines
      - Commercial engines
  - Lakehouse Architecture Characteristics
    - Single storage tier with no dedicated warehouse
    - Warehouse-like performance on the data lake
    - Decoupled architecture with separate storage and compute scaling
    - Open architecture
    - Support for different data types
    - Support for diverse workloads
  - Lakehouse Architecture Benefits
    - Simplified architecture
    - Support for unstructured data and ML use cases
    - No vendor lock-ins
    - Data sharing
    - Scalable and cost efficient
    - No data swamps
    - Schema enforcement and evolution
      - Schema enforcement
      - Schema evolution
    - Unified platform for ETL/ELT, BI, AI/ML, and real-time workloads
      - ETL/ELT workloads
      - BI workloads
      - AI/ML workloads
      - Real-time workloads
    - Time travel
      - Retrieve older data based on version
      - Retrieve older data based on timestamp
- Key Takeaways
- References
2. Traditional Architectures and Modern Data Platforms
- Traditional Architectures: Data Lakes and Data Warehouses
  - Data Warehouse Fundamentals
    - Benefits and advantages
    - Limitations and challenges
  - Data Lake Fundamentals
    - Benefits and advantages
    - Limitations and challenges
- Modern Data Platforms
  - Finding Answers in the Cloud
  - Standalone Approach
    - Benefits
    - Limitations
  - Combined Approach
    - Benefits
    - Limitations
  - Expectations of Modern Data Platforms
- Comparison: Data Warehouse, Data Lake, Lakehouse
  - Capabilities and Limitations
    - Standalone cloud data warehouse
    - Standalone cloud data lake
    - Combined architecture
    - Lakehouse architecture
  - Implementation Activities
    - Standalone cloud data warehouse
    - Standalone cloud data lake
    - Combined architecture
    - Lakehouse architecture
  - Administration and Management
    - Standalone cloud data warehouse
    - Standalone cloud data lake
    - Combined architecture
    - Lakehouse architecture
  - Business Outcomes
    - Standalone cloud data warehouse
    - Standalone cloud data lake
    - Combined architecture
    - Lakehouse architecture
- Lakehouse Architecture: The Default Choice for Future Data Platforms?
- Key Takeaways
- References
3. Storage: The Heart of the Lakehouse
- Lakehouse Storage: Key Concepts
  - Row Versus Columnar Storage
  - Storage-based Performance Optimization
- Lakehouse Storage Components
  - Cloud Object Storage
    - Storage characteristics
  - File Formats
    - Parquet
      - File layout
      - Key features
    - ORC
      - File layout
      - Key features
    - Avro
      - File layout
      - Key features
    - Similarities, differences, and use cases
  - Table Formats
    - Hive
    - Iceberg
      - Table layout
      - Key features
    - Hudi
      - Table layout
      - Key features
    - Linux Foundations Delta Lake
      - Table layout
      - Key features
    - Similarities, differences, and use cases
- Key Design Considerations
  - Ecosystem Support
  - Community Support
  - Supported File Formats
  - Supported Compute Engines
  - Supported Features
  - Commercial Product Support
  - Current and Future Versions
  - Performance Benchmarking
  - Comparisons
  - Sharing Features
- Key Takeaways
- References
4. Data Catalogs
- Understanding Metadata
  - Technical Metadata
  - Business Metadata
- How Metastores and Data Catalogs Work Together
- Features of a Data Catalog
  - Search, Explore, and Discover Data
  - Data Classification
  - Data Governance and Security
  - Data Lineage
- Unified Data Catalog
  - Challenges of Siloed Metadata Management
  - What Is a Unified Data Catalog?
  - Benefits of a Unified Data Catalog
- Implementing a Data Catalog: Key Design Considerations and Options
  - Using Hive metastore
  - Using AWS Services
  - Using Azure Services
  - Using GCP Services
  - Using Databricks
- Key Takeaways
- References
5. Compute Engines for Lakehouse Architectures
- Data Computation Benefits of Lakehouse Architecture
  - Independent Scaling
  - Cross-region, Cross-account Access
  - Unified Batch and Real-Time Processing
  - Enhanced BI Performance
  - Freedom to Choose Different Engine Types
  - Cross-zone Analysis
- Compute Engine Options for Lakehouse Platforms
  - Open Source Tools
    - Tools for data engineering
      - Spark
      - Flink
    - Tools for data consumption
      - Presto and Trino
  - Cloud Services
    - AWS
      - AWS Glue
      - Amazon EMR
      - Amazon Athena
      - Other AWS services
    - Azure
      - Azure Data Factory (ADF)
      - Azure HDInsight
      - Azure Synapse Analytics
    - GCP
      - Dataproc
      - BigQuery
  - Third-Party Platforms
    - Databricks
    - Snowflake
- Key Design Considerations
  - Open Table Format Support
  - Supported Version and Features
  - Ecosystem Support
  - Persona-Based Preferences
  - Managed Open Source Versus Cloud Native Versus Third-Party Products
  - Data Consumption Workloads
    - BI workloads
    - AI/ML workloads
- Key Takeaways
- References
6. Data (and AI) Governance and Security in Lakehouse Architecture
- What Is Data Governance and Data Security?
- Benefits of Data Governance and Data Security
- Unified Governance and Security in Lakehouse Architecture
- Governance and Security Processes in Lakehouse Architecture
  - Metadata Management
  - Compliance and Regulations
  - Data and ML Model Quality
  - Lineage Across Data and AI assets
    - Understanding data flow
    - Performing impact analysis
    - Identifying unused objects
    - Tracking sensitive data
  - Data and AI Asset Sharing
  - Data Ownership
  - Auditing and Monitoring
  - Access Management
  - Data Protection
    - Data at rest
    - Data in transit
  - Handling Sensitive Data
    - Identify sensitive data
    - Anonymize sensitive data
- Whats Your Role?
- Key Takeaways
- References
7. The Big Picture: Designing and Implementing a Lakehouse Platform
- Pre-design Activities
  - Understanding Platform Requirements
  - Studying Existing System
  - Understanding the Organizations Vision and Data Strategy
  - Conducting Workshops and Interviews
- Choosing the Right Architecture
- Establishing Guiding Principles
  - Data Ecosystem
  - Scalability and Performance
  - Cost Control and Optimization
  - Platform Operations
  - Governance and Security
- Design Considerations and Implementation Best Practices
  - Architecture Blueprint
  - Data Ingestion
    - Data ingestion considerations
      - Ingestion frequency
      - Source system types
      - Identify incremental data (change data capture)
      - Sensitive data
    - Technology choices
    - Best practices
  - Data Storage
    - Storage zones considerations
      - Raw zone
      - Cleansed zone
      - Curated zone
      - Semantic zone
    - Data modeling considerations
      - Entity relationship (ER) modeling
      - Data Vault modeling
      - Dimensional modeling
    - Best practices
  - Data Processing
    - Data processing considerations
      - Open table format conversion
      - Schema and data quality validations
      - Data integration
      - Data transformations and enrichment
    - Best practices
  - Data Consumption and Delivery
    - Workload considerations
    - Best practices
  - Common Services
    - Metadata management
    - Governance and security
    - Platform operations
      - DataOps
      - MLOps
    - Best practices
- Design References
  - Step-by-Step Design Guide
  - Design Questionnaire
- Key Takeaways
- References
8. Lakehouse in the Real World
- Delivering a Real-World Lakehouse
- Estimation and Planning Phase
  - Estimation
  - Planning
- Analysis and Design Phase
  - Analyzing the Existing System
  - Data Modeling
  - Finalizing the Tech Stack
- Implementation and Test Phase
  - Historical Data Migration
  - Data Reconciliation and Testing
  - Reverse Engineering
  - Data Quality and Handling Sensitive Data
- Support and Maintenance Phase
  - Auditing and Tracking
  - Disaster Recovery Strategy
  - Decommissioning the Old System
- Delivery References
  - Project Deliverables
  - Reference Architectures
    - Cloud native implementation
    - Third-party platform implementation
- Key Takeaways
- References
9. Lakehouse of the Future
- Warehouse to Lakehouse: Whats Next?
  - Data Mesh
  - HTAP
  - Zero ETL
- Interoperability and New Formats
  - Universal Format (UniForm)
  - Apache XTable
  - Upcoming File and Table Formats
- Managed Platforms for Public and Private Clouds
  - Microsoft Fabric and Other Platforms
  - Managed Lakehouse for Private Cloud Platform
- AI in a Lakehouse
- Key Takeaways
- Book Conclusion
- References
Index