Practical Lakehouse Architecture - Helion
ebook
Autor: Gaurav Ashok ThalpatiISBN: 9781098152970
stron: 286, Format: ebook
Data wydania: 2024-07-24
Księgarnia: Helion
Cena książki: 183,08 zł (poprzednio: 247,41 zł)
Oszczędzasz: 26% (-64,33 zł)
Tagi: Analiza danych
This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.
Practical Lakehouse Architecture shows you how to:
- Understand key lakehouse concepts and features like transaction support, time travel, and schema evolution
- Understand the differences between traditional and lakehouse data architectures
- Differentiate between various file formats and table formats
- Design lakehouse architecture layers for storage, compute, metadata management, and data consumption
- Implement data governance and data security within the platform
- Evaluate technologies and decide on the best technology stack to implement the lakehouse for your use case
- Make critical design decisions and address practical challenges to build a future-ready data platform
- Start your lakehouse implementation journey and migrate data from existing systems to the lakehouse
Osoby które kupowały "Practical Lakehouse Architecture", wybierały także:
- Data Science w Pythonie. Kurs video. Algorytmy uczenia maszynowego 199,00 zł, (59,70 zł -70%)
- Power BI Desktop. Kurs video. Wykorzystanie narzędzia w analizie i wizualizacji danych 349,00 zł, (104,70 zł -70%)
- Statystyka. Kurs video. Przewodnik dla student 128,71 zł, (39,90 zł -69%)
- Microsoft Excel. Kurs video. Wykresy i wizualizacja danych 199,00 zł, (69,65 zł -65%)
- Analiza danych w Tableau. Kurs video. Podstawy pracy analityka 249,00 zł, (87,15 zł -65%)
Spis treści
Practical Lakehouse Architecture eBook -- spis treści
- Preface
- Who Should Read This Book?
- Why I Wrote This Book
- Navigating This Book
- OReilly Online Learning
- Conventions Used in This Book
- How to Contact Us
- Acknowledgments
- 1. Introduction to Lakehouse Architecture
- Understanding Data Architecture
- What Is Data Architecture?
- How Does Data Architecture Help Build a Data Platform?
- Defining core components
- Defining component interdependencies and data flow
- Defining guiding principles
- Defining the technology stack
- Aligning with overall vision and data strategy
- Core Components of a Data Platform
- Source systems
- Internal and external source systems
- Batch, near real-time, and streaming systems
- Structured, semi-structured, and unstructured data
- Data ingestion
- Batch ingestion
- Near real-time
- Streaming
- Data storage
- General storage
- Purpose-built storage
- Data processing and transformations
- Data validation and cleansing
- Data transformation
- Data curation and serving
- Data consumption and delivery
- BI workloads
- Ad hoc/Interactive analysis
- Downstream applications and APIs
- AI and ML workloads
- Common services
- Metadata management
- Data governance and data security
- Data operations
- Source systems
- Why Do We Need a New Data Architecture?
- Lakehouse Architecture: A New Pattern
- The Lakehouse: Best of Both Worlds
- How does a lakehouse get data lake features?
- How does a lakehouse get data warehouse features?
- Understanding Lakehouse Architecture
- Storage layer
- Cloud storage
- Open file formats
- Open table formats
- Compute layer
- Open-source engines
- Commercial engines
- Storage layer
- Lakehouse Architecture Characteristics
- Single storage tier with no dedicated warehouse
- Warehouse-like performance on the data lake
- Decoupled architecture with separate storage and compute scaling
- Open architecture
- Support for different data types
- Support for diverse workloads
- Lakehouse Architecture Benefits
- Simplified architecture
- Support for unstructured data and ML use cases
- No vendor lock-ins
- Data sharing
- Scalable and cost efficient
- No data swamps
- Schema enforcement and evolution
- Schema enforcement
- Schema evolution
- Unified platform for ETL/ELT, BI, AI/ML, and real-time workloads
- ETL/ELT workloads
- BI workloads
- AI/ML workloads
- Real-time workloads
- Time travel
- Retrieve older data based on version
- Retrieve older data based on timestamp
- The Lakehouse: Best of Both Worlds
- Key Takeaways
- References
- Understanding Data Architecture
- 2. Traditional Architectures and Modern Data Platforms
- Traditional Architectures: Data Lakes and Data Warehouses
- Data Warehouse Fundamentals
- Benefits and advantages
- Limitations and challenges
- Data Lake Fundamentals
- Benefits and advantages
- Limitations and challenges
- Data Warehouse Fundamentals
- Modern Data Platforms
- Finding Answers in the Cloud
- Standalone Approach
- Benefits
- Limitations
- Combined Approach
- Benefits
- Limitations
- Expectations of Modern Data Platforms
- Comparison: Data Warehouse, Data Lake, Lakehouse
- Capabilities and Limitations
- Standalone cloud data warehouse
- Standalone cloud data lake
- Combined architecture
- Lakehouse architecture
- Implementation Activities
- Standalone cloud data warehouse
- Standalone cloud data lake
- Combined architecture
- Lakehouse architecture
- Administration and Management
- Standalone cloud data warehouse
- Standalone cloud data lake
- Combined architecture
- Lakehouse architecture
- Business Outcomes
- Standalone cloud data warehouse
- Standalone cloud data lake
- Combined architecture
- Lakehouse architecture
- Capabilities and Limitations
- Lakehouse Architecture: The Default Choice for Future Data Platforms?
- Key Takeaways
- References
- Traditional Architectures: Data Lakes and Data Warehouses
- 3. Storage: The Heart of the Lakehouse
- Lakehouse Storage: Key Concepts
- Row Versus Columnar Storage
- Storage-based Performance Optimization
- Lakehouse Storage Components
- Cloud Object Storage
- Storage characteristics
- File Formats
- Parquet
- File layout
- Key features
- ORC
- File layout
- Key features
- Avro
- File layout
- Key features
- Similarities, differences, and use cases
- Parquet
- Table Formats
- Hive
- Iceberg
- Table layout
- Key features
- Hudi
- Table layout
- Key features
- Linux Foundations Delta Lake
- Table layout
- Key features
- Similarities, differences, and use cases
- Cloud Object Storage
- Key Design Considerations
- Ecosystem Support
- Community Support
- Supported File Formats
- Supported Compute Engines
- Supported Features
- Commercial Product Support
- Current and Future Versions
- Performance Benchmarking
- Comparisons
- Sharing Features
- Key Takeaways
- References
- Lakehouse Storage: Key Concepts
- 4. Data Catalogs
- Understanding Metadata
- Technical Metadata
- Business Metadata
- How Metastores and Data Catalogs Work Together
- Features of a Data Catalog
- Search, Explore, and Discover Data
- Data Classification
- Data Governance and Security
- Data Lineage
- Unified Data Catalog
- Challenges of Siloed Metadata Management
- What Is a Unified Data Catalog?
- Benefits of a Unified Data Catalog
- Implementing a Data Catalog: Key Design Considerations and Options
- Using Hive metastore
- Using AWS Services
- Using Azure Services
- Using GCP Services
- Using Databricks
- Key Takeaways
- References
- Understanding Metadata
- 5. Compute Engines for Lakehouse Architectures
- Data Computation Benefits of Lakehouse Architecture
- Independent Scaling
- Cross-region, Cross-account Access
- Unified Batch and Real-Time Processing
- Enhanced BI Performance
- Freedom to Choose Different Engine Types
- Cross-zone Analysis
- Compute Engine Options for Lakehouse Platforms
- Open Source Tools
- Tools for data engineering
- Spark
- Flink
- Tools for data consumption
- Presto and Trino
- Tools for data engineering
- Cloud Services
- AWS
- AWS Glue
- Amazon EMR
- Amazon Athena
- Other AWS services
- Azure
- Azure Data Factory (ADF)
- Azure HDInsight
- Azure Synapse Analytics
- GCP
- Dataproc
- BigQuery
- AWS
- Third-Party Platforms
- Databricks
- Snowflake
- Open Source Tools
- Key Design Considerations
- Open Table Format Support
- Supported Version and Features
- Ecosystem Support
- Persona-Based Preferences
- Managed Open Source Versus Cloud Native Versus Third-Party Products
- Data Consumption Workloads
- BI workloads
- AI/ML workloads
- Key Takeaways
- References
- Data Computation Benefits of Lakehouse Architecture
- 6. Data (and AI) Governance and Security in Lakehouse Architecture
- What Is Data Governance and Data Security?
- Benefits of Data Governance and Data Security
- Unified Governance and Security in Lakehouse Architecture
- Governance and Security Processes in Lakehouse Architecture
- Metadata Management
- Compliance and Regulations
- Data and ML Model Quality
- Lineage Across Data and AI assets
- Understanding data flow
- Performing impact analysis
- Identifying unused objects
- Tracking sensitive data
- Data and AI Asset Sharing
- Data Ownership
- Auditing and Monitoring
- Access Management
- Data Protection
- Data at rest
- Data in transit
- Handling Sensitive Data
- Identify sensitive data
- Anonymize sensitive data
- Whats Your Role?
- Key Takeaways
- References
- 7. The Big Picture: Designing and Implementing a Lakehouse Platform
- Pre-design Activities
- Understanding Platform Requirements
- Studying Existing System
- Understanding the Organizations Vision and Data Strategy
- Conducting Workshops and Interviews
- Choosing the Right Architecture
- Establishing Guiding Principles
- Data Ecosystem
- Scalability and Performance
- Cost Control and Optimization
- Platform Operations
- Governance and Security
- Design Considerations and Implementation Best Practices
- Architecture Blueprint
- Data Ingestion
- Data ingestion considerations
- Ingestion frequency
- Source system types
- Identify incremental data (change data capture)
- Sensitive data
- Technology choices
- Best practices
- Data ingestion considerations
- Data Storage
- Storage zones considerations
- Raw zone
- Cleansed zone
- Curated zone
- Semantic zone
- Data modeling considerations
- Entity relationship (ER) modeling
- Data Vault modeling
- Dimensional modeling
- Best practices
- Storage zones considerations
- Data Processing
- Data processing considerations
- Open table format conversion
- Schema and data quality validations
- Data integration
- Data transformations and enrichment
- Best practices
- Data processing considerations
- Data Consumption and Delivery
- Workload considerations
- Best practices
- Common Services
- Metadata management
- Governance and security
- Platform operations
- DataOps
- MLOps
- Best practices
- Design References
- Step-by-Step Design Guide
- Design Questionnaire
- Key Takeaways
- References
- Pre-design Activities
- 8. Lakehouse in the Real World
- Delivering a Real-World Lakehouse
- Estimation and Planning Phase
- Estimation
- Planning
- Analysis and Design Phase
- Analyzing the Existing System
- Data Modeling
- Finalizing the Tech Stack
- Implementation and Test Phase
- Historical Data Migration
- Data Reconciliation and Testing
- Reverse Engineering
- Data Quality and Handling Sensitive Data
- Support and Maintenance Phase
- Auditing and Tracking
- Disaster Recovery Strategy
- Decommissioning the Old System
- Delivery References
- Project Deliverables
- Reference Architectures
- Cloud native implementation
- Third-party platform implementation
- Key Takeaways
- References
- 9. Lakehouse of the Future
- Warehouse to Lakehouse: Whats Next?
- Data Mesh
- HTAP
- Zero ETL
- Interoperability and New Formats
- Universal Format (UniForm)
- Apache XTable
- Upcoming File and Table Formats
- Managed Platforms for Public and Private Clouds
- Microsoft Fabric and Other Platforms
- Managed Lakehouse for Private Cloud Platform
- AI in a Lakehouse
- Key Takeaways
- Book Conclusion
- References
- Warehouse to Lakehouse: Whats Next?
- Index