The Cloud Data Lake - Helion

ISBN: 9781098116545
stron: 246, Format: ebook
Data wydania: 2022-12-12
Księgarnia: Helion
Cena książki: 203,15 zł (poprzednio: 236,22 zł)
Oszczędzasz: 14% (-33,07 zł)
More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.
This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.
- Learn the benefits of a cloud-based big data strategy for your organization
- Get guidance and best practices for designing performant and scalable data lakes
- Examine architecture and design choices, and data governance principles and strategies
- Build a data strategy that scales as your organizational and business needs increase
- Implement a scalable data lake in the cloud
- Use cloud-based advanced analytics to gain more value from your data
Osoby które kupowały "The Cloud Data Lake", wybierały także:
- Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 127,27 zł, (14,00 zł -89%)
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Podręcznik startupu. Budowa wielkiej firmy krok po kroku 93,33 zł, (14,00 zł -85%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Scrum. O zwinnym zarz 77,78 zł, (14,00 zł -82%)
Spis treści
The Cloud Data Lake eBook -- spis treści
- Preface
- Why I Wrote This Book
- Who Should Read This Book?
- Introducing Klodars Corporation
- Navigating the Book
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Big DataBeyond the Buzz
- What Is Big Data?
- Elastic Data InfrastructureThe Challenge
- Cloud Computing Fundamentals
- Cloud Computing Terminology
- Value Proposition of the Cloud
- Cloud Data Lake Architecture
- Limitations of On-Premises Data Warehouse Solutions
- What Is a Cloud Data Lake Architecture?
- Benefits of a Cloud Data Lake Architecture
- Defining Your Cloud Data Lake Journey
- Summary
- 2. Big Data Architectures on the Cloud
- Why Klodars Corporation Moves to the Cloud
- Fundamentals of Cloud Data Lake Architectures
- A Word on Variety of Data
- Cloud Data Lake Storage
- Big Data Analytics Engines
- MapReduce
- Apache Hadoop
- Apache Spark
- Real-time stream processing pipelines
- Cloud Data Warehouses
- Modern Data Warehouse Architecture
- Reference Architecture
- Sample Use Case for a Modern Data Warehouse Architecture
- Benefits and Challenges of Modern Data Warehouse Architecture
- Data Lakehouse Architecture
- Reference Architecture for the Data Lakehouse
- Data formats
- Metadata
- Compute engines
- Sample Use Case for Data Lakehouse Architecture
- Benefits and Challenges of the Data Lakehouse Architecture
- Data Warehouses and Unstructured Data
- Reference Architecture for the Data Lakehouse
- Data Mesh
- Reference Architecture
- Sample Use Case for a Data Mesh Architecture
- Challenges and Benefits of a Data Mesh Architecture
- What Is the Right Architecture for Me?
- Know Your Customers
- Know Your Business Drivers
- Consider Your Growth and Future Scenarios
- Design Considerations
- Hybrid Approaches
- Summary
- 3. Design Considerations for Your Data Lake
- Setting Up the Cloud Data Lake Infrastructure
- Identify Your Goals
- How Klodars Corporation defined the data lake goals
- Plan Your Architecture and Deliverables
- How Klodars Corporation planned their architecture and deliverables
- Implement the Cloud Data Lake
- Release and Operationalize
- Identify Your Goals
- Organizing Data in Your Data Lake
- A Day in the Life of Data
- Data Lake Zones
- Organization Mechanisms
- Introduction to Data Governance
- Actors Involved in Data Governance
- Data Classification
- Metadata Management, Data Catalog, and Data Sharing
- Data Access Management
- Data Quality and Observability
- Data Governance at Klodars Corporation
- Data Governance Wrap-Up
- Manage Data Lake Costs
- Demystifying Data Lake Costs on the Cloud
- Data Lake Cost Strategy
- Data Lake Environments and Associated Costs
- Cost strategy based on data
- Transactions and impact on costs
- Summary
- Setting Up the Cloud Data Lake Infrastructure
- 4. Scalable Data Lakes
- A Sneak Peek into Scalability
- What Is Scalability?
- Scale in Our Day-to-Day Life
- Scalability in Data Lake Architectures
- Internals of Data Lake Processing Systems
- Data Copy Internals
- Components of a data copy solution
- Understanding resource utilization of a data copy job
- ELT/ETL Processing Internals
- Components of an Apache Spark application
- Understanding resource utilization of a Spark job
- A Note on Other Interactive Queries
- Data Copy Internals
- Considerations for Scalable Data Lake Solutions
- Pick the Right Cloud Offerings
- Hybrid and multicloud solutions
- IaaS versus PaaS versus SaaS solutions
- Cloud offerings for Klodars Corporation
- Plan for Peak Capacity
- Data Formats and Job Profile
- Pick the Right Cloud Offerings
- Summary
- A Sneak Peek into Scalability
- 5. Optimizing Cloud Data Lake Architectures for Performance
- Basics of Measuring Performance
- Goals and Metrics for Performance
- Measuring Performance
- Optimizing for Faster Performance
- Cloud Data Lake Performance
- SLAs, SLOs, and SLIs
- Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs
- Drivers of Performance
- Performance Drivers for a Copy Job
- Performance Drivers for a Spark Job
- Optimization Principles and Techniques for Performance Tuning
- Data Formats
- Exploring Apache Parquet
- Other popular data formats
- How Klodars Corporation picked their data formats
- Data Organization and Partitioning
- Optimal data organization strategy for Klodars Corporation
- Choosing the Right Configurations on Apache Spark
- Data Formats
- Minimize Overheads with Data Transfer
- Premium Offerings and Performance
- The Case of Bigger Virtual Machines
- The Case of Flash Storage
- Summary
- Basics of Measuring Performance
- 6. Deep Dive on Data Formats
- Why Do We Need These Open Data Formats?
- Why Do We Need to Store Tabular Data?
- Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?
- Delta Lake
- Why Was Delta Lake Founded?
- Eliminate data silos across business analysts, data scientists, and data engineers
- Provide a unified data and computational system for batch and real-time streaming data
- Support bulk updates or changes to existing data
- Handle errors due to schema changes and incorrect data
- How Does Delta Lake Work?
- When Do You Use Delta Lake?
- Why Was Delta Lake Founded?
- Apache Iceberg
- Why Was Apache Iceberg Founded?
- How Does Apache Iceberg Work?
- When Do You Use Apache Iceberg?
- Apache Hudi
- Why Was Apache Hudi Founded?
- How Does Apache Hudi Work?
- Copy-on-write tables
- Merge-on-read tables
- When Do You Use Apache Hudi?
- Summary
- Why Do We Need These Open Data Formats?
- 7. Decision Framework for Your Architecture
- Cloud Data Lake Assessment
- Cloud Data Lake Assessment Questionnaire
- Analysis for Your Cloud Data Lake Assessment
- Starting from Scratch
- Migrating an Existing Data Lake or Data Warehouse to the Cloud
- Improving an Existing Cloud Data Lake
- Phase 1 of Decision Framework: Assess
- Understand Customer Requirements
- Understand Opportunities for Improvement
- Know Your Business Drivers
- Complete the Assess Phase by Prioritizing the Requirements
- Phase 2 of Decision Framework: Define
- Finalize the Design Choices for the Cloud Data Lake
- Picking your architecture
- Picking your cloud provider
- Decision points for data lake migrations
- Plan Your Cloud Data Lake Project Deliverables
- Finalize the Design Choices for the Cloud Data Lake
- Phase 3 of Decision Framework: Implement
- Phase 4 of Decision Framework: Operationalize
- Summary
- Cloud Data Lake Assessment
- 8. Six Lessons for a Data Informed Future
- Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes
- Lesson 2: With Great Power Comes Great ResponsibilityData Is No Exception
- Lesson 3: Customers Lead Technology, Not the Other Way Around
- Lesson 4: Change Is Inevitable, so Be Prepared
- Lesson 5: Build Empathy and Prioritize Ruthlessly
- Lesson 6: Big Impact Does Not Happen Overnight
- Summary
- A. Cloud Data Lake Decision Framework Template
- Phase 1: Assess Framework
- Phase 2: Define Framework
- Planning the Cloud Data Lake Deliverables
- Phase 3: Implement Framework
- Index