Fundamentals of Data Engineering - Helion

ebook

Autor: Joe Reis, Matt Housley
ISBN: 9781098108250
stron: 450, Format: ebook
Data wydania: 2022-06-22
Księgarnia: Helion

Cena książki: 245,65 zł (poprzednio: 285,64 zł)
Oszczędzasz: 14% (-39,99 zł)

Osoby, które kupiły tę książkę, wybierały także »

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:

Get a concise overview of the entire data engineering landscape
Assess data engineering problems using an end-to-end framework of best practices
Cut through marketing hype when choosing data technologies, architecture, and processes
Use the data engineering lifecycle to design and build a robust architecture
Incorporate data governance and security across the data engineering lifecycle

Osoby które kupowały "Fundamentals of Data Engineering", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 117,27 zł, (12,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
React.js i Node.js. Kurs video. Budowanie serwisu w oparciu o popularne biblioteki języka JavaScript 128,46 zł, (16,70 zł -87%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,14 zł, (12,90 zł -86%)

Spis treści

Fundamentals of Data Engineering eBook -- spis treści

Preface
- What This Book Isnt
- What This Book Is About
- Who Should Read This Book
- Prerequisites
- What Youll Learn and How It Will Improve Your Abilities
- Navigating This Book
- Conventions Used in This Book
- How to Contact Us
- Acknowledgments
I. Foundation and Building Blocks
1. Data Engineering Described
- What Is Data Engineering?
  - Data Engineering Defined
  - The Data Engineering Lifecycle
  - Evolution of the Data Engineer
    - The early days: 1980 to 2000, from data warehousing to the web
    - The early 2000s: The birth of contemporary data engineering
    - The 2000s and 2010s: Big data engineering
    - The 2020s: Engineering for the data lifecycle
  - Data Engineering and Data Science
- Data Engineering Skills and Activities
  - Data Maturity and the Data Engineer
    - Stage 1: Starting with data
    - Stage 2: Scaling with data
    - Stage 3: Leading with data
  - The Background and Skills of a Data Engineer
  - Business Responsibilities
  - Technical Responsibilities
  - The Continuum of Data Engineering Roles, from A to B
- Data Engineers Inside an Organization
  - Internal-Facing Versus External-Facing Data Engineers
  - Data Engineers and Other Technical Roles
    - Upstream stakeholders
      - Data architects
      - Software engineers
      - DevOps engineers and site-reliability engineers
    - Downstream stakeholders
      - Data scientists
      - Data analysts
      - Machine learning engineers and AI researchers
  - Data Engineers and Business Leadership
    - Data in the C-suite
      - Chief executive officer
      - Chief information officer
      - Chief technology officer
      - Chief data officer
      - Chief analytics officer
      - Chief algorithms officer
    - Data engineers and project managers
    - Data engineers and product managers
    - Data engineers and other management roles
- Conclusion
- Additional Resources
2. The Data Engineering Lifecycle
- What Is the Data Engineering Lifecycle?
  - The Data Lifecycle Versus the Data Engineering Lifecycle
  - Generation: Source Systems
    - Evaluating source systems: Key engineering considerations
  - Storage
    - Evaluating storage systems: Key engineering considerations
    - Understanding data access frequency
    - Selecting a storage system
  - Ingestion
    - Key engineering considerations for the ingestion phase
    - Batch versus streaming
    - Key considerations for batch versus stream ingestion
    - Push versus pull
  - Transformation
    - Key considerations for the transformation phase
  - Serving Data
    - Analytics
      - Business intelligence
      - Operational analytics
      - Embedded analytics
    - Machine learning
    - Reverse ETL
- Major Undercurrents Across the Data Engineering Lifecycle
  - Security
  - Data Management
    - Data governance
      - Discoverability
      - Metadata
      - Data accountability
      - Data quality
    - Data modeling and design
    - Data lineage
    - Data integration and interoperability
    - Data lifecycle management
    - Ethics and privacy
  - DataOps
    - Automation
    - Observability and monitoring
    - Incident response
    - DataOps summary
  - Data Architecture
  - Orchestration
  - Software Engineering
    - Core data processing code
    - Development of open source frameworks
    - Streaming
    - Infrastructure as code
    - Pipelines as code
    - General-purpose problem solving
- Conclusion
- Additional Resources
3. Designing Good Data Architecture
- What Is Data Architecture?
  - Enterprise Architecture Defined
    - TOGAFs definition
    - Gartners definition
    - EABOKs definition
    - Our definition
  - Data Architecture Defined
    - TOGAFs definition
    - DAMAs definition
    - Our definition
  - Good Data Architecture
- Principles of Good Data Architecture
  - Principle 1: Choose Common Components Wisely
  - Principle 2: Plan for Failure
  - Principle 3: Architect for Scalability
  - Principle 4: Architecture Is Leadership
  - Principle 5: Always Be Architecting
  - Principle 6: Build Loosely Coupled Systems
  - Principle 7: Make Reversible Decisions
  - Principle 8: Prioritize Security
    - Hardened-perimeter and zero-trust security models
    - The shared responsibility model
    - Data engineers as security engineers
  - Principle 9: Embrace FinOps
- Major Architecture Concepts
  - Domains and Services
  - Distributed Systems, Scalability, and Designing for Failure
  - Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
    - Architecture tiers
      - Single tier
      - Multitier
    - Monoliths
    - Microservices
    - Considerations for data architecture
  - User Access: Single Versus Multitenant
  - Event-Driven Architecture
  - Brownfield Versus Greenfield Projects
    - Brownfield projects
    - Greenfield projects
- Examples and Types of Data Architecture
  - Data Warehouse
    - The cloud data warehouse
    - Data marts
  - Data Lake
  - Convergence, Next-Generation Data Lakes, and the Data Platform
  - Modern Data Stack
  - Lambda Architecture
  - Kappa Architecture
  - The Dataflow Model and Unified Batch and Streaming
  - Architecture for IoT
    - Devices
    - Interfacing with devices
      - IoT gateway
      - Ingestion
      - Storage
      - Serving
    - Scratching the surface of the IoT
  - Data Mesh
  - Other Data Architecture Examples
- Whos Involved with Designing a Data Architecture?
- Conclusion
- Additional Resources
4. Choosing Technologies Across the Data Engineering Lifecycle
- Team Size and Capabilities
- Speed to Market
- Interoperability
- Cost Optimization and Business Value
  - Total Cost of Ownership
  - Total Opportunity Cost of Ownership
  - FinOps
- Today Versus the Future: Immutable Versus Transitory Technologies
  - Our Advice
- Location
  - On Premises
  - Cloud
  - Hybrid Cloud
  - Multicloud
  - Decentralized: Blockchain and the Edge
  - Our Advice
    - Choose technologies for the present, but look toward the future
  - Cloud Repatriation Arguments
    - You are not Dropbox, nor are you Cloudflare
- Build Versus Buy
  - Open Source Software
    - Community-managed OSS
    - Commercial OSS
  - Proprietary Walled Gardens
    - Independent offerings
    - Cloud platform proprietary service offerings
  - Our Advice
- Monolith Versus Modular
  - Monolith
  - Modularity
  - The Distributed Monolith Pattern
  - Our Advice
- Serverless Versus Servers
  - Serverless
  - Containers
  - How to Evaluate Server Versus Serverless
  - Our Advice
- Optimization, Performance, and the Benchmark Wars
  - Big Data...for the 1990s
  - Nonsensical Cost Comparisons
  - Asymmetric Optimization
  - Caveat Emptor
- Undercurrents and Their Impacts on Choosing Technologies
  - Data Management
  - DataOps
  - Data Architecture
  - Orchestration Example: Airflow
  - Software Engineering
- Conclusion
- Additional Resources
II. The Data Engineering Lifecycle in Depth
5. Data Generation in Source Systems
- Sources of Data: How Is Data Created?
- Source Systems: Main Ideas
  - Files and Unstructured Data
  - APIs
  - Application Databases (OLTP Systems)
    - ACID
    - Atomic transactions
    - OLTP and analytics
  - Online Analytical Processing System
  - Change Data Capture
  - Logs
    - Log encoding
    - Log resolution
    - Log latency: Batch or real time
  - Database Logs
  - CRUD
  - Insert-Only
  - Messages and Streams
  - Types of Time
- Source System Practical Details
  - Databases
    - Major considerations for understanding database technologies
    - Relational databases
    - Nonrelational databases: NoSQL
      - Key-value stores
      - Document stores
      - Wide-column
      - Graph databases
      - Search
      - Time series
  - APIs
    - REST
    - GraphQL
    - Webhooks
    - RPC and gRPC
  - Data Sharing
  - Third-Party Data Sources
  - Message Queues and Event-Streaming Platforms
    - Message queues
      - Message ordering and delivery
      - Delivery frequency
      - Scalability
    - Event-streaming platforms
      - Topics
      - Stream partitions
      - Fault tolerance and resilience
- Whom Youll Work With
- Undercurrents and Their Impact on Source Systems
  - Security
  - Data Management
  - DataOps
  - Data Architecture
  - Orchestration
  - Software Engineering
- Conclusion
- Additional Resources
6. Storage
- Raw Ingredients of Data Storage
  - Magnetic Disk Drive
  - Solid-State Drive
  - Random Access Memory
  - Networking and CPU
  - Serialization
  - Compression
  - Caching
- Data Storage Systems
  - Single Machine Versus Distributed Storage
  - Eventual Versus Strong Consistency
  - File Storage
    - Local disk storage
    - Network-attached storage
    - Cloud filesystem services
  - Block Storage
    - Block storage applications
    - RAID
    - Storage area network
    - Cloud virtualized block storage
    - Local instance volumes
  - Object Storage
    - Object stores for data engineering applications
    - Object lookup
    - Object consistency and versioning
    - Storage classes and tiers
    - Object storebacked filesystems
  - Cache and Memory-Based Storage Systems
    - Example: Memcached and lightweight object caching
    - Example: Redis, memory caching with optional persistence
  - The Hadoop Distributed File System
    - Hadoop is dead. Long live Hadoop!
  - Streaming Storage
  - Indexes, Partitioning, and Clustering
    - The evolution from rows to columns
    - From indexes to partitions and clustering
    - Example: Snowflake micro-partitioning
- Data Engineering Storage Abstractions
  - The Data Warehouse
  - The Data Lake
  - The Data Lakehouse
  - Data Platforms
  - Stream-to-Batch Storage Architecture
- Big Ideas and Trends in Storage
  - Data Catalog
    - Catalog application integration
    - Automated scanning
    - Data portal and social layer
    - Data catalog use cases
  - Data Sharing
  - Schema
  - Separation of Compute from Storage
    - Colocation of compute and storage
    - Separation of compute and storage
      - Ephemerality and scalability
      - Data durability and availability
    - Hybrid separation and colocation
      - Example: AWS EMR with S3 and HDFS
      - Example: Apache Spark
      - Example: Apache Druid
      - Example: Hybrid object storage
    - Zero-copy cloning
  - Data Storage Lifecycle and Data Retention
    - Hot, warm, and cold data
      - Hot data
      - Warm data
      - Cold data
      - Storage tier considerations
    - Data retention
      - Value
      - Time
      - Compliance
      - Cost
  - Single-Tenant Versus Multitenant Storage
- Whom Youll Work With
- Undercurrents
  - Security
  - Data Management
    - Data catalogs and metadata management
    - Data versioning in object storage
    - Privacy
  - DataOps
    - Systems monitoring
    - Observing and monitoring data
  - Data Architecture
  - Orchestration
  - Software Engineering
- Conclusion
- Additional Resources
7. Ingestion
- What Is Data Ingestion?
- Key Engineering Considerations for the Ingestion Phase
  - Bounded Versus Unbounded Data
  - Frequency
  - Synchronous Versus Asynchronous Ingestion
  - Serialization and Deserialization
  - Throughput and Scalability
  - Reliability and Durability
  - Payload
    - Kind
    - Shape
    - Size
    - Schema and data types
      - Detecting and handling upstream and downstream schema changes
      - Schema registries
    - Metadata
  - Push Versus Pull Versus Poll Patterns
- Batch Ingestion Considerations
  - Snapshot or Differential Extraction
  - File-Based Export and Ingestion
  - ETL Versus ELT
  - Inserts, Updates, and Batch Size
  - Data Migration
- Message and Stream Ingestion Considerations
  - Schema Evolution
  - Late-Arriving Data
  - Ordering and Multiple Delivery
  - Replay
  - Time to Live
  - Message Size
  - Error Handling and Dead-Letter Queues
  - Consumer Pull and Push
  - Location
- Ways to Ingest Data
  - Direct Database Connection
  - Change Data Capture
    - Batch-oriented CDC
    - Continuous CDC
    - CDC and database replication
    - CDC considerations
  - APIs
  - Message Queues and Event-Streaming Platforms
  - Managed Data Connectors
  - Moving Data with Object Storage
  - EDI
  - Databases and File Export
  - Practical Issues with Common File Formats
  - Shell
  - SSH
  - SFTP and SCP
  - Webhooks
  - Web Interface
  - Web Scraping
  - Transfer Appliances for Data Migration
  - Data Sharing
- Whom Youll Work With
  - Upstream Stakeholders
  - Downstream Stakeholders
- Undercurrents
  - Security
  - Data Management
    - Schema changes
    - Data ethics, privacy, and compliance
  - DataOps
    - Data-quality tests
  - Orchestration
  - Software Engineering
- Conclusion
- Additional Resources
8. Queries, Modeling, and Transformation
- Queries
  - What Is a Query?
    - Data definition language
    - Data manipulation language
    - Data control language
    - Transaction control language
  - The Life of a Query
  - The Query Optimizer
  - Improving Query Performance
    - Optimize your join strategy and schema
    - Use the explain plan and understand your querys performance
    - Avoid full table scans
    - Know how your database handles commits
    - Vacuum dead records
    - Leverage cached query results
  - Queries on Streaming Data
    - Basic query patterns on streams
      - The fast-follower approach
      - The Kappa architecture
    - Windows, triggers, emitted statistics, and late-arriving data
      - Session window
      - Fixed-time windows
      - Sliding windows
      - Watermarks
    - Combining streams with other data
      - Conventional table joins
      - Enrichment
      - Stream-to-stream joining
- Data Modeling
  - What Is a Data Model?
  - Conceptual, Logical, and Physical Data Models
  - Normalization
  - Techniques for Modeling Batch Analytical Data
    - Inmon
    - Kimball
      - Fact tables
      - Dimension tables
      - Star schema
    - Data Vault
      - Hubs
      - Links
      - Satellites
    - Wide denormalized tables
  - Modeling Streaming Data
- Transformations
  - Batch Transformations
    - Distributed joins
      - Broadcast join
      - Shuffle hash join
    - ETL, ELT, and data pipelines
    - SQL and code-based transformation tools
      - SQL is declarative...but it can still build complex data workflows
      - Example: When to avoid SQL for batch transformations in Spark
      - Example: Optimizing Spark and other processing frameworks
    - Update patterns
      - Truncate and reload
      - Insert only
      - Delete
      - Upsert/merge
    - Schema updates
    - Data wrangling
    - Example: Data transformation in Spark
    - Business logic and derived data
    - MapReduce
    - After MapReduce
  - Materialized Views, Federation, and Query Virtualization
    - Views
    - Materialized views
    - Composable materialized views
    - Federated queries
    - Data virtualization
  - Streaming Transformations and Processing
    - Basics
    - Transformations and queries are a continuum
    - Streaming DAGs
    - Micro-batch versus true streaming
- Whom Youll Work With
  - Upstream Stakeholders
  - Downstream Stakeholders
- Undercurrents
  - Security
  - Data Management
  - DataOps
  - Data Architecture
  - Orchestration
  - Software Engineering
- Conclusion
- Additional Resources
9. Serving Data for Analytics, Machine Learning, and Reverse ETL
- General Considerations for Serving Data
  - Trust
  - Whats the Use Case, and Whos the User?
  - Data Products
  - Self-Service or Not?
  - Data Definitions and Logic
  - Data Mesh
- Analytics
  - Business Analytics
  - Operational Analytics
  - Embedded Analytics
- Machine Learning
- What a Data Engineer Should Know About ML
- Ways to Serve Data for Analytics and ML
  - File Exchange
  - Databases
  - Streaming Systems
  - Query Federation
  - Data Sharing
  - Semantic and Metrics Layers
  - Serving Data in Notebooks
- Reverse ETL
- Whom Youll Work With
- Undercurrents
  - Security
  - Data Management
  - DataOps
  - Data Architecture
  - Orchestration
  - Software Engineering
- Conclusion
- Additional Resources
III. Security, Privacy, and the Future of Data Engineering
10. Security and Privacy
- People
  - The Power of Negative Thinking
  - Always Be Paranoid
- Processes
  - Security Theater Versus Security Habit
  - Active Security
  - The Principle of Least Privilege
  - Shared Responsibility in the Cloud
  - Always Back Up Your Data
  - An Example Security Policy
- Technology
  - Patch and Update Systems
  - Encryption
    - Encryption at rest
    - Encryption over the wire
  - Logging, Monitoring, and Alerting
  - Network Access
  - Security for Low-Level Data Engineering
    - Internal security research
- Conclusion
- Additional Resources
11. The Future of Data Engineering
- The Data Engineering Lifecycle Isnt Going Away
- The Decline of Complexity and the Rise of Easy-to-Use Data Tools
- The Cloud-Scale Data OS and Improved Interoperability
- Enterprisey Data Engineering
- Titles and Responsibilities Will Morph...
- Moving Beyond the Modern Data Stack, Toward the Live Data Stack
  - The Live Data Stack
  - Streaming Pipelines and Real-Time Analytical Databases
  - The Fusion of Data with Applications
  - The Tight Feedback Between Applications and ML
  - Dark Matter Data and the Rise of...Spreadsheets?!
- Conclusion
A. Serialization and Compression Technical Details
- Serialization Formats
  - Row-Based Serialization
    - CSV: The nonstandard standard
    - XML
    - JSON and JSONL
    - Avro
  - Columnar Serialization
    - Parquet
    - ORC
    - Apache Arrow or in-memory serialization
  - Hybrid Serialization
    - Hudi
    - Iceberg
- Database Storage Engines
- Compression: gzip, bzip2, Snappy, Etc.
B. Cloud Networking
- Cloud Network Topology
  - Data Egress Charges
  - Availability Zones
  - Regions
  - GCP-Specific Networking and Multiregional Redundancy
  - Direct Network Connections to the Clouds
- CDNs
- The Future of Data Egress Fees
Index