Data Engineering Design Patterns - Helion

ISBN: 9781098165789
stron: 374, Format: ebook
Data wydania: 2024-05-09
Księgarnia: Helion
Cena książki: 228,65 zł (poprzednio: 265,87 zł)
Oszczędzasz: 14% (-37,22 zł)
Data projects are an intrinsic part of an organization’s technical ecosystem, but data engineers in many companies continue to work on problems that others have already solved. This hands-on guide shows you how to provide valuable data by focusing on various aspects of data engineering, including data ingestion, data quality, idempotency, and more.
Author Bartosz Konieczny guides you through the process of building reliable end-to-end data engineering projects, from data ingestion to data observability, focusing on data engineering design patterns that solve common business problems in a secure and storage-optimized manner. Each pattern includes a user-facing description of the problem, solutions, and consequences that place the pattern into the context of real-life scenarios.
Throughout this journey, you’ll use open source data tools and public cloud services to apply each pattern. You'll learn:
- Challenges data engineers face and their impact on data systems
- How these challenges relate to data system components
- Useful applications of data engineering patterns
- How to identify and fix issues with your current data components
- TTechnology-agnostic solutions to new and existing data projects, with open source implementation examples
Bartosz Konieczny is a freelance data engineer who's been coding since 2010. He's held various senior hands-on positions that allowed him to work on many data engineering problems in batch and stream processing.
Osoby które kupowały "Data Engineering Design Patterns", wybierały także:
- Cisco CCNA 200-301. Kurs video. Administrowanie bezpieczeństwem sieci. Część 3 665,00 zł, (39,90 zł -94%)
- Cisco CCNA 200-301. Kurs video. Administrowanie urządzeniami Cisco. Część 2 665,00 zł, (39,90 zł -94%)
- Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 665,00 zł, (39,90 zł -94%)
- Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Programowanie i automatyzacja sieci 443,33 zł, (39,90 zł -91%)
- CCNP Enterprise 350-401 ENCOR. Kurs video. Mechanizmy kierowania ruchem pakiet 443,33 zł, (39,90 zł -91%)
Spis treści
Data Engineering Design Patterns. Recipes for Solving the Most Common Data Engineering Problems eBook -- spis treści
- Preface
- Conventions Used in This Book
- The Structure of This Book
- How to Use This Book
- What Should I Know Prior to Reading This Book?
- Glossary and Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Introducing Data Engineering Design Patterns
- What Are Design Patterns?
- Yet More Design Patterns?
- Common Data Engineering Patterns
- Case Study Used in This Book
- Summary
- 2. Data Ingestion Design Patterns
- Full Load
- Pattern: Full Loader
- Problem
- Solution
- Consequences
- Data volume
- Data consistency
- Examples
- Pattern: Full Loader
- Incremental Load
- Pattern: Incremental Loader
- Problem
- Solution
- Consequences
- Hard deletes
- Backfilling
- Examples
- Pattern: Change Data Capture
- Problem
- Solution
- Consequences
- Complexity
- Data scope
- Payload
- Data semantics
- Examples
- Pattern: Incremental Loader
- Replication
- Pattern: Passthrough Replicator
- Problem
- Solution
- Consequences
- Keep it simple
- Security and isolation
- PII data
- Latency
- Metadata
- Examples
- Pattern: Transformation Replicator
- Problem
- Solution
- Consequences
- Transformation risk for text file formats
- Desynchronization
- Examples
- Pattern: Passthrough Replicator
- Data Compaction
- Pattern: Compactor
- Problem
- Solution
- Consequences
- Cost versus performance trade-offs
- Consistency
- Cleaning
- Example
- Pattern: Compactor
- Data Readiness
- Pattern: Readiness Marker
- Problem
- Solution
- Consequences
- Lack of enforcement
- Reliability for late data
- Examples
- Pattern: Readiness Marker
- Event Driven
- Pattern: External Trigger
- Problem
- Solution
- Consequences
- Push versus pull
- Execution context
- Error management
- Examples
- Pattern: External Trigger
- Summary
- Full Load
- 3. Error Management Design Patterns
- Unprocessable Records
- Pattern: Dead-Letter
- Problem
- Solution
- Consequences
- Snowball backfilling effect
- Dead-lettered records identification
- Ordering and consistency
- Error-safe functions
- Error or failure?
- Examples
- Pattern: Dead-Letter
- Duplicated Records
- Pattern: Windowed Deduplicator
- Problem
- Solution
- Consequences
- Space versus time trade-off
- Idempotent producer
- Examples
- Pattern: Windowed Deduplicator
- Late Data
- Pattern: Late Data Detector
- Problem
- Solution
- Consequences
- Late data capture
- MIN strategy, stuck-in-the-past situations, and stateful jobs
- Max strategy and event skew
- Examples
- Pattern: Static Late Data Integrator
- Problem
- Solution
- Consequences
- Snowball backfilling effect
- Overlapping executions and backfilling
- Pipeline trigger
- Waste of resources
- Time requirement
- Examples
- Pattern: Dynamic Late Data Integrator
- Problem
- Solution
- Consequences
- Concurrency
- Stateful pipelines and very late data
- Scheduling complexity
- Examples
- Pattern: Late Data Detector
- Filtering
- Pattern: Filter Interceptor
- Problem
- Solution
- Consequences
- Runtime impact
- Declarative languages
- Streaming
- Examples
- Pattern: Filter Interceptor
- Fault Tolerance
- Pattern: Checkpointer
- Problem
- Solution
- Consequences
- Delivery guarantee versus latency trade-off
- Exactly-once feeling
- Examples
- Pattern: Checkpointer
- Summary
- Unprocessable Records
- 4. Idempotency Design Patterns
- Overwriting
- Pattern: Fast Metadata Cleaner
- Problem
- Solution
- Consequences
- Granularity and backfilling boundary
- Metadata limits
- Data exposition layer
- Schema evolution
- Examples
- Pattern: Data Overwrite
- Problem
- Solution
- Consequences
- Data overhead
- Vacuum need
- Examples
- Pattern: Fast Metadata Cleaner
- Updates
- Pattern: Merger
- Problem
- Solution
- Consequences
- Uniqueness
- I/O
- Incremental datasets with backfilling
- Examples
- Pattern: Stateful Merger
- Problem
- Solution
- Consequences
- Versioned data stores
- Vacuum operations
- Metadata operations
- Examples
- Pattern: Merger
- Database
- Pattern: Keyed Idempotency
- Problem
- Solution
- Consequences
- Database dependent
- Mutable data source
- Examples
- Pattern: Transactional Writer
- Problem
- Solution
- Consequences
- Commit step
- Distributed processing
- Idempotency scope
- Examples
- Pattern: Keyed Idempotency
- Immutable Dataset
- Pattern: Proxy
- Problem
- Solution
- Consequences
- Database support
- Immutability configuration
- Examples
- Pattern: Proxy
- Summary
- Overwriting
- 5. Data Value Design Patterns
- Data Enrichment
- Pattern: Static Joiner
- Problem
- Solution
- Consequences
- Late data and consistency
- Idempotency
- Examples
- Pattern: Dynamic Joiner
- Problem
- Solution
- Consequences
- Space versus exactness trade-off
- Late data
- Examples
- Pattern: Static Joiner
- Data Decoration
- Pattern: Wrapper
- Problem
- Solution
- Consequences
- Domain split
- Size
- Examples
- Pattern: Metadata Decorator
- Problem
- Solution
- Consequences
- Implementation
- Data
- Examples
- Pattern: Wrapper
- Data Aggregation
- Pattern: Distributed Aggregator
- Problem
- Solution
- Consequences
- Additional network exchange
- Data skew
- Scaling
- Examples
- Pattern: Local Aggregator
- Problem
- Solution
- Consequences
- Scaling
- Grouping keys
- Examples
- Pattern: Distributed Aggregator
- Sessionization
- Pattern: Incremental Sessionizer
- Problem
- Solution
- Consequences
- Inactivity period
- Data freshness
- Late data, event time partitions, and backfilling
- Examples
- Pattern: Stateful Sessionizer
- Problem
- Solution
- Consequences
- At-least-once processing
- Scaling
- Inactivity period length
- Inactivity period time
- Examples
- Pattern: Incremental Sessionizer
- Data Ordering
- Pattern: Bin Pack Orderer
- Problem
- Solution
- Consequences
- Retries
- Complexity
- Examples
- Pattern: FIFO Orderer
- Problem
- Solution
- Consequences
- I/O overhead and latency
- FIFO is not exactly once
- Examples
- Pattern: Bin Pack Orderer
- Summary
- Data Enrichment
- 6. Data Flow Design Patterns
- Sequence
- Pattern: Local Sequencer
- Problem
- Solution
- Consequences
- Boundaries
- Examples
- Pattern: Isolated Sequencer
- Problem
- Solution
- Consequences
- Scheduling
- Communication
- Examples
- Pattern: Local Sequencer
- Fan-In
- Pattern: Aligned Fan-In
- Problem
- Solution
- Consequences
- Infrastructure spikes
- Scheduling skew
- Scheduling overhead
- Complexity
- Examples
- Pattern: Unaligned Fan-In
- Problem
- Solution
- Consequences
- Readability
- Partial data
- Examples
- Pattern: Aligned Fan-In
- Fan-Out
- Pattern: Parallel Split
- Problem
- Solution
- Consequences
- Blocked execution
- Hardware
- Examples
- Pattern: Exclusive Choice
- Problem
- Solution
- Consequences
- Complexity factory
- Hidden logic
- Heavy conditions
- Examples
- Pattern: Parallel Split
- Orchestration
- Pattern: Single Runner
- Problem
- Solution
- Consequences
- Backfilling
- Latency
- Examples
- Pattern: Concurrent Runner
- Problem
- Solution
- Consequences
- Resource starvation
- Shared state
- Examples
- Pattern: Single Runner
- Summary
- Sequence
- 7. Data Security Design Patterns
- Data Removal
- Pattern: Vertical Partitioner
- Problem
- Solution
- Consequences
- Query performance
- Querying complexity
- Complexity in a polyglot world
- Raw data
- Examples
- Pattern: In-Place Overwriter
- Problem
- Solution
- Consequences
- I/O overhead
- Cost
- Examples
- Pattern: Vertical Partitioner
- Access Control
- Pattern: Fine-Grained Accessor for Tables
- Problem
- Solution
- Consequences
- Row-level security limits
- Data type
- Query overhead
- Examples
- Pattern: Fine-Grained Accessor for Resources
- Problem
- Solution
- Consequences
- Security by the book trade-off
- Complexity
- Quotas
- Examples
- Pattern: Fine-Grained Accessor for Tables
- Data Protection
- Pattern: Encryptor
- Problem
- Solution
- Consequences
- Encryption/decryption overhead
- Data loss risk
- Protocol updates
- Examples
- Pattern: Anonymizer
- Problem
- Solution
- Consequences
- Information loss
- Examples
- Pattern: Pseudo-Anonymizer
- Problem
- Solution
- Consequences
- False sense of security
- Information loss
- Examples
- Pattern: Encryptor
- Connectivity
- Pattern: Secrets Pointer
- Problem
- Solution
- Consequences
- Cache invalidation and streaming jobs
- Logs
- A secret remains secret
- Examples
- Pattern: Secretless Connector
- Problem
- Solution
- Consequences
- Workless impression
- Rotation
- Examples
- Pattern: Secrets Pointer
- Summary
- Data Removal
- 8. Data Storage Design Patterns
- Partitioning
- Pattern: Horizontal Partitioner
- Problem
- Solution
- Consequences
- Granularity and metadata overhead
- Skew
- Mutability
- Examples
- Pattern: Vertical Partitioner
- Problem
- Solution
- Consequences
- Domain split
- Querying
- Data producer
- Examples
- Pattern: Horizontal Partitioner
- Records Organization
- Pattern: Bucket
- Problem
- Solution
- Consequences
- Mutability
- Bucket size
- Examples
- Pattern: Sorter
- Problem
- Solution
- Consequences
- Unsorted segments
- Composite sort keys
- Mutability
- Examples
- Pattern: Bucket
- Read Performance Optimization
- Pattern: Metadata Enhancer
- Problem
- Solution
- Consequences
- Overhead
- Out-of-date statistics
- Examples
- Pattern: Dataset Materializer
- Problem
- Solution
- Consequences
- Refresh cost
- Data access
- Data storage overhead
- Examples
- Pattern: Manifest
- Problem
- Solution
- Consequences
- Complexity
- Size
- Examples
- Pattern: Metadata Enhancer
- Data Representation
- Pattern: Normalizer
- Problem
- Solution
- Consequences
- Query cost
- Archival
- Examples
- Pattern: Denormalizer
- Problem
- Solution
- Consequences
- Costly updates
- Storage
- One big antipattern
- Examples
- Pattern: Normalizer
- Summary
- Partitioning
- 9. Data Quality Design Patterns
- Quality Enforcement
- Pattern: Audit-Write-Audit-Publish
- Problem
- Solution
- Consequences
- Compute cost
- Rules coverage
- Streaming latency
- An issue may not be an issue
- Examples
- Pattern: Constraints Enforcer
- Problem
- Solution
- Consequences
- All-or-nothing semantics
- Data producer shift
- Constraints coverage
- Examples
- Pattern: Audit-Write-Audit-Publish
- Schema Consistency
- Pattern: Schema Compatibility Enforcer
- Problem
- Solution
- Consequences
- Interaction overhead
- Schema evolution
- Examples
- Pattern: Schema Migrator
- Problem
- Solution
- Consequences
- Size impact
- Impossible removal
- Examples
- Pattern: Schema Compatibility Enforcer
- Quality Observation
- Pattern: Offline Observer
- Problem
- Solution
- Consequences
- Time accuracy
- Compute resources
- Examples
- Pattern: Online Observer
- Problem
- Solution
- Consequences
- Extra delays
- Parallel splits
- Examples
- Pattern: Offline Observer
- Summary
- Quality Enforcement
- 10. Data Observability Design Patterns
- Data Detectors
- Pattern: Flow Interruption Detector
- Problem
- Solution
- Consequences
- Threshold
- Metadata
- False positives for storage
- Examples
- Pattern: Skew Detector
- Problem
- Solution
- Consequences
- Seasonality
- Communication
- Fatality loop
- Examples
- Pattern: Flow Interruption Detector
- Time Detectors
- Pattern: Lag Detector
- Problem
- Solution
- Consequences
- Data skew
- Examples
- Pattern: SLA Misses Detector
- Problem
- Solution
- Consequences
- Late data and event time
- Examples
- Pattern: Lag Detector
- Data Lineage
- Pattern: Dataset Tracker
- Problem
- Solution
- Consequences
- Vendor lock
- Custom work
- Examples
- Pattern: Fine-Grained Tracker
- Problem
- Solution
- Consequences
- Custom code
- Row-level visualization
- Evolution management
- Examples
- Pattern: Dataset Tracker
- Summary
- Data Detectors
- Afterword
- A. Summary of Patterns
- Data Ingestion Design Patterns
- Error Management Design Patterns
- Idempotency Design Patterns
- Data Value Design Patterns
- Data Flow Design Patterns
- Data Security Design Patterns
- Data Storage Design Patterns
- Data Quality Design Patterns
- Data Observability Design Patterns
- Index