Cost-Effective Data Pipelines - Helion
ISBN: 9781492098607
stron: 288, Format: ebook
Data wydania: 2023-07-13
Księgarnia: Helion
Cena książki: 203,15 zł (poprzednio: 236,22 zł)
Oszczędzasz: 14% (-33,07 zł)
The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?
With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.
By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:
- Reduce cloud spend with lower cost cloud service offerings and smart design strategies
- Minimize waste without sacrificing performance by rightsizing compute resources
- Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
- Set up development and test environments that minimize cloud service dependencies
- Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution
- Improve data quality and pipeline operation through validation and testing
Osoby które kupowały "Cost-Effective Data Pipelines", wybierały także:
- Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
- Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
- Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)
- Scrum. O zwinnym zarządzaniu projektami. Wydanie II rozszerzone 58,64 zł, (12,90 zł -78%)
- Od hierarchii do turkusu, czyli jak zarządzać w XXI wieku 58,64 zł, (12,90 zł -78%)
Spis treści
Cost-Effective Data Pipelines eBook -- spis treści
- Preface
- Who This Book Is For
- What You Will Learn
- What This Book Is Not
- Running Example
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Designing Compute for Data Pipelines
- Understanding Availability of Cloud Compute
- Outages
- Capacity Limits
- Account Limits
- Infrastructure
- Leveraging Different Purchasing Options in Pipeline Design
- On Demand
- Spot/Interruptible
- Contractual Discounts
- Contractual Discounts in the Real World: A Cautionary Tale
- Requirements Gathering for Compute Design
- Business Requirements
- Architectural Requirements
- Requirements-Gathering Example: HoD Batch Ingest
- Data
- Performance
- Purchasing options
- Benchmarking
- Instance Family Identification
- Cluster Sizing
- Monitoring
- Cluster resource utilization
- Data processing engine introspection
- Benchmarking Example
- Undersized
- Oversized
- Right-Sized
- Summary
- Recommended Readings
- Understanding Availability of Cloud Compute
- 2. Responding to Changes in Demand by Scaling Compute
- Identifying Scaling Opportunities
- Variation in Data Pipelines
- Scaling Metrics
- Pipeline Scaling Example
- Designing for Scaling
- Implementing Scaling Plans
- Scaling Mechanics
- Common Autoscaling Pitfalls
- Scale-out threshold is too high
- Flapping
- Over-scaling
- Autoscaling Example
- Summary
- Recommended Readings
- Identifying Scaling Opportunities
- 3. Data Organization in the Cloud
- Cloud Storage Costs
- Storage at Rest
- Egress
- Data Access
- Cloud Storage Organization
- Storage Bucket Strategies
- Lifecycle Configurations
- File Structure Design
- File Formats
- Partitioning
- Compaction
- Summary
- Recommended Readings
- Cloud Storage Costs
- 4. Economical Pipeline Fundamentals
- Idempotency
- Preventing Data Duplication
- Tolerating Data Duplication
- Checkpointing
- Automatic Retries
- Retry Considerations
- Retry Levels in Data Pipelines
- Data Validation
- Validating Data Characteristics
- Schemas
- Creating schemas
- Validating with schemas
- Keeping schemas up to date
- Summary
- Idempotency
- 5. Setting Up Effective Development Environments
- Environments
- Software Environments
- Data Environments
- Data Pipeline Environments
- Environment Planning
- Design
- Costs
- Environment uptime
- Local Development
- Containers
- Container lifecycle
- Container composition
- Running local code against production dependencies
- Using environment variables
- Sharing configurations
- Consolidating common settings
- Resource Dependency Reduction
- Resource Cleanup
- Containers
- Summary
- Environments
- 6. Software Development Strategies
- Managing Different Coding Environments
- Example: A Multimodal Pipeline
- Notebooks
- Web UIs
- Example: A Multimodal Pipeline
- Example: How Code Becomes Difficult to Change
- Modular Design
- Single Responsibility
- Dependency Inversion
- Supporting multicloud
- Plugging in other data sinks
- Testing
- Modular Design with DataFrames
- Configurable Design
- Summary
- Recommended Readings
- Managing Different Coding Environments
- 7. Unit Testing
- The Role of Unit Testing in Data Pipelines
- Unit Testing Overview
- Example: Identifying Unit Testing Needs
- Pipeline Areas to Unit-Test
- Data Logic
- Connections
- Observability
- Data Modification Processes
- Cloud Components
- Working with Dependencies
- Interfaces
- Data
- Example: Unit Testing Plan
- Identifying Components to Test
- Identifying Dependencies
- Summary
- The Role of Unit Testing in Data Pipelines
- 8. Mocks
- Considerations for Replacing Dependencies
- Placement
- Dependency Stability
- Complexity Versus Criticality
- Mocking Generic Interfaces
- Responses
- Requests
- Connectivity
- Mocking Cloud Services
- Building Your Own Mocks
- Mocking with Moto
- Testing with Databases
- Test Database Example
- Working with Test Databases
- Summary
- Further Exploration
- More Moto Mocks
- Mock Placement
- Considerations for Replacing Dependencies
- 9. Data for Testing
- Working with Live Data
- Benefits
- Challenges
- Working with Synthetic Data
- Benefits
- Challenges
- Is Synthetic Data the Right Approach?
- Manual Data Generation
- Automated Data Generation
- Synthetic Data Libraries
- Customizing generated data
- Distributing cases in test data
- Schema-Driven Generation
- Mapping data generation to schemas
- Example: catching schema change impacts with CI tests
- Synthetic Data Libraries
- Property-Based Testing
- Summary
- Working with Live Data
- 10. Logging
- Logging Costs
- Impact of Scale
- Impact of Cloud Storage Elasticity
- Reducing Logging Costs
- Effective Logging
- Summary
- Logging Costs
- 11. Finding Your Way with Monitoring
- Costs of Inadequate Monitoring
- Getting Lost in the Woods
- Navigation to the Rescue
- Job metrics
- Autoscaling events
- Job runtime alerting
- Error metrics
- System Monitoring
- Data Volume
- Throughput
- Consumer Lag
- Worker Utilization
- Resource Monitoring
- Understanding the Bounds
- Understanding Reliability Impacts
- Pipeline Performance
- Pipeline Stage Duration
- Profiling
- Errors to Watch Out For
- Ingestion success and failure
- Stage failures
- Validation failures
- Communication failures
- Stage timeouts
- Query Monitoring
- Minimizing Monitoring Costs
- Summary
- Recommended Readings
- Costs of Inadequate Monitoring
- 12. Essential Takeaways
- An Ounce of Prevention Is Worth a Pound of Cure
- Reign In Compute Spend
- Organize Your Resources
- Design for Interruption
- Build In Data Quality
- Change Is the Only Constant
- Design for Change
- Monitor for Change
- Parting Thoughts
- An Ounce of Prevention Is Worth a Pound of Cure
- A. Preparing a Cloud Budget
- Its All About the Details
- Historical Data
- Estimating for New Projects
- Changes That Impact Costs
- Data landscape
- Load
- Infrastructure
- Creating a Budget
- Budget Summary
- Changes Between Previous and Next Budget Periods
- Cost Breakdown
- Communicating the Budget
- Summary
- Its All About the Details
- Index