Cost-Effective Data Pipelines - Helion

ebook

Autor: Sev Leonard
ISBN: 9781492098607
stron: 288, Format: ebook
Data wydania: 2023-07-13
Księgarnia: Helion

Cena książki: 203,15 zł (poprzednio: 236,22 zł)
Oszczędzasz: 14% (-33,07 zł)

Osoby, które kupiły tę książkę, wybierały także »

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?

With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.

By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:

Reduce cloud spend with lower cost cloud service offerings and smart design strategies
Minimize waste without sacrificing performance by rightsizing compute resources
Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
Set up development and test environments that minimize cloud service dependencies
Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution
Improve data quality and pipeline operation through validation and testing

Osoby które kupowały "Cost-Effective Data Pipelines", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
Efekt piaskownicy. Jak szefować żeby roboty nie zabrały ci roboty 59,50 zł, (11,90 zł -80%)
Przywództwo w świecie VUCA. Jak być skutecznym liderem w niepewnym środowisku 58,64 zł, (12,90 zł -78%)

Spis treści

Cost-Effective Data Pipelines eBook -- spis treści

Preface
- Who This Book Is For
- What You Will Learn
- What This Book Is Not
- Running Example
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Designing Compute for Data Pipelines
- Understanding Availability of Cloud Compute
  - Outages
  - Capacity Limits
  - Account Limits
  - Infrastructure
- Leveraging Different Purchasing Options in Pipeline Design
  - On Demand
  - Spot/Interruptible
  - Contractual Discounts
  - Contractual Discounts in the Real World: A Cautionary Tale
- Requirements Gathering for Compute Design
  - Business Requirements
  - Architectural Requirements
  - Requirements-Gathering Example: HoD Batch Ingest
    - Data
    - Performance
    - Purchasing options
- Benchmarking
  - Instance Family Identification
  - Cluster Sizing
  - Monitoring
    - Cluster resource utilization
    - Data processing engine introspection
- Benchmarking Example
  - Undersized
  - Oversized
  - Right-Sized
- Summary
- Recommended Readings
2. Responding to Changes in Demand by Scaling Compute
- Identifying Scaling Opportunities
  - Variation in Data Pipelines
  - Scaling Metrics
  - Pipeline Scaling Example
- Designing for Scaling
- Implementing Scaling Plans
  - Scaling Mechanics
  - Common Autoscaling Pitfalls
    - Scale-out threshold is too high
    - Flapping
    - Over-scaling
- Autoscaling Example
- Summary
- Recommended Readings
3. Data Organization in the Cloud
- Cloud Storage Costs
  - Storage at Rest
  - Egress
  - Data Access
- Cloud Storage Organization
  - Storage Bucket Strategies
  - Lifecycle Configurations
- File Structure Design
  - File Formats
  - Partitioning
  - Compaction
- Summary
- Recommended Readings
4. Economical Pipeline Fundamentals
- Idempotency
  - Preventing Data Duplication
  - Tolerating Data Duplication
- Checkpointing
- Automatic Retries
  - Retry Considerations
  - Retry Levels in Data Pipelines
- Data Validation
  - Validating Data Characteristics
  - Schemas
    - Creating schemas
    - Validating with schemas
    - Keeping schemas up to date
- Summary
5. Setting Up Effective Development Environments
- Environments
  - Software Environments
  - Data Environments
  - Data Pipeline Environments
  - Environment Planning
    - Design
    - Costs
    - Environment uptime
- Local Development
  - Containers
    - Container lifecycle
    - Container composition
      - Running local code against production dependencies
      - Using environment variables
      - Sharing configurations
      - Consolidating common settings
  - Resource Dependency Reduction
  - Resource Cleanup
- Summary
6. Software Development Strategies
- Managing Different Coding Environments
  - Example: A Multimodal Pipeline
    - Notebooks
    - Web UIs
- Example: How Code Becomes Difficult to Change
- Modular Design
  - Single Responsibility
  - Dependency Inversion
    - Supporting multicloud
    - Plugging in other data sinks
    - Testing
  - Modular Design with DataFrames
- Configurable Design
- Summary
- Recommended Readings
7. Unit Testing
- The Role of Unit Testing in Data Pipelines
  - Unit Testing Overview
  - Example: Identifying Unit Testing Needs
- Pipeline Areas to Unit-Test
  - Data Logic
  - Connections
  - Observability
  - Data Modification Processes
  - Cloud Components
- Working with Dependencies
  - Interfaces
  - Data
- Example: Unit Testing Plan
  - Identifying Components to Test
  - Identifying Dependencies
- Summary
8. Mocks
- Considerations for Replacing Dependencies
  - Placement
  - Dependency Stability
  - Complexity Versus Criticality
- Mocking Generic Interfaces
  - Responses
  - Requests
  - Connectivity
- Mocking Cloud Services
  - Building Your Own Mocks
  - Mocking with Moto
- Testing with Databases
  - Test Database Example
  - Working with Test Databases
- Summary
- Further Exploration
  - More Moto Mocks
  - Mock Placement
9. Data for Testing
- Working with Live Data
  - Benefits
  - Challenges
- Working with Synthetic Data
  - Benefits
  - Challenges
  - Is Synthetic Data the Right Approach?
- Manual Data Generation
- Automated Data Generation
  - Synthetic Data Libraries
    - Customizing generated data
    - Distributing cases in test data
  - Schema-Driven Generation
    - Mapping data generation to schemas
    - Example: catching schema change impacts with CI tests
- Property-Based Testing
- Summary
10. Logging
- Logging Costs
  - Impact of Scale
  - Impact of Cloud Storage Elasticity
- Reducing Logging Costs
- Effective Logging
- Summary
11. Finding Your Way with Monitoring
- Costs of Inadequate Monitoring
  - Getting Lost in the Woods
  - Navigation to the Rescue
    - Job metrics
    - Autoscaling events
    - Job runtime alerting
    - Error metrics
- System Monitoring
  - Data Volume
  - Throughput
  - Consumer Lag
  - Worker Utilization
- Resource Monitoring
  - Understanding the Bounds
  - Understanding Reliability Impacts
- Pipeline Performance
  - Pipeline Stage Duration
  - Profiling
  - Errors to Watch Out For
    - Ingestion success and failure
    - Stage failures
    - Validation failures
    - Communication failures
    - Stage timeouts
- Query Monitoring
- Minimizing Monitoring Costs
- Summary
- Recommended Readings
12. Essential Takeaways
- An Ounce of Prevention Is Worth a Pound of Cure
  - Reign In Compute Spend
  - Organize Your Resources
  - Design for Interruption
  - Build In Data Quality
- Change Is the Only Constant
  - Design for Change
  - Monitor for Change
- Parting Thoughts
A. Preparing a Cloud Budget
- Its All About the Details
  - Historical Data
  - Estimating for New Projects
  - Changes That Impact Costs
    - Data landscape
    - Load
    - Infrastructure
- Creating a Budget
  - Budget Summary
  - Changes Between Previous and Next Budget Periods
  - Cost Breakdown
- Communicating the Budget
- Summary
Index