Implementing Service Level Objectives - Helion

ebook

Autor: Alex Hidalgo
ISBN: 978-14-920-7676-6
stron: 404, Format: ebook
Data wydania: 2020-08-05
Księgarnia: Helion

Cena książki: 186,15 zł (poprzednio: 216,45 zł)
Oszczędzasz: 14% (-30,30 zł)

Osoby, które kupiły tę książkę, wybierały także »

Although service-level objectives (SLOs) continue to grow in importance, there’s a distinct lack of information about how to implement them. Practical advice that does exist usually assumes that your team already has the infrastructure, tooling, and culture in place. In this book, recognized SLO expert Alex Hidalgo explains how to build an SLO culture from the ground up.

Ideal as a primer and daily reference for anyone creating both the culture and tooling necessary for SLO-based approaches to reliability, this guide provides detailed analysis of advanced SLO and service-level indicator (SLI) techniques. Armed with mathematical models and statistical knowledge to help you get the most out of an SLO-based approach, you’ll learn how to build systems capable of measuring meaningful SLIs with buy-in across all departments of your organization.

Define SLIs that meaningfully measure the reliability of a service from a user’s perspective
Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
Use error budgets to help your team have better discussions and make better data-driven decisions
Build supportive tooling and resources required for an SLO-based approach
Use SLO data to present meaningful reports to leadership and your users

Osoby które kupowały "Implementing Service Level Objectives", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Implementing Service Level Objectives eBook -- spis treści

Foreword
Preface
- You Dont Have to Be Perfect
- How to Read This Book
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
I. SLO Development
1. The Reliability Stack
- Service Truths
- The Reliability Stack
  - Service Level Indicators
  - Service Level Objectives
  - Error Budgets
- What Is a Service?
  - Example Services
    - Web services
    - Request and response APIs
    - Data processing pipelines
    - Batch jobs
    - Databases and storage systems
    - Compute platforms
    - Hardware and the network
- Things to Keep in Mind
  - SLOs Are Just Data
  - SLOs Are a Process, Not a Project
  - Iterate Over Everything
  - The World Will Change
  - Its All About Humans
- Summary
2. How to Think About Reliability
- Reliability Engineering
- Past Performance and Your Users
  - Implied Agreements
  - Making Agreements
  - A Worked Example of Reliability
- How Reliable Should You Be?
  - 100% Isnt Necessary
  - Reliability Is Expensive
  - How to Think About Reliability
- Summary
3. Developing Meaningful Service Level Indicators
- What Meaningful SLIs Provide
  - Happier Users
  - Happier Engineers
  - A Happier Business
- Caring About Many Things
  - A Request and Response Service
  - Measuring Many Things by Measuring Only a Few
  - A Written Example
- Something More Complex
  - Measuring Complex Service User Reliability
  - Another Written Example
  - Business Alignment and SLIs
- Summary
4. Choosing Good Service Level Objectives
- Reliability Targets
  - User Happiness
  - The Problem of Being Too Reliable
  - The Problem with the Number Nine
  - The Problem with Too Many SLOs
- Service Dependencies and Components
  - Service Dependencies
    - Hard dependencies
    - Soft dependencies
    - Turning hard dependencies into soft dependencies
    - Dependency math
  - Service Components
    - Multiple-team component services
    - Single-team component services
- Reliability for Things You Dont Own
  - Open Source or Hosted Services
  - Measuring Hardware
    - But I am big enough!
    - Beyond just hardware
- Choosing Targets
  - Past Performance
  - Basic Statistics
    - The five Ms
    - Ranges
    - Percentiles
  - Metric Attributes
    - Resolution
    - Quantity
    - Quality
  - Percentile Thresholds
  - What to Do Without a History
- Summary
5. How to Use Error Budgets
- Error Budgets in Practice
  - To Release New Features or Not?
  - Project Focus
  - Examining Risk Factors
  - Experimentation and Chaos Engineering
  - Load and Stress Tests
  - Blackhole Exercises
  - Purposely Burning Budget
  - Error Budgets for Humans
- Error Budget Measurement
  - Establishing Error Budgets
    - Events-based error budget math
    - Time-based error budget math
    - Rolling versus calendar-bound windows
    - Excluding time
    - Choosing a time window
  - Decision Making
  - Error Budget Policies
    - Owners and stakeholders
    - Error budget burn policies
    - Error budget exceeded policies
    - Justification and revisit schedule
- Summary
II. SLO Implementation
6. Getting Buy-In
- Engineering Is More than Code
- Key Stakeholders
  - Engineering
  - Product
  - Operations
  - QA
  - Legal
  - Executive Leadership
- Making It So
  - Order of Operation
  - Common Objections and How to Overcome Them
    - Engineering
    - Operations
    - Product
    - Leadership
    - Legal
    - QA
  - Your First Error Budget Policy (and Your First Critical Test)
    - No new features (feature freeze)
    - Your first test
- Lessons Learned the Hard Way
- Summary
7. Measuring SLIs and SLOs
- Design Goals
  - Flexible Targets
  - Testable Targets
  - Freshness
  - Cost
  - Reliability
  - Organizational Constraints
- Common Machinery
  - Centralized Time Series Statistics (Metrics)
    - TSDBs: The basics
    - Multidimensional analysis
    - Statistical distribution support
    - TSDBs and our design goals
  - Structured Event Databases (Logging)
    - Aggregate analysis
    - Structured event databases and our design goals
- Common Cases
  - Latency-Sensitive Request Processing
  - Low-Lag, High-Throughput Batch Processing
  - Mobile and Web Clients
- The General Case
- Other Considerations
  - Integration with Distributed Tracing
  - SLI and SLO Discoverability
- Summary
8. SLO Monitoring and Alerting
- Motivation: What Is SLO Alerting, and Why Should You Do It?
  - The Shortcomings of Simple Threshold Alerting
    - Thresholds dont stay relevant
    - Poor proxies for user experience
    - Context loss in static thresholds
    - Unclear correlation between threshold and behavior and nonrange alerting
    - Alert fatigue and fog of war
    - Picking an SLO number is something a human should do
    - Complexity and failure in distributed systems
  - A Better Way
- How to Do SLO Alerting
  - Choosing a Target
  - Error Budgets and Response Time
  - Error Budget Burn Rate
  - Rolling Windows
  - Putting It Together
  - Troubleshooting with SLO Alerting
  - Corner Cases
  - SLO Alerting in a Brownfield Setup
    - Show the human impact of the current situation
    - Review the existing outage footprint
    - Run the old and new in parallel
- Parting Recommendations
- Summary
9. Probability and Statistics for SLIs and SLOs
- On Probability
  - SLI Example: Availability
    - Sample spaces
    - Coin interlude
    - Dewclaw in a data center
    - Dewclaw in two data centers
    - Independence
  - SLI Example: Low QPS
    - Expected value
    - Median
    - We break our SLO a lot, actually
    - What can you do?
- On Statistics
  - Maximum Likelihood Estimation
  - Maximum a Posteriori
    - Bayes theorem
    - The relationship between MLE and MAP
    - Using MAP
  - Bayesian Inference
    - The highest density interval
  - SLI Example: Queueing Latency
    - Modeling events with the Poisson distribution
    - Variance, percentiles, and the cumulative distribution function
  - Batch Latency
    - Queueing systems
    - The exponential distribution
    - Decreasing latency
    - Adding capacity
- SLI Example: Durability
- Further Reading
- Summary
10. Architecting for Reliability
- Example System: Image-Serving Service
  - Architectural Considerations: Hardware
  - Architectural Considerations: Monolith or Microservices
  - Architectural Considerations: Anticipating Failure Modes
  - Architectural Considerations: Three Types of Requests
    - Synchronous requests
    - Asynchronous requests
    - Batch requests
  - Systems and Building Blocks
  - Quantitative Analysis of Systems
  - Instrumentation! The System Also Needs Instrumentation!
- Architectural Considerations: Hardware, Revisited
- SLOs as a Result of System SLIs
- The Importance of Identifying and Understanding Dependencies
- Summary
11. Data Reliability
- Data Services
  - Designing Data Applications
- Users of Data Services
- Setting Measurable Data Objectives
  - Data and Data Application Reliability
  - Data Properties
    - Freshness
    - Completeness
    - Consistency
    - Accuracy
    - Validity
    - Integrity
    - Durability
  - Data Application Properties
    - Security
    - Availability
    - Scalability
    - Performance
    - Resilience
    - Robustness
- System Design Concerns
  - Data Application Failures
  - Other Qualities
- Data Lineage
- Summary
12. A Worked Example
- Dogs Deserve Clothes
  - How a Service Grows
  - The Design of a Service
- SLIs and SLOs as User Journeys
  - Customers: Finding and Browsing Products
    - SLO: Front page loads and latency
    - SLO: Search results
  - Other Services as Users: Buying Products
    - SLO: Checkout success
  - Internal Users
    - SLO: Business data analysis
    - SLO: Internal wiki
  - Platforms as Services
    - SLO: Container platform
- Summary
III. SLO Culture
13. Building an SLO Culture
- A Culture of No SLOs
- Strategies for Shifting Culture
- Path to a Culture of SLOs
  - Getting Buy-in
  - Prioritizing SLO Work
    - Do it yourself
    - Assign it
  - Implementing Your SLO
    - Start with a document
    - What is important to measure?
  - What Will Your SLIs Be?
  - What Will Your SLOs Be?
  - Using Your SLO
    - Alerting
    - Exhausting your error budget
    - Using surplus error budget
  - Iterating on Your SLO
  - Determining When Your SLOs Are Good Enough
  - Advocating for Others to Use SLOs
- Summary
14. SLO Evolution
- SLO Genesis
  - The First Pass
  - Listening to Users
  - Periodic Revisits
- Usage Changes
  - Increased Utilization Changes
  - Decreased Utilization Changes
  - Functional Utilization Changes
- Dependency Changes
  - Service Dependency Changes
  - Platform Changes
  - Dependency Introduction or Retirement
- Failure-Induced Changes
- User Expectation and Requirement Changes
  - User Expectation Changes
    - Running too well
    - Market changes
  - User Requirement Changes
- Tooling Changes
  - Measurement Changes
  - Calculation Changes
- Intuition-Based Changes
- Setting Aspirational SLOs
- Identifying Incorrect SLOs
  - Listening to Users (Redux)
  - Paying Attention to Failures
- How to Change SLOs
  - Revisit Schedules
- Summary
15. Discoverable and Understandable SLOs
- Understandability
  - SLO Definition Documents
    - Ownership
    - Approvers
    - Definition status
    - Service overview
    - SLO definitions and status
    - Rationale
    - Revisit schedule
    - Error budget policy
    - External links
  - Phraseology
- Discoverability
  - Document Repositories
  - Discoverability Tooling
  - SLO Reports
  - Dashboards
- Summary
16. SLO Advocacy
- Crawl
  - Do Your Research
  - Prepare Your Sales Pitch
    - What do your engineers care about?
    - What do your company executives and business partners care about?
  - Create Your Supporting Artifacts
    - Documentation
    - Training
    - Collaboration-based training
  - Run Your First Training and Workshop
  - Implement an SLO Pilot with a Single Service
  - Spread Your Message
  - Learn How to Handle Challenges
- Walk
  - Work with Early Adopters to Implement SLOs for More Services
  - Celebrate Achievements and Build Confidence
  - Create a Library of Case Studies
  - Scale Your Training Program by Adding More Trainers
  - Scale Your Communications
- Run
  - Share Your Library of SLO Case Studies
  - Create a Community of SLO Experts
  - Continuously Improve
- Summary
17. Reliability Reporting
- Basic Reporting
  - Counting Incidents
  - Severity Levels
  - The Problem with Mean Time to X
    - Incidents are unique
    - Means arent always meaningful
  - SLOs for Basic Reporting
    - A worked reporting example
- Advanced Reporting
  - SLO Status
  - Error Budget Status
- Summary
A. SLO Definition Template
- SLO Definition: Service Name
- Service Overview
- SLIs and SLOs
- Rationale
- Revisit Schedule
- Error Budget Policy
- External Links
B. Proofs for Chapter 9
- Theorem 1
  - Proof
- Theorem 2
  - Proof
- Theorem 3
  - Proof
- Theorem 4
  - Proof
- Theorem 5
  - Proof
- Theorem 6
  - Proof
- Theorem 7
  - Proof
Index