The Site Reliability Workbook. Practical Ways to Implement SRE - Helion

ebook

Autor: Betsy Beyer, Niall Richard Murphy, David K. Rensin
ISBN: 978-14-920-2945-8
stron: 512, Format: ebook
Data wydania: 2018-07-25
Księgarnia: Helion

Cena książki: 152,15 zł (poprzednio: 176,92 zł)
Oszczędzasz: 14% (-24,77 zł)

Osoby, które kupiły tę książkę, wybierały także »

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.

Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

You’ll learn:

How to run reliable services in environments you don’t completely control—like cloud
Practical applications of how to create, monitor, and run your services via Service Level Objectives
How to convert existing ops teams to SRE—including how to dig out of operational overload
Methods for starting SRE from either greenfield or brownfield

Osoby które kupowały "The Site Reliability Workbook. Practical Ways to Implement SRE", wybierały także:

Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 124,17 zł, (14,90 zł -88%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 93,13 zł, (14,90 zł -84%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
Scrum. O zwinnym zarz 78,42 zł, (14,90 zł -81%)

Spis treści

The Site Reliability Workbook. Practical Ways to Implement SRE eBook -- spis treści

Foreword I
Foreword II
Preface
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
1. How SRE Relates to DevOps
- Background on DevOps
  - No More Silos
  - Accidents Are Normal
  - Change Should Be Gradual
  - Tooling and Culture Are Interrelated
  - Measurement Is Crucial
- Background on SRE
  - Operations Is a Software Problem
  - Manage by Service Level Objectives (SLOs)
  - Work to Minimize Toil
  - Automate This Years Job Away
  - Move Fast by Reducing the Cost of Failure
  - Share Ownership with Developers
  - Use the Same Tooling, Regardless of Function or Job Title
- Compare and Contrast
- Organizational Context and Fostering Successful Adoption
  - Narrow, Rigid Incentives Narrow Your Success
  - Its Better to Fix It Yourself; Dont Blame Someone Else
  - Consider Reliability Work as a Specialized Role
  - When Can Substitute for Whether
  - Strive for Parity of Esteem: Career and Financial
- Conclusion
I. Foundations
2. Implementing SLOs
- Why SREs Need SLOs
- Getting Started
  - Reliability Targets and Error Budgets
  - What to Measure: Using SLIs
    - Types of components
- A Worked Example
  - Moving from SLI Specification to SLI Implementation
    - API and HTTP server availability and latency
    - Pipeline freshness, coverage, and correctness
  - Measuring the SLIs
    - Load balancer metrics
    - Calculating the SLIs
  - Using the SLIs to Calculate Starter SLOs
- Choosing an Appropriate Time Window
- Getting Stakeholder Agreement
  - Establishing an Error Budget Policy
  - Documenting the SLO and Error Budget Policy
  - Dashboards and Reports
- Continuous Improvement of SLO Targets
  - Improving the Quality of Your SLO
- Decision Making Using SLOs and Error Budgets
- Advanced Topics
  - Modeling User Journeys
  - Grading Interaction Importance
  - Modeling Dependencies
  - Experimenting with Relaxing Your SLOs
- Conclusion
3. SLO Engineering Case Studies
- Evernotes SLO Story
  - Why Did Evernote Adopt the SRE Model?
  - Introduction of SLOs: A Journey in Progress
  - Breaking Down the SLO Wall Between Customer and Cloud Provider
  - Current State
- The Home Depots SLO Story
  - The SLO Culture Project
  - Our First Set of SLOs
    - Availability and latency for API calls
    - Infrastructure utilization
    - Traffic volume
    - Latency
    - Errors
    - Tickets
    - VALET
  - Evangelizing SLOs
  - Automating VALET Data Collection
    - TPS Reports
    - VALET service
    - VALET Dashboard
  - The Proliferation of SLOs
  - Applying VALET to Batch Applications
  - Using VALET in Testing
  - Future Aspirations
  - Summary
- Conclusion
4. Monitoring
- Desirable Features of a Monitoring Strategy
  - Speed
  - Calculations
  - Interfaces
  - Alerts
- Sources of Monitoring Data
  - Examples
    - Move information from logs to metrics
      - Problem
      - Proposed solution
      - Outcome
    - Improve both logs and metrics
      - Problem
      - Proposed solution
      - Outcome
    - Keep logs as the data source
      - Problem
      - Proposed solution
      - Outcome
- Managing Your Monitoring System
  - Treat Your Configuration as Code
  - Encourage Consistency
  - Prefer Loose Coupling
- Metrics with Purpose
  - Intended Changes
  - Dependencies
  - Saturation
  - Status of Served Traffic
  - Implementing Purposeful Metrics
- Testing Alerting Logic
- Conclusion
5. Alerting on SLOs
- Alerting Considerations
- Ways to Alert on Significant Events
  - 1: Target Error Rate SLO Threshold
  - 2: Increased Alert Window
  - 3: Incrementing Alert Duration
  - 4: Alert on Burn Rate
  - 5: Multiple Burn Rate Alerts
  - 6: Multiwindow, Multi-Burn-Rate Alerts
- Low-Traffic Services and Error Budget Alerting
  - Generating Artificial Traffic
  - Combining Services
  - Making Service and Infrastructure Changes
  - Lowering the SLO or Increasing the Window
- Extreme Availability Goals
- Alerting at Scale
- Conclusion
6. Eliminating Toil
- What Is Toil?
- Measuring Toil
- Toil Taxonomy
  - Business Processes
  - Production Interrupts
  - Release Shepherding
  - Migrations
  - Cost Engineering and Capacity Planning
  - Troubleshooting for Opaque Architectures
- Toil Management Strategies
  - Identify and Measure Toil
  - Engineer Toil Out of the System
  - Reject the Toil
  - Use SLOs to Reduce Toil
  - Start with Human-Backed Interfaces
  - Provide Self-Service Methods
  - Get Support from Management and Colleagues
  - Promote Toil Reduction as a Feature
  - Start Small and Then Improve
  - Increase Uniformity
  - Assess Risk Within Automation
  - Automate Toil Response
  - Use Open Source and Third-Party Tools
  - Use Feedback to Improve
- Case Studies
- Case Study 1: Reducing Toil in the Datacenter with Automation
  - Background
  - Problem Statement
  - What We Decided to Do
  - Design First Effort: Saturn Line-Card Repair
  - Implementation
  - Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
  - Implementation
  - Lessons Learned
    - UIs should not introduce overhead or complexity
    - Dont rely on human expertise
    - Design reusable components
    - Dont overthink the problem
    - Sometimes imperfect automation is good enough
    - Repair automation is not fire and forget
    - Build in risk assessment and defense in depth
    - Get a failure budget and manager support
    - Think holistically
- Case Study 2: Decommissioning Filer-Backed Home Directories
  - Background
  - Problem Statement
  - What We Decided to Do
  - Design and Implementation
  - Key Components
    - Moonwalk
    - Moira Portal
    - Archiving and migration automation
  - Lessons Learned
    - Challenge assumptions and retire expensive business processes
    - Build self-service interfaces
    - Start with human-backed interfaces
    - Melt snowflakes
    - Employ organizational nudges
- Conclusion
7. Simplicity
- Measuring Complexity
- Simplicity Is End-to-End, and SREs Are Good for That
  - Case Study 1: End-to-End API Simplicity
    - Background
    - Lessons learned
  - Case Study 2: Project Lifecycle Complexity
    - Background
    - What we decided to do
    - Lessons learned
- Regaining Simplicity
  - Case Study 3: Simplification of the Display Ads Spiderweb
    - Background
    - What we decided to do
    - Lessons learned
  - Case Study 4: Running Hundreds of Microservices on a Shared Platform
    - Background
    - What we decided to do
    - Design
    - Outcomes
    - Lessons learned
  - Case Study 5: pDNS No Longer Depends on Itself
    - Background
    - Problem statement
    - What we decided to do
    - Lessons learned
- Conclusion
II. Practices
8. On-Call
- Recap of Being On-Call Chapter of First SRE Book
- Example On-Call Setups Within Google and Outside Google
  - Google: Forming a New Team
    - Initial scenario
    - Training roadmap
    - Afterword
  - Evernote: Finding Our Feet in the Cloud
    - Moving our on-prem infrastructure to the cloud
    - Adjusting our on-call policies and processes
    - Restructuring our monitoring and metrics
    - Tracking our performance over time
    - Engaging with CRE
    - Sustaining a self-perpetuating cycle
- Practical Implementation Details
  - Anatomy of Pager Load
    - Scenario: A team in overload
    - Pager load inputs
      - Preexisting bugs
      - New bugs
      - Identification delay
      - Mitigation delay
      - Alerting
      - Rigor of follow-up
      - Data quality
      - Vigilance
  - On-Call Flexibility
    - Scenario: A change in personal circumstances
      - Automate on-call scheduling
      - Plan for short-term swaps
      - Plan for long-term breaks
      - Plan for part-time work schedules
  - On-Call Team Dynamics
    - Scenario: A culture of survive the week
      - Proposal one: Empower your ops engineers
      - Proposal two: Improve team relations
- Conclusion
9. Incident Response
- Incident Management at Google
  - Incident Command System
  - Main Roles in Incident Response
- Case Studies
  - Case Study 1: Software BugThe Lights Are On but No Ones (Google) Home
    - Context
    - Incident
    - Review
  - Case Study 2: Service FaultCache Me If You Can
    - Context
    - Incident
    - Review
      - What went well?
      - What could have been handled better?
  - Case Study 3: Power OutageLightning Never Strikes TwiceUntil It Does
    - Context
    - Incident
    - Review
  - Case Study 4: Incident Response at PagerDuty
    - Major incident response at PagerDuty
    - Tools used for incident response
- Putting Best Practices into Practice
  - Incident Response Training
  - Prepare Beforehand
    - Decide on a communication channel
    - Keep your audience informed
    - Prepare a list of contacts
    - Establish criteria for an incident
  - Drills
- Conclusion
10. Postmortem Culture: Learning from Failure
- Case Study
- Bad Postmortem
  - Why Is This Postmortem Bad?
    - Missing context
    - Key details omitted
    - Key action item characteristics missing
    - Counterproductive finger pointing
    - Animated language
    - Missing ownership
    - Limited audience
    - Delayed publication
- Good Postmortem
  - Why Is This Postmortem Better?
    - Clarity
    - Concrete action items
    - Blamelessness
    - Depth
    - Promptness
    - Conciseness
- Organizational Incentives
  - Model and Enforce Blameless Behavior
    - Use blameless language
    - Include all incident participants in postmortem authoring
    - Gather feedback
  - Reward Postmortem Outcomes
    - Reward action item closeout
    - Reward positive organizational change
    - Highlight improved reliability
    - Hold up postmortem owners as leaders
    - Gamification
  - Share Postmortems Openly
    - Share announcements across the organization
    - Conduct cross-team reviews
    - Hold training exercises
    - Report incidents and outages weekly
  - Respond to Postmortem Culture Failures
    - Avoiding association
    - Failing to reinforce the culture
    - Lacking time to write postmortems
    - Repeating incidents
- Tools and Templates
  - Postmortem Templates
    - Googles template
    - Other industry templates
  - Postmortem Tooling
    - Postmortem creation
    - Postmortem checklist
    - Postmortem storage
    - Postmortem follow-up
    - Postmortem analysis
    - Other industry tools
- Conclusion
11. Managing Load
- Google Cloud Load Balancing
  - Anycast
    - Stabilized anycast
  - Maglev
  - Global Software Load Balancer
  - Google Front End
  - GCLB: Low Latency
  - GCLB: High Availability
  - Case Study 1: Pokémon GO on GCLB
    - Migrating to GCLB
    - Resolving the issue
    - Future-proofing
- Autoscaling
  - Handling Unhealthy Machines
  - Working with Stateful Systems
  - Configuring Conservatively
  - Setting Constraints
  - Including Kill Switches and Manual Overrides
  - Avoiding Overloading Backends
  - Avoiding Traffic Imbalance
- Combining Strategies to Manage Load
  - Case Study 2: When Load Shedding Attacks
    - What was happening?
    - What went wrong?
    - Lessons learned
- Conclusion
12. Introducing Non-Abstract Large System Design
- What Is NALSD?
- Why Non-Abstract?
- AdWords Example
  - Design Process
  - Initial Requirements
  - One Machine
    - Calculations
    - Evaluation
  - Distributed System
    - MapReduce
      - Evaluation
    - LogJoiner
      - Calculations
    - Sharded LogJoiner
      - Evaluation
    - Multidatacenter
      - Calculations
      - Evaluation
- Conclusion
13. Data Processing Pipelines
- Pipeline Applications
  - Event Processing/Data Transformation to Order or Structure Data
  - Data Analytics
  - Machine Learning
- Pipeline Best Practices
  - Define and Measure Service Level Objectives
    - Data freshness
    - Data correctness
    - Data isolation/load balancing
    - End-to-end measurement
  - Plan for Dependency Failure
  - Create and Maintain Pipeline Documentation
    - System diagrams
    - Process documentation
    - Playbook entries
  - Map Your Development Lifecycle
    - Prototyping
    - Testing with a 1% dry run
    - Staging
    - Canarying
    - Performing a partial deployment
    - Deploying to production
  - Reduce Hotspotting and Workload Patterns
  - Implement Autoscaling and Resource Planning
  - Adhere to Access Control and Security Policies
  - Plan Escalation Paths
- Pipeline Requirements and Design
  - What Features Do You Need?
  - Idempotent and Two-Phase Mutations
  - Checkpointing
  - Code Patterns
    - Reusing code
    - Using the microservice approach to creating pipelines
  - Pipeline Production Readiness
    - Pipeline maturity matrix
- Pipeline Failures: Prevention and Response
  - Potential Failure Modes
    - Delayed data
    - Corrupt data
  - Potential Causes
    - Pipeline dependencies
    - Pipeline application or configuration
    - Unexpected resource growth
    - Region-level outage
- Case Study: Spotify
  - Event Delivery
  - Event Delivery System Design and Architecture
    - Data collection
    - Extract Transform Load
    - Data delivery
  - Event Delivery System Operation
    - Timeliness
    - Skewness
    - Completeness
  - Customer Integration and Support
    - Documentation
    - System monitoring
    - Capacity planning
    - Development process
    - Incident handling
  - Summary
- Conclusion
14. Configuration Design and Best Practices
- What Is Configuration?
  - Configuration and Reliability
  - Separating Philosophy and Mechanics
- Configuration Philosophy
  - Configuration Asks Users Questions
  - Questions Should Be Close to User Goals
  - Mandatory and Optional Questions
  - Escaping Simplicity
- Mechanics of Configuration
  - Separate Configuration and Resulting Data
  - Importance of Tooling
    - Semantic validation
    - Configuration syntax
  - Ownership and Change Tracking
  - Safe Configuration Change Application
- Conclusion
15. Configuration Specifics
- Configuration-Induced Toil
- Reducing Configuration-Induced Toil
- Critical Properties and Pitfalls of Configuration Systems
  - Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
  - Pitfall 2: Designing Accidental or Ad Hoc Language Features
  - Pitfall 3: Building Too Much Domain-Specific Optimization
  - Pitfall 4: Interleaving Configuration Evaluation with Side Effects
  - Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- Integrating a Configuration Language
  - Generating Config in Specific Formats
  - Driving Multiple Applications
- Integrating an Existing Application: Kubernetes
  - What Kubernetes Provides
  - Example Kubernetes Config
  - Integrating the Configuration Language
- Integrating Custom Applications (In-House Software)
- Effectively Operating a Configuration System
  - Versioning
  - Source Control
  - Tooling
  - Testing
- When to Evaluate Configuration
  - Very Early: Checking in the JSON
    - Pros
    - Cons
  - Middle of the Road: Evaluate at Build Time
    - Pros
    - Cons
  - Late: Evaluate at Runtime
    - Pros
    - Cons
- Guarding Against Abusive Configuration
- Conclusion
16. Canarying Releases
- Release Engineering Principles
- Balancing Release Velocity and Reliability
- What Is Canarying?
- Release Engineering and Canarying
  - Requirements of a Canary Process
  - Our Example Setup
- A Roll Forward Deployment Versus a Simple Canary Deployment
- Canary Implementation
  - Minimizing Risk to SLOs and the Error Budget
  - Choosing a Canary Population and Duration
- Selecting and Evaluating Metrics
  - Metrics Should Indicate Problems
  - Metrics Should Be Representative and Attributable
  - Before/After Evaluation Is Risky
  - Use a Gradual Canary for Better Metric Selection
- Dependencies and Isolation
- Canarying in Noninteractive Systems
- Requirements on Monitoring Data
- Related Concepts
  - Blue/Green Deployment
  - Artificial Load Generation
  - Traffic Teeing
- Conclusion
III. Processes
17. Identifying and Recovering from Overload
- From Load to Overload
- Case Study 1: Work Overload When Half a Team Leaves
  - Background
  - Problem Statement
  - What We Decided to Do
  - Implementation
  - Lessons Learned
- Case Study 2: Perceived Overload After Organizational and Workload Changes
  - Background
  - Problem Statement
  - What We Decided to Do
  - Implementation
    - Short-term actions
    - Mid-term actions
    - Long-term actions
  - Effects
  - Lessons Learned
- Strategies for Mitigating Overload
  - Recognizing the Symptoms of Overload
  - Reducing Overload and Restoring Team Health
    - Identify and alleviate psychosocial stressors
    - Prioritize and triage within one quarter
    - Protect yourself in the future
- Conclusion
18. SRE Engagement Model
- The Service Lifecycle
  - Phase 1: Architecture and Design
  - Phase 2: Active Development
  - Phase 3: Limited Availability
  - Phase 4: General Availability
  - Phase 5: Deprecation
  - Phase 6: Abandoned
  - Phase 7: Unsupported
- Setting Up the Relationship
  - Communicating Business and Production Priorities
  - Identifying Risks
  - Aligning Goals
  - Setting Ground Rules
  - Planning and Executing
- Sustaining an Effective Ongoing Relationship
  - Investing Time in Working Better Together
  - Maintaining an Open Line of Communication
  - Performing Regular Service Reviews
  - Reassessing When Ground Rules Start to Slip
  - Adjusting Priorities According to Your SLOs and Error Budget
  - Handling Mistakes Appropriately
    - Sleep on it
    - Meet in person (or as close to it as possible) to resolve issues
    - Be positive
    - Understand differences in communication
- Scaling SRE to Larger Environments
  - Supporting Multiple Services with a Single SRE Team
  - Structuring a Multiple SRE Team Environment
  - Adapting SRE Team Structures to Changing Circumstances
  - Running Cohesive Distributed SRE Teams
- Ending the Relationship
  - Case Study 1: Ares
  - Case Study 2: Data Analysis Pipeline
    - The pivot
    - Communication breakdown
    - Decommission
- Conclusion
19. SRE: Reaching Beyond Your Walls
- Truths We Hold to Be Self-Evident
  - Reliability Is the Most Important Feature
  - Your Users, Not Your Monitoring, Decide Your Reliability
  - If You Run a Platform, Then Reliability Is a Partnership
  - Everything Important Eventually Becomes a Platform
  - When Your Customers Have a Hard Time, You Have to Slow Down
  - You Will Need to Practice SRE with Your Customers
- How to: SRE with Your Customers
  - Step 1: SLOs and SLIs Are How You Speak
  - Step 2: Audit the Monitoring and Build Shared Dashboards
  - Step 3: Measure and Renegotiate
  - Step 4: Design Reviews and Risk Analysis
  - Step 5: Practice, Practice, Practice
  - Be Thoughtful and Disciplined
- Conclusion
20. SRE Team Lifecycles
- SRE Practices Without SREs
- Starting an SRE Role
  - Finding Your First SRE
  - Placing Your First SRE
  - Bootstrapping Your First SRE
  - Distributed SREs
- Your First SRE Team
  - Forming
    - Creating a new team as part of a major project
    - Assembling a horizontal SRE team
    - Converting a team in place
  - Storming
    - Risks and mitigations
      - New team as part of a major project
      - Horizontal SRE team
      - A team converted in place
  - Norming
  - Performing
    - Partnering on architecture
    - Self-regulating workload
- Making More SRE Teams
  - Service Complexity
    - Where to split
    - Pitfalls
  - SRE Rollout
  - Geographical Splits
    - Placement: How many time zones apart should the teams be?
    - People and projects: Seeding the team
    - Parity: Distributing Work Between Offices and Avoiding a Night Shift
    - Placement: What about having three shifts?
    - Timing: Should both halves of the team start at the same time?
    - Finance: Travel budget
    - Leadership: Joint ownership of a service
- Suggested Practices for Running Many Teams
  - Mission Control
  - SRE Exchange
  - Training
  - Horizontal Projects
  - SRE Mobility
  - Travel
  - Launch Coordination Engineering Teams
  - Production Excellence
  - SRE Funding and Hiring
- Conclusion
21. Organizational Change Management in SRE
- SRE Embraces Change
- Introduction to Change Management
  - Lewins Three-Stage Model
  - McKinseys 7-S Model
  - Kotters Eight-Step Process for Leading Change
  - The Prosci ADKAR Model
  - Emotion-Based Models
  - The Deming Cycle
  - How These Theories Apply to SRE
- Case Study 1: Scaling WazeFrom Ad Hoc to Planned Change
  - Background
  - The Messaging Queue: Replacing a System While Maintaining Reliability
  - The Next Cycle of Change: Improving the Deployment Process
  - Lessons Learned
- Case Study 2: Common Tooling Adoption in SRE
  - Background
  - Problem Statement
  - What We Decided to Do
  - Design
  - Implementation: Monitoring
  - Lessons Learned
- Conclusion
Conclusion
- Onward
- The Future Belongs to the Past
- SRE + <Insert Other Discipline>
- Trickles, Streams, and Floods
- SRE Belongs to All of Us
- On Gratitude
A. Example SLO Document
- Service Overview
- SLIs and SLOs
- Rationale
- Error Budget
- Clarifications and Caveats
B. Example Error Budget Policy
- Service Overview
- Goals
- Non-Goals
- SLO Miss Policy
- Outage Policy
- Escalation Policy
- Background
C. Results of Postmortem Analysis
Index