Seeking SRE. Conversations About Running Production Systems at Scale - Helion

ebook

Autor: David N. Blank-Edelman
ISBN: 978-14-919-7881-8
stron: 596, Format: ebook
Data wydania: 2018-08-21
Księgarnia: Helion

Cena książki: 143,65 zł (poprzednio: 167,03 zł)
Oszczędzasz: 14% (-23,38 zł)

Osoby, które kupiły tę książkę, wybierały także »

Organizations big and small have started to realize just how crucial system and application reliability is to their business. They’ve also learned just how difficult it is to maintain that reliability while iterating at the speed demanded by the marketplace. Site Reliability Engineering (SRE) is a proven approach to this challenge.

SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O’Reilly book that described Google’s creation of the discipline and the implementation that’s allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space. The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now.

Listen as engineers and other leaders in the field discuss:

Different ways of implementing SRE and SRE principles in a wide variety of settings
How SRE relates to other approaches such as DevOps
Specialties on the cutting edge that will soon be commonplace in SRE
Best practices and technologies that make practicing SRE easier
The important but rarely explored human side of SRE

David N. Blank-Edelman is the book’s curator and editor.

Osoby które kupowały "Seeking SRE. Conversations About Running Production Systems at Scale", wybierały także:

Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 747,50 zł, (29,90 zł -96%)
Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Sieci przedsi 427,14 zł, (29,90 zł -93%)
Jak zhakowa 125,00 zł, (10,00 zł -92%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Deep Web bez tajemnic. Kurs video. Pozyskiwanie ukrytych danych 186,88 zł, (29,90 zł -84%)

Spis treści

Seeking SRE. Conversations About Running Production Systems at Scale eBook -- spis treści

Introduction
- And So It Begins...
- Origin Story
- Voices
- Forward in All Directions!1
- Acknowledgments
I. SRE Implementation
1. Context Versus Control in SRE
2. Interviewing Site Reliability Engineers
- Interviewing 101
  - Who Is Involved
  - Industry Versus University
  - Biases
  - The Funnel
- SRE Funnels
  - Phone Screens
    - Conducting a phone screen
  - The Onsite Interview
    - Coding and system questions
    - Deep dives and architecture questions
    - Cultural interviews
  - Take-Home Questions
  - Advice for Hiring Managers
    - Selling candidates
    - Walking away
- Final Thoughts on Interviewing SREs
- Further Reading
3. So, You Want to Build an SRE Team?
- Choose SRE for the Right Reasons
- Orienting to a Data-Driven Approach
- Commitment to SRE
- Making a Decision About SRE
4. Using Incident Metrics to Improve SRE at Scale
- The Virtuous Cycle to the Rescue: If You Dont Measure It
- Metrics Review: If a Metric Falls in the Forest
- Surrogate Metrics
- Repair Debt
- Virtual Repair Debt: Exorcising the Ghost in the Machine
- Real-Time Dashboards: The Bread and Butter of SRE
- Learnings: TL;DR
- Further Reading
5. Working with Third Parties Shouldnt Suck
- Build, Buy, or Adopt?
  - Establish Importance
  - Identify Stakeholders
  - Make a Decision
  - Acknowledge Reality
    - Is this a core competency?
    - Integration timeline?
    - Project Operating Expense and Abandonment Expense
- Third Parties as First-Class Citizens
  - When Theyre Down, Youre Down
    - Direct impact
    - Indirect impact
  - Running the Black Box Like a Service
  - Service-Level Indicators, Service-Level Objectives, and SLAs
    - SLIs on black boxes
      - Polling API informs SLIs
      - Real-time data informs SLIs
      - Synthetic monitoring informs SLIs
      - RUM informs SLIs
    - SLOs
      - Negotiating SLAs with vendors
  - Playbook: From Staging to Production
    - Testing and staging
    - Monitoring
      - Uses for synthetic monitoring
      - Uses for RUM
    - Tooling
    - Automation
    - Logging
    - Disaster planning
    - Communication
    - Decommissioning
- Closing Thoughts
6. How to Apply SRE Principles Without Dedicated SRE Teams
- SREs to the Rescue! (and How They Failed)
  - A Matter of Scale in Terms of Headcount
  - The Embedded SRE
- You Build It, You Run It
  - The Deployment Platform
  - Closing the Loop: Take Your Own Pager
  - Introducing Production Engineering
- Some Implementation Details
  - Developers Productivity and Health Versus the Pager
  - Resolving Cross-Team Reliability Issues by Using Postmortems
  - Uniform Infrastructure and Tooling Versus Autonomy and Innovation
  - Getting Buy-In
- Conclusion
- Further Reading
7. SRE Without SRE: The Spotify Case Study
- Tabula Rasa: 20062007
  - Prelude
  - Key Learnings
- Beta and Release: 20082009
  - Prelude
  - Bringing Scalability and Reliability to the Forefront
  - Key Learnings
- The Curse of Success: 2010
  - Prelude
  - A New Ownership Model
    - The dev owner role
    - The ops owner role
  - Formalizing Core Services
  - Blessed Deployment Time Slots
  - On-Call and Alerting
    - Not completely pain-free
  - Spawning Off Internal Office Support
  - Addressing the Remaining Top Concerns
    - Long lead times
    - Unintentional specialization and misalignment
    - Interruptions
    - Introducing the goalie role
  - Creating Detectives
  - Key Learnings
- Pets and Cattle, and Agile: 2011
  - Prelude
  - Forming Bad Habits
  - Breaking Those Bad Habits
  - Key Learnings
- A System That Didnt Scale: 2012
  - Prelude
  - Manual Work Hits a Cliff
  - Key Learnings
- Introducing Ops-in-Squads: 20132015
  - Prelude
    - Lightening the manual load
  - Building on Trust
  - Driving the Paradigm Shift
  - Key Learnings
- Autonomy Versus Consistency: 20152017
  - Prelude
  - Benefits
  - Trade-Offs
  - Key Learnings
- The Future: Speed at Scale, Safely
8. Introducing SRE in Large Enterprises
- Background
- Introducing SRE
  - Defining Current State
    - Start by defining the roles and responsibilities of traditional functions in the organization to understand the landscape
    - Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability
    - Prepare the business case: calculate cost of similar resources doing duplicate work
    - To establish a roadmap for what products SRE will be responsible for, survey the current infrastructure landscape
  - Identifying and Educating Stakeholders
    - Start having conversations with leaders and champions in the organization
    - Defining SRE
  - Presenting the Business Case
  - Implementing the SRE Team
    - Setting goals and defining metrics of success
    - Growing the team: insource or outsource?
    - Insourcing experienced talent: rotating engineering team members
    - SRE throughout the development cycle
    - Defining the role of supporting divisions
  - Lessons Learned
  - Sample Implementation Roadmap
- Closing Thoughts
- Further Reading
9. From SysAdmin to SRE in 8,963 Words
- Clarifying Terminology
  - Service-Level Indicator
  - SLA
  - Service-Level Objective
- Establishing SLAs for Internal Components
- Understanding External Dependencies
- Nontechnical Solutions
- Tracking Availability Level
- Dealing with Corner Cases
- Conclusion
10. Clearing the Way for SRE in the Enterprise
- Toil, the Enemy of SRE
- Toil in the Enterprise
- Silos, Queues, and Tickets
  - Silos Get in the Way
  - Ticket-Driven Request Queues Are Expensive
- Take Action Now
- Start by Leaning on Lean
- Get Rid of as Many Handoffs as Possible
- Replace Remaining Handoffs with Self-Service
  - Self-Service Is More Than a Button
  - Self-Service Helps SREs in Multiple Ways
  - Operations as a Service
- Error Budgets, Toil Limits, and Other Tools for Empowering Humans
  - Error Budgets
  - Toil Limits
  - Leverage Existing Enthusiasm for DevOps
  - Unify Backlogs and Protect Capacity
  - Psychological Safety and Human Factors
- Join the Movement
11. SRE Patterns Loved by DevOps People Everywhere
- Pattern 1: Birth of Automated Testing at Google
- Pattern 2: Launch and Handoff Readiness Review at Google
- Pattern 3: Create a Shared Source Code Repository
- Conclusion
- Further Reading and Source Material
12. DevOps and SRE: Voices from the Community
- Background
- Method
- Results
- Replies
13. Production Engineering at Facebook
II. Near Edge SRE
14. In the Beginning, There Was Chaos
- The Problem with Systems
- Economic Pillars of Complexity
- Beginning Chaos
- Navigating Complexity for Safety
- Chaos Goes Big
- Formalization
- Advanced Principles
- Frequently Asked Questions
- Conclusion
15. The Intersection of Reliability and Privacy
- The Intersection of Reliability and Privacy
- The General Landscape of Privacy Engineering
- Privacy and SRE: Common Approaches
  - Reducing Toil
    - Automation
    - Default behavior for shared architectures
    - Frameworks
  - Efficient and Deliberate Problem Solving
    - Solve challenges once
    - Find and address root causes
  - Relationship Management
  - Early Intervention and Education Through Evangelism
- Nuances, Differences, and Trade-Offs
- Conclusion
- Further Reading
16. Database Reliability Engineering
- Guiding Principles of the Database Reliability Engineer
  - Protect the Data
  - Self-Service for Scale
  - Databases Are Not Special
- A Culture of Database Reliability Engineering
- Recoverability
  - Considerations for Recovery
  - Anatomy of a Recovery Strategy
  - Building Block 1: Detection
    - User error
    - Application errors
    - Infrastructure services
    - Operating system and hardware errors
  - Building Block 2: Diverse Storage
    - Online, high-performance storage
    - Online, low-performance storage
    - Offline storage
    - Object storage
  - Building Block 3: A Varied Toolbox
    - Full physical backups
    - Incremental physical backups
    - Full and incremental logical backups
    - Object stores
  - Building Block 4: Testing
  - Championing Recovery Reliability
- Continuous Delivery: From Development to Production
  - Education and Collaboration
    - Architecture
    - Data model
    - Best practices and standards
    - Tools
- Collaboration
- Deployment
  - Migrations and Versioning
  - Impact Analysis
  - Migration Patterns
    - Migration testing
    - Rollback testing
  - Championing CD
- Making the Case for DBRE
- Further Reading
17. Engineering for Data Durability
- Replication Is Table Stakes
  - Backups
    - Restoration
    - Freshness
  - Replication
    - Estimating durability
- Real-World Durability
  - Isolation
    - Physical isolation
    - Logical isolation
    - Operational isolation
- Protection
  - Testing
  - Safeguards
  - Recovery
- Verification
  - The Power of Zero
  - Verification Coverage
    - Disk Scrubber
    - Index Scanner
    - Storage Watcher
  - Watching the Watchers
- Automation
  - Window of Vulnerability
  - Operator Fatigue
  - Reliability
- Conclusion
18. Introduction to Machine Learning for SRE
- Why Use Machine Learning for SRE?
- Why and How Should My Company Be Engaging in This?
  - Some SRE Problems Machine Learning Can Help Solve
- The Awakening of Applied AI
- What Is Machine Learning?
  - What Do We Mean by Learning?
  - From Chess to Go: How Deep Can We Dive?
  - Why Now? What Changed for Us?
- What Are Neural Networks?
  - Neurons and Neural Networks
  - How and When Should We Apply Neural Networks?
  - What Kinds of Data Can We Use?
- Practical Machine Learning
  - Popular Libraries for Neural Networks
  - Practical Machine Learning Examples
    - Installing Python, IPython, and Jupyter Notebook
    - Decision trees
    - A neural network from scratch
    - Using TensorFlow and TensorBoard
    - Time series: server requests waiting
- Success Stories
- Further Reading
  - My GitHub Repository
  - Recommended Books
III. SRE Best Practices and Technologies
19. Do Docs Better: Integrating Documentation into the Engineering Workflow
- Defining Quality: What Do Good Docs Look Like?
  - Functional Requirements for SRE Documentation
    - Service overviews
    - Playbooks
    - Postmortems
    - Policies
    - SLAs
    - Defining success metrics
- Integrating Docs into the Engineering Workflow
  - The Google Experience: g3doc and EngPlay
  - What We Learned
    - Where possible, documentation should live in source control, alongside its associated code
    - Pick the simplest markup language that supports your needs
    - Integrations are key to adoption
- Doing Docs Better: Best Practices
  - Create Templates for Each Documentation Type
  - Better > Best: Set Realistic Standards for Quality
  - Require Docs as Part of Code Review
  - Ruthlessly Prune Your Docs
  - Recognize and Reward Documentation
- Communicating the Value of Documentation
- Further Reading
20. Active Teaching and Learning
- Active Learning
  - Active Learning Example: Wheel of Misfortune
  - Active Learning Example: Incident Manager (a Card Game)
  - Active Learning Example: SRE Classroom
- The Costs of Failing to Learn
- Learning Habits of Effective SRE Teams
  - Production Meetings
  - Postmortems
- A Call to Action: Ditch the Boring Slides
21. The Art and Science of the Service-Level Objective
- Why Set Goals?
- Availability
  - Time Quanta
  - Transactions
  - Transactions over Time Quanta
- On Evaluating SLOs
- Histograms
- Where Percentiles Fall Down (and Histograms Step Up)
- Parting Thought: Looking at SLOs Upside Down
- Further Reading
22. SRE as a Success Culture
- Where Did SRE Come From?
- Key Values for SRE
  - Keeping the Site Up
    - Isolated failure domains
    - Redundant systems
    - Graduated degradation
  - Empowering Teams to Do the Right Thing
  - Approaching Operations as an Engineering Problem
  - Achieving Business Success Through Promises (Service Levels)
    - Progression in Service-Level Execution
- Critical Enabling Functions of SRE
  - Monitoring, Metrics, and KPIs
  - Incident Management and Emergency Response
  - Capacity Planning and Demand Forecasting
  - Performance Analysis and Optimization
  - Provisioning, Change Management, and Velocity
- Phases of SRE Execution
  - Phase 1: Firefighting/Reactive
  - Phase 2: Gatekeepers
  - Phase 3: Advocates/Partners
  - Phase 4: Catalytic
  - Complications of Differing Phases
- Focus on the Details of Success
- Further Reading
23. SRE Antipatterns
- Antipattern 1: Site Reliability Operations
- Antipattern 2: Humans Staring at Screens
- Antipattern 3: Mob Incident Response
- Antipattern 4: Root Cause = Human Error
- Antipattern 5: Passing the Pager
- Antipattern 6: Magic Smoke Jumping!
- Antipattern 7: Alert Reliability Engineering
- Antipattern 8: Hiring a Dog-Walker to Tend Your Pets
- Antipattern 9: Speed-Bump Engineering
- Antipattern 10: Design Chokepoints
- Antipattern 11: Too Much Stick, Not Enough Carrot
- Antipattern 12: Postponing Production
- Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
- Antipattern 14: Dependency Hell
- Antipattern 15: Ungainly Governance
- Antipattern 16: Ill-Considered SLOh-Ohs
- Antipattern 17: Tossing Your API Over the Firewall
- Antipattern 18: Fixing the Ops Team
- So, Thats It, Then?
24. Immutable Infrastructure and SRE
- Scalability, Reliability, and Performance
- Failure Recovery
- Simpler Operations
- Faster Startup Times
- Known State
- Continuous Integration/Continuous Deployment with Confidence
- Security
- Multiregion Operations
- Release Engineering
- Building the Base Image
- Deploying Applications
- Disadvantages
- Conclusion
25. Scriptable Load Balancers
- Scriptable Load Balancers: The New Kid on the Block
  - Why Scriptable Load Balancers?
- Making the Difficult Easy
  - Shard-Aware Routing
    - Routing requests with DNS
    - Routing queries in the application
    - Routing requests in the application
    - Routing requests with a scriptable load balancer
  - Harnessing Potential
  - Case Study: Intermission
- Service-Level Middleware
  - Middleware to the Rescue
  - APIs of Service-Level Middleware
  - Case Study: WAF/Bot Mitigation
- Avoiding Disaster
  - Getting Clever with State
  - Case Study: Checkout Queue
- Looking to the Future and Further Reading
26. The Service Mesh: Wrangler of Your Microservices?
- Ready to Get Rid of the Monolith?
- Current State of Microservice Networking
- Service Mesh to the Rescue
  - The Benefits of a Sidecar Proxy
  - Eventually Consistent Service Discovery
  - Observability and Alarming
  - Sidecar Performance Implications
  - Thin Libraries and Context Propagation
  - Configuration Management (Control Plane Versus Data Plane)
- The Service Mesh in Practice
  - The Origin and Development of Envoy at Lyft
  - Operating Envoy at Lyft
    - Operational learnings
    - Development learnings
    - Technical learnings
- The Future of the Service Mesh
- Further Reading
IV. The Human Side of SRE
27. Psychological Safety in SRE
- The Primary Indicator of a Successful Team
  - How to Build Psychological Safety into Your Own Team
    - Make respect part of your teams culture
    - Make space for people to take chances
    - Make it obvious when your team is doing well
    - Make your communication clear and your expectations explicit
    - Make your team feel safe
    - Why are operations teams more likely to feel unsafe than other engineering teams?
      - We love interrupts and the torrents of information
      - On-call and operations
      - Cognitive overload
      - Imaginary expectations
      - Operations teams are bad at estimating their level of psychological safety
- Further Reading
28. SRE Cognitive Work
- Introduction
- What Do SRE People Do?
- Why Should We Care About Practitioner Cognition?
  - Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
  - Human Performance in Modern Complex Systems: The Main Themes
- Observations on SRE Cognitive Work Around Incidents
  - Every Incident Could Have Been Worse
  - Sacrifice Decisions Take Place Under Uncertainty
  - Repairs to Functional Systems
  - Special Knowledge About Complex Systems
  - Managing the Costs of Coordination
    - Classification schemes
    - Formal role assignments
  - SREs Are Cognitive Agents Working in a Joint Cognitive System
- The Calibration Problem
  - Mental Models
  - Incidents Trigger Individual Recalibration
  - Incidents Are Opportunities for Collective Recalibration
- What Are the Implications of All This?
  - Incidents Will Continue
  - Incidents Will Impose Costs
  - Incident Patterns Will Change
  - Incidents Point to Specific Calibration Problems and Locations
- What Should Happen Next?
  - Build a Corpus of Cases
  - Focus on Making Automation a Team Player in SRE Work
  - Address the Calibration Problem
- What Can You Do?
- Conclusion
- References
29. Beyond Burnout
- Defining Mental Disorders
- Mental Disorders Are Missing from the Diversity Conversation
- Sanity Isnt a Business Requirement
- Thoughts and Prayers Arent Scalable
- Full-Stack Inclusivity
  - Application
  - Interviewing
  - Compensation
  - Benefits
  - Onboarding
  - Working Conditions
  - Job Duties
  - Training
  - Promotion
  - Leaving
- Inclusivity for Anyone Helps Everyone
- Mental Disorder Resources
30. Against On-Call: A Polemic
- The Rationale for On-Call
  - First, Do No Harm
  - Parallels with SRE
  - Differences with SRE
  - Underlying Assumptions Driving On-Call for Engineers
  - On-Call Is Emergency Medicine Instead of Ward Medicine
  - Counterarguments
- The Cost to Humans of Doing On-Call
  - We dont need another hero
- Actual Solutions
  - Training
  - Prioritization
    - Accommodations
    - Compensation
    - Flexible schedules
    - Recovery
    - Exclusion backlash
  - Improving On-the-Job Performance
    - Cognitive hacks
- We Need a Fundamental Change in Approach
  - Strong-Anti-On-Call
  - Weak-Anti-On-Call
  - A Union of the Two
- Conclusion
31. Elegy for Complex Systems
- The Computer and Human Systems Cannot Be Separated
- Decoherence and Cascading Failure
- Always in a State of Partial Failure
- Novelty Priority Inversion
- Nobody Anticipates the Overhead of Coordination
- Your healthcare.gov Is Out There
  - To Get Involved
- Further Reading
32. Intersections Between Operations and Social Activism
- Before, During, After
  - Creating the Perfect Plan
  - Principles of Organizing
    - Principles 1 and 2 (interfaces and incident command)
    - Principles 3 and 4 (blameless retrospectives and psychological safety)
  - Managing Crisis: Responding When Things Break Down
    - Handling chaos: contrast in responses during the July 8 KKK rally
    - Preparing for the worst: handling terror at Unite the Right
    - The corollary to trust is forgiveness
  - Writing Our Own History: Making Sense of What Went Down
    - Charlottesville in review: assigning and avoiding blame
    - Beyond culpability: building capacity instead of assigning blame
- The Long Tail: Turning Action into Change
  - Activism and Change Within a Company
- Conclusion
33. Conclusion
Index