Effective Monitoring and Alerting. For Web Operations - Helion

ebook

Autor: Slawek Ligus
ISBN: 978-14-493-3348-5
stron: 166, Format: ebook
Data wydania: 2012-11-26
Księgarnia: Helion

Cena książki: 59,42 zł (poprzednio: 69,09 zł)
Oszczędzasz: 14% (-9,67 zł)

Osoby, które kupiły tę książkę, wybierały także »

With this practical book, you’ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service.

Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If you’re a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you.

Monitor every component of your application stack, from the network to user experience
Learn how to draw the right conclusions from the metrics you obtain
Develop a robust alerting system that can identify problematic anomalies—without raising false alarms
Address system failures by their impact on resource utilization and user experience
Plan an alerting configuration that scales with your expanding network
Learn how to choose appropriate maintenance times automatically
Develop a work environment that fosters flexibility and adaptability

Osoby które kupowały "Effective Monitoring and Alerting. For Web Operations", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Effective Monitoring and Alerting. For Web Operations eBook -- spis treści

Effective Monitoring and Alerting
SPECIAL OFFER: Upgrade this ebook with OReilly
Preface
- Who Should Read This Book
- Conventions Used in This Book
- Using Code Examples
- Safari Books Online
- How to Contact Us
- Acknowledgements
1. Introduction
- Monitoring, Alerting, and What They Can Do for You
  - Early Problem Detection
    - Availability
    - Performance
  - Decision Making
    - Baselining
    - Predictions
  - Automation
    - Admission Control
    - Autonomic Computing
- Monitoring and Alerting in a Nutshell
  - Metrics and Timeseries
  - Alarms, Alerts, and Monitors
  - Monitoring System
  - The Process of Alerting
  - Issue Tracking
    - Tickets and queues
- The Challenges
- Important Terms
2. Monitoring
- The Building Blocks
  - Data Collection
  - Coverage
    - Resources
      - Network
      - Computational resources
    - Solution stack
      - Operating system
      - Middleware
      - Application
    - User experience
  - Metrics
    - Summary statistics
      - Frequency distribution and percentiles
      - Rate of change
    - Time granularity
    - Metric aggregation
  - Example: Inputs, Metrics, and Timeseries
  - Understanding Metrics
    - Type of unit
    - Data Collection Mode
    - Data Source
    - Number of Inputs per Data Point
    - Type of Quantity
  - Timeseries Patterns
- Drawing Conclusions from Timeseries Plots
  - Interpretation of Anomalies
    - Flow
    - Stock
    - Availability
    - Throughput
    - Applications of quantities
  - Frequently Encountered Anomalies
    - Flattening Effect
    - Warm-Up Effect
    - Regular Anomalies
    - Spikes During Troughs
  - Determining Causality
  - Capturing the Daily Cycle, Trends, and Seasonal Changes
3. Alerting
- The Challenge
- Prerequisites
  - Monitoring and Alerting Platform
  - Audit Trail
  - Issue Tracking
- Understanding Failure and Its Impact
  - Establishing Significance
  - Identifying Causes
- Anatomy of an Alarm
  - Boolean Function
    - Metric Monitor
      - Upper Limit
      - Lower Limit
      - Outside Range
      - Data Points Not Recorded
    - Time Evaluation
    - Another Alarm as Input Source
  - Suppression
  - Aggregation
- Case Study: A Data Pipeline
- Types of Alerts
- Setting Up Alarms
  - Identifying Impact
  - Establishing Severity
  - Picking the Right Timeseries
  - Configuring Monitors
    - Coming Up with a Threshold
      - Static thresholds
      - Data-driven thresholds
    - Breach and Clear Delay
  - Setting Up Alarms
  - Testing Alerting Configurations
- Alerting Suggestions
4. At Scale
- Implications of Scale
- Composition of Large-Scale Systems
- Commonalities of Large-Scale Alerting Configurations
- Monitoring Coverage
  - Reflecting Dimensions in Metrics
- Managing Large Alerting Configurations
  - Addressing the Problems
    - Organize alarms and monitors in a namespace
    - Calculate threshold values from metric data
    - Periodically refresh and clean up the configuration
  - Suggested Solution
    - Refresh intervals
      - Running the engine
      - Naming
      - Alarm creation and threshold calculation
      - Cleanup procedures
      - Writing Modules
      - Suppression
      - Extra Features
  - Result
5. Monitoring in System Automation
- Choosing Appropriate Maintenance Times Automatically
- Controlling the Rate of Upgrade
- Recovery-Oriented Admission Control
- Automated Deployment and Rollback
6. The Work Environment
- Keeping an Audit Trail
- Working with Tickets
  - Root Cause Analysis
    - The Five Whys
      - Extracting Categories
- Dealing with Anomalies
- Learning from Outages
- Using Checklists
- Creating Dashboards
- Service-Level Agreements
- Preventing the Ironies of Automation
- Culture
7. Measuring Success
- The Feedback Loop
  - Root Cause Classification
    - A Short Story of a Long Classifier List
  - Timing
- Ticket Reporting
  - Frequency of Incidence
  - Incidence Times
  - Time to Respond and Time to Resolution
- Measuring Detectability
  - False Positives and False Negatives
  - Precision and Recall
  - The F-Measure
- Transition to Automated Alarms
- Maintenance Overhead
- How (Not) to Measure
8. The Principles
- Get in the Habit of Measuring
- Draw Conclusions Reliably
- Monitor Extensively
- Alarm Selectively
- Work Smart, Not Hard
  - Learn from the Experience of Others
  - Have a Tactic
  - Run a Bank of Cases
  - Enjoy the Process
A. Setting Up OpenTSDB
- The Software
  - Architecture
  - Getting OpenTSDB
- First Steps
  - Starting TSD
  - Pushing Data
  - Input Tagging
    - Tag Wildcards
  - Temporal Aggregation
  - Summary Statistics
  - Rate of Change
- Gathering Data System-Wide
  - Running tcollector
  - Writing a Custom Collector
- Timeseries Plots
  - Plotting Tips
- Get Involved
About the Author
SPECIAL OFFER: Upgrade this ebook with OReilly
Copyright