Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. 2nd Edition - Helion

ebook

Autor: Martin Kleppmann, Chris Riccomini
ISBN: 9781098119027
stron: 672, Format: ebook
Data wydania: 2026-02-17
Księgarnia: Helion

Cena książki: 203,15 zł (poprzednio: 236,22 zł)
Oszczędzasz: 14% (-33,07 zł)

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Analiza danych

Data is at the center of many challenges in system design today. Difficult issues such as scalability, consistency, reliability, efficiency, and maintainability need to be resolved. In addition, there's an overwhelming variety of systems, including relational databases, NoSQL datastores, data warehouses, and data lakes. There are cloud services, on-premises services, and embedded databases. What are the right choices for your application? How do you make sense of all these buzzwords?

In this second edition, authors Martin Kleppmann and Chris Riccomini build on the foundation laid in the acclaimed first edition, integrating new technologies and emerging trends. You'll be guided through the maze of decisions and trade-offs involved in building a modern data system, learn how to choose the right tools for your needs, and understand the fundamentals of distributed systems.

Peer under the hood of the systems you already use, and learn to use them more effectively
Make informed decisions by identifying the strengths and weaknesses of different tools
Learn how major cloud services are designed for scalability, fault tolerance, and consistency
Understand the core principles upon which modern databases are built

Osoby które kupowały "Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. 2nd Edition", wybierały także:

Data Science w Pythonie. Kurs video. Przetwarzanie i analiza danych 149,00 zł, (67,05 zł -55%)
Excel 2013. Kurs video. Poziom drugi. Przetwarzanie i analiza danych 79,00 zł, (35,55 zł -55%)
Python w analizie danych. Przetwarzanie danych za pomoc 119,00 zł, (59,50 zł -50%)
Zaawansowana analiza danych w PySpark. Metody przetwarzania informacji na szerok 69,00 zł, (34,50 zł -50%)
Statystyka praktyczna w data science. 50 kluczowych zagadnień w językach R i Python. Wydanie II 87,00 zł, (43,50 zł -50%)

Spis treści

Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. 2nd Edition eBook -- spis treści

Preface
- Who Should Read This Book?
- Whats New in the Second Edition?
- References and Further Reading
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Trade-Offs in Data Systems Architecture
- Operational Versus Analytical Systems
  - Characterizing Transaction Processing and Analytics
  - Data Warehousing
    - From data warehouse to data lake
    - Beyond the data lake
  - Systems of Record and Derived Data
- Cloud Versus Self-Hosting
  - Pros and Cons of Cloud Services
  - Cloud Native System Architecture
    - Layering of cloud services
    - Separation of storage and compute
  - Operations in the Cloud Era
- Distributed Versus Single-Node Systems
  - Problems with Distributed Systems
  - Microservices and Serverless
  - Cloud Computing Versus Supercomputing
- Data Systems, Law, and Society
- Summary
2. Defining Nonfunctional Requirements
- Case Study: Social Network Home Timelines
  - Representing Users, Posts, and Follows
  - Materializing and Updating Timelines
- Describing Performance
  - Latency and Response Time
  - Average, Median, and Percentiles
  - Use of Response Time Metrics
- Reliability and Fault Tolerance
  - Fault Tolerance
  - Hardware and Software Faults
    - Tolerating hardware faults through redundancy
    - Software faults
  - Humans and Reliability
- Scalability
  - Understanding Load
  - Shared-Memory, Shared-Disk, and Shared-Nothing Architectures
  - Principles for Scalability
- Maintainability
  - Operability: Making Life Easy for Operations
  - Simplicity: Managing Complexity
  - Evolvability: Making Change Easy
- Summary
3. Data Models and Query Languages
- Relational Versus Document Models
  - The Object-Relational Mismatch
    - Object-relational mapping
    - The document data model for one-to-many relationships
  - Normalization, Denormalization, and Joins
    - Trade-offs of normalization
    - Denormalization in the social networking case study
  - Many-to-One and Many-to-Many Relationships
  - Stars and Snowflakes: Schemas for Analytics
  - When to Use Which Model
    - Schema flexibility in the document model
    - Data locality for reads and writes
    - Query languages for documents
    - Convergence of document and relational databases
- Graph-Like Data Models
  - Property Graphs
  - The Cypher Query Language
  - Graph Queries in SQL
  - Triple Stores and SPARQL
    - The RDF data model
    - The SPARQL query language
  - Datalog: Recursive Relational Queries
  - GraphQL
- Event Sourcing and CQRS
- DataFrames, Matrices, and Arrays
- Summary
4. Storage and Retrieval
- Storage and Indexing for OLTP
  - Log-Structured Storage
    - The SSTable file format
    - Constructing and merging SSTables
    - Bloom filters
    - Compaction strategies
  - B-Trees
    - Making B-trees reliable
    - Using B-tree variants
  - Comparing B-Trees and LSM-Trees
    - Read performance
    - Sequential versus random writes
    - Write amplification
    - Disk space usage
  - Multicolumn and Secondary Indexes
  - Storing Values Within the Index
  - Keeping Everything in Memory
- Data Storage for Analytics
  - Cloud Data Warehouses
  - Column-Oriented Storage
    - Column compression
    - Sort order in column storage
    - Writing to column-oriented storage
  - Query Execution: Compilation and Vectorization
  - Materialized Views and Data Cubes
- Multidimensional and Full-Text Indexes
  - Full-Text Search
  - Vector Embeddings
- Summary
5. Encoding and Evolution
- Formats for Encoding Data
  - Language-Specific Formats
  - JSON, XML, and Binary Variants
    - JSON Schema
    - Binary encodings
  - Protocol Buffers
    - Field tags and schema evolution
  - Avro
    - The writers schema and the readers schema
    - Schema evolution rules
    - But what is the writers schema?
    - Dynamically generated schemas
  - The Merits of Schemas
- Modes of Dataflow
  - Dataflow Through Databases
    - Different values written at different times
    - Archival storage
  - Dataflow Through Services: REST and RPC
    - Web services
    - The problems with remote procedure calls
    - Load balancers, service discovery, and service meshes
    - Data encoding and evolution for RPC
  - Durable Execution and Workflows
  - Event-Driven Architectures
    - Message brokers
    - Distributed actor frameworks
- Summary
6. Replication
- Single-Leader Replication
  - Synchronous Versus Asynchronous Replication
  - Setting Up New Followers
  - Handling Node Outages
    - Follower failure: Catch-up recovery
    - Leader failure: Failover
  - Implementation of Replication Logs
    - Statement-based replication
    - Write-ahead log shipping
    - Logical (row-based) log replication
  - Problems with Replication Lag
    - Reading your own writes
    - Monotonic reads
    - Consistent prefix reads
  - Solutions for Replication Lag
- Multi-Leader Replication
  - Geographically Distributed Operation
    - Multi-leader replication topologies
    - Problems with different topologies
  - Sync Engines and Local-First Software
    - Real-time collaboration, offline-first, and local-first apps
    - Pros and cons of sync engines
  - Dealing with Conflicting Writes
    - Conflict avoidance
    - Last write wins (discarding concurrent writes)
    - Manual conflict resolution
    - Automatic conflict resolution
    - Conflict-free replicated datatypes and operational transformation
    - Types of conflict
- Leaderless Replication
  - Writing to the Database When a Node Is Down
    - Catching up on missed writes
    - Using quorums for reading and writing
    - Understanding the limitations of quorum consistency
    - Monitoring staleness
  - Single-Leader Versus Leaderless Replication Performance
  - Multi-Region Operation
  - Detecting Concurrent Writes
    - The happens-before relation and concurrency
    - Capturing the happens-before relationship
    - Version vectors
- Summary
7. Sharding
- Pros and Cons of Sharding
- Sharding for Multitenancy
- Sharding of Key-Value Data
  - Sharding by Key Range
    - Rebalancing key-range sharded data
  - Sharding by Hash of Key
    - Hash modulo number of nodes
    - Fixed number of shards
    - Sharding by hash range
    - Consistent hashing
  - Skewed Workloads and Relieving Hot Spots
  - Operations: Automatic Versus Manual Rebalancing
- Request Routing
- Sharding and Secondary Indexes
  - Local Secondary Indexes
  - Global Secondary Indexes
- Summary
8. Transactions
- What Exactly Is a Transaction?
  - The Meaning of ACID
    - Atomicity
    - Consistency
    - Isolation
    - Durability
  - Single-Object and Multi-Object Operations
    - Single-object writes
    - The need for multi-object transactions
    - Handling errors and aborts
- Weak Isolation Levels
  - Read Committed
    - No dirty reads
    - No dirty writes
    - Implementing read-committed
  - Snapshot Isolation and Repeatable Read
    - Multiversion concurrency control
    - Visibility rules for observing a consistent snapshot
    - Indexes and snapshot isolation
    - Snapshot isolation, repeatable read, and naming confusion
  - Preventing Lost Updates
    - Atomic write operations
    - Explicit locking
    - Automatically detecting lost updates
    - Conditional writes (compare-and-set)
    - Conflict resolution and replication
  - Write Skew and Phantoms
    - Characterizing write skew
    - More examples of write skew
    - Phantoms causing write skew
    - Materializing conflicts
- Serializability
  - Actual Serial Execution
    - Encapsulating transactions in stored procedures
    - Pros and cons of stored procedures
    - Sharding
    - Summary of serial execution
  - Two-Phase Locking
    - Implementation of 2PL
    - Performance of 2PL
    - Predicate locks
    - Index-range locks
  - Serializable Snapshot Isolation
    - Pessimistic versus optimistic concurrency control
    - Decisions based on an outdated premise
    - Detection of stale MVCC reads
    - Detection of writes that affect prior reads
    - Performance of serializable snapshot isolation
- Distributed Transactions
  - Two-Phase Commit
    - A system of promises
    - Coordinator failure
    - Three-phase commit
  - Distributed Transactions Across Different Systems
    - Exactly-once message processing
    - XA transactions
    - Holding locks while in doubt
    - Recovering from coordinator failure
    - Problems with XA transactions
  - Database-Internal Distributed Transactions
  - Exactly-Once Message Processing Revisited
- Summary
9. The Trouble with Distributed Systems
- Faults and Partial Failures
- Unreliable Networks
  - The Limitations of TCP
  - Network Faults in Practice
  - Fault Detection
  - Timeouts and Unbounded Delays
    - Network congestion and queueing
    - Variability of network delays
  - Synchronous Versus Asynchronous Networks
    - Can we not simply make network delays predictable?
    - Combining circuit switching and packet switching
- Unreliable Clocks
  - Monotonic Versus Time-of-Day Clocks
    - Time-of-day clocks
    - Monotonic clocks
  - Clock Synchronization and Accuracy
  - Relying on Synchronized Clocks
    - Timestamps for ordering events
    - Clock readings with a confidence interval
    - Synchronized clocks for global snapshots
  - Process Pauses
    - Provididng response time guarantees
    - Limiting the impact of garbage collection
- Knowledge, Truth, and Lies
  - The Majority Rules
  - Distributed Locks and Leases
    - Fencing off zombies and delayed requests
    - Fencing with multiple replicas
  - Byzantine Faults
    - Uses of Byzantine fault tolerance
    - Weak forms of lying
  - System Model and Reality
    - Defining the correctness of an algorithm
    - Distinguishing between safety and liveness
    - Mapping system models to the real world
  - Formal Methods and Randomized Testing
    - Model checking and specification languages
    - Fault injection
    - Deterministic simulation testing
- Summary
10. Consistency and Consensus
- Linearizability
  - What Makes a System Linearizable?
  - Relying on Linearizability
    - Locking and leader election
    - Constraints and uniqueness guarantees
    - Cross-channel timing dependencies
  - Implementing Linearizable Systems
  - The Cost of Linearizability
    - The CAP theorem
    - Linearizability and network delays
- ID Generators and Logical Clocks
  - Logical Clocks
    - Lamport timestamps
    - Hybrid logical clocks
    - Lamport/hybrid logical clocks versus vector clocks
  - Linearizable ID Generators
    - Implementing a linearizable ID generator
    - Enforcing constraints using logical clocks
- Consensus
  - The Many Faces of Consensus
    - Single-value consensus
    - Compare-and-set as consensus
    - Shared logs as consensus
    - Fetch-and-add as consensus
    - Atomic commitment as consensus
  - Consensus in Practice
    - Using shared logs
    - From single-leader replication to consensus
    - Subtleties of consensus
    - Pros and cons of consensus
  - Coordination Services
    - Allocating work to nodes
    - Service discovery
- Summary
11. Batch Processing
- Batch Processing with Unix Tools
  - Simple Log Analysis
  - Chain of Commands Versus Custom Program
  - Sorting Versus In-Memory Aggregation
- Batch Processing in Distributed Systems
  - Distributed Filesystems
  - Object Stores
  - Distributed Job Orchestration
    - Resource allocation
    - Scheduling workflows
    - Handling faults
- Batch Processing Models
  - MapReduce
  - Dataflow Engines
  - Shuffling Data
  - Joins and Grouping
  - Query Languages
  - DataFrames
- Batch Use Cases
  - ExtractTransformLoad
  - Analytics
  - Machine Learning
  - Serving Derived Data
- Summary
12. Stream Processing
- Transmitting Event Streams
  - Messaging Systems
    - Direct messaging from producers to consumers
    - Message brokers
    - Message brokers compared to databases
    - Multiple consumers
    - Acknowledgments and redelivery
  - Log-Based Message Brokers
    - Using logs for message storage
    - Logs compared to traditional messaging
    - Consumer offsets
    - Disk space usage
    - When consumers cannot keep up with producers
    - Replaying old messages
- Databases and Streams
  - Keeping Systems in Sync
  - Change Data Capture
    - Implementing CDC
    - Initial snapshot
    - Log compaction
    - API support for change streams
    - CDC versus event sourcing
  - State, Streams, and Immutability
    - Advantages of immutable events
    - Deriving several views from the same event log
    - Concurrency control
    - Limitations of immutability
- Processing Streams
  - Uses of Stream Processing
    - Complex event processing
    - Stream analytics
    - Maintaining materialized views
    - Search on streams
    - Event-driven architectures and RPC
  - Reasoning About Time
    - Event time versus processing time
    - Handling straggler events
    - Whose clock are you using, anyway?
    - Types of windows
  - Stream Joins
    - Streamstream join (window join)
    - Streamtable join (stream enrichment)
    - Tabletable join (materialized view maintenance)
    - Time dependence of joins
  - Fault Tolerance
    - Microbatching and checkpointing
    - Atomic commit revisited
    - Idempotence
    - Rebuilding state after a failure
- Summary
13. A Philosophy of Streaming Systems
- Data Integration
  - Combining Specialized Tools by Deriving Data
    - Reasoning about dataflows
    - Derived data versus distributed transactions
    - The limits of total ordering
    - Ordering events to capture causality
  - Batch and Stream Processing
    - Maintaining derived state
    - Reprocessing data for application evolution
    - Unifying batch and stream processing
- Unbundling Databases
  - Composing Data Storage Technologies
    - Creating an index
    - The meta-database of everything
    - Making unbundling work
    - Unbundled versus integrated systems
  - Designing Applications Around Dataflow
    - Application code as a derivation function
    - Separation of application code and state
    - Dataflow: Interplay between state changes and application code
    - Stream processors and services
  - Observing Derived State
    - Materialized views and caching
    - Stateful, offline-capable clients
    - Pushing state changes to clients
    - End-to-end event streams
    - Reads are events too
    - Multishard data processing
- Aiming for Correctness
  - The End-to-End Argument for Databases
    - Exactly-once execution of an operation
    - Duplicate suppression
    - Uniquely identifying requests
    - The end-to-end argument
    - Applying end-to-end thinking in data systems
  - Enforcing Constraints
    - Uniqueness constraints require consensus
    - Uniqueness in log-based messaging
    - Multishard request processing
  - Timeliness and Integrity
    - Correctness of dataflow systems
    - Loosely interpreted constraints
    - Coordination-avoiding data systems
  - Trust, but Verify
    - Maintaining integrity in the face of software bugs
    - Dont just blindly trust what they promise
    - Designing for auditability
    - The end-to-end argument again
    - Tools for auditable data systems
- Summary
14. Doing the Right Thing
- Predictive Analytics
  - Bias and Discrimination
  - Responsibility and Accountability
  - Feedback Loops
- Privacy and Tracking
  - Surveillance
  - Consent and Freedom of Choice
  - Privacy and Use of Data
  - Data as Assets and Power
  - Remembering the Industrial Revolution
  - Legislation and Self-Regulation
- Summary
Glossary
Index