Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Helion
ISBN: 9781491903100
stron: 616, Format: ebook
Data wydania: 2017-03-16
Księgarnia: Helion
Cena książki: 194,65 zł (poprzednio: 226,34 zł)
Oszczędzasz: 14% (-31,69 zł)
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords?
In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.
- Peer under the hood of the systems you already use, and learn how to use and operate them more effectively
- Make informed decisions by identifying the strengths and weaknesses of different tools
- Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity
- Understand the distributed systems research upon which modern databases are built
- Peek behind the scenes of major online services, and learn from their architectures
Osoby które kupowały "Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems", wybierały także:
- R i pakiet shiny. Kurs video. Interaktywne aplikacje w analizie danych 149,00 zł, (67,05 zł -55%)
- Apache NiFi. Kurs video. Automatyzacja przep 149,00 zł, (67,05 zł -55%)
- Web scraping. Kurs video. Zautomatyzowane pozyskiwanie danych z sieci 139,00 zł, (62,55 zł -55%)
- Data Science w Pythonie. Kurs video. Przetwarzanie i analiza danych 149,00 zł, (67,05 zł -55%)
- Excel 2013. Kurs video. Poziom drugi. Przetwarzanie i analiza danych 79,00 zł, (35,55 zł -55%)
Spis treści
Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems eBook -- spis treści
- Preface
- Who Should Read This Book?
- Scope of This Book
- Outline of This Book
- References and Further Reading
- OReilly Safari
- How to Contact Us
- Acknowledgments
- I. Foundations of Data Systems
- 1. Reliable, Scalable, and Maintainable Applications
- Thinking About Data Systems
- Reliability
- Hardware Faults
- Software Errors
- Human Errors
- How Important Is Reliability?
- Scalability
- Describing Load
- Describing Performance
- Approaches for Coping with Load
- Maintainability
- Operability: Making Life Easy for Operations
- Simplicity: Managing Complexity
- Evolvability: Making Change Easy
- Summary
- 2. Data Models and Query Languages
- Relational Model Versus Document Model
- The Birth of NoSQL
- The Object-Relational Mismatch
- Many-to-One and Many-to-Many Relationships
- Are Document Databases Repeating History?
- The network model
- The relational model
- Comparison to document databases
- Relational Versus Document Databases Today
- Which data model leads to simpler application code?
- Schema flexibility in the document model
- Data locality for queries
- Convergence of document and relational databases
- Query Languages for Data
- Declarative Queries on the Web
- MapReduce Querying
- Graph-Like Data Models
- Property Graphs
- The Cypher Query Language
- Graph Queries in SQL
- Triple-Stores and SPARQL
- The semantic web
- The RDF data model
- The SPARQL query language
- The Foundation: Datalog
- Summary
- Relational Model Versus Document Model
- 3. Storage and Retrieval
- Data Structures That Power Your Database
- Hash Indexes
- SSTables and LSM-Trees
- Constructing and maintaining SSTables
- Making an LSM-tree out of SSTables
- Performance optimizations
- B-Trees
- Making B-trees reliable
- B-tree optimizations
- Comparing B-Trees and LSM-Trees
- Advantages of LSM-trees
- Downsides of LSM-trees
- Other Indexing Structures
- Storing values within the index
- Multi-column indexes
- Full-text search and fuzzy indexes
- Keeping everything in memory
- Transaction Processing or Analytics?
- Data Warehousing
- The divergence between OLTP databases and data warehouses
- Stars and Snowflakes: Schemas for Analytics
- Data Warehousing
- Column-Oriented Storage
- Column Compression
- Memory bandwidth and vectorized processing
- Sort Order in Column Storage
- Several different sort orders
- Writing to Column-Oriented Storage
- Aggregation: Data Cubes and Materialized Views
- Column Compression
- Summary
- Data Structures That Power Your Database
- 4. Encoding and Evolution
- Formats for Encoding Data
- Language-Specific Formats
- JSON, XML, and Binary Variants
- Binary encoding
- Thrift and Protocol Buffers
- Field tags and schema evolution
- Datatypes and schema evolution
- Avro
- The writers schema and the readers schema
- Schema evolution rules
- But what is the writers schema?
- Dynamically generated schemas
- Code generation and dynamically typed languages
- The Merits of Schemas
- Modes of Dataflow
- Dataflow Through Databases
- Different values written at different times
- Archival storage
- Dataflow Through Services: REST and RPC
- Web services
- The problems with remote procedure calls (RPCs)
- Current directions for RPC
- Data encoding and evolution for RPC
- Message-Passing Dataflow
- Message brokers
- Distributed actor frameworks
- Dataflow Through Databases
- Summary
- Formats for Encoding Data
- II. Distributed Data
- 5. Replication
- Leaders and Followers
- Synchronous Versus Asynchronous Replication
- Setting Up New Followers
- Handling Node Outages
- Follower failure: Catch-up recovery
- Leader failure: Failover
- Implementation of Replication Logs
- Statement-based replication
- Write-ahead log (WAL) shipping
- Logical (row-based) log replication
- Trigger-based replication
- Problems with Replication Lag
- Reading Your Own Writes
- Monotonic Reads
- Consistent Prefix Reads
- Solutions for Replication Lag
- Multi-Leader Replication
- Use Cases for Multi-Leader Replication
- Multi-datacenter operation
- Clients with offline operation
- Collaborative editing
- Handling Write Conflicts
- Synchronous versus asynchronous conflict detection
- Conflict avoidance
- Converging toward a consistent state
- Custom conflict resolution logic
- What is a conflict?
- Multi-Leader Replication Topologies
- Use Cases for Multi-Leader Replication
- Leaderless Replication
- Writing to the Database When a Node Is Down
- Read repair and anti-entropy
- Quorums for reading and writing
- Limitations of Quorum Consistency
- Monitoring staleness
- Sloppy Quorums and Hinted Handoff
- Multi-datacenter operation
- Detecting Concurrent Writes
- Last write wins (discarding concurrent writes)
- The happens-before relationship and concurrency
- Capturing the happens-before relationship
- Merging concurrently written values
- Version vectors
- Writing to the Database When a Node Is Down
- Summary
- Leaders and Followers
- 6. Partitioning
- Partitioning and Replication
- Partitioning of Key-Value Data
- Partitioning by Key Range
- Partitioning by Hash of Key
- Skewed Workloads and Relieving Hot Spots
- Partitioning and Secondary Indexes
- Partitioning Secondary Indexes by Document
- Partitioning Secondary Indexes by Term
- Rebalancing Partitions
- Strategies for Rebalancing
- How not to do it: hash mod N
- Fixed number of partitions
- Dynamic partitioning
- Partitioning proportionally to nodes
- Operations: Automatic or Manual Rebalancing
- Strategies for Rebalancing
- Request Routing
- Parallel Query Execution
- Summary
- 7. Transactions
- The Slippery Concept of a Transaction
- The Meaning of ACID
- Atomicity
- Consistency
- Isolation
- Durability
- Single-Object and Multi-Object Operations
- Single-object writes
- The need for multi-object transactions
- Handling errors and aborts
- The Meaning of ACID
- Weak Isolation Levels
- Read Committed
- No dirty reads
- No dirty writes
- Implementing read committed
- Snapshot Isolation and Repeatable Read
- Implementing snapshot isolation
- Visibility rules for observing a consistent snapshot
- Indexes and snapshot isolation
- Repeatable read and naming confusion
- Preventing Lost Updates
- Atomic write operations
- Explicit locking
- Automatically detecting lost updates
- Compare-and-set
- Conflict resolution and replication
- Write Skew and Phantoms
- Characterizing write skew
- More examples of write skew
- Phantoms causing write skew
- Materializing conflicts
- Read Committed
- Serializability
- Actual Serial Execution
- Encapsulating transactions in stored procedures
- Pros and cons of stored procedures
- Partitioning
- Summary of serial execution
- Two-Phase Locking (2PL)
- Implementation of two-phase locking
- Performance of two-phase locking
- Predicate locks
- Index-range locks
- Serializable Snapshot Isolation (SSI)
- Pessimistic versus optimistic concurrency control
- Decisions based on an outdated premise
- Detecting stale MVCC reads
- Detecting writes that affect prior reads
- Performance of serializable snapshot isolation
- Actual Serial Execution
- Summary
- The Slippery Concept of a Transaction
- 8. The Trouble with Distributed Systems
- Faults and Partial Failures
- Cloud Computing and Supercomputing
- Unreliable Networks
- Network Faults in Practice
- Detecting Faults
- Timeouts and Unbounded Delays
- Network congestion and queueing
- Synchronous Versus Asynchronous Networks
- Can we not simply make network delays predictable?
- Unreliable Clocks
- Monotonic Versus Time-of-Day Clocks
- Time-of-day clocks
- Monotonic clocks
- Clock Synchronization and Accuracy
- Relying on Synchronized Clocks
- Timestamps for ordering events
- Clock readings have a confidence interval
- Synchronized clocks for global snapshots
- Process Pauses
- Response time guarantees
- Limiting the impact of garbage collection
- Monotonic Versus Time-of-Day Clocks
- Knowledge, Truth, and Lies
- The Truth Is Defined by the Majority
- The leader and the lock
- Fencing tokens
- Byzantine Faults
- Weak forms of lying
- System Model and Reality
- Correctness of an algorithm
- Safety and liveness
- Mapping system models to the real world
- The Truth Is Defined by the Majority
- Summary
- Faults and Partial Failures
- 9. Consistency and Consensus
- Consistency Guarantees
- Linearizability
- What Makes a System Linearizable?
- Relying on Linearizability
- Locking and leader election
- Constraints and uniqueness guarantees
- Cross-channel timing dependencies
- Implementing Linearizable Systems
- Linearizability and quorums
- The Cost of Linearizability
- The CAP theorem
- Linearizability and network delays
- Ordering Guarantees
- Ordering and Causality
- The causal order is not a total order
- Linearizability is stronger than causal consistency
- Capturing causal dependencies
- Sequence Number Ordering
- Noncausal sequence number generators
- Lamport timestamps
- Timestamp ordering is not sufficient
- Total Order Broadcast
- Using total order broadcast
- Implementing linearizable storage using total order broadcast
- Implementing total order broadcast using linearizable storage
- Ordering and Causality
- Distributed Transactions and Consensus
- Atomic Commit and Two-Phase Commit (2PC)
- From single-node to distributed atomic commit
- Introduction to two-phase commit
- A system of promises
- Coordinator failure
- Three-phase commit
- Distributed Transactions in Practice
- Exactly-once message processing
- XA transactions
- Holding locks while in doubt
- Recovering from coordinator failure
- Limitations of distributed transactions
- Fault-Tolerant Consensus
- Consensus algorithms and total order broadcast
- Single-leader replication and consensus
- Epoch numbering and quorums
- Limitations of consensus
- Membership and Coordination Services
- Allocating work to nodes
- Service discovery
- Membership services
- Atomic Commit and Two-Phase Commit (2PC)
- Summary
- III. Derived Data
- 10. Batch Processing
- Batch Processing with Unix Tools
- Simple Log Analysis
- Chain of commands versus custom program
- Sorting versus in-memory aggregation
- The Unix Philosophy
- A uniform interface
- Separation of logic and wiring
- Transparency and experimentation
- Simple Log Analysis
- MapReduce and Distributed Filesystems
- MapReduce Job Execution
- Distributed execution of MapReduce
- MapReduce workflows
- Reduce-Side Joins and Grouping
- Example: analysis of user activity events
- Sort-merge joins
- Bringing related data together in the same place
- GROUP BY
- Handling skew
- Map-Side Joins
- Broadcast hash joins
- Partitioned hash joins
- Map-side merge joins
- MapReduce workflows with map-side joins
- The Output of Batch Workflows
- Building search indexes
- Key-value stores as batch process output
- Philosophy of batch process outputs
- Comparing Hadoop to Distributed Databases
- Diversity of storage
- Diversity of processing models
- Designing for frequent faults
- MapReduce Job Execution
- Beyond MapReduce
- Materialization of Intermediate State
- Dataflow engines
- Fault tolerance
- Discussion of materialization
- Graphs and Iterative Processing
- The Pregel processing model
- Fault tolerance
- Parallel execution
- High-Level APIs and Languages
- The move toward declarative query languages
- Specialization for different domains
- Materialization of Intermediate State
- Summary
- Batch Processing with Unix Tools
- 11. Stream Processing
- Transmitting Event Streams
- Messaging Systems
- Direct messaging from producers to consumers
- Message brokers
- Message brokers compared to databases
- Multiple consumers
- Acknowledgments and redelivery
- Partitioned Logs
- Using logs for message storage
- Logs compared to traditional messaging
- Consumer offsets
- Disk space usage
- When consumers cannot keep up with producers
- Replaying old messages
- Messaging Systems
- Databases and Streams
- Keeping Systems in Sync
- Change Data Capture
- Implementing change data capture
- Initial snapshot
- Log compaction
- API support for change streams
- Event Sourcing
- Deriving current state from the event log
- Commands and events
- State, Streams, and Immutability
- Advantages of immutable events
- Deriving several views from the same event log
- Concurrency control
- Limitations of immutability
- Processing Streams
- Uses of Stream Processing
- Complex event processing
- Stream analytics
- Maintaining materialized views
- Search on streams
- Message passing and RPC
- Reasoning About Time
- Event time versus processing time
- Knowing when youre ready
- Whose clock are you using, anyway?
- Types of windows
- Stream Joins
- Stream-stream join (window join)
- Stream-table join (stream enrichment)
- Table-table join (materialized view maintenance)
- Time-dependence of joins
- Fault Tolerance
- Microbatching and checkpointing
- Atomic commit revisited
- Idempotence
- Rebuilding state after a failure
- Uses of Stream Processing
- Summary
- Transmitting Event Streams
- 12. The Future of Data Systems
- Data Integration
- Combining Specialized Tools by Deriving Data
- Reasoning about dataflows
- Derived data versus distributed transactions
- The limits of total ordering
- Ordering events to capture causality
- Batch and Stream Processing
- Maintaining derived state
- Reprocessing data for application evolution
- The lambda architecture
- Unifying batch and stream processing
- Combining Specialized Tools by Deriving Data
- Unbundling Databases
- Composing Data Storage Technologies
- Creating an index
- The meta-database of everything
- Making unbundling work
- Unbundled versus integrated systems
- Whats missing?
- Designing Applications Around Dataflow
- Application code as a derivation function
- Separation of application code and state
- Dataflow: Interplay between state changes and application code
- Stream processors and services
- Observing Derived State
- Materialized views and caching
- Stateful, offline-capable clients
- Pushing state changes to clients
- End-to-end event streams
- Reads are events too
- Multi-partition data processing
- Composing Data Storage Technologies
- Aiming for Correctness
- The End-to-End Argument for Databases
- Exactly-once execution of an operation
- Duplicate suppression
- Operation identifiers
- The end-to-end argument
- Applying end-to-end thinking in data systems
- Enforcing Constraints
- Uniqueness constraints require consensus
- Uniqueness in log-based messaging
- Multi-partition request processing
- Timeliness and Integrity
- Correctness of dataflow systems
- Loosely interpreted constraints
- Coordination-avoiding data systems
- Trust, but Verify
- Maintaining integrity in the face of software bugs
- Dont just blindly trust what they promise
- A culture of verification
- Designing for auditability
- The end-to-end argument again
- Tools for auditable data systems
- The End-to-End Argument for Databases
- Doing the Right Thing
- Predictive Analytics
- Bias and discrimination
- Responsibility and accountability
- Feedback loops
- Privacy and Tracking
- Surveillance
- Consent and freedom of choice
- Privacy and use of data
- Data as assets and power
- Remembering the Industrial Revolution
- Legislation and self-regulation
- Predictive Analytics
- Summary
- Data Integration
- Glossary
- Index