Stream Processing with Apache Spark. Mastering Structured Streaming and Spark Streaming - Helion

ebook

Autor: Gerard Maas, Francois Garillot
ISBN: 978-14-919-4419-6
stron: 452, Format: ebook
Data wydania: 2019-06-05
Księgarnia: Helion

Cena książki: 203,15 zł (poprzednio: 236,22 zł)
Oszczędzasz: 14% (-33,07 zł)

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Apache

Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs.

Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API.

Learn fundamental stream processing concepts and examine different streaming architectures
Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail
Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs
Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms
Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams

Osoby które kupowały "Stream Processing with Apache Spark. Mastering Structured Streaming and Spark Streaming", wybierały także:

Apache 2. Leksykon kieszonkowy 24,90 zł, (12,45 zł -50%)
Apache Kafka. Kurs video. Przetwarzanie danych w czasie rzeczywistym 89,00 zł, (62,30 zł -30%)
Apache. Receptury. Wydanie II 49,00 zł, (36,75 zł -25%)
Apache Cookbook. Solutions and Examples for Apache Administration. 2nd Edition 118,78 zł, (92,65 zł -22%)
MongoDB for Jobseekers 84,60 zł, (71,91 zł -15%)

Spis treści

Stream Processing with Apache Spark. Mastering Structured Streaming and Spark Streaming eBook -- spis treści

Foreword
Preface
- Who Should Read This Book?
- Installing Spark
- Learning Scala
- The Way Ahead
- Bibliography
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
  - From Gerard
  - From François
I. Fundamentals of Stream Processing with Apache Spark
1. Introducing Stream Processing
- What Is Stream Processing?
  - Batch Versus Stream Processing
  - The Notion of Time in Stream Processing
  - The Factor of Uncertainty
- Some Examples of Stream Processing
- Scaling Up Data Processing
  - MapReduce
  - The Lesson Learned: Scalability and Fault Tolerance
- Distributed Stream Processing
  - Stateful Stream Processing in a Distributed System
- Introducing Apache Spark
  - The First Wave: Functional APIs
  - The Second Wave: SQL
  - A Unified Engine
  - Spark Components
  - Spark Streaming
  - Structured Streaming
- Where Next?
2. Stream-Processing Model
- Sources and Sinks
- Immutable Streams Defined from One Another
- Transformations and Aggregations
- Window Aggregations
  - Tumbling Windows
  - Sliding Windows
- Stateless and Stateful Processing
- Stateful Streams
- An Example: Local Stateful Computation in Scala
  - A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
- Stateless or Stateful Streaming
- The Effect of Time
  - Computing on Timestamped Events
  - Timestamps as the Provider of the Notion of Time
  - Event Time Versus Processing Time
  - Computing with a Watermark
- Summary
3. Streaming Architectures
- Components of a Data Platform
- Architectural Models
- The Use of a Batch-Processing Component in a Streaming Application
- Referential Streaming Architectures
  - The Lambda Architecture
  - The Kappa Architecture
- Streaming Versus Batch Algorithms
  - Streaming Algorithms Are Sometimes Completely Different in Nature
  - Streaming Algorithms Cant Be Guaranteed to Measure Well Against Batch Algorithms
- Summary
4. Apache Spark as a Stream-Processing Engine
- The Tale of Two APIs
- Sparks Memory Usage
  - Failure Recovery
  - Lazy Evaluation
  - Cache Hints
- Understanding Latency
- Throughput-Oriented Processing
- Sparks Polyglot API
- Fast Implementation of Data Analysis
- To Learn More About Spark
- Summary
5. Sparks Distributed Processing Model
- Running Apache Spark with a Cluster Manager
  - Examples of Cluster Managers
- Sparks Own Cluster Manager
- Understanding Resilience and Fault Tolerance in a Distributed System
  - Fault Recovery
  - Cluster Manager Support for Fault Tolerance
- Data Delivery Semantics
- Microbatching and One-Element-at-a-Time
  - Microbatching: An Application of Bulk-Synchronous Processing
  - One-Record-at-a-Time Processing
  - Microbatching Versus One-at-a-Time: The Trade-Offs
- Bringing Microbatch and One-Record-at-a-Time Closer Together
- Dynamic Batch Interval
- Structured Streaming Processing Model
  - The Disappearance of the Batch Interval
6. Sparks Resilience Model
- Resilient Distributed Datasets in Spark
- Spark Components
- Sparks Fault-Tolerance Guarantees
  - Task Failure Recovery
  - Stage Failure Recovery
  - Driver Failure Recovery
    - Cluster-mode deployment
    - Checkpointing
- Summary
A. References for Part I
II. Structured Streaming
7. Introducing Structured Streaming
- First Steps with Structured Streaming
- Batch Analytics
- Streaming Analytics
  - Connecting to a Stream
  - Preparing the Data in the Stream
  - Operations on Streaming Dataset
  - Creating a Query
  - Start the Stream Processing
  - Exploring the Data
- Summary
8. The Structured Streaming Programming Model
- Initializing Spark
- Sources: Acquiring Streaming Data
  - Available Sources
- Transforming Streaming Data
  - Streaming API Restrictions on the DataFrame API
    - Understanding the limitations
    - Operations on aggregated streams
    - Stream deduplication
    - Workarounds
- Sinks: Output the Resulting Data
  - format
  - outputMode
    - Understanding the append semantic
  - queryName
  - option
  - options
  - trigger
  - start()
- Summary
9. Structured Streaming in Action
- Consuming a Streaming Source
- Application Logic
- Writing to a Streaming Sink
- Summary
10. Structured Streaming Sources
- Understanding Sources
  - Reliable Sources Must Be Replayable
  - Sources Must Provide a Schema
    - Defining schemas
- Available Sources
- The File Source
  - Specifying a File Format
  - Common Options
  - Common Text Parsing Options (CSV, JSON)
    - Handing parsing errors
    - Schema inference
    - Date and time formats
  - JSON File Source Format
    - JSON parsing options
  - CSV File Source Format
    - CSV parsing options
  - Parquet File Source Format
    - Schema definition
  - Text File Source Format
    - Text ingestion options
    - text and textFile
- The Kafka Source
  - Setting Up a Kafka Source
  - Selecting a Topic Subscription Method
  - Configuring Kafka Source Options
    - Kafka source-specific options
  - Kafka Consumer Options
    - Banned configuration options
- The Socket Source
  - Configuration
  - Operations
- The Rate Source
  - Options
11. Structured Streaming Sinks
- Understanding Sinks
- Available Sinks
  - Reliable Sinks
  - Sinks for Experimentation
  - The Sink API
  - Exploring Sinks in Detail
- The File Sink
  - Using Triggers with the File Sink
  - Common Configuration Options Across All Supported File Formats
  - Common Time and Date Formatting (CSV, JSON)
  - The CSV Format of the File Sink
    - Options
  - The JSON File Sink Format
    - Options
  - The Parquet File Sink Format
  - The Text File Sink Format
    - Options
- The Kafka Sink
  - Understanding the Kafka Publish Model
  - Using the Kafka Sink
    - Choosing an encoding
- The Memory Sink
  - Output Modes
- The Console Sink
  - Options
  - Output Modes
- The Foreach Sink
  - The ForeachWriter Interface
  - TCP Writer Sink: A Practical ForeachWriter Example
  - The Moral of this Example
  - Troubleshooting ForeachWriter Serialization Issues
12. Event TimeBased Stream Processing
- Understanding Event Time in Structured Streaming
- Using Event Time
- Processing Time
- Watermarks
- Time-Based Window Aggregations
  - Defining Time-Based Windows
  - Understanding How Intervals Are Computed
  - Using Composite Aggregation Keys
  - Tumbling and Sliding Windows
    - Tumbling windows
    - Sliding windows
    - Interval offset
- Record Deduplication
- Summary
13. Advanced Stateful Operations
- Example: Car Fleet Management
- Understanding Group with State Operations
  - Internal State Flow
- Using MapGroupsWithState
- Using FlatMapGroupsWithState
  - Output Modes
  - Managing State Over Time
    - When a timeout actually times out
- Summary
14. Monitoring Structured Streaming Applications
- The Spark Metrics Subsystem
  - Structured Streaming Metrics
- The StreamingQuery Instance
  - Getting Metrics with StreamingQueryProgress
- The StreamingQueryListener Interface
  - Implementing a StreamingQueryListener
15. Experimental Areas: Continuous Processing and Machine Learning
- Continuous Processing
  - Understanding Continuous Processing
    - Microbatch in Structured Streaming
    - Introducing continuous processing: A low-latency streaming mode
  - Using Continuous Processing
  - Limitations
- Machine Learning
  - Learning Versus Exploiting
  - Applying a Machine Learning Model to a Stream
  - Example: Estimating Room Occupancy by Using Ambient Sensors
    - The challenge of model serving
    - Model serving in Structured Streaming
  - Online Training
B. References for Part II
III. Spark Streaming
16. Introducing Spark Streaming
- The DStream Abstraction
  - DStreams as a Programming Model
  - DStreams as an Execution Model
- The Structure of a Spark Streaming Application
  - Creating the Spark Streaming Context
  - Defining a DStream
  - Defining Output Operations
  - Starting the Spark Streaming Context
  - Stopping the Streaming Process
- Summary
17. The Spark Streaming Programming Model
- RDDs as the Underlying Abstraction for DStreams
- Understanding DStream Transformations
- Element-Centric DStream Transformations
- RDD-Centric DStream Transformations
- Counting
- Structure-Changing Transformations
- Summary
18. The Spark Streaming Execution Model
- The Bulk-Synchronous Architecture
- The Receiver Model
  - The Receiver API
  - How Receivers Work
  - The Receivers Data Flow
  - The Internal Data Resilience
  - Receiver Parallelism
  - Balancing Resources: Receivers Versus Processing Cores
  - Achieving Zero Data Loss with the Write-Ahead Log
    - Enabling the WAL
- The Receiverless or Direct Model
- Summary
19. Spark Streaming Sources
- Types of Sources
  - Basic Sources
  - Receiver-Based Sources
  - Direct Sources
- Commonly Used Sources
- The File Source
  - How It Works
- The Queue Source
  - How It Works
  - Using a Queue Source for Unit Testing
  - A Simpler Alternative to the Queue Source: The ConstantInputDStream
    - How it works
    - ConstantInputDStream as a random data generator
- The Socket Source
  - How It Works
- The Kafka Source
  - Using the Kafka Source
  - How It Works
- Where to Find More Sources
20. Spark Streaming Sinks
- Output Operations
- Built-In Output Operations
  - print
  - saveAsxyz
  - foreachRDD
- Using foreachRDD as a Programmable Sink
- Third-Party Output Operations
21. Time-Based Stream Processing
- Window Aggregations
- Tumbling Windows
  - Window Length Versus Batch Interval
- Sliding Windows
  - Sliding Windows Versus Batch Interval
  - Sliding Windows Versus Tumbling Windows
- Using Windows Versus Longer Batch Intervals
- Window Reductions
  - reduceByWindow
  - reduceByKeyAndWindow
  - countByWindow
  - countByValueAndWindow
- Invertible Window Aggregations
- Slicing Streams
- Summary
22. Arbitrary Stateful Streaming Computation
- Statefulness at the Scale of a Stream
- updateStateByKey
- Limitation of updateStateByKey
  - Performance
  - Memory Usage
- Introducing Stateful Computation with mapwithState
- Using mapWithState
- Event-Time Stream Computation Using mapWithState
23. Working with Spark SQL
- Spark SQL
- Accessing Spark SQL Functions from Spark Streaming
  - Example: Writing Streaming Data to Parquet
    - Saving DataFrames
- Dealing with Data at Rest
  - Using Join to Enrich the Input Stream
- Join Optimizations
- Updating Reference Datasets in a Streaming Application
  - Enhancing Our Example with a Reference Dataset
    - Loading the reference data from a Parquet file
    - Setting up the refreshing mechanism
    - Runtime implications
- Summary
24. Checkpointing
- Understanding the Use of Checkpoints
- Checkpointing DStreams
- Recovery from a Checkpoint
  - Limitations
- The Cost of Checkpointing
- Checkpoint Tuning
25. Monitoring Spark Streaming
- The Streaming UI
- Understanding Job Performance Using the Streaming UI
  - Input Rate Chart
  - Scheduling Delay Chart
  - Processing Time Chart
  - Total Delay Chart
  - Batch Details
- The Monitoring REST API
  - Using the Monitoring REST API
  - Information Exposed by the Monitoring REST API
- The Metrics Subsystem
- The Internal Event Bus
  - Interacting with the Event Bus
    - The StreamingListener interface
    - Batch events
    - Output operation events
    - StreamingListener registration
- Summary
26. Performance Tuning
- The Performance Balance of Spark Streaming
  - The Relationship Between Batch Interval and Processing Delay
  - The Last Moments of a Failing Job
  - Going Deeper: Scheduling Delay and Processing Delay
  - Checkpoint Influence in Processing Time
- External Factors that Influence the Jobs Performance
- How to Improve Performance?
- Tweaking the Batch Interval
- Limiting the Data Ingress with Fixed-Rate Throttling
- Backpressure
- Dynamic Throttling
  - Tuning the Backpressure PID
  - Custom Rate Estimator
  - A Note on Alternative Dynamic Handling Strategies
- Caching
- Speculative Execution
C. References for Part III
IV. Advanced Spark Streaming Techniques
27. Streaming Approximation and Sampling Algorithms
- Exactness, Real Time, and Big Data
  - Exactness
  - Real-Time Processing
  - Big Data
- The Exactness, Real-Time, and Big Data triangle
  - Big Data and Real Time
- Approximation Algorithms
- Hashing and Sketching: An Introduction
- Counting Distinct Elements: HyperLogLog
  - Role-Playing Exercise: If We Were a System Administrator
  - Practical HyperLogLog in Spark
- Counting Element Frequency: Count Min Sketches
  - Introducing Bloom Filters
  - Bloom Filters with Spark
  - Computing Frequencies with a Count-Min Sketch
- Ranks and Quantiles: T-Digest
  - T-Digest in Spark
- Reducing the Number of Elements: Sampling
  - Random Sampling
  - Stratified Sampling
28. Real-Time Machine Learning
- Streaming Classification with Naive Bayes
  - streamDM Introduction
  - Naive Bayes in Practice
  - Training a Movie Review Classifier
- Introducing Decision Trees
- Hoeffding Trees
  - Hoeffding Trees in Spark, in Practice
- Streaming Clustering with Online K-Means
  - K-Means Clustering
  - Online Data and K-Means
  - The Problem of Decaying Clusters
  - Streaming K-Means with Spark Streaming
D. References for Part IV
V. Beyond Apache Spark
29. Other Distributed Real-Time Stream Processing Systems
- Apache Storm
  - Processing Model
  - The Storm Topology
  - The Storm Cluster
  - Compared to Spark
- Apache Flink
  - A Streaming-First Framework
  - Compared to Spark
- Kafka Streams
  - Kafka Streams Programming Model
  - Compared to Spark
- In the Cloud
  - Amazon Kinesis on AWS
  - Microsoft Azure Stream Analytics
  - Apache Beam/Google Cloud Dataflow
30. Looking Ahead
- Stay Plugged In
  - Seek Help on Stack Overflow
  - Start Discussions on the Mailing Lists
  - Attend Conferences
- Attend Meetups
  - Read Books
- Contributing to the Apache Spark Project
E. References for Part V
Index