Spark: The Definitive Guide. Big Data Processing Made Simple - Helion

ebook

Autor: Bill Chambers, Matei Zaharia
ISBN: 978-14-919-1229-4
stron: 608, Format: ebook
Data wydania: 2018-02-08
Księgarnia: Helion

Cena książki: 186,15 zł (poprzednio: 216,45 zł)
Oszczędzasz: 14% (-30,30 zł)

Osoby, które kupiły tę książkę, wybierały także »

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

You’ll explore the basic operations and common functions of Spark’s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library.

Get a gentle overview of big data and Spark
Learn about DataFrames, SQL, and Datasets—Spark’s core APIs—through worked examples
Dive into Spark’s low-level APIs, RDDs, and execution of SQL and DataFrames
Understand how Spark runs on a cluster
Debug, monitor, and tune Spark clusters and applications
Learn the power of Structured Streaming, Spark’s stream-processing engine
Learn how you can apply MLlib to a variety of problems, including classification or recommendation

Osoby które kupowały "Spark: The Definitive Guide. Big Data Processing Made Simple", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Spark: The Definitive Guide. Big Data Processing Made Simple eBook -- spis treści

Preface
- About the Authors
- Who This Book Is For
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark?
- Apache Sparks Philosophy
- Context: The Big Data Problem
- History of Spark
- The Present and Future of Spark
- Running Spark
  - Downloading Spark Locally
    - Downloading Spark for a Hadoop cluster
    - Building Spark from source
  - Launching Sparks Interactive Consoles
    - Launching the Python console
    - Launching the Scala console
    - Launching the SQL console
  - Running Spark in the Cloud
  - Data Used in This Book
2. A Gentle Introduction to Spark
- Sparks Basic Architecture
  - Spark Applications
- Sparks Language APIs
- Sparks APIs
- Starting Spark
- The SparkSession
- DataFrames
  - Partitions
- Transformations
  - Lazy Evaluation
- Actions
- Spark UI
- An End-to-End Example
  - DataFrames and SQL
- Conclusion
3. A Tour of Sparks Toolset
- Running Production Applications
- Datasets: Type-Safe Structured APIs
- Structured Streaming
- Machine Learning and Advanced Analytics
- Lower-Level APIs
- SparkR
- Sparks Ecosystem and Packages
- Conclusion
II. Structured APIsDataFrames, SQL, and Datasets
4. Structured API Overview
- DataFrames and Datasets
- Schemas
- Overview of Structured Spark Types
  - DataFrames Versus Datasets
  - Columns
  - Rows
  - Spark Types
- Overview of Structured API Execution
  - Logical Planning
  - Physical Planning
  - Execution
- Conclusion
5. Basic Structured Operations
- Schemas
- Columns and Expressions
  - Columns
    - Explicit column references
  - Expressions
    - Columns as expressions
    - Accessing a DataFrames columns
- Records and Rows
  - Creating Rows
- DataFrame Transformations
  - Creating DataFrames
  - select and selectExpr
  - Converting to Spark Types (Literals)
  - Adding Columns
  - Renaming Columns
  - Reserved Characters and Keywords
  - Case Sensitivity
  - Removing Columns
  - Changing a Columns Type (cast)
  - Filtering Rows
  - Getting Unique Rows
  - Random Samples
  - Random Splits
  - Concatenating and Appending Rows (Union)
  - Sorting Rows
  - Limit
  - Repartition and Coalesce
  - Collecting Rows to the Driver
- Conclusion
6. Working with Different Types of Data
- Where to Look for APIs
- Converting to Spark Types
- Working with Booleans
- Working with Numbers
- Working with Strings
  - Regular Expressions
- Working with Dates and Timestamps
- Working with Nulls in Data
  - Coalesce
  - ifnull, nullIf, nvl, and nvl2
  - drop
  - fill
  - replace
- Ordering
- Working with Complex Types
  - Structs
  - Arrays
  - split
  - Array Length
  - array_contains
  - explode
  - Maps
- Working with JSON
- User-Defined Functions
- Conclusion
7. Aggregations
- Aggregation Functions
  - count
  - countDistinct
  - approx_count_distinct
  - first and last
  - min and max
  - sum
  - sumDistinct
  - avg
  - Variance and Standard Deviation
  - skewness and kurtosis
  - Covariance and Correlation
  - Aggregating to Complex Types
- Grouping
  - Grouping with Expressions
  - Grouping with Maps
- Window Functions
- Grouping Sets
  - Rollups
  - Cube
  - Grouping Metadata
  - Pivot
- User-Defined Aggregation Functions
- Conclusion
8. Joins
- Join Expressions
- Join Types
- Inner Joins
- Outer Joins
- Left Outer Joins
- Right Outer Joins
- Left Semi Joins
- Left Anti Joins
- Natural Joins
- Cross (Cartesian) Joins
- Challenges When Using Joins
  - Joins on Complex Types
  - Handling Duplicate Column Names
    - Approach 1: Different join expression
    - Approach 2: Dropping the column after the join
    - Approach 3: Renaming a column before the join
- How Spark Performs Joins
  - Communication Strategies
    - Big tabletobig table
    - Big tabletosmall table
    - Little tabletolittle table
- Conclusion
9. Data Sources
- The Structure of the Data Sources API
  - Read API Structure
  - Basics of Reading Data
    - Read modes
  - Write API Structure
  - Basics of Writing Data
    - Save modes
- CSV Files
  - CSV Options
  - Reading CSV Files
  - Writing CSV Files
- JSON Files
  - JSON Options
  - Reading JSON Files
  - Writing JSON Files
- Parquet Files
  - Reading Parquet Files
    - Parquet options
  - Writing Parquet Files
- ORC Files
  - Reading Orc Files
  - Writing Orc Files
- SQL Databases
  - Reading from SQL Databases
  - Query Pushdown
    - Reading from databases in parallel
    - Partitioning based on a sliding window
  - Writing to SQL Databases
- Text Files
  - Reading Text Files
  - Writing Text Files
- Advanced I/O Concepts
  - Splittable File Types and Compression
  - Reading Data in Parallel
  - Writing Data in Parallel
    - Partitioning
    - Bucketing
  - Writing Complex Types
  - Managing File Size
- Conclusion
10. Spark SQL
- What Is SQL?
- Big Data and SQL: Apache Hive
- Big Data and SQL: Spark SQL
  - Sparks Relationship to Hive
    - The Hive metastore
- How to Run Spark SQL Queries
  - Spark SQL CLI
  - Sparks Programmatic SQL Interface
  - SparkSQL Thrift JDBC/ODBC Server
- Catalog
- Tables
  - Spark-Managed Tables
  - Creating Tables
  - Creating External Tables
  - Inserting into Tables
  - Describing Table Metadata
  - Refreshing Table Metadata
  - Dropping Tables
    - Dropping unmanaged tables
  - Caching Tables
- Views
  - Creating Views
  - Dropping Views
- Databases
  - Creating Databases
  - Setting the Database
  - Dropping Databases
- Select Statements
  - casewhenthen Statements
- Advanced Topics
  - Complex Types
    - Structs
    - Lists
  - Functions
    - User-defined functions
  - Subqueries
    - Uncorrelated predicate subqueries
    - Correlated predicate subqueries
    - Uncorrelated scalar queries
- Miscellaneous Features
  - Configurations
  - Setting Configuration Values in SQL
- Conclusion
11. Datasets
- When to Use Datasets
- Creating Datasets
  - In Java: Encoders
  - In Scala: Case Classes
- Actions
- Transformations
  - Filtering
  - Mapping
- Joins
- Grouping and Aggregations
- Conclusion
III. Low-Level APIs
12. Resilient Distributed Datasets (RDDs)
- What Are the Low-Level APIs?
  - When to Use the Low-Level APIs?
  - How to Use the Low-Level APIs?
- About RDDs
  - Types of RDDs
  - When to Use RDDs?
  - Datasets and RDDs of Case Classes
- Creating RDDs
  - Interoperating Between DataFrames, Datasets, and RDDs
  - From a Local Collection
  - From Data Sources
- Manipulating RDDs
- Transformations
  - distinct
  - filter
  - map
    - flatMap
  - sort
  - Random Splits
- Actions
  - reduce
  - count
    - countApprox
    - countApproxDistinct
    - countByValue
    - countByValueApprox
  - first
  - max and min
  - take
- Saving Files
  - saveAsTextFile
  - SequenceFiles
  - Hadoop Files
- Caching
- Checkpointing
- Pipe RDDs to System Commands
  - mapPartitions
  - foreachPartition
  - glom
- Conclusion
13. Advanced RDDs
- Key-Value Basics (Key-Value RDDs)
  - keyBy
  - Mapping over Values
  - Extracting Keys and Values
  - lookup
  - sampleByKey
- Aggregations
  - countByKey
  - Understanding Aggregation Implementations
    - groupByKey
    - reduceByKey
  - Other Aggregation Methods
    - aggregate
    - aggregateByKey
    - combineByKey
    - foldByKey
- CoGroups
- Joins
  - Inner Join
  - zips
- Controlling Partitions
  - coalesce
  - repartition
  - repartitionAndSortWithinPartitions
  - Custom Partitioning
- Custom Serialization
- Conclusion
14. Distributed Shared Variables
- Broadcast Variables
- Accumulators
  - Basic Example
  - Custom Accumulators
- Conclusion
IV. Production Applications
15. How Spark Runs on a Cluster
- The Architecture of a Spark Application
  - Execution Modes
    - Cluster mode
    - Client mode
    - Local mode
- The Life Cycle of a Spark Application (Outside Spark)
  - Client Request
  - Launch
  - Execution
  - Completion
- The Life Cycle of a Spark Application (Inside Spark)
  - The SparkSession
    - The SparkContext
  - Logical Instructions
    - Logical instructions to physical execution
  - A Spark Job
  - Stages
  - Tasks
- Execution Details
  - Pipelining
  - Shuffle Persistence
- Conclusion
16. Developing Spark Applications
- Writing Spark Applications
  - A Simple Scala-Based App
    - Running the application
  - Writing Python Applications
    - Running the application
  - Writing Java Applications
    - Running the application
- Testing Spark Applications
  - Strategic Principles
    - Input data resilience
    - Business logic resilience and evolution
    - Resilience in output and atomicity
  - Tactical Takeaways
    - Managing SparkSessions
    - Which Spark API to Use?
  - Connecting to Unit Testing Frameworks
  - Connecting to Data Sources
- The Development Process
- Launching Applications
  - Application Launch Examples
- Configuring Applications
  - The SparkConf
  - Application Properties
  - Runtime Properties
  - Execution Properties
  - Configuring Memory Management
  - Configuring Shuffle Behavior
  - Environmental Variables
  - Job Scheduling Within an Application
- Conclusion
17. Deploying Spark
- Where to Deploy Your Cluster to Run Spark Applications
  - On-Premises Cluster Deployments
  - Spark in the Cloud
- Cluster Managers
  - Standalone Mode
    - Starting a standalone cluster
    - Cluster launch scripts
    - Standalone cluster configurations
    - Submitting applications
  - Spark on YARN
    - Submitting applications
  - Configuring Spark on YARN Applications
    - Hadoop configurations
    - Application properties for YARN
  - Spark on Mesos
    - Submitting applications
    - Configuring Mesos
  - Secure Deployment Configurations
  - Cluster Networking Configurations
  - Application Scheduling
    - Dynamic allocation
- Miscellaneous Considerations
- Conclusion
18. Monitoring and Debugging
- The Monitoring Landscape
- What to Monitor
  - Driver and Executor Processes
  - Queries, Jobs, Stages, and Tasks
- Spark Logs
- The Spark UI
  - Other Spark UI tabs
  - Configuring the Spark user interface
  - Spark REST API
  - Spark UI History Server
- Debugging and Spark First Aid
  - Spark Jobs Not Starting
    - Signs and symptoms
    - Potential treatments
  - Errors Before Execution
    - Signs and symptoms
    - Potential treatments
  - Errors During Execution
    - Signs and symptoms
    - Potential treatments
  - Slow Tasks or Stragglers
    - Signs and symptoms
    - Potential treatments
  - Slow Aggregations
    - Signs and symptoms
    - Potential treatments
  - Slow Joins
    - Signs and symptoms
    - Potential treatments
  - Slow Reads and Writes
    - Signs and symptoms
    - Potential treatments
  - Driver OutOfMemoryError or Driver Unresponsive
    - Signs and symptoms
    - Potential treatments
  - Executor OutOfMemoryError or Executor Unresponsive
    - Signs and symptoms
    - Potential treatments
  - Unexpected Nulls in Results
    - Signs and symptoms
    - Potential treatments
  - No Space Left on Disk Errors
    - Signs and symptoms
    - Potential treatments
  - Serialization Errors
    - Signs and symptoms
    - Potential treatments
- Conclusion
19. Performance Tuning
- Indirect Performance Enhancements
  - Design Choices
    - Scala versus Java versus Python versus R
    - DataFrames versus SQL versus Datasets versus RDDs
  - Object Serialization in RDDs
  - Cluster Configurations
    - Cluster/application sizing and sharing
    - Dynamic allocation
  - Scheduling
  - Data at Rest
    - File-based long-term data storage
    - Splittable file types and compression
    - Table partitioning
    - Bucketing
    - The number of files
    - Data locality
    - Statistics collection
  - Shuffle Configurations
  - Memory Pressure and Garbage Collection
    - Measuring the impact of garbage collection
    - Garbage collection tuning
- Direct Performance Enhancements
  - Parallelism
  - Improved Filtering
  - Repartitioning and Coalescing
    - Custom partitioning
  - User-Defined Functions (UDFs)
  - Temporary Data Storage (Caching)
  - Joins
  - Aggregations
  - Broadcast Variables
- Conclusion
V. Streaming
20. Stream Processing Fundamentals
- What Is Stream Processing?
  - Stream Processing Use Cases
    - Notifications and alerting
    - Real-time reporting
    - Incremental ETL
    - Update data to serve in real time
    - Real-time decision making
    - Online machine learning
  - Advantages of Stream Processing
  - Challenges of Stream Processing
- Stream Processing Design Points
  - Record-at-a-Time Versus Declarative APIs
  - Event Time Versus Processing Time
  - Continuous Versus Micro-Batch Execution
- Sparks Streaming APIs
  - The DStream API
  - Structured Streaming
- Conclusion
21. Structured Streaming Basics
- Structured Streaming Basics
- Core Concepts
  - Transformations and Actions
  - Input Sources
  - Sinks
  - Output Modes
  - Triggers
  - Event-Time Processing
    - Event-time data
    - Watermarks
- Structured Streaming in Action
- Transformations on Streams
  - Selections and Filtering
  - Aggregations
  - Joins
- Input and Output
  - Where Data Is Read and Written (Sources and Sinks)
    - File source and sink
    - Kafka source and sink
  - Reading from the Kafka Source
  - Writing to the Kafka Sink
    - Foreach sink
    - Sources and sinks for testing
  - How Data Is Output (Output Modes)
    - Append mode
    - Complete mode
    - Update mode
    - When can you use each mode?
  - When Data Is Output (Triggers)
    - Processing time trigger
    - Once trigger
- Streaming Dataset API
- Conclusion
22. Event-Time and Stateful Processing
- Event Time
- Stateful Processing
- Arbitrary Stateful Processing
- Event-Time Basics
- Windows on Event Time
  - Tumbling Windows
    - Sliding windows
  - Handling Late Data with Watermarks
- Dropping Duplicates in a Stream
- Arbitrary Stateful Processing
  - Time-Outs
  - Output Modes
  - mapGroupsWithState
  - flatMapGroupsWithState
- Conclusion
23. Structured Streaming in Production
- Fault Tolerance and Checkpointing
- Updating Your Application
  - Updating Your Streaming Application Code
  - Updating Your Spark Version
  - Sizing and Rescaling Your Application
- Metrics and Monitoring
  - Query Status
  - Recent Progress
    - Input rate and processing rate
    - Batch duration
  - Spark UI
- Alerting
- Advanced Monitoring with the Streaming Listener
- Conclusion
VI. Advanced Analytics and Machine Learning
24. Advanced Analytics and Machine Learning Overview
- A Short Primer on Advanced Analytics
  - Supervised Learning
    - Classification
    - Regression
  - Recommendation
  - Unsupervised Learning
  - Graph Analytics
  - The Advanced Analytics Process
    - Data collection
    - Data cleaning
    - Feature engineering
    - Training models
    - Model tuning and evaluation
    - Leveraging the model and/or insights
- Sparks Advanced Analytics Toolkit
  - What Is MLlib?
    - When and why should you use MLlib (versus scikit-learn, TensorFlow, or foo package)
- High-Level MLlib Concepts
  - Low-level data types
- MLlib in Action
  - Feature Engineering with Transformers
  - Estimators
  - Pipelining Our Workflow
  - Training and Evaluation
  - Persisting and Applying Models
- Deployment Patterns
- Conclusion
25. Preprocessing and Feature Engineering
- Formatting Models According to Your Use Case
- Transformers
- Estimators for Preprocessing
  - Transformer Properties
- High-Level Transformers
  - RFormula
  - SQL Transformers
  - VectorAssembler
- Working with Continuous Features
  - Bucketing
    - Advanced bucketing techniques
  - Scaling and Normalization
  - StandardScaler
    - MinMaxScaler
    - MaxAbsScaler
    - ElementwiseProduct
    - Normalizer
- Working with Categorical Features
  - StringIndexer
  - Converting Indexed Values Back to Text
  - Indexing in Vectors
  - One-Hot Encoding
- Text Data Transformers
  - Tokenizing Text
  - Removing Common Words
  - Creating Word Combinations
  - Converting Words into Numerical Representations
    - Term frequencyinverse document frequency
  - Word2Vec
- Feature Manipulation
  - PCA
  - Interaction
  - Polynomial Expansion
- Feature Selection
  - ChiSqSelector
- Advanced Topics
  - Persisting Transformers
- Writing a Custom Transformer
- Conclusion
26. Classification
- Use Cases
- Types of Classification
  - Binary Classification
  - Multiclass Classification
  - Multilabel Classification
- Classification Models in MLlib
  - Model Scalability
- Logistic Regression
  - Model Hyperparameters
  - Training Parameters
  - Prediction Parameters
  - Example
  - Model Summary
- Decision Trees
  - Model Hyperparameters
  - Training Parameters
  - Prediction Parameters
- Random Forest and Gradient-Boosted Trees
  - Model Hyperparameters
    - Random forest only
    - Gradient-boosted trees (GBT) only
  - Training Parameters
  - Prediction Parameters
- Naive Bayes
  - Model Hyperparameters
  - Training Parameters
  - Prediction Parameters
- Evaluators for Classification and Automating Model Tuning
- Detailed Evaluation Metrics
- One-vs-Rest Classifier
- Multilayer Perceptron
- Conclusion
27. Regression
- Use Cases
- Regression Models in MLlib
  - Model Scalability
- Linear Regression
  - Model Hyperparameters
  - Training Parameters
  - Example
  - Training Summary
- Generalized Linear Regression
  - Model Hyperparameters
  - Training Parameters
  - Prediction Parameters
  - Example
  - Training Summary
- Decision Trees
  - Model Hyperparameters
  - Training Parameters
  - Example
- Random Forests and Gradient-Boosted Trees
  - Model Hyperparameters
  - Training Parameters
  - Example
- Advanced Methods
  - Survival Regression (Accelerated Failure Time)
  - Isotonic Regression
- Evaluators and Automating Model Tuning
- Metrics
- Conclusion
28. Recommendation
- Use Cases
- Collaborative Filtering with Alternating Least Squares
  - Model Hyperparameters
  - Training Parameters
  - Prediction Parameters
  - Example
- Evaluators for Recommendation
- Metrics
  - Regression Metrics
  - Ranking Metrics
- Frequent Pattern Mining
- Conclusion
29. Unsupervised Learning
- Use Cases
- Model Scalability
- k-means
  - Model Hyperparameters
  - Training Parameters
  - Example
  - k-means Metrics Summary
- Bisecting k-means
  - Model Hyperparameters
  - Training Parameters
  - Example
  - Bisecting k-means Summary
- Gaussian Mixture Models
  - Model Hyperparameters
  - Training Parameters
  - Example
  - Gaussian Mixture Model Summary
- Latent Dirichlet Allocation
  - Model Hyperparameters
  - Training Parameters
  - Prediction Parameters
  - Example
- Conclusion
30. Graph Analytics
- Building a Graph
- Querying the Graph
  - Subgraphs
- Motif Finding
- Graph Algorithms
  - PageRank
  - In-Degree and Out-Degree Metrics
  - Breadth-First Search
  - Connected Components
  - Strongly Connected Components
  - Advanced Tasks
- Conclusion
31. Deep Learning
- What Is Deep Learning?
- Ways of Using Deep Learning in Spark
- Deep Learning Libraries
  - MLlib Neural Network Support
  - TensorFrames
  - BigDL
  - TensorFlowOnSpark
  - DeepLearning4J
  - Deep Learning Pipelines
- A Simple Example with Deep Learning Pipelines
  - Setup
  - Images and DataFrames
  - Transfer Learning
    - Applying deep learning models at scale
  - Applying Popular Models
    - Applying custom Keras models
    - Applying TensorFlow models
    - Deploying models as SQL functions
- Conclusion
VII. Ecosystem
32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
- PySpark
  - Fundamental PySpark Differences
  - Pandas Integration
- R on Spark
  - SparkR
    - Pros and cons of using SparkR instead of other languages
    - Setup
    - Key Concepts
    - Function masking
    - SparkR functions only apply to SparkDataFrames
    - Data manipulation
    - Data sources
    - Machine learning
    - User-defined functions
  - sparklyr
    - Key concepts
    - No DataFrames
    - Data manipulation
    - Executing SQL
    - Data sources
    - Machine learning
- Conclusion
33. Ecosystem and Community
- Spark Packages
  - An Abridged List of Popular Packages
  - Using Spark Packages
    - In Scala
    - In Python
    - At runtime
  - External Packages
- Community
  - Spark Summit
  - Local Meetups
- Conclusion
Index