Scaling Machine Learning with Spark - Helion

ebook

Autor: Adi Polak
ISBN: 9781098106775
stron: 294, Format: ebook
Data wydania: 2023-03-07
Księgarnia: Helion

Cena książki: 220,15 zł (poprzednio: 255,99 zł)
Oszczędzasz: 14% (-35,84 zł)

Osoby, które kupiły tę książkę, wybierały także »

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.

Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology.

You will:

Explore machine learning, including distributed computing concepts and terminology
Manage the ML lifecycle with MLflow
Ingest data and perform basic preprocessing with Spark
Explore feature engineering, and use Spark to extract features
Train a model with MLlib and build a pipeline to reproduce it
Build a data system to combine the power of Spark with deep learning
Get a step-by-step example of working with distributed TensorFlow
Use PyTorch to scale machine learning and its internal architecture

Osoby które kupowały "Scaling Machine Learning with Spark", wybierały także:

Jak zhakowa 125,00 zł, (10,00 zł -92%)
Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 126,36 zł, (13,90 zł -89%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 92,67 zł, (13,90 zł -85%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)

Spis treści

Scaling Machine Learning with Spark eBook -- spis treści

Preface
- Who Should Read This Book?
- Do You Need Distributed Machine Learning?
- Navigating This Book
- What Is Not Covered
- The Environment and Tools
  - The Tools
  - The Datasets
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Distributed Machine Learning Terminology and Concepts
- The Stages of the Machine Learning Workflow
- Tools and Technologies in the Machine Learning Pipeline
- Distributed Computing Models
  - General-Purpose Models
    - MapReduce
    - MPI
    - Barrier
    - Shared memory
  - Dedicated Distributed Computing Models
- Introduction to Distributed Systems Architecture
  - Centralized Versus Decentralized Systems
  - Interaction Models
    - Client/server
    - Peer-to-peer
    - Geo-distributed
  - Communication in a Distributed Setting
    - Asynchronous
    - Synchronous
- Introduction to Ensemble Methods
  - High Versus Low Bias
  - Types of Ensemble Methods
  - Distributed Training Topologies
    - Centralized ensemble learning
    - Decentralized decision trees
    - Centralized, distributed training with parameter servers
    - Centralized, distributed training in a P2P topology
- The Challenges of Distributed Machine Learning Systems
  - Performance
    - Data parallelism versus model parallelism
    - Combining data parallelism and model parallelism
    - Deep learning
  - Resource Management
  - Fault Tolerance
  - Privacy
  - Portability
- Setting Up Your Local Environment
  - Chapters 26 Tutorials Environment
  - Chapters 710 Tutorials Environment
- Summary
2. Introduction to Spark and PySpark
- Apache Spark Architecture
- Intro to PySpark
- Apache Spark Basics
  - Software Architecture
    - Creating a custom schema
    - Key Spark data abstractions and APIs
    - DataFrames are immutable
  - PySpark and Functional Programming
  - Executing PySpark Code
- pandas DataFrames Versus Spark DataFrames
- Scikit-Learn Versus MLlib
- Summary
3. Managing the Machine Learning Experiment Lifecycle with MLflow
- Machine Learning Lifecycle Management Requirements
- What Is MLflow?
  - Software Components of the MLflow Platform
  - Users of the MLflow Platform
- MLflow Components
  - MLflow Tracking
    - Using MLflow Tracking to record runs
    - Logging your dataset path and version
  - MLflow Projects
  - MLflow Models
  - MLflow Model Registry
    - Registering models
    - Transitioning between model stages
- Using MLflow at Scale
- Summary
4. Data Ingestion, Preprocessing, and Descriptive Statistics
- Data Ingestion with Spark
  - Working with Images
    - Image format
    - Binary format
  - Working with Tabular Data
- Preprocessing Data
  - Preprocessing Versus Processing
  - Why Preprocess the Data?
  - Data Structures
  - MLlib Data Types
  - Preprocessing with MLlib Transformers
    - Working with text data
    - From nominal categorical features to indices
    - Structuring continuous numerical data
    - Additional transformers
  - Preprocessing Image Data
    - Extracting labels
    - Transforming labels to indices
    - Extracting image size
  - Save the Data and Avoid the Small Files Problem
    - Avoiding small files
    - Image compression and Parquet
- Descriptive Statistics: Getting a Feel for the Data
  - Calculating Statistics
  - Descriptive Statistics with Spark Summarizer
  - Data Skewness
  - Correlation
    - Pearson correlation
    - Spearman correlation
- Summary
5. Feature Engineering
- Features and Their Impact on Models
- MLlib Featurization Tools
  - Extractors
  - Selectors
  - Example: Word2Vec
- The Image Featurization Process
  - Understanding Image Manipulation
    - Grayscale
    - Defining image boundaries using image gradients
  - Extracting Features with Spark APIs
    - pyspark.sql.functions: pandas_udf and Python type hints
    - pyspark.sql.GroupedData: applyInPandas and mapInPandas
- The Text Featurization Process
  - Bag-of-Words
  - TF-IDF
  - N-Gram
  - Additional Techniques
- Enriching the Dataset
- Summary
6. Training Models with Spark MLlib
- Algorithms
- Supervised Machine Learning
  - Classification
    - MLlib classification algorithms
    - Implementing multilabel classification support
    - What about imbalanced class labels?
  - Regression
    - Recommendation systems
    - ALS for collaborative filtering
- Unsupervised Machine Learning
  - Frequent Pattern Mining
  - Clustering
- Evaluating
  - Supervised Evaluators
  - Unsupervised Evaluators
- Hyperparameters and Tuning Experiments
  - Building a Parameter Grid
  - Splitting the Data into Training and Test Sets
  - Cross-Validation: A Better Way to Test Your Models
- Machine Learning Pipelines
  - Constructing a Pipeline
  - How Does Splitting Work with the Pipeline API?
- Persistence
- Summary
7. Bridging Spark and Deep Learning Frameworks
- The Two Clusters Approach
- Implementing a Dedicated Data Access Layer
  - Features of a DAL
  - Selecting a DAL
- What Is Petastorm?
  - SparkDatasetConverter
  - Petastorm as a Parquet Store
- Project Hydrogen
  - Barrier Execution Mode
  - Accelerator-Aware Scheduling
- A Brief Introduction to the Horovod Estimator API
- Summary
8. TensorFlow Distributed Machine Learning Approach
- A Quick Overview of TensorFlow
  - What Is a Neural Network?
  - TensorFlow Cluster Process Roles and Responsibilities
- Loading Parquet Data into a TensorFlow Dataset
- An Inside Look at TensorFlows Distributed Machine Learning Strategies
  - ParameterServerStrategy
  - CentralStorageStrategy: One Machine, Multiple Processors
  - MirroredStrategy: One Machine, Multiple Processors, Local Copy
  - MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
  - TPUStrategy
  - What Things Change When You Switch Strategies?
- Training APIs
  - Keras API
    - MobileNetV2 transfer learning case study
    - Training the Keras MobileNetV2 algorithm from scratch
  - Custom Training Loop
  - Estimator API
- Putting It All Together
- Troubleshooting
- Summary
9. PyTorch Distributed Machine Learning Approach
- A Quick Overview of PyTorch Basics
  - Computation Graph
  - PyTorch Mechanics and Concepts
- PyTorch Distributed Strategies for Training Models
  - Introduction to PyTorchs Distributed Approach
  - Distributed Data-Parallel Training
  - RPC-Based Distributed Training
    - Remote execution
    - Remote references
      - Using RRefs to orchestrate distributed algorithms
      - Identifying objects by reference
    - Distributed autograd
    - The distributed optimizer
  - Communication Topologies in PyTorch (c10d)
    - Collective communication in PyTorch
    - Peer-to-peer communication in PyTorch
  - What Can We Do with PyTorchs Low-Level APIs?
- Loading Data with PyTorch and Petastorm
- Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
  - The Enigma of Mismatched Data Types
  - The Mystery of Straggling Workers
- How Does PyTorch Differ from TensorFlow?
- Summary
10. Deployment Patterns for Machine Learning Models
- Deployment Patterns
  - Pattern 1: Batch Prediction
  - Pattern 2: Model-in-Service
  - Pattern 3: Model-as-a-Service
  - Determining Which Pattern to Use
  - Production Software Requirements
- Monitoring Machine Learning Models in Production
  - Data Drift
  - Model Drift, Concept Drift
  - Distributional Domain Shift (the Long Tail)
  - What Metrics Should I Monitor in Production?
  - How Do I Measure Changes Using My Monitoring System?
    - Define a reference
    - Measure the reference against fresh metrics values
    - Algorithms to use for measuring
  - What It Looks Like in Production
- The Production Feedback Loop
- Deploying with MLlib
  - Production Machine Learning Pipelines with Structured Streaming
- Deploying with MLflow
  - Defining an MLflow Wrapper
  - Deploying the Model as a Microservice
  - Loading the Model as a Spark UDF
- How to Develop Your System Iteratively
- Summary
Index