Scaling Machine Learning with Spark - Helion
ISBN: 9781098106775
stron: 294, Format: ebook
Data wydania: 2023-03-07
Księgarnia: Helion
Cena książki: 29,90 zł (poprzednio: 299,00 zł)
Oszczędzasz: 90% (-269,10 zł)
Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.
Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology.
You will:
- Explore machine learning, including distributed computing concepts and terminology
- Manage the ML lifecycle with MLflow
- Ingest data and perform basic preprocessing with Spark
- Explore feature engineering, and use Spark to extract features
- Train a model with MLlib and build a pipeline to reproduce it
- Build a data system to combine the power of Spark with deep learning
- Get a step-by-step example of working with distributed TensorFlow
- Use PyTorch to scale machine learning and its internal architecture
Osoby które kupowały "Scaling Machine Learning with Spark", wybierały także:
- Cisco CCNA 200-301. Kurs video. Administrowanie bezpieczeństwem sieci. Część 3 665,00 zł, (39,90 zł -94%)
- Cisco CCNA 200-301. Kurs video. Administrowanie urządzeniami Cisco. Część 2 665,00 zł, (39,90 zł -94%)
- Cisco CCNA 200-301. Kurs video. Podstawy sieci komputerowych i konfiguracji. Część 1 665,00 zł, (39,90 zł -94%)
- Impact of P2P and Free Distribution on Book Sales 427,14 zł, (29,90 zł -93%)
- Cisco CCNP Enterprise 350-401 ENCOR. Kurs video. Programowanie i automatyzacja sieci 443,33 zł, (39,90 zł -91%)
Spis treści
Scaling Machine Learning with Spark eBook -- spis treści
- Preface
- Who Should Read This Book?
- Do You Need Distributed Machine Learning?
- Navigating This Book
- What Is Not Covered
- The Environment and Tools
- The Tools
- The Datasets
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Distributed Machine Learning Terminology and Concepts
- The Stages of the Machine Learning Workflow
- Tools and Technologies in the Machine Learning Pipeline
- Distributed Computing Models
- General-Purpose Models
- MapReduce
- MPI
- Barrier
- Shared memory
- Dedicated Distributed Computing Models
- General-Purpose Models
- Introduction to Distributed Systems Architecture
- Centralized Versus Decentralized Systems
- Interaction Models
- Client/server
- Peer-to-peer
- Geo-distributed
- Communication in a Distributed Setting
- Asynchronous
- Synchronous
- Introduction to Ensemble Methods
- High Versus Low Bias
- Types of Ensemble Methods
- Distributed Training Topologies
- Centralized ensemble learning
- Decentralized decision trees
- Centralized, distributed training with parameter servers
- Centralized, distributed training in a P2P topology
- The Challenges of Distributed Machine Learning Systems
- Performance
- Data parallelism versus model parallelism
- Combining data parallelism and model parallelism
- Deep learning
- Resource Management
- Fault Tolerance
- Privacy
- Portability
- Performance
- Setting Up Your Local Environment
- Chapters 26 Tutorials Environment
- Chapters 710 Tutorials Environment
- Summary
- 2. Introduction to Spark and PySpark
- Apache Spark Architecture
- Intro to PySpark
- Apache Spark Basics
- Software Architecture
- Creating a custom schema
- Key Spark data abstractions and APIs
- DataFrames are immutable
- PySpark and Functional Programming
- Executing PySpark Code
- Software Architecture
- pandas DataFrames Versus Spark DataFrames
- Scikit-Learn Versus MLlib
- Summary
- 3. Managing the Machine Learning Experiment Lifecycle with MLflow
- Machine Learning Lifecycle Management Requirements
- What Is MLflow?
- Software Components of the MLflow Platform
- Users of the MLflow Platform
- MLflow Components
- MLflow Tracking
- Using MLflow Tracking to record runs
- Logging your dataset path and version
- MLflow Projects
- MLflow Models
- MLflow Model Registry
- Registering models
- Transitioning between model stages
- MLflow Tracking
- Using MLflow at Scale
- Summary
- 4. Data Ingestion, Preprocessing, and Descriptive Statistics
- Data Ingestion with Spark
- Working with Images
- Image format
- Binary format
- Working with Tabular Data
- Working with Images
- Preprocessing Data
- Preprocessing Versus Processing
- Why Preprocess the Data?
- Data Structures
- MLlib Data Types
- Preprocessing with MLlib Transformers
- Working with text data
- From nominal categorical features to indices
- Structuring continuous numerical data
- Additional transformers
- Preprocessing Image Data
- Extracting labels
- Transforming labels to indices
- Extracting image size
- Save the Data and Avoid the Small Files Problem
- Avoiding small files
- Image compression and Parquet
- Descriptive Statistics: Getting a Feel for the Data
- Calculating Statistics
- Descriptive Statistics with Spark Summarizer
- Data Skewness
- Correlation
- Pearson correlation
- Spearman correlation
- Summary
- Data Ingestion with Spark
- 5. Feature Engineering
- Features and Their Impact on Models
- MLlib Featurization Tools
- Extractors
- Selectors
- Example: Word2Vec
- The Image Featurization Process
- Understanding Image Manipulation
- Grayscale
- Defining image boundaries using image gradients
- Extracting Features with Spark APIs
- pyspark.sql.functions: pandas_udf and Python type hints
- pyspark.sql.GroupedData: applyInPandas and mapInPandas
- Understanding Image Manipulation
- The Text Featurization Process
- Bag-of-Words
- TF-IDF
- N-Gram
- Additional Techniques
- Enriching the Dataset
- Summary
- 6. Training Models with Spark MLlib
- Algorithms
- Supervised Machine Learning
- Classification
- MLlib classification algorithms
- Implementing multilabel classification support
- What about imbalanced class labels?
- Regression
- Recommendation systems
- ALS for collaborative filtering
- Classification
- Unsupervised Machine Learning
- Frequent Pattern Mining
- Clustering
- Evaluating
- Supervised Evaluators
- Unsupervised Evaluators
- Hyperparameters and Tuning Experiments
- Building a Parameter Grid
- Splitting the Data into Training and Test Sets
- Cross-Validation: A Better Way to Test Your Models
- Machine Learning Pipelines
- Constructing a Pipeline
- How Does Splitting Work with the Pipeline API?
- Persistence
- Summary
- 7. Bridging Spark and Deep Learning Frameworks
- The Two Clusters Approach
- Implementing a Dedicated Data Access Layer
- Features of a DAL
- Selecting a DAL
- What Is Petastorm?
- SparkDatasetConverter
- Petastorm as a Parquet Store
- Project Hydrogen
- Barrier Execution Mode
- Accelerator-Aware Scheduling
- A Brief Introduction to the Horovod Estimator API
- Summary
- 8. TensorFlow Distributed Machine Learning Approach
- A Quick Overview of TensorFlow
- What Is a Neural Network?
- TensorFlow Cluster Process Roles and Responsibilities
- Loading Parquet Data into a TensorFlow Dataset
- An Inside Look at TensorFlows Distributed Machine Learning Strategies
- ParameterServerStrategy
- CentralStorageStrategy: One Machine, Multiple Processors
- MirroredStrategy: One Machine, Multiple Processors, Local Copy
- MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
- TPUStrategy
- What Things Change When You Switch Strategies?
- Training APIs
- Keras API
- MobileNetV2 transfer learning case study
- Training the Keras MobileNetV2 algorithm from scratch
- Custom Training Loop
- Estimator API
- Keras API
- Putting It All Together
- Troubleshooting
- Summary
- A Quick Overview of TensorFlow
- 9. PyTorch Distributed Machine Learning Approach
- A Quick Overview of PyTorch Basics
- Computation Graph
- PyTorch Mechanics and Concepts
- PyTorch Distributed Strategies for Training Models
- Introduction to PyTorchs Distributed Approach
- Distributed Data-Parallel Training
- RPC-Based Distributed Training
- Remote execution
- Remote references
- Using RRefs to orchestrate distributed algorithms
- Identifying objects by reference
- Distributed autograd
- The distributed optimizer
- Communication Topologies in PyTorch (c10d)
- Collective communication in PyTorch
- Peer-to-peer communication in PyTorch
- What Can We Do with PyTorchs Low-Level APIs?
- Loading Data with PyTorch and Petastorm
- Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
- The Enigma of Mismatched Data Types
- The Mystery of Straggling Workers
- How Does PyTorch Differ from TensorFlow?
- Summary
- A Quick Overview of PyTorch Basics
- 10. Deployment Patterns for Machine Learning Models
- Deployment Patterns
- Pattern 1: Batch Prediction
- Pattern 2: Model-in-Service
- Pattern 3: Model-as-a-Service
- Determining Which Pattern to Use
- Production Software Requirements
- Monitoring Machine Learning Models in Production
- Data Drift
- Model Drift, Concept Drift
- Distributional Domain Shift (the Long Tail)
- What Metrics Should I Monitor in Production?
- How Do I Measure Changes Using My Monitoring System?
- Define a reference
- Measure the reference against fresh metrics values
- Algorithms to use for measuring
- What It Looks Like in Production
- The Production Feedback Loop
- Deploying with MLlib
- Production Machine Learning Pipelines with Structured Streaming
- Deploying with MLflow
- Defining an MLflow Wrapper
- Deploying the Model as a Microservice
- Loading the Model as a Spark UDF
- How to Develop Your System Iteratively
- Summary
- Deployment Patterns
- Index