reklama - zainteresowany?

Scaling Machine Learning with Spark - Helion

Scaling Machine Learning with Spark
ebook
Autor: Adi Polak
ISBN: 9781098106775
stron: 294, Format: ebook
Data wydania: 2023-03-07
Księgarnia: Helion

Cena książki: 29,90 zł (poprzednio: 299,00 zł)
Oszczędzasz: 90% (-269,10 zł)

Dodaj do koszyka Scaling Machine Learning with Spark

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.

Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology.

You will:

  • Explore machine learning, including distributed computing concepts and terminology
  • Manage the ML lifecycle with MLflow
  • Ingest data and perform basic preprocessing with Spark
  • Explore feature engineering, and use Spark to extract features
  • Train a model with MLlib and build a pipeline to reproduce it
  • Build a data system to combine the power of Spark with deep learning
  • Get a step-by-step example of working with distributed TensorFlow
  • Use PyTorch to scale machine learning and its internal architecture

Dodaj do koszyka Scaling Machine Learning with Spark

Spis treści

Scaling Machine Learning with Spark eBook -- spis treści

  • Preface
    • Who Should Read This Book?
    • Do You Need Distributed Machine Learning?
    • Navigating This Book
    • What Is Not Covered
    • The Environment and Tools
      • The Tools
      • The Datasets
    • Conventions Used in This Book
    • Using Code Examples
    • OReilly Online Learning
    • How to Contact Us
    • Acknowledgments
  • 1. Distributed Machine Learning Terminology and Concepts
    • The Stages of the Machine Learning Workflow
    • Tools and Technologies in the Machine Learning Pipeline
    • Distributed Computing Models
      • General-Purpose Models
        • MapReduce
        • MPI
        • Barrier
        • Shared memory
      • Dedicated Distributed Computing Models
    • Introduction to Distributed Systems Architecture
      • Centralized Versus Decentralized Systems
      • Interaction Models
        • Client/server
        • Peer-to-peer
        • Geo-distributed
      • Communication in a Distributed Setting
        • Asynchronous
        • Synchronous
    • Introduction to Ensemble Methods
      • High Versus Low Bias
      • Types of Ensemble Methods
      • Distributed Training Topologies
        • Centralized ensemble learning
        • Decentralized decision trees
        • Centralized, distributed training with parameter servers
        • Centralized, distributed training in a P2P topology
    • The Challenges of Distributed Machine Learning Systems
      • Performance
        • Data parallelism versus model parallelism
        • Combining data parallelism and model parallelism
        • Deep learning
      • Resource Management
      • Fault Tolerance
      • Privacy
      • Portability
    • Setting Up Your Local Environment
      • Chapters 26 Tutorials Environment
      • Chapters 710 Tutorials Environment
    • Summary
  • 2. Introduction to Spark and PySpark
    • Apache Spark Architecture
    • Intro to PySpark
    • Apache Spark Basics
      • Software Architecture
        • Creating a custom schema
        • Key Spark data abstractions and APIs
        • DataFrames are immutable
      • PySpark and Functional Programming
      • Executing PySpark Code
    • pandas DataFrames Versus Spark DataFrames
    • Scikit-Learn Versus MLlib
    • Summary
  • 3. Managing the Machine Learning Experiment Lifecycle with MLflow
    • Machine Learning Lifecycle Management Requirements
    • What Is MLflow?
      • Software Components of the MLflow Platform
      • Users of the MLflow Platform
    • MLflow Components
      • MLflow Tracking
        • Using MLflow Tracking to record runs
        • Logging your dataset path and version
      • MLflow Projects
      • MLflow Models
      • MLflow Model Registry
        • Registering models
        • Transitioning between model stages
    • Using MLflow at Scale
    • Summary
  • 4. Data Ingestion, Preprocessing, and Descriptive Statistics
    • Data Ingestion with Spark
      • Working with Images
        • Image format
        • Binary format
      • Working with Tabular Data
    • Preprocessing Data
      • Preprocessing Versus Processing
      • Why Preprocess the Data?
      • Data Structures
      • MLlib Data Types
      • Preprocessing with MLlib Transformers
        • Working with text data
        • From nominal categorical features to indices
        • Structuring continuous numerical data
        • Additional transformers
      • Preprocessing Image Data
        • Extracting labels
        • Transforming labels to indices
        • Extracting image size
      • Save the Data and Avoid the Small Files Problem
        • Avoiding small files
        • Image compression and Parquet
    • Descriptive Statistics: Getting a Feel for the Data
      • Calculating Statistics
      • Descriptive Statistics with Spark Summarizer
      • Data Skewness
      • Correlation
        • Pearson correlation
        • Spearman correlation
    • Summary
  • 5. Feature Engineering
    • Features and Their Impact on Models
    • MLlib Featurization Tools
      • Extractors
      • Selectors
      • Example: Word2Vec
    • The Image Featurization Process
      • Understanding Image Manipulation
        • Grayscale
        • Defining image boundaries using image gradients
      • Extracting Features with Spark APIs
        • pyspark.sql.functions: pandas_udf and Python type hints
        • pyspark.sql.GroupedData: applyInPandas and mapInPandas
    • The Text Featurization Process
      • Bag-of-Words
      • TF-IDF
      • N-Gram
      • Additional Techniques
    • Enriching the Dataset
    • Summary
  • 6. Training Models with Spark MLlib
    • Algorithms
    • Supervised Machine Learning
      • Classification
        • MLlib classification algorithms
        • Implementing multilabel classification support
        • What about imbalanced class labels?
      • Regression
        • Recommendation systems
        • ALS for collaborative filtering
    • Unsupervised Machine Learning
      • Frequent Pattern Mining
      • Clustering
    • Evaluating
      • Supervised Evaluators
      • Unsupervised Evaluators
    • Hyperparameters and Tuning Experiments
      • Building a Parameter Grid
      • Splitting the Data into Training and Test Sets
      • Cross-Validation: A Better Way to Test Your Models
    • Machine Learning Pipelines
      • Constructing a Pipeline
      • How Does Splitting Work with the Pipeline API?
    • Persistence
    • Summary
  • 7. Bridging Spark and Deep Learning Frameworks
    • The Two Clusters Approach
    • Implementing a Dedicated Data Access Layer
      • Features of a DAL
      • Selecting a DAL
    • What Is Petastorm?
      • SparkDatasetConverter
      • Petastorm as a Parquet Store
    • Project Hydrogen
      • Barrier Execution Mode
      • Accelerator-Aware Scheduling
    • A Brief Introduction to the Horovod Estimator API
    • Summary
  • 8. TensorFlow Distributed Machine Learning Approach
    • A Quick Overview of TensorFlow
      • What Is a Neural Network?
      • TensorFlow Cluster Process Roles and Responsibilities
    • Loading Parquet Data into a TensorFlow Dataset
    • An Inside Look at TensorFlows Distributed Machine Learning Strategies
      • ParameterServerStrategy
      • CentralStorageStrategy: One Machine, Multiple Processors
      • MirroredStrategy: One Machine, Multiple Processors, Local Copy
      • MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
      • TPUStrategy
      • What Things Change When You Switch Strategies?
    • Training APIs
      • Keras API
        • MobileNetV2 transfer learning case study
        • Training the Keras MobileNetV2 algorithm from scratch
      • Custom Training Loop
      • Estimator API
    • Putting It All Together
    • Troubleshooting
    • Summary
  • 9. PyTorch Distributed Machine Learning Approach
    • A Quick Overview of PyTorch Basics
      • Computation Graph
      • PyTorch Mechanics and Concepts
    • PyTorch Distributed Strategies for Training Models
      • Introduction to PyTorchs Distributed Approach
      • Distributed Data-Parallel Training
      • RPC-Based Distributed Training
        • Remote execution
        • Remote references
          • Using RRefs to orchestrate distributed algorithms
          • Identifying objects by reference
        • Distributed autograd
        • The distributed optimizer
      • Communication Topologies in PyTorch (c10d)
        • Collective communication in PyTorch
        • Peer-to-peer communication in PyTorch
      • What Can We Do with PyTorchs Low-Level APIs?
    • Loading Data with PyTorch and Petastorm
    • Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
      • The Enigma of Mismatched Data Types
      • The Mystery of Straggling Workers
    • How Does PyTorch Differ from TensorFlow?
    • Summary
  • 10. Deployment Patterns for Machine Learning Models
    • Deployment Patterns
      • Pattern 1: Batch Prediction
      • Pattern 2: Model-in-Service
      • Pattern 3: Model-as-a-Service
      • Determining Which Pattern to Use
      • Production Software Requirements
    • Monitoring Machine Learning Models in Production
      • Data Drift
      • Model Drift, Concept Drift
      • Distributional Domain Shift (the Long Tail)
      • What Metrics Should I Monitor in Production?
      • How Do I Measure Changes Using My Monitoring System?
        • Define a reference
        • Measure the reference against fresh metrics values
        • Algorithms to use for measuring
      • What It Looks Like in Production
    • The Production Feedback Loop
    • Deploying with MLlib
      • Production Machine Learning Pipelines with Structured Streaming
    • Deploying with MLflow
      • Defining an MLflow Wrapper
      • Deploying the Model as a Microservice
      • Loading the Model as a Spark UDF
    • How to Develop Your System Iteratively
    • Summary
  • Index

Dodaj do koszyka Scaling Machine Learning with Spark

Code, Publish & WebDesing by CATALIST.com.pl



(c) 2005-2025 CATALIST agencja interaktywna, znaki firmowe należą do wydawnictwa Helion S.A.