Deep Learning at Scale - Helion

ebook

Autor: Suneeta Mall
ISBN: 9781098145248
stron: 448, Format: ebook
Data wydania: 2024-06-18
Księgarnia: Helion

Cena książki: 228,65 zł (poprzednio: 265,87 zł)
Oszczędzasz: 14% (-37,22 zł)

Osoby, które kupiły tę książkę, wybierały także »

Bringing a deep-learning project into production at scale is quite challenging. To successfully scale your project, a foundational understanding of full stack deep learning, including the knowledge that lies at the intersection of hardware, software, data, and algorithms, is required.

This book illustrates complex concepts of full stack deep learning and reinforces them through hands-on exercises to arm you with tools and techniques to scale your project. A scaling effort is only beneficial when it's effective and efficient. To that end, this guide explains the intricate concepts and techniques that will help you scale effectively and efficiently.

You'll gain a thorough understanding of:

How data flows through the deep-learning network and the role the computation graphs play in building your model
How accelerated computing speeds up your training and how best you can utilize the resources at your disposal
How to train your model using distributed training paradigms, i.e., data, model, and pipeline parallelism
How to leverage PyTorch ecosystems in conjunction with NVIDIA libraries and Triton to scale your model training
Debugging, monitoring, and investigating the undesirable bottlenecks that slow down your model training
How to expedite the training lifecycle and streamline your feedback loop to iterate model development
A set of data tricks and techniques and how to apply them to scale your training model
How to select the right tools and techniques for your deep-learning project
Options for managing the compute infrastructure when running at scale

Osoby które kupowały "Deep Learning at Scale", wybierały także:

Biologika Sukcesji Pokoleniowej. Sezon 3. Konflikty na terytorium 124,17 zł, (14,90 zł -88%)
Windows Media Center. Domowe centrum rozrywki 66,67 zł, (8,00 zł -88%)
Podręcznik startupu. Budowa wielkiej firmy krok po kroku 93,13 zł, (14,90 zł -84%)
Ruby on Rails. Ćwiczenia 18,75 zł, (3,00 zł -84%)
Scrum. O zwinnym zarz 78,42 zł, (14,90 zł -81%)

Spis treści

Deep Learning at Scale eBook -- spis treści

Preface
- Why Scaling Matters
- Who This Book Is For
- How This Book Is Organized
  - Introduction
  - Part I: Foundational Concepts of Deep Learning
  - Part II: Distributed Training
  - Part III: Extreme Scaling
- What You Need to Use This Book
- Setting Up Your Environment for Hands-on Exercises
- Using Code Examples
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. What Nature and History Have Taught Us About Scale
- The Philosophy of Scaling
  - The General Law of Scaling
  - History of Scaling Law
- Scalable Systems
  - Nature as a Scalable System
  - Our Visual System: A Biological Inspiration
- Artificial Intelligence: The Evolution of Learnable Systems
  - It Takes Four to Tango
    - The hardware
    - The data
    - The software
    - The (deep learning) algorithms
  - Evolving Deep Learning Trends
    - General evolution of deep learning
    - Evolution in specialized domains
      - Math and compute
      - Protein folding
      - Simulated world
- Scale in the Context of Deep Learning
  - Six Development Considerations
    - Well-defined problem
    - Domain knowledge (a.k.a. the constraints)
    - Ground truth
    - Model development
    - Deployment
    - Feedback
  - Scaling Considerations
    - Questions to ask before scaling
    - Characteristics of scalable systems
      - Reliability
      - Availability
      - Adaptability
      - Performance
    - Considerations of scalable systems
      - Avoiding single points of failure
      - Designing for high availability
      - Scaling paradigms
      - Coordination and communication
      - Caching and intermittent storage
      - Process state
      - Graceful recovery and checkpointing
      - Maintainability and observability
    - Scaling effectively
- Summary
I. Foundational Concepts of Deep Learning
2. Deep Learning
- The Role of Data in Deep Learning
- Data Flow in Deep Learning
- Hands-On Exercise #1: Implementing Minimalistic Deep Learning
  - Developing the Model
    - Model input data and pipeline
    - Model
    - Training loop
      - Loss
      - Metrics
  - The Embedded/Latent Space
  - A Word of Caution
  - The Learning Rate and Loss Landscape
  - Scaling Consideration
  - Profiling
- Hands-On Exercise #2: Getting Complex with PyTorch
  - Model Input Data and Pipeline
  - Model
  - Auxiliary Utilities
    - Callbacks
    - Loggers
    - Profilers
  - Putting It All Together
- Computation Graphs
- Inference
- Summary
3. The Computational Side of Deep Learning
- The Higgs Boson of the Digital World
  - Floating-Point Numbers: The Faux Continuous Numbers
    - Floating-point encoding
    - Floating-point standards
  - Units of Data Measurement
  - Data Storage Formats: The Trade-off of Latency and Throughput
- Computer Architecture
  - The Birth of the Electromechanical Engine
  - Memory and Persistence
    - Virtual memory
    - Input/output
    - Memory and Moores law
  - Computation and Memory Combined
- The Scaling Laws of Electronics
- Scaling Out Computation with Parallelization
  - Threads Versus Processes: The Unit of Parallelization
    - Simultaneous multithreading
    - Scenario walkthrough: A web crawler to curate a links dataset
  - Hardware-Optimized Libraries for Acceleration
  - Parallel Computer Architectures: Flynns and Duncans Taxonomies
- Accelerated Computing
  - Popular Accelerated Devices for Deep Learning
    - Graphics processing units (GPUs)
      - GPU microarchitecture
  - CUDA
    - NVIDIAs dominance: The competition landscape
    - Application-specific integrated circuits (ASICs)
      - Tensor Processing Units (TPUs)
      - Intelligence Processing Units (IPUs)
    - Field programmable gate arrays (FPGAs)
    - Wafer Scale Engines (WSEs)
  - Accelerator Benchmarking
- Summary
4. Putting It All Together: Efficient Deep Learning
- Hands-On Exercise #1: GPT-2
  - Exercise Objectives
  - Model Architecture
    - Key contributors to scale
      - Transformer attention block
      - Unsupervised training
      - Zero-shot learning
      - Parameter scale
  - Implementation
    - model.py
    - dataset.py
    - app.py
  - Running the Example
  - Experiment Tracking
  - Measuring to Understand the Limitations and Scale Out
    - Running on a CPU
    - Running on a GPU
  - Transitioning from Language to Vision
- Hands-On Exercise #2: Vision Model with Convolution
  - Model Architecture
    - Key contributors to scale in the scene parsing exercise
      - Scaling with convolutions
      - Scaling with EfficientNet
    - Implementation
  - Running the Example
  - Observations
- Graph Compilation Using PyTorch 2.0
  - New Components of PyTorch 2.0
  - Graph Execution in PyTorch 2.0
    - Graph acquisition
    - Graph lowering
    - Graph compilation
- Modeling Techniques to Scale Training on a Single Device
  - Graph Compilation
  - Reduced- and Mixed-Precision Training
    - Mixed precision
    - The effect of precision on gradients
      - Gradient scaling
      - Gradient clipping
      - 8-bit optimizers and quantization
    - A mixed-precision algorithm
  - Memory Tricks for Efficiency
    - Memory layout
    - Feature compression
    - Meta and fake tensors
  - Optimizer Efficiencies
    - Stochastic gradient descent (SGD)
    - Gradient accumulation
    - Gradient checkpointing
    - Patch Gradient Descent
    - Learning rate and weight decay
  - Model Input Pipeline Tricks
  - Writing Custom Kernels in PyTorch 2.0 with Triton
- Summary
II. Distributed Training
5. Distributed Systems and Communications
- Distributed Systems
  - The Eight Fallacies of Distributed Computing
  - The Consistency, Availability, and Partition Tolerance (CAP) Theorem
  - The Scaling Law of Distributed Systems
  - Types of Distributed Systems
    - Centralized
    - Decentralized
- Communication in Distributed Systems
  - Communication Paradigm
  - Communication Patterns
    - Basic communication patterns
    - Collective communication patterns
  - Communication Technologies
    - RPC
  - MPI
    - NCCL
    - Communication technology summary
  - Communication Initialization: Rendezvous
  - Hands-On Exercise
- Scaling Compute Capacity
  - Infrastructure Setup Options
    - Private cloud (on-premise/DIY data centers)
    - Public cloud
    - Hybrid cloud
    - Multicloud
    - Federation
  - Provisioning of Accelerated Devices
  - Workload Management
    - Slurm
    - Kubernetes
    - Ray
      - Distributed memory layer
      - Asynchronous model
    - Amazon SageMaker
    - Google Vertex AI
- Deep Learning Infrastructure Review
  - Overview of Leading Deep Learning Clusters
  - Similarities Between Todays Most Powerful Systems
- Summary
6. Theoretical Foundations of Distributed Deep Learning
- Distributed Deep Learning
  - Centralized DDL
    - Parameter server configurations
    - Subtypes of centralized DDL
      - Synchronous centralized DDL
      - Asynchronous centralized DDL
  - Decentralized DDL
    - Limiting divergence
    - Subtypes of decentralized DDL
      - Synchronous decentralized DDL
      - Asynchronous decentralized DDL
- Dimensions of Scaling Distributed Deep Learning
  - Partitioning Dimensions of Distributed Deep Learning
  - Types of Distributed Deep Learning Techniques
    - Ensembling
    - Data parallelism
    - Model parallelism
    - Pipeline parallelism
    - Tensor parallelism
    - Hybrid parallelism
    - Federation/collaborative learning
  - Choosing a Scaling Technique
- Measuring Scale
  - End-to-End Metrics and Benchmarks
    - Time to convergence
    - Cost to train
    - Multilevel benchmarks
  - Measuring Incrementally in a Reproducible Environment
- Summary
7. Data Parallelism
- Data Partitioning
  - Implications of Data Sampling Strategies
  - Working with Remote Datasets
- Introduction to Data Parallel Techniques
  - Hands-On Exercise #1: Centralized Parameter Server Using RCP
    - Setup
    - Observations
      - Inspecting involved processes
      - Inspecting connections
      - Communication patterns
    - Discussion
  - Hands-On Exercise #2: Centralized Gradient-Partitioned Joint Worker/Server Distributed Training
    - Setup
    - Observations
      - Communication patterns
    - Discussion
  - Hands-On Exercise #3: Decentralized Asynchronous Distributed Training
    - Setup
    - Observations
      - Communication patterns
    - Discussion
- Centralized Synchronous Data Parallel Strategies
  - Data Parallel (DP)
  - Distributed Data Parallel (DDP)
    - Devil in the details
    - Distributed Data Parallel 2 (DDP2)
  - Zero Redundancy OptimizerPowered Data Parallelism (ZeRO-DP)
  - Fault-Tolerant Training
  - Hands-On Exercise #4: Scene Parsing with DDP
    - Setup
    - Observations
      - Baseline
      - Multi-GPU training
      - Multinode
      - Mixed-precision training
  - Hands-On Exercise #5: Distributed Sharded DDP (ZeRO)
    - Setup
      - Runtime configuration
    - Observations
    - Discussion
- Building Efficient Pipelines
  - Dataset Format
  - Local Versus Remote
  - Staging
  - Threads Versus Processes: Scaling Your Pipelines
  - Memory Tricks
  - Data Augmentations: CPU Versus GPU
  - JIT Acceleration
  - Hands-On Exercise #6: Pipeline Efficiency with FFCV
    - Setup
      - Runtime configuration
    - Observations
- Summary
8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism
- Questions to Ask Before Scaling Vertically
- Theoretical Foundations of Vertical Scaling
  - Revisiting the Dimensions of Scaling
    - Implementing tensor parallelism
    - Implementing model parallelism
    - Choosing a scaling dimension
  - Operators Perspective of Parallelism Dimensions
  - Data Flow and Communications in Vertical Scaling
    - Tensor parallelism
    - Model parallelism
    - Pipeline parallelism: An evolution of model parallelism
      - GPipe
      - PipeDream
    - Hybrid parallelism
      - 2D hybrid parallelism
      - 3D hybrid parallelism
- Basic Building Blocks for Scaling Beyond DP
  - PyTorch Primitives for Vertical Scaling
    - Device mesh: Mapping model architecture to physical devices
    - Distributed tensors: Tensors with sharding and replication
      - Sharding and replication examples
      - Partial tensors
    - Logical tensors: Representation without materialization
      - Meta tensors
      - Fake tensors
  - Working with Larger Models
  - Distributed Checkpointing: Saving the Partitioned Model
- Summary
9. Gaining Practical Expertise with Scaling Across All Dimensions
- Hands-On Exercises: Model, Tensor, Pipeline, and Hybrid Parallelism
  - The Dataset
  - Hands-On Exercise #1: Baseline DeepFM
    - Training
    - Observations
  - Hands-On Exercise #2: Model Parallel DeepFM
    - Implementation details
    - Observations
  - Hands-On Exercise #3: Pipeline Parallel DeepFM
    - Implementation details
    - Observations
  - Hands-On Exercise #4: Pipeline Parallel DeepFM with RPC
    - Implementation details
    - Observations
  - Hands-On Exercise #5: Tensor Parallel DeepFM
    - Implementation details
    - Observations
  - Hands-On Exercise #6: Hybrid Parallel DeepFM
    - Implementation details
    - Observations
- Tools and Libraries for Vertical Scaling
  - OneFlow
  - FairScale
  - DeepSpeed
  - FSDP
  - Overview and Comparison
  - Hands-On Exercise #7: Automatic Vertical Scaling with DeepSpeed
  - Observations
- Summary
III. Extreme Scaling
10. Data-Centric Scaling
- The Seven Vs of Data Through a Deep Learning Lens
- The Scaling Law of Data
- Data Quality
  - Validity
  - Variety
    - Handling too much variety
      - Heuristic-based pruning
      - Algorithmic outlier pruning
      - Hands-on exercise #1: Outlier detection
      - Scaling outlier detection
    - Handling too-low variety
      - Data augmentation
      - Advanced data augmentation
      - Automated augmentation
      - Synthetic data generation
    - Handling imbalance
      - Sampling
      - Hands-on exercise #2: Handling imbalance in a multilabel dataset
      - Loss tricks
  - Veracity
    - Reasons for error in labels
    - Approaches to labeling
    - Techniques to increase veracity/decrease noise
      - Using heuristics to identify noise
      - Using inter-label information, such as ontology
      - Continuous feedback
      - Handling disagreements from multiple annotators
      - Identifying noisy samples by loss gradients
      - Hands-on exercise #3: Loss tricks to find noisy samples
      - Using confident learning
      - Summary of veracity tactics
  - Value and Volume
    - Core principles driving value
    - Volume reduction via compression and pruning
    - Volume reduction via dimensionality reduction
    - Volume reduction via approximation
    - Volume reduction via distillation
    - Value via regularization
- The Data Engine and Continual Learning
  - Volatility
  - Velocity
- Summary
11. Scaling Experiments: Effective Planning and Management
- Model Development Is Iterative
- Planning for Experiments and Execution
  - Simplify the Complex
  - Fast Iteration for Fast Feedback
  - Decoupled Iterations
  - Feasibility Testing
  - Developing and Scaling a Minimal Viable Solution
  - Setting Up for Iterative Execution
- Techniques to Scale Your Experiments
  - Accelerating Model Convergence
    - Using transfer learning
      - Retraining
      - Fine tuning
      - Pretraining
    - Knowledge distillation
  - Accelerating Learning Via Optimization and Automation
    - Hyperparameter optimization
    - AutoML
      - Neural architecture search
      - Model validation
    - Simulating optimization behavior with Daydream
  - Accelerating Learning by Increasing Expertise
    - Continuous learning
    - Learning to learn via meta-learning
    - Curriculum learning
    - Mixture of experts
  - Learning with Scarce Supervision
    - Self-supervised learning
    - Contrastive learning
- Hands-On Exercises
  - Hands-On Exercise #1: Transfer Learning
  - Hands-On Exercise #2: Hyperparameter Optimization
  - Hands-On Exercise #3: Knowledge Distillation
  - Hands-On Exercise #4: Mixture of Experts
    - Mock MoE
    - DeepSpeed-MoE
  - Hands-On Exercise #5: Contrastive Learning
  - Hands-On Exercise #6: Meta-Learning
- Summary
12. Efficient Fine-Tuning of Large Models
- Review of Fine-Tuning Techniques
  - Standard Fine Tuning
  - Meta-Learning (Zero-/Few-Shot Learning)
  - Adapter-Based Fine Tuning
  - Low-Rank Tuning
- LoRAParameter-Efficient Fine Tuning
- Quantized LoRA (QLoRA)
- Hands-on Exercise: QLoRA-Based Fine Tuning
  - Implementation Details
  - Inference
  - Exercise Summary
- Summary
13. Foundation Models
- What Are Foundation Models?
- The Evolution of Foundation Models
- Challenges Involved in Developing Foundation Models
  - Measurement Complexity
  - Deployment Challenges
  - Propagation of Defects to All Downstream Models
  - Legal and Ethical Considerations
  - Ensuring Consistency and Coherency
- Multimodal Large Language Models
  - Projection
  - Gated Cross-Attention
  - Query-Based Encoding
  - Further Exploration
- Summary
Index