Generative AI on Kubernetes. Operationalizing Large Language Models - Helion

ebook

Autor: Roland Hu
ISBN: 9781098171889
stron: 406, Format: ebook
Data wydania: 2026-02-27
Księgarnia: Helion

Cena książki: 169,14 zł (poprzednio: 198,99 zł)
Oszczędzasz: 15% (-29,85 zł)

Osoby, które kupiły tę książkę, wybierały także »

Tagi: Sztuczna inteligencja

Generative AI is revolutionizing industries, and Kubernetes has fast become the backbone for deploying and managing these resource-intensive workloads. This book serves as a practical, hands-on guide for MLOps engineers, software developers, Kubernetes administrators, and AI professionals ready to combine AI innovation with the power of cloud native infrastructure. Authors Roland Huß and Daniele Zonca provide a clear road map for training, fine-tuning, deploying, and scaling GenAI models on Kubernetes, addressing challenges like resource optimization, automation, and security along the way.

With actionable insights with real-world examples, readers will learn to tackle the opportunities and complexities of managing GenAI applications in production environments. Whether you're experimenting with large-scale language models or facing the nuances of AI deployment at scale, you'll uncover expertise you need to operationalize this exciting technology effectively.

Learn how to deploy LLMs more efficiently with optimized inference runtimes
Get hands-on with GPU scheduling, including hardware detection and multinode scaling
Monitor and understand LLM-specific metrics like Time to First Token and token throughput
Know when to fine-tune a model or when retrieval augmentation is the better choice
Discover how to evaluate models with standardized benchmarks before committing GPU resources
Learn to run agentic applications with secure tool integration, identity management, and persistent state

Osoby które kupowały "Generative AI on Kubernetes. Operationalizing Large Language Models", wybierały także:

Jak zarabia 166,25 zł, (39,90 zł -76%)
Lider w 89,00 zł, (44,50 zł -50%)
FAIK. Sztuczna inteligencja w s 59,90 zł, (29,95 zł -50%)
Praktyczne zastosowania generatywnej AI i ChatGPT. Wykorzystaj potencja 87,00 zł, (43,50 zł -50%)
AI dla tw 79,00 zł, (39,50 zł -50%)

Spis treści

Generative AI on Kubernetes. Operationalizing Large Language Models eBook -- spis treści

Preface
- Why We Wrote This Book
- Kubernetes
- Generative AI
- How This Book Is Structured
- Who This Book Is For
- What You Will Learn
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
Introduction
- Challenges of Running Generative AI at Scale
- Kubernetes for AI Workloads
- Understanding LLM Fundamentals
  - How LLMs Process Text
  - Tokenization and Embeddings
    - Tokenizer implementation
    - Embeddings
  - The Two Phases of Inference
    - Prefill
    - Decode
- Overview
  - Inference
  - Production Readiness
  - Tuning
  - AI-Driven Applications
I. Inference
1. Deploying Models
- It Works on My Machine
- Model Server
  - vLLM
  - Hugging Face Text Generation Inference
  - Other Model Servers
    - llama.cpp
    - NVIDIA NIM
    - SGLang
- Deploying Models to Kubernetes Manually
- Model Server Controller
  - KServe
    - From InferenceService to LLMInferenceService
  - Ray Serve and KubeRay
- Lessons Learned
2. Model Data
- Model Data Storage Formats
  - Weight-Only Formats
  - Self-Contained Formats
  - ONNX
  - Safetensors
  - GGUF and GGML
  - Current State and Gaps
- Model Registry
  - Hugging Face Model Hub
  - MLflow Model Registry
  - Kubeflow Model Registry
  - OCI Registry
- Accessing Model Data in Kubernetes
- Shared Storage with PersistentVolumes
- OCI Image for Storing Model Data
  - Modelcars
  - OCI Image Volume Mounts
- Lessons Learned
II. Production Readiness
3. Kubernetes and GPUs
- GPU Discovery
  - Node Feature Discovery
  - GPU Feature Discovery
- Kubernetes GPU Device Plug-Ins
- GPU Workload Scheduling
  - Label-Based Scheduling
    - nodeSelector
    - Node affinity
    - Taints and tolerations
  - Resource-Based Scheduling
  - Dynamic Resource Allocation
- NVIDIA GPU Operator
  - Operator Configuration with ClusterPolicy
  - Sub-GPU Allocation
    - Time slicing
    - Multi-Instance GPU
- Multi-GPU Inference
  - Data Parallelism
  - Model Parallelism
    - Tensor parallelism
    - Pipeline parallelism
    - Hybrid parallelism
  - Single-Node Versus Multinode Inference
  - GPU Resource Optimizations
- Lessons Learned
4. Running in Production
- Model and Runtime Tuning
  - Language Model Evaluation
  - Language Model Compression
  - Model Performance Benchmark
  - vLLM Runtime Parameters Tuning
- Autoscaling
- Optimize vLLM Startup Time
- LLM-Aware Routing
  - From API Gateway to AI Gateway
    - Token-based rate limiting and user management
    - Evolution of AI gateway capabilities
  - Gateway API Inference Extension
- Disaggregated Serving
- Lessons Learned
5. Model Observability
- Observability Stack and Configuration
  - Logs
  - Metrics
  - Tracing
- Model Server Metrics
  - Time To First Token
  - Time Per Output Token or Inter-Token Latency
  - Throughput
  - Latency
  - Request Queue Metrics
- GPU Usage Monitoring
- Quality Metrics
- Responsible AI
  - Explainability
  - Fairness
- Model Safety: Hallucination and Guardrails
  - Understanding and Detecting Hallucinations
  - Runtime Guardrails
    - NVIDIA NeMo Guardrails
    - FMS Guardrails Orchestrator
    - Guardrails AI
    - Llama Stack and moderation APIs
- Lessons Learned
III. Tuning
6. Model Customization
- Introduction to LLM Creation
- Prompt and Context Engineering
- When to Use Model Customization
- Tuning a Model
  - Fine-Tuning
  - Parameter-Efficient Fine-Tuning
  - Low-Rank Adaptation
- Running Tuning Jobs on Kubernetes
  - Kubeflow Trainer
  - Other Frameworks
    - DeepSpeed
    - Ray
    - Unsloth
- Lessons Learned
7. Job Scheduling Optimization
- Kubernetes Scheduler Optimization
  - Core Kubernetes Scheduler
  - Resource Bin Packing Strategy
  - Dynamic Scheduling with Descheduler
- Gang Scheduling
  - PyTorch Rendezvous and Gang Scheduling
  - Comparing Gang Scheduling Solutions
    - Coscheduling plug-in (PodGroup CRD)
    - Kueue
    - NVIDIA KAI Scheduler
    - Volcano
    - Making the right choice
- Topology-Aware Scheduling
  - Comparing Topology-Aware Scheduling Solutions
    - Coscheduling plug-in (PodGroup CRD)
    - Kueue
    - NVIDIA KAI Scheduler
    - Volcano
    - Making the right choice
  - Quota Management and Multitenancy: GPU as a Service
  - Comparing Quota Management and Multitenancy Solutions
    - Kueue
    - NVIDIA KAI Scheduler
    - Volcano
    - Making the right choice
- Network Optimization for Distributed Training
  - Comparing Network Technologies for GPU Communication
    - NVLink and AMD Infinity Fabric
    - NVSwitch
    - InfiniBand
    - RoCE
    - Standard Ethernet
    - GPUDirect RDMA
    - Making the right choice
  - Using Secondary Network Interfaces in Kubernetes
  - Bridging HPC and Kubernetes: Slurm and Slinky
- Storage for Training
- Training Job Security
  - Security Guidelines for Ray
  - Security Guidelines for PyTorch
- Observability of Training Jobs
  - Metrics Collection for Distributed Training
  - Logging Across Distributed Workers
  - Tracing Distributed Training Operations
- Lessons Learned
IV. AI-Driven Apps
8. AI-Driven Applications
- Architectural Patterns
  - Kubernetes Workload Types
  - Chat Applications
  - Backend AI Services
    - Scheduled batch jobs
    - Continuous control loops
    - Multistep tool automation
- Retrieval-Augmented Generation
  - RAG Components
  - Document Ingestion
  - User Query Processing
  - RAG on Kubernetes
- Agentic Workflows
  - Agentic Frameworks and Runtimes
  - OpenAIs Responses API
  - Agents on Kubernetes
  - Multiagent Systems
  - Ambient Agents
- Lessons Learned
9. Running Agentic Applications in Production
- The Model Context Protocol
- MCP Security
  - Agent Impersonation (Token Passthrough)
  - Service Account Delegation
    - ServiceAccounts as workload identity
    - Server identity versus agent identity
    - ServiceAccount usage
    - Making authenticated requests
    - Authentication via token validation
    - Authorization with SubjectAccessReview
    - External validation via OIDC/JWT
  - Delegated Identity via OAuth2 Token Exchange
  - Mutual TLS with SPIFFE/SPIRE (Zero-Trust)
    - How SPIFFE works for MCP
    - Deploying SPIRE on Kubernetes
    - Using SPIFFE
    - Choosing the right security pattern
- Agent-to-Agent Protocol
  - A2A complements MCP
  - A2A in a Nutshell
  - Running A2A on Kubernetes
- Agent State Management
  - State Storage Patterns
  - Choosing Between Key-Value Stores and Databases
  - Checkpointing for Long-Running Agents
- Lessons Learned
Afterword
- What We Covered
- Final Words
Index