Generative AI on Kubernetes. Operationalizing Large Language Models - Helion

ISBN: 9781098171889
stron: 406, Format: ebook
Data wydania: 2026-02-27
Księgarnia: Helion
Cena książki: 169,14 zł (poprzednio: 198,99 zł)
Oszczędzasz: 15% (-29,85 zł)
Generative AI is revolutionizing industries, and Kubernetes has fast become the backbone for deploying and managing these resource-intensive workloads. This book serves as a practical, hands-on guide for MLOps engineers, software developers, Kubernetes administrators, and AI professionals ready to combine AI innovation with the power of cloud native infrastructure. Authors Roland Huß and Daniele Zonca provide a clear road map for training, fine-tuning, deploying, and scaling GenAI models on Kubernetes, addressing challenges like resource optimization, automation, and security along the way.
With actionable insights with real-world examples, readers will learn to tackle the opportunities and complexities of managing GenAI applications in production environments. Whether you're experimenting with large-scale language models or facing the nuances of AI deployment at scale, you'll uncover expertise you need to operationalize this exciting technology effectively.
- Learn how to deploy LLMs more efficiently with optimized inference runtimes
- Get hands-on with GPU scheduling, including hardware detection and multinode scaling
- Monitor and understand LLM-specific metrics like Time to First Token and token throughput
- Know when to fine-tune a model or when retrieval augmentation is the better choice
- Discover how to evaluate models with standardized benchmarks before committing GPU resources
- Learn to run agentic applications with secure tool integration, identity management, and persistent state
Osoby które kupowały "Generative AI on Kubernetes. Operationalizing Large Language Models", wybierały także:
- Jak zarabia 166,25 zł, (39,90 zł -76%)
- Lider w 89,00 zł, (44,50 zł -50%)
- FAIK. Sztuczna inteligencja w s 59,90 zł, (29,95 zł -50%)
- Praktyczne zastosowania generatywnej AI i ChatGPT. Wykorzystaj potencja 87,00 zł, (43,50 zł -50%)
- AI dla tw 79,00 zł, (39,50 zł -50%)
Spis treści
Generative AI on Kubernetes. Operationalizing Large Language Models eBook -- spis treści
- Preface
- Why We Wrote This Book
- Kubernetes
- Generative AI
- How This Book Is Structured
- Who This Book Is For
- What You Will Learn
- Conventions Used in This Book
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- Introduction
- Challenges of Running Generative AI at Scale
- Kubernetes for AI Workloads
- Understanding LLM Fundamentals
- How LLMs Process Text
- Tokenization and Embeddings
- Tokenizer implementation
- Embeddings
- The Two Phases of Inference
- Prefill
- Decode
- Overview
- Inference
- Production Readiness
- Tuning
- AI-Driven Applications
- I. Inference
- 1. Deploying Models
- It Works on My Machine
- Model Server
- vLLM
- Hugging Face Text Generation Inference
- Other Model Servers
- llama.cpp
- NVIDIA NIM
- SGLang
- Deploying Models to Kubernetes Manually
- Model Server Controller
- KServe
- From InferenceService to LLMInferenceService
- Ray Serve and KubeRay
- KServe
- Lessons Learned
- 2. Model Data
- Model Data Storage Formats
- Weight-Only Formats
- Self-Contained Formats
- ONNX
- Safetensors
- GGUF and GGML
- Current State and Gaps
- Model Registry
- Hugging Face Model Hub
- MLflow Model Registry
- Kubeflow Model Registry
- OCI Registry
- Accessing Model Data in Kubernetes
- Shared Storage with PersistentVolumes
- OCI Image for Storing Model Data
- Modelcars
- OCI Image Volume Mounts
- Lessons Learned
- Model Data Storage Formats
- II. Production Readiness
- 3. Kubernetes and GPUs
- GPU Discovery
- Node Feature Discovery
- GPU Feature Discovery
- Kubernetes GPU Device Plug-Ins
- GPU Workload Scheduling
- Label-Based Scheduling
- nodeSelector
- Node affinity
- Taints and tolerations
- Resource-Based Scheduling
- Dynamic Resource Allocation
- Label-Based Scheduling
- NVIDIA GPU Operator
- Operator Configuration with ClusterPolicy
- Sub-GPU Allocation
- Time slicing
- Multi-Instance GPU
- Multi-GPU Inference
- Data Parallelism
- Model Parallelism
- Tensor parallelism
- Pipeline parallelism
- Hybrid parallelism
- Single-Node Versus Multinode Inference
- GPU Resource Optimizations
- Lessons Learned
- GPU Discovery
- 4. Running in Production
- Model and Runtime Tuning
- Language Model Evaluation
- Language Model Compression
- Model Performance Benchmark
- vLLM Runtime Parameters Tuning
- Autoscaling
- Optimize vLLM Startup Time
- LLM-Aware Routing
- From API Gateway to AI Gateway
- Token-based rate limiting and user management
- Evolution of AI gateway capabilities
- Gateway API Inference Extension
- From API Gateway to AI Gateway
- Disaggregated Serving
- Lessons Learned
- Model and Runtime Tuning
- 5. Model Observability
- Observability Stack and Configuration
- Logs
- Metrics
- Tracing
- Model Server Metrics
- Time To First Token
- Time Per Output Token or Inter-Token Latency
- Throughput
- Latency
- Request Queue Metrics
- GPU Usage Monitoring
- Quality Metrics
- Responsible AI
- Explainability
- Fairness
- Model Safety: Hallucination and Guardrails
- Understanding and Detecting Hallucinations
- Runtime Guardrails
- NVIDIA NeMo Guardrails
- FMS Guardrails Orchestrator
- Guardrails AI
- Llama Stack and moderation APIs
- Lessons Learned
- Observability Stack and Configuration
- III. Tuning
- 6. Model Customization
- Introduction to LLM Creation
- Prompt and Context Engineering
- When to Use Model Customization
- Tuning a Model
- Fine-Tuning
- Parameter-Efficient Fine-Tuning
- Low-Rank Adaptation
- Running Tuning Jobs on Kubernetes
- Kubeflow Trainer
- Other Frameworks
- DeepSpeed
- Ray
- Unsloth
- Lessons Learned
- 7. Job Scheduling Optimization
- Kubernetes Scheduler Optimization
- Core Kubernetes Scheduler
- Resource Bin Packing Strategy
- Dynamic Scheduling with Descheduler
- Gang Scheduling
- PyTorch Rendezvous and Gang Scheduling
- Comparing Gang Scheduling Solutions
- Coscheduling plug-in (PodGroup CRD)
- Kueue
- NVIDIA KAI Scheduler
- Volcano
- Making the right choice
- Topology-Aware Scheduling
- Comparing Topology-Aware Scheduling Solutions
- Coscheduling plug-in (PodGroup CRD)
- Kueue
- NVIDIA KAI Scheduler
- Volcano
- Making the right choice
- Quota Management and Multitenancy: GPU as a Service
- Comparing Quota Management and Multitenancy Solutions
- Kueue
- NVIDIA KAI Scheduler
- Volcano
- Making the right choice
- Comparing Topology-Aware Scheduling Solutions
- Network Optimization for Distributed Training
- Comparing Network Technologies for GPU Communication
- NVLink and AMD Infinity Fabric
- NVSwitch
- InfiniBand
- RoCE
- Standard Ethernet
- GPUDirect RDMA
- Making the right choice
- Using Secondary Network Interfaces in Kubernetes
- Bridging HPC and Kubernetes: Slurm and Slinky
- Comparing Network Technologies for GPU Communication
- Storage for Training
- Training Job Security
- Security Guidelines for Ray
- Security Guidelines for PyTorch
- Observability of Training Jobs
- Metrics Collection for Distributed Training
- Logging Across Distributed Workers
- Tracing Distributed Training Operations
- Lessons Learned
- Kubernetes Scheduler Optimization
- IV. AI-Driven Apps
- 8. AI-Driven Applications
- Architectural Patterns
- Kubernetes Workload Types
- Chat Applications
- Backend AI Services
- Scheduled batch jobs
- Continuous control loops
- Multistep tool automation
- Retrieval-Augmented Generation
- RAG Components
- Document Ingestion
- User Query Processing
- RAG on Kubernetes
- Agentic Workflows
- Agentic Frameworks and Runtimes
- OpenAIs Responses API
- Agents on Kubernetes
- Multiagent Systems
- Ambient Agents
- Lessons Learned
- Architectural Patterns
- 9. Running Agentic Applications in Production
- The Model Context Protocol
- MCP Security
- Agent Impersonation (Token Passthrough)
- Service Account Delegation
- ServiceAccounts as workload identity
- Server identity versus agent identity
- ServiceAccount usage
- Making authenticated requests
- Authentication via token validation
- Authorization with SubjectAccessReview
- External validation via OIDC/JWT
- Delegated Identity via OAuth2 Token Exchange
- Mutual TLS with SPIFFE/SPIRE (Zero-Trust)
- How SPIFFE works for MCP
- Deploying SPIRE on Kubernetes
- Using SPIFFE
- Choosing the right security pattern
- Agent-to-Agent Protocol
- A2A complements MCP
- A2A in a Nutshell
- Running A2A on Kubernetes
- Agent State Management
- State Storage Patterns
- Choosing Between Key-Value Stores and Databases
- Checkpointing for Long-Running Agents
- Lessons Learned
- Afterword
- What We Covered
- Final Words
- Index





