Reinforcement Learning - Helion
ISBN: 9781492072348
stron: 408, Format: ebook
Data wydania: 2020-11-06
Księgarnia: Helion
Cena książki: 211,65 zł (poprzednio: 246,10 zł)
Oszczędzasz: 14% (-34,45 zł)
Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcementand enable a machine to learn by itself.
Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learnnumerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.
- Learn what RL is and how the algorithms help solve problems
- Become grounded in RL fundamentals including Markov decision processes, dynamic programming, and temporal difference learning
- Dive deep into a range of value and policy gradient methods
- Apply advanced RL solutions such as meta learning, hierarchical learning, multi-agent, and imitation learning
- Understand cutting-edge deep RL algorithms including Rainbow, PPO, TD3, SAC, and more
- Get practical examples through the accompanying website
Osoby które kupowały "Reinforcement Learning", wybierały także:
- Python na maturze. Kurs video. Algorytmy i podstawy j 135,14 zł, (48,65 zł -64%)
- Algorytmy kryptograficzne. Przewodnik po algorytmach w blockchain, kryptografii kwantowej, protoko 79,00 zł, (39,50 zł -50%)
- Informatyk samouk. Przewodnik po strukturach danych i algorytmach dla pocz 58,98 zł, (29,49 zł -50%)
- My 89,00 zł, (44,50 zł -50%)
- Nauka algorytm 58,98 zł, (29,49 zł -50%)
Spis treści
Reinforcement Learning eBook -- spis treści
- Preface
- Objective
- Who Should Read This Book?
- Guiding Principles and Style
- Prerequisites
- Scope and Outline
- Supplementary Materials
- Conventions Used in This Book
- Acronyms
- Mathematical Notation
- Fair Use Policy
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
- 1. Why Reinforcement Learning?
- Why Now?
- Machine Learning
- Reinforcement Learning
- When Should You Use RL?
- RL Applications
- Taxonomy of RL Approaches
- Model-Free or Model-Based
- How Agents Use and Update Their Strategy
- Discrete or Continuous Actions
- Optimization Methods
- Policy Evaluation and Improvement
- Fundamental Concepts in Reinforcement Learning
- The First RL Algorithm
- Value estimation
- Prediction error
- Weight update rule
- Is RL the Same as ML?
- Reward and Feedback
- Delayed rewards
- Hindsight
- The First RL Algorithm
- Reinforcement Learning as a Discipline
- Summary
- Further Reading
- 2. Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods
- Multi-Arm Bandit Testing
- Reward Engineering
- Policy Evaluation: The Value Function
- Policy Improvement: Choosing the Best Action
- Simulating the Environment
- Running the Experiment
- Improving the -greedy Algorithm
- Markov Decision Processes
- Inventory Control
- Transition table
- Transition graph
- Transition matrix
- Inventory Control Simulation
- Inventory Control
- Policies and Value Functions
- Discounted Rewards
- Predicting Rewards with the State-Value Function
- Simulation using the state-value function
- Predicting Rewards with the Action-Value Function
- Optimal Policies
- Monte Carlo Policy Generation
- Value Iteration with Dynamic Programming
- Implementing Value Iteration
- Results of Value Iteration
- Summary
- Further Reading
- Multi-Arm Bandit Testing
- 3. Temporal-Difference Learning, Q-Learning, and n-Step Algorithms
- Formulation of Temporal-Difference Learning
- Q-Learning
- SARSA
- Q-Learning Versus SARSA
- Case Study: Automatically Scaling Application Containers to Reduce Cost
- Industrial Example: Real-Time Bidding in Advertising
- Defining the MDP
- Results of the Real-Time Bidding Environments
- Further Improvements
- Extensions to Q-Learning
- Double Q-Learning
- Delayed Q-Learning
- Comparing Standard, Double, and Delayed Q-learning
- Opposition Learning
- n-Step Algorithms
- n-Step Algorithms on Grid Environments
- Eligibility Traces
- Extensions to Eligibility Traces
- Watkinss Q( )
- Fuzzy Wipes in Watkinss Q( )
- Speedy Q-Learning
- Accumulating Versus Replacing Eligibility Traces
- Summary
- Further Reading
- Formulation of Temporal-Difference Learning
- 4. Deep Q-Networks
- Deep Learning Architectures
- Fundamentals
- Common Neural Network Architectures
- Deep Learning Frameworks
- Deep Reinforcement Learning
- Deep Q-Learning
- Experience Replay
- Q-Network Clones
- Neural Network Architecture
- Implementing DQN
- Example: DQN on the CartPole Environment
- Why train online?
- Which is better? DQN versus Q-learning
- Case Study: Reducing Energy Usage in Buildings
- Rainbow DQN
- Distributional RL
- Prioritized Experience Replay
- Noisy Nets
- Dueling Networks
- Example: Rainbow DQN on Atari Games
- Results
- Discussion
- Other DQN Improvements
- Improving Exploration
- Improving Rewards
- Learning from Offline Data
- Summary
- Further Reading
- Deep Learning Architectures
- 5. Policy Gradient Methods
- Benefits of Learning a Policy Directly
- How to Calculate the Gradient of a Policy
- Policy Gradient Theorem
- Policy Functions
- Linear Policies
- Logistic policy
- Softmax policy
- Arbitrary Policies
- Linear Policies
- Basic Implementations
- Monte Carlo (REINFORCE)
- Example: REINFORCE on the CartPole environment
- REINFORCE with Baseline
- Example: REINFORCE with baseline on the CartPole environment
- Gradient Variance Reduction
- n-Step Actor-Critic and Advantage Actor-Critic (A2C)
- Example: n-step actor-critic on the CartPole environment
- State-value learning decay rates versus policy decay rates
- Eligibility Traces Actor-Critic
- Example: Eligibility trace actor-critic on the CartPole environment
- A Comparison of Basic Policy Gradient Algorithms
- Monte Carlo (REINFORCE)
- Industrial Example: Automatically Purchasing Products for Customers
- The Environment: Gym-Shopping-Cart
- Expectations
- Results from the Shopping Cart Environment
- Summary
- Further Reading
- 6. Beyond Policy Gradients
- Off-Policy Algorithms
- Importance Sampling
- Behavior and Target Policies
- Off-Policy Q-Learning
- Gradient Temporal-Difference Learning
- Greedy-GQ
- Off-Policy Actor-Critics
- Deterministic Policy Gradients
- Deterministic Policy Gradients
- Deep Deterministic Policy Gradients
- DDPG derivation
- DDPG implementation
- Twin Delayed DDPG
- Delayed policy updates (DPU)
- Clipped double Q-learning (CDQ)
- Target policy smoothing (TPS)
- TD3 implementation
- Case Study: Recommendations Using Reviews
- Improvements to DPG
- Trust Region Methods
- KullbackLeibler Divergence
- KL divergence experiments
- Natural Policy Gradients and Trust Region Policy Optimization
- Proximal Policy Optimization
- PPOs clipped objective
- PPOs value function and exploration objectives
- KullbackLeibler Divergence
- Example: Using Servos for a Real-Life Reacher
- Experiment Setup
- RL Algorithm Implementation
- Increasing the Complexity of the Algorithm
- Hyperparameter Tuning in a Simulation
- Resulting Policies
- Other Policy Gradient Algorithms
- Retrace( )
- Actor-Critic with Experience Replay (ACER)
- Actor-Critic Using Kronecker-Factored Trust Regions (ACKTR)
- Emphatic Methods
- Extensions to Policy Gradient Algorithms
- Quantile Regression in Policy Gradient Algorithms
- Summary
- Which Algorithm Should I Use?
- A Note on Asynchronous Methods
- Further Reading
- Off-Policy Algorithms
- 7. Learning All Possible Policies with Entropy Methods
- What Is Entropy?
- Maximum Entropy Reinforcement Learning
- Soft Actor-Critic
- SAC Implementation Details and Discrete Action Spaces
- Automatically Adjusting Temperature
- Case Study: Automated Traffic Management to Reduce Queuing
- Extensions to Maximum Entropy Methods
- Other Measures of Entropy (and Ensembles)
- Optimistic Exploration Using the Upper Bound of Double Q-Learning
- Tinkering with Experience Replay
- Soft Policy Gradient
- Soft Q-Learning (and Derivatives)
- Path Consistency Learning
- Performance Comparison: SAC Versus PPO
- How Does Entropy Encourage Exploration?
- How Does the Temperature Parameter Alter Exploration?
- Industrial Example: Learning to Drive with a Remote Control Car
- Description of the Problem
- Minimizing Training Time
- Dramatic Actions
- Hyperparameter Search
- Final Policy
- Further Improvements
- Summary
- Equivalence Between Policy Gradients and Soft Q-Learning
- What Does This Mean For the Future?
- What Does This Mean Now?
- 8. Improving How an Agent Learns
- Rethinking the MDP
- Partially Observable Markov Decision Process
- Predicting the belief state
- Case Study: Using POMDPs in Autonomous Vehicles
- Contextual Markov Decision Processes
- MDPs with Changing Actions
- Regularized MDPs
- Partially Observable Markov Decision Process
- Hierarchical Reinforcement Learning
- Naive HRL
- High-Low Hierarchies with Intrinsic Rewards (HIRO)
- Learning Skills and Unsupervised RL
- Using Skills in HRL
- HRL Conclusions
- Multi-Agent Reinforcement Learning
- MARL Frameworks
- Centralized or Decentralized
- Single-Agent Algorithms
- Case Study: Using Single-Agent Decentralized Learning in UAVs
- Centralized Learning, Decentralized Execution
- Decentralized Learning
- Other Combinations
- Challenges of MARL
- MARL Conclusions
- Expert Guidance
- Behavior Cloning
- Imitation RL
- Inverse RL
- Curriculum Learning
- Other Paradigms
- Meta-Learning
- Transfer Learning
- Summary
- Further Reading
- Rethinking the MDP
- 9. Practical Reinforcement Learning
- The RL Project Life Cycle
- Life Cycle Definition
- Data science life cycle
- Reinforcement learning life cycle
- Life Cycle Definition
- Problem Definition: What Is an RL Project?
- RL Problems Are Sequential
- RL Problems Are Strategic
- Low-Level RL Indicators
- An entity
- An environment
- A state
- An action
- Quantify success or failure
- Types of Learning
- Online learning
- Offline or batch learning
- Concurrent learning
- Reset-free learning
- RL Engineering and Refinement
- Process
- Environment Engineering
- Implementation
- Simulation
- Interacting with real life
- State Engineering or State Representation Learning
- Learning forward models
- Constraints
- Transformation (dimensionality reduction, autoencoders, and world models)
- Policy Engineering
- Discrete states
- Continuous states
- Converting to discrete states
- Mixed state spaces
- Mapping Policies to Action Spaces
- Binary actions
- Continuous actions
- Hybrid action spaces
- When to perform actions
- Massive action spaces
- Exploration
- Is intrinsic motivation exploration?
- Visitation counts (sampling)
- Information gain (surprise)
- State prediction (curiosity or self-reflection)
- Curious challenges
- Random embeddings (random distillation networks)
- Distance to novelty (episodic curiosity)
- Exploration conclusions
- Reward Engineering
- Reward engineering guidelines
- Reward shaping
- Common rewards
- Reward conclusions
- Summary
- Further Reading
- The RL Project Life Cycle
- 10. Operational Reinforcement Learning
- Implementation
- Frameworks
- RL frameworks
- Other frameworks
- Scaling RL
- Distributed training (Gorila)
- Single-machine training (A3C, PAAC)
- Distributed replay (Ape-X)
- Synchronous distribution (DD-PPO)
- Improving utilization (IMPALA, SEED)
- Scaling conclusions
- Evaluation
- Policy performance measures
- Statistical policy comparisons
- Algorithm performance measures
- Problem-specific performance measures
- Explainability
- Evaluation conclusions
- Frameworks
- Deployment
- Goals
- Goals during different phases of development
- Best practices
- Hierarchy of needs
- Architecture
- Ancillary Tooling
- Build versus buy
- Monitoring
- Logging and tracing
- Continuous integration and continuous delivery
- Experiment tracking
- Hyperparameter tuning
- Deploying multiple agents
- Deploying policies
- Safety, Security, and Ethics
- Safe RL
- Secure RL
- Ethical RL
- Goals
- Summary
- Further Reading
- Implementation
- 11. Conclusions and the Future
- Tips and Tricks
- Framing the Problem
- Your Data
- Training
- Evaluation
- Deployment
- Debugging
- ${ALGORITHM_NAME} Cant Solve ${ENVIRONMENT}!
- Monitoring for Debugging
- The Future of Reinforcement Learning
- RL Market Opportunities
- Future RL and Research Directions
- Research in industry
- Research in academia
- Ethical standards
- Concluding Remarks
- Next Steps
- Now Its Your Turn
- Further Reading
- Tips and Tricks
- A. The Gradient of a Logistic Policy for Two Actions
- B. The Gradient of a Softmax Policy
- Glossary
- Acronyms and Common Terms
- Symbols and Notation
- Index