# Accelerating Online Reinforcement Learning with Offline Datasets

@article{Nair2020AcceleratingOR, title={Accelerating Online Reinforcement Learning with Offline Datasets}, author={Ashvin Nair and Murtaza Dalal and Abhishek Gupta and Sergey Levine}, journal={ArXiv}, year={2020}, volume={abs/2006.09359} }

Reinforcement learning provides an appealing formalism for learning control policies from experience. However, the classic active formulation of reinforcement learning necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings. If we can instead allow reinforcement learning to effectively use previously collected data to aid the online learning process, where the data could be expert demonstrations or more generally any prior… Expand

#### Supplemental Presentations

#### 68 Citations

A Workflow for Offline Model-Free Robotic Reinforcement Learning

- Computer Science
- ArXiv
- 2021

This paper develops a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems, and devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Expand

Offline Meta-Reinforcement Learning with Online Self-Supervision

- Computer Science
- ArXiv
- 2021

A hybrid offline meta-RL algorithm is proposed, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any ground truth reward labels, to bridge this distribution shift problem. Expand

Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble

- Computer Science
- ArXiv
- 2021

This paper proposes a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples from the offline dataset, and leverages multiple Q-functions trained pessimistically offline, thereby preventing overoptimism concerning unfamiliar actions at novel states during the initial training phase. Expand

The Difficulty of Passive Learning in Deep Reinforcement Learning

- Computer Science
- ArXiv
- 2021

This work proposes the “tandem learning” experimental paradigm, and identifies function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Expand

Addressing Distribution Shift in Online Reinforcement Learning with Offline Datasets

- Computer Science
- 2021

A simple yet effective framework that incorporates a balanced replay scheme and an ensemble distillation scheme that improves the policy using the Q-ensemble during fine-tuning, which allows the policy updates to be more robust to error in each individual Q-function. Expand

Offline Reinforcement Learning with Implicit Q-Learning

- Computer Science
- ArXiv
- 2021

This work proposes a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization, called implicit Q-learning (IQL). Expand

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

- Computer Science
- ICML
- 2021

Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training. Expand

Offline Reinforcement Learning with Value-based Episodic Memory

- Computer Science
- ArXiv
- 2021

This paper adopts a different framework, which learns the V -function instead of the Q-function to naturally keep the learning procedure within the support of an offline dataset, and proposes Expectile V -Learning (EVL), which smoothly interpolates between the optimal value learning and behavior cloning. Expand

Offline Inverse Reinforcement Learning

- Computer Science
- ArXiv
- 2021

The objective of offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available and sampling additional observations is impossible (typically if this operation… Expand

Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2020 UNCERTAINTY WEIGHTED OFFLINE REINFORCEMENT LEARNING

- 2020

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based… Expand

#### References

SHOWING 1-10 OF 54 REFERENCES

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

- Computer Science, Mathematics
- ArXiv
- 2019

This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand

Off-Policy Deep Reinforcement Learning without Exploration

- Computer Science, Mathematics
- ICML
- 2019

This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. Expand

Behavior Regularized Offline Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2019

A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Expand

Reinforcement Learning from Imperfect Demonstrations

- Computer Science, Mathematics
- ICLR
- 2018

This work proposes a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing theQ-values of actions unseen in the demonstration data, making NAC robust to suboptimal demonstration data. Expand

Exponentially Weighted Imitation Learning for Batched Historical Data

- Computer Science
- NeurIPS
- 2018

A monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space and can be used to learn from data generated by an unknown policy. Expand

Overcoming Exploration in Reinforcement Learning with Demonstrations

- Computer Science, Mathematics
- 2018 IEEE International Conference on Robotics and Automation (ICRA)
- 2018

This work uses demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Expand

Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning

- Computer Science, Mathematics
- CoRL
- 2019

This work simplifies the long-horizon policy learning problem by using a novel data-relabeling algorithm for learning goal-conditioned hierarchical policies, where the low-level only acts for a fixed number of steps, regardless of the goal achieved. Expand

Batch Reinforcement Learning

- Computer Science
- Reinforcement Learning
- 2012

This chapter introduces the basic principles and the theory behind batch reinforcement learning, the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications ofbatch reinforcement learning. Expand

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

- Computer Science, Mathematics
- NeurIPS
- 2019

A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Expand

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

- Computer Science
- ArXiv
- 2017

A general and model-free approach for Reinforcement Learning on real robotics with sparse rewards built upon the Deep Deterministic Policy Gradient algorithm to use demonstrations that out-performs DDPG, and does not require engineered rewards. Expand