Sample-Efficient Deep Reinforcement Learning for Control, Exploration and Safety


One major challenge of reinforcement learning is to efficiently explore an environment in order to learn optimal policies through trial and error. To achieve this, the agent must be able to learn effectively from past experiences, enabling it to form an accurate picture of the benefit of certain actions over others. Beyond that, an obvious but central issue is that what is not known must be explored, and the necessity to explore in a safe way adds another layer of difficulty to the problem. These are the main issues that we address in this Ph.D. thesis. By deconstructing the actor-critic framework and developing alternative formulations of the underlying optimization problem using the notion of variance, we explore how deep reinforcement learning algorithms can more effectively solve continuous control problems, hard exploration environments and risk-sensitive tasks. The first part of the thesis focuses on the critic component of the actor-critic framework, also referred to as value function, and how to learn more efficiently to control agents in continuous control domains through distinct uses of the variance in the value function estimates. The second part of the thesis is concerned with the actor component of the actor-critic framework, also referred to as policy. We propose the introduction of a third element to the optimization problem that agents solve by introducing an adversary. The adversary is of the same nature as the RL agent but trained to suggest actions that mimic the actor or counteract the constraints of our problem. It is represented by some averaged policy distribution with which the actor must differentiate his behavior by maximizing its divergence with it, eventually encouraging the actor to explore more thoroughly in tasks where efficient exploration is a bottleneck, or to act more safely.

PhD Thesis