Zahra Niazi

DRL tutorial

Hey there! Welcome to this Deep Reinforcement Learning (DRL) tutorial!

During my graduate course on neural networks and deep learning at Ferdowsi University of Mashhad, under the guidance of Dr. Rouhani, I was tasked with making a presentation about deep reinforcement learning. It was a fascinating challenge to connect what we learned in class with the real-world applications of deep learning in reinforcement scenarios.

The repo containing all the files can be found here.

The slides for this tutorial can be found here!

And the gitbook is available here!

Table of contents

The Reinforcement Learning Framework

Two main approaches for solving RL problems

Value-Based Functions

Value-Based Learning Strategies

Deep Q-Learning

Policy-Based Learning Strategies

Actor-Critic Methods

Offline Reinforcement Learning

The Big Picture

The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions.

Formal definitions

โ€œReinforcement learning provides a mathematical formalism for learning-based decision makingโ€

Reinforcement learning is an approach for learning decision making and control from experience

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

RL Process


The RL Process: a loop of state, action, reward and next state

0 (1).png

State, Action

The agentโ€™s goal is to maximize its cumulative reward, called the expected return.

The reward hypothesis


The reward hypothesis: the central idea of Reinforcement Learning

Why is the goal of the agent to maximize the expected return?

Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

Thatโ€™s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.

Markov Property

The Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

State-Observation Space

Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.

There is a differentiation to make between observation and state, however:

State s:

is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

2 (1).jpeg

In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.

Observation o:

is a partial description of the state. In a partially observed environment

3 (1).jpeg

In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation. In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.

Action Space

The Action space is the set of all possible actions in an environment.

The actions can come from a discrete or continuous space:

Discrete space

The number of possible actions is finite.

Again, in Super Mario Bros, we have only 5 possible actions: 4 directions and jumping

Again, in Super Mario Bros, we have only 5 possible actions: 4 directions and jumping

In Super Mario Bros, we have a finite set of actions since we have only 5 possible actions: 4 directions and jump.

Continuous space

The number of possible actions is infinite.

A Self Driving Car agent has an infinite number of possible actions since it can turn left 20ยฐ, 21,1ยฐ, 21,2ยฐ, honk, turn right 20ยฐโ€ฆ

A Self Driving Car agent has an infinite number of possible actions since it can turn left 20ยฐ, 21,1ยฐ, 21,2ยฐ, honk, turn right 20ยฐโ€ฆ

Rewards and discounting

The reward is fundamental in RL because itโ€™s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

The cumulative reward at each time step t can be written as:

The cumulative reward equals the sum of all rewards in the sequence.

The cumulative reward equals the sum of all rewards in the sequence.

which is equivalent to:

The cumulative reward

The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...

However, in reality, we canโ€™t just add them like that. The rewards that come sooner (at the beginning of the game) are more likely to happen since they are more predictable than the long-term future reward.


Letโ€™s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouseโ€™s goal is to eat the maximum amount of cheese before being eaten by the cat.

The cumulative reward

To discount the rewards, we proceed like this:

  1. We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.
    1. The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
    2. On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).
  2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.

Our discounted expected cumulative reward is:

The cumulative reward

Types of Tasks

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuing

Episodic task

In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States.

The cumulative reward

For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ends when youโ€™re killed or you reached the end of the level.

Continuous task

These are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.

The cumulative reward

For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop it.

The Exploration-Exploitation tradeoff

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

Letโ€™s take an example:

The cumulative reward

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).

If we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

The cumulative reward

But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).

The cumulative reward

This is what we call the exploration/exploitation trade-off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.

Therefore, we must define a rule that helps to handle this trade-off.

Policy-Based Methods

Now that we learned the RL framework, how do we solve the RL problem? In other words, how do we build an RL agent that can select the actions that maximize its expected cumulative reward?

The policy ฯ€: the agentโ€™s brain

The Policy ฯ€ is the brain of our Agent, itโ€™s the function that tells us what action to take given the state we are in. So it defines the agentโ€™s behavior at a given time.

Think of policy as the brain of our agent, the function that will tell us the action to take given a state

Think of policy as the brain of our agent, the function that will tell us the action to take given a state

This Policy is the function we want to learn, our goal is to find the optimal policy ฯ€*, the policy that maximizes expected return when the agent acts according to it. We find this ฯ€* through training.

There are two approaches to train our agent to find this optimal policy ฯ€*:


In Policy Based methods, we learn a policy function directly.

As we can see here, the policy (deterministic) directly indicates the action to take for each step

As we can see here, the policy (deterministic) directly indicates the action to take for each step

This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state . We have two types of policies:

Deterministic:

a policy at a given state will always return the same action

action = policy(state)

action = policy(state)

action = policy(state)

Stochastic:

outputs a probability distribution over actions.

![policy(actions state) = probability distribution over the set of actions given the current state)](../assets/17.jpeg){:style=โ€display:block; margin-left:auto; margin-right:autoโ€}
<p style="text-align: center;">policy(actions state) = probability distribution over the set of actions given the current state</p>

Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.

image (21).png

Value-Based Methods

In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of being at that state.

19.jpeg

The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.


But what does it mean to act according to our policy? After all, we donโ€™t have a policy in value-based methods since we train a value function and not a policy.

Here we see that our value function defined values for each possible state.

Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

Since the policy is not trained/learned, we need to specify its behavior. For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, weโ€™ll create a Greedy Policy.

In this case, you donโ€™t train the policy: your policy is just a simple pre-specified function (for instance, the Greedy Policy) that uses the values given by the value-function to select its actions.

In value-based training, finding an optimal value function (denoted Q* or V*) leads to having an optimal policy.

Finding an optimal value function leads to having an optimal policy.

20 (1).jpeg

In fact, most of the time, in value-based methods, youโ€™ll use an Epsilon-Greedy Policy that handles the exploration/exploitation trade-off.

State Value Function


We write the state value function under a policy ฯ€ like this:

image (1).png

For each state, the state-value function outputs the expected return if the agent starts at that state and then follows the policy forever afterward (for all future timesteps, if you prefer).

If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.

If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.

Action Value Function


The value of taking action a in state s under a policy ฯ€ is:

image (5).png

In the action-value function, for each state and action pair, the action-value function outputs the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.

We see that the difference is:

The Bellman Equation

Instead of calculating the expected return for each state or each state-action pair, we can use the Bellman equation.

The Bellman equation is a recursive equation that works like this:

image (22).png

Instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:

The immediate reward ๐‘…๐‘ก_+1** + the discounted value of the state that follows (** ๐‘”๐‘Ž๐‘š๐‘š๐‘Žโˆ—๐‘‰(๐‘†๐‘ก+1_) ) .

To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, which is a long process, we calculate the value as the sum of immediate reward + the discounted value of the state that follows.

Monte Carlo vs Temporal Difference Learning

The idea behind Reinforcement Learning is that given the experience and the received reward, the agent will update its value function or policy.

Monte Carlo and Temporal Difference Learning are two different strategies on how to train our value function or our policy function. Both of them use experience to solve the RL problem.

Monte Carlo: learning at the end of the episode

Monte Carlo approach:

Monte Carlo waits until the end of the episode, calculates ๐บ๐‘ก (return) and uses it as a target for updating ๐‘‰โŸฎ๐‘†๐‘กโŸฏ.

So it requires a complete episode of interaction before updating our value function.

image (20).png

If we take an example:

image (10).png

At the end of the episode:

By running more and more episodes, the agent will learn to play better and better.


For instance, if we train a state-value function using Monte Carlo:

image (15).png

Temporal Difference Learning: learning at each step

Temporal Difference, on the other hand, waits for only one interaction (one step) ๐‘†๐‘ก+1 to form a TD target and update ๐‘‰โŸฎ๐‘†๐‘กโŸฏ using ๐‘…๐‘ก+1โ€‹ and ๐›พโˆ—๐‘‰โŸฎ๐‘†๐‘ก+1โŸฏ.

The idea with TD is to update the ๐‘‰โŸฎ๐‘†๐‘กโŸฏat each step.

But because we didnโ€™t experience an entire episode, we donโ€™t have ๐บ๐‘ก (expected return). Instead, we estimate ๐บ๐‘ก by adding ๐‘…๐‘ก+1 and the discounted value of the next state.

This is called bootstrapping. Itโ€™s called this because TD bases its update in part on an existing estimate ๐‘‰โŸฎ๐‘†๐‘ก+1โŸฏ and not a complete sample ๐บ๐‘กโ€‹.

TD approach:

TD-3.png

This method is called TD(0) or one-step TD (update the value function after any individual step).

TD Approach:

At the end of one step (State, Actions, Rewards, and Next States):

Now we continue to interact with this environment with our updated value function. By running more and more steps, the agent will learn to play better and better.


If we take an example:

TD-2.png

We can now update ๐‘‰โŸฎ๐‘†0โŸฏ:

So we just updated our value function for State 0.

Now we continue to interact with this environment with our updated value function.

Summary

Summary.png

Off-policy vs On-policy


On-Policy Reinforcement Learning:

In on-policy RL, the agent learns from the experiences it gathers while following its current policy. This means that the data used for training the agent comes from the same policy it is trying to improve. It updates its policy based on its current actions and their consequences.

Key characteristics of on-policy RL:

Off-Policy Reinforcement Learning:

In off-policy RL, the agent learns from experiences generated by a different (usually older) policy, often referred to as the โ€œbehavior policy.โ€ It uses these experiences to learn and improve a different policy called the โ€œtarget policy.โ€ This decoupling of data collection and policy improvement allows for greater flexibility.

Key characteristics of off-policy RL:

Q-Learning


What is Q-Learning?

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function:

Introducing Q-Learning

Q-Learning is the algorithm we use to train our Q-function, an action-value function that determines the value of being at a particular state and taking a specific action at that state.

The Q comes from โ€œthe Qualityโ€ (the value) of that action at that state.


Letโ€™s recap the difference between value and reward:

Value

The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.

Reward

The reward is the feedback I get from the environment after performing an action at a state.

Internally, our Q-function is encoded by a Q-table, a table where each cell corresponds to a state-action pair value. Think of this Q-table as the memory or cheat sheet of our Q-function.


Letโ€™s go through an example of a maze.

Maze-1.png

The Q-table is initialized. Thatโ€™s why all values are = 0. This table contains, for each state and action, the corresponding state-action values.

Maze-2.png

Here we see that the state-action value of the initial state and going up is 0:

Maze-3.png

So: the Q-function uses a Q-table that has the value of each state-action pair. Given a state and action, our Q-function will search inside its Q-table to output the value.

Q-function-2.png


If we recap, Q-Learning is the RL algorithm that:

In the beginning, our Q-table is useless since it gives arbitrary values for each state-action pair (most of the time, we initialize the Q-table to 0).

As the agent explores the environment and we update the Q-table, it will give us a better and better approximation to the optimal policy.

The Q-Learning Algorithm


Step 01: We initialize the Q-table

We need to initialize the Q-table for each state-action pair. Most of the time, we initialize with values of 0.

image (17).png


Step 02: Choose an action using the epsilon-greedy strategy

The epsilon-greedy strategy is a policy that handles the exploration/exploitation trade-off.

The idea is that, with an initial value of ษ› = 1.0:

Q-learning-5.png

At the beginning of the training, the probability of doing exploration will be huge since ษ› is very high, so most of the time, weโ€™ll explore. But as the training goes on, and consequently our Q-table gets better and better in its estimations, we progressively reduce the epsilon value since we will need less and less exploration and more exploitation.


Step 03: Perform action ๐ด๐‘ก, get reward ๐‘…๐‘ก+1 and next state ๐‘†๐‘ก+1


Step 4: Update ๐‘„โŸฎ๐‘†๐‘ก, ๐ด๐‘กโŸฏ

Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) after one step of the interaction.

To produce our TD target, we used the immediate reward ๐‘…๐‘ก+1โ€‹ plus the discounted value of the next state, computed by finding the action that maximizes the current Q-function at the next state. (We call that bootstrap).

Q-learning-7.png

Therefore, our ๐‘„โŸฎ๐‘†๐‘ก, ๐ด๐‘กโŸฏ update formula goes like this:

Q-learning-8.png

This means that to update our ๐‘„โŸฎ๐‘†๐‘ก, ๐ด๐‘กโŸฏ:

How do we form the TD target?

  1. We obtain the reward after taking the action ๐‘…๐‘ก+1
  2. To get the best state-action pair value for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done, we start in a new state and select our action using a epsilon-greedy policy again.

This is why we say that Q Learning is an off-policy algorithm.

From Q-Learning to Deep Q-Learning

We learned that Q-Learning is an algorithm we use to train our Q-Function, an action-value function that determines the value of being at a particular state and taking a specific action at that state.

Internally, our Q-function is encoded by a Q-table, a table where each cell corresponds to a state-action pair value. Think of this Q-table as the memory or cheat sheet of our Q-function.

The problem is that Q-Learning is a tabular method. This becomes a problem if the states and actions spaces are not small enough to be represented efficiently by arrays and tables. In other words: it is not scalable.

Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values using a parametrized Q-function ๐‘„๐œƒโŸฎ๐‘ ,๐‘ŽโŸฏ.

possible observations (for comparison, we have approximately

This neural network will approximate, given a state, the different Q-values for each possible action at that state. And thatโ€™s exactly what Deep Q-Learning does.

deep.jpg

The Deep Q-Network (DQN)

This is the architecture of our Deep Q-Learning network:

deep-q-network.jpg

As input, we take a stack of 4 frames passed through the network as a state and output a vector of Q-values for each possible action at that state. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.

When the Neural Network is initialized, the Q-value estimation is terrible. But during training, our Deep QNetwork agent will associate a situation with the appropriate action and learn to play the game well.

Then the stacked frames are processed by three convolutional layers. These layers allow us to capture and exploit spatial relationships in images.

Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.

So, the Deep Q-Learning uses a neural network to approximate, given a state, the different Q-values for each possible action at that state.


Preprocessing the input and temporal limitation:

We need to preprocess the input. Itโ€™s an essential step since we want to reduce the complexity of our state to reduce the computation time needed for training.

  1. To achieve this, we reduce the state space to 84x84 and grayscale it. This is a big improvement since we reduce our three color channels (RGB) to 1.
  2. We can also crop a part of the screen in some games if it does not contain important information. Then we stack four frames together.

Why do we stack four frames together? We stack frames together because it helps us handle the problem of temporal limitation.

Letโ€™s take an example with the game of Pong. When you see this frame:

image (16).png

Can you tell me where the ball is going?

No, because one frame is not enough to have a sense of motion!

But what if I add three more frames? Here you can see ball is going? Here, you can see that the ball is going to the right.

image (18).png

The Deep Q-Learning Algorithm

We learned that Deep Q-Learning uses a deep neural network to approximate the different Q-values for each possible action at a state (value-function estimation).

The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:

Q-learning-8.png

In Deep Q-Learning, we create a loss function that compares our prediction and the gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better.

The Deep Q-Learning training algorithm has two phases:

  1. Sampling: we perform actions and store the observed experience tuples in a replay memory.
  2. Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.

image (19).png

This is not the only difference compared with Q-Learning.

Deep Q-Learning training might suffer from instability, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).

To help us stabilize the training, we implement three different solutions:

1.Experience Replay to make more efficient use of experiences.
2.Fixed Q-Target to stabilize the training.
2.Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.

Experience Replay

Why do we create a replay memory? Experience Replay in Deep Q-Learning has two functions:

Make more efficient use of the experiences during the training.

Usually, in online reinforcement learning, the agent interacts with the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.

Experience replay helps by using the experiences of the training more efficiently. We use a replay buffer that saves experience samples that we can reuse during the training.

This allows the agent to learn from the same experiences multiple times.

Avoid forgetting previous experiences and reduce the correlation between experiences.

The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget the previous experiences as it gets new experiences. For instance, if the agent is in the first level and then in the second, which is different, it can forget how to behave and play in the first level.

The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents the network from only learning about what it has done immediately before.

Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid action values from oscillating or diverging catastrophically.

In the Deep Q-Learning pseudocode, we initialize a replay memory buffer D with capacity N (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.

Fixed Q-Target

When we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q-Target) and the current Q-value (estimation of Q).

But we donโ€™t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.

Q-Target

Q-learning-7.png

Q-Loss

Q-learning-8.png

However, the problem is that we are using the same parameters (weights) for estimating the TD target and the Q-value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.

Therefore, at every step of training, both our Q-values and the target values shift. Weโ€™re getting closer to our target, but the target is also moving. Itโ€™s like chasing a moving target!

This can lead to significant oscillation in training.


Itโ€™s like if you were a cowboy (the Q estimation) and you wanted to catch a cow (the Q-target). Your goal is to get closer (reduce the error).

qtarget-1.png

At each time step, youโ€™re trying to approach the cow, which also moves at each time step (because you use the same parameters).

qtarget-2.png

qtarget-3.png

This leads to a bizarre path of chasing (a significant oscillation in training).

qtarget-4.png

Instead, what we see in the pseudo-code is that we:

Double DQN

Double DQNs, or Double Deep Q-Learning neural networks, were introduced by Hado van Hasselt. This method handles the problem of the overestimation of Q-values.

Q-learning-7.png

When calculating the TD Target, we face a simple problem: how are we sure that the best action for the next state is the action with the highest Q-value?

We know that the accuracy of Q-values depends on what action we tried and what neighboring states we explored.

Consequently, we donโ€™t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.

The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:

Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning.

Introducing Policy-Gradient Methods


Recap of Policy-Based Methods

In policy-based methods, we directly learn to approximate ฯ€โˆ— without having to learn a value function.

The idea is to parameterize the policy. For instance, using a neural network ฯ€ฮธโ€‹, this policy will output a probability distribution over actions (stochastic policy).

stochastic_policy.png

Our objective then is to maximize the performance of the parameterized policy using gradient ascent.

To do that, we control the parameter that will affect the distribution of actions over a state.

policy_based.png

Consequently, thanks to policy-based methods, we can directly optimize our policy ฯ€ฮธโ€‹ to output a probability distribution over actions ๐œ‹๐œƒโŸฎ๐‘Ž ๐‘ โŸฏ that leads to the best cumulative return.

To do that, we define an objective function ๐ฝโŸฎ๐œƒโŸฏ, that is, the expected cumulative reward, and we want to find the value ฮธ that maximizes this objective function.


The difference between policy-based and policy-gradient methods

Policy-gradient methods, what weโ€™re going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time on-policy since for each update, we only use data (trajectories) collected by our most recent version of ฯ€ฮธ.

The difference between these two methods lies on how we optimize the parameter ฮธ:

Before diving more into how policy-gradient methods work (the objective function, policy gradient theorem, gradient ascent, etc.), letโ€™s study the advantages and disadvantages of policy-based methods.

The Advantages and Disadvantages of Policy-Gradient Methods


Advantages

The simplicity of integration

We can estimate the policy directly without storing additional data (action values). Policy-gradient methods can learn a stochastic policy

Policy-gradient methods can learn a stochastic policy

Policy-gradient methods can learn a stochastic policy while value functions canโ€™t. This has two consequences:

  1. We donโ€™t need to implement an exploration/exploitation trade-off by hand. Since we output a probability distribution over actions, the agent explores the state space without always taking the same trajectory.
  2. We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when two states seem (or are) the same but need different actions.

Letโ€™s take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.

image (3).png

Our vacuum cleaner can only perceive where the walls are.

The problem is that the two red (colored) states are aliased states because the agent perceives an upper and lower wall for each.

image (2).png

Under a deterministic policy, the policy will either always move right when in a red state or always move left. Either case will cause our agent to get stuck and never suck the dust.

Under a value-based Reinforcement learning algorithm, we learn a quasi-deterministic policy (โ€œgreedy epsilon strategyโ€). Consequently, our agent can spend a lot of time before finding the dust.

On the other hand, an optimal stochastic policy will randomly move left or right in red (colored) states. Consequently, it will not be stuck and will reach the goal state with a high probability.

image (14).png

Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces

The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state. But what if we have an infinite possibility of actions?

For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15ยฐ, 17.2ยฐ, 19,4ยฐ, honking, etc.). Weโ€™ll need to output a Q-value for each possible action! And taking the max action of a continuous output is an optimization problem itself!

Instead, with policy-gradient methods, we output a probability distribution over actions.

Policy-gradient methods have better convergence properties

In value-based methods, we use an aggressive operator to change the value function: we take the maximum over Q-estimates. Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.

For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after itโ€™s right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.

On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) change smoothly over time.

Disadvantages

Naturally, policy-gradient methods also have some disadvantages:

Policy-Gradient Methods


The Big Picture

The idea is that we have a parameterized stochastic policy. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called the action preference.

Our goal with policy-gradient is to control the probability distribution of actions by tuning the policy such that good actions (that maximize the return) are sampled more frequently in the future. Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.

But how are we going to optimize the weights using the expected return?

The idea is that weโ€™re going to let the agent interact during an episode. And if we win the episode, we consider that each action taken was good and must be more sampled in the future since they lead to win.

So for each state-action pair, we want to increase the P(a s): the probability of taking that action at that state. Or decrease if we lost.

The Policy-gradient algorithm (simplified) looks like this:

pg_bigpicture.jpg


We have our stochastic policy ฯ€ which has a parameter ฮธ. This ฯ€, given a state, outputs a probability distribution of actions.

Where ฯ€ฮธ(atst) is the probability of the agent selecting action atatโ€‹ from state stโ€‹ given our policy.

But how do we know if our policy is good? We need to have a way to measure it. To know that, we define a score/objective function called J(ฮธ).

The objective function

The objective function gives us the performance of the agent given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the expected cumulative reward.

objective.jpg

Letโ€™s give some more details on this formula:

The expected return

The expected return (also called expected cumulative reward), is the weighted average (where the weights are given by P(ฯ„;ฮธ) of all possible values that the return R(ฯ„)can take).

expected_reward.png

R(ฯ„)

Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.

P(ฯ„;ฮธ)

Probability of each possible trajectory ฯ„ฯ„ (that probability depends on ฮธฮธ since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).

probability.png

J(ฮธ)

Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given ฮธ multiplied by the return of this trajectory.

Our objective then is to maximize the expected cumulative reward by finding the ฮธ that will output the best action probability distributions:

max_objective.png

The Policy-Gradient Theorem

Policy-gradient is an optimization problem: we want to find the values of ฮธ that maximize our objective function J(ฮธ), so we need to use gradient-ascent. Itโ€™s the inverse of gradient-descent since it gives the direction of the steepest increase of J(ฮธ).

Our update step for gradient-ascent is: \(\theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta)\)

We can repeatedly apply this update in the hopes that ฮธ converges to the value that maximizes J(ฮธ).

However, there are two problems with computing the derivative of J(ฮธ):

  1. We canโ€™t calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive. So we want to calculate a gradient estimation with a sample-based estimate (collect some trajectories).
  2. To differentiate this objective function, we need to differentiate the state distribution, called the Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we canโ€™t differentiate it because we might not know about it

Fortunately weโ€™re going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.

So we have: \(\nabla_\theta J(\theta)=\nabla_\theta \sum_\tau P(\tau;\theta)R(\tau)\)

We can rewrite the gradient of the sum as the sum of the gradient: \(= \sum_\tau \nabla_\theta P(\tau;\theta)R(\tau)\)

We then multiply every term in the sum by \(\frac{ P(\tau;\theta)}{ P(\tau;\theta)}\) (which is possible since itโ€™s = 1)

\[= \sum_\tau \frac{ P(\tau;\theta)}{ P(\tau;\theta)} \nabla_\theta P(\tau;\theta)R(\tau) \\ = \sum_\tau P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{ P(\tau;\theta)} R(\tau)\]

We can then use the derivative log trick (also called likelihood ratio trick or REINFORCE trick), a simple rule in calculus that implies that: \(\nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)}\)

So this is our likelihood policy gradient:

\[\nabla_{\theta} J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_{\theta} \log P(\tau;\theta) R(\tau)\]

Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with a sample-based estimate if you prefer)

\[\nabla_{\theta} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} \log P(\tau^{(i)};\theta) R(\tau^{(i)})\]

where each \(\tau(i)\\) is a sampled trajectory.

But we still have some mathematics work to do there: we need to simplify \(\nabla_{\theta} \log P(\tau \mid \theta)\).

We know that:

\[\nabla_{\theta} \log P(\tau^{(i)};\theta) = \nabla_{\theta} \log \left[\mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)} \mid s_t^{(i)}, a_t^{(i)}) \pi_{\theta}(a_t^{(i)} \mid s_t^{(i)})\right]\]

Where \(\mu(s_0)\) is the initial state distribution and \(P(s_{t+1}^{(i)} \mid s_t^{(i)}, a_t^{(i)})\) is the state transition dynamics of the MDP.

We know that the log of a product is equal to the sum of the logs:

\[\nabla_{\theta} \log P(\tau^{(i)};\theta) = \nabla_{\theta} \left[\log \mu(s_0) + \sum_{t=0}^{H} \log P(s_{t+1}^{(i)} \mid s_t^{(i)}, a_t^{(i)}) + \sum_{t=0}^{H} \log \pi_{\theta}(a_t^{(i)} \mid s_t^{(i)})\right]\]

We also know that the gradient of the sum is equal to the sum of gradients:

\[\nabla_{\theta} \log P(\tau^{(i)};\theta) = \\ \nabla_{\theta} \log \mu(s_0) + \nabla_{\theta} \sum_{t=0}^{H} \log P(s_{t+1}^{(i)} \mid s_t^{(i)}, a_t^{(i)}) + \nabla_{\theta} \sum_{t=0}^{H} \log \pi_{\theta}(a_t^{(i)} \mid s_t^{(i)})\]

Since neither the initial state distribution nor the state transition dynamics of the MDP are dependent on (ฮธ), the derivative of both terms is 0. So we can remove them:

Since: \(\nabla_{\theta} \sum_{t=0}^{H} \log P(s_{t+1}^{(i)} \mid s_t^{(i)}, a_t^{(i)}) = 0\) and \(\nabla_{\theta} \mu(s_0) = 0\)

\[\nabla_{\theta} \log P(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} \mid s_t^{(i)})\]

We can rewrite the gradient of the sum as the sum of gradients:

\[\nabla_{\theta} \log P(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} \mid s_t^{(i)})\]

So, the final formula for estimating the policy gradient is:

\[\nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{H} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} \mid s_t^{(i)}) R(\tau^{(i)})\]

The Reinforce Algorithm

The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that uses an estimated return from an entire episode to update the policy parameter ฮธ:

In a loop:

policy_gradient_one.png

We can interpret this update as follows:

We can also collect multiple episodes (trajectories) to estimate the gradient:

policy_gradient_multiple.png

The Problem of Variance in Reinforce

In Reinforce, we want to increase the probability of actions in a trajectory proportionally to how high the return is.

pg.jpg

This return \(R(\tau)\) is calculated using a Monte-Carlo sampling. We collect a trajectory and calculate the discounted return, and use this score to increase or decrease the probability of every action taken in that trajectory. If the return is good, all actions will be โ€œreinforcedโ€ by increasing their likelihood of being taken.

\(R(\tau) = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + ...\)

The advantage of this method is that itโ€™s unbiased. Since weโ€™re not estimating the return, we use only the true return we obtain.

Given the stochasticity of the environment (random events during an episode) and stochasticity of the policy, trajectories can lead to different returns, which can lead to high variance. Consequently, the same starting state can lead to very different returns. Because of this, the return starting at the same state can vary significantly across episodes.

We can also collect multiple episodes (trajectories) to estimate the gradient:

variance.png

Introducing Actor-Critic Methods

In Policy-Based methods, we aim to optimize the policy directly without using a value function. More precisely, Reinforce is part of a subclass of Policy-Based Methods called Policy-Gradient methods. This subclass optimizes the policy directly by estimating the weights of the optimal policy using Gradient Ascent.

We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), we have significant variance in policy gradient estimation.

Remember that the policy gradient estimation is the direction of the steepest increase in return. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, leads to slower training since we need a lot of samples to mitigate it.

So now weโ€™ll study Actor-Critic methods, a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:


Reducing variance with Actor-Critic methods

The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: the Actor-Critic method.

To understand the Actor-Critic, imagine youโ€™re playing a video game. You can play with a friend that will provide you with some feedback. Youโ€™re the Actor and your friend is the Critic.

ac.png

You donโ€™t know how to play at the beginning, so you try some actions randomly. The Critic observes your action and provides feedback.

Learning from this feedback, youโ€™ll update your policy and be better at playing that game.

On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.

This is the idea behind Actor-Critic. We learn two function approximations:

The Actor-Critic Process

Now that we have seen the Actor Criticโ€™s big picture, letโ€™s dive deeper to understand how the Actor and Critic improve together during the training. As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):

Letโ€™s see the training process to understand how the Actor and Critic are optimized:

Step 1:

At each timestep, \(t\), we get the current state \(S_t\) from the environment and pass it as input through our Actor and Critic.

Our Policy takes the state and outputs an action \(A_t\) โ€‹.

step1.png

Step 2:

The Critic takes that action also as input and, using \(S_t\) and \(A_t\) computes the value of taking that action at that state: the Q-value.

step2.png

Step 3:

The action \(A_t\) performed in the environment outputs a new state \(S_{t+1}\)โ€‹ and a reward \(R_{t+1}\)

step3.png

Step 4:

The Actor updates its policy parameters using the Q-value.

step4.png

Step 5:

Thanks to its updated parameters, the Actor produces the next action to take at \(A_{t+1}\)โ€‹ given the new state \(S_{t+1}\)โ€‹.

The Critic then updates its value parameters.

step5.png

Adding Advantage in Actor-Critic (A2C)

We can stabilize learning further by using the Advantage function as Critic instead of the Action value function.

The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: how taking that action at a state is better compared to the average value of the state. Itโ€™s subtracting the mean value of the state from the state action pair:

image (11).png

In other words, this function calculates the extra reward we get if we take this action at that state compared to the mean reward we get at that state.

The extra reward is whatโ€™s beyond the expected value of that state.

The problem with implementing this advantage function is that it requires two value functions โ€” \(Q(s,a)\) and \(V(s)\). Fortunately, we can use the TD error as a good estimator of the advantage function.

image (13).png

Offline vs Online Reinforcement Learning

Deep Reinforcement Learning (RL) is a framework to build decision-making agents. These agents aim to learn optimal behavior (policy) by interacting with the environment through trial and error and receiving rewards as unique feedback.

The agentโ€™s goal is to maximize its cumulative reward, called return. Because RL is based on the reward hypothesis: all goals can be described as the maximization of the expected cumulative reward.

Deep Reinforcement Learning agents learn with batches of experience. The question is, how do they collect it?:

offlinevsonlinerl.gif

In online reinforcement learning, which is what weโ€™ve learned during this course, the agent gathers data directly: it collects a batch of experience by interacting with the environment. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy).

But this implies that either you train your agent directly in the real world or have a simulator. If you donโ€™t have one, you need to build it, which can be very complex (how to reflect the complex reality of the real world in an environment?), expensive, and insecure (if the simulator has flaws that may provide a competitive advantage, the agent will exploit them).

On the other hand, in offline reinforcement learning, the agent only uses data collected from other agents or human demonstrations. It does not interact with the environment.

The process is as follows:

Can we develop data-driven RL methods?

On-policy RL updates the policy while interacting with the environment, off-policy RL learns from data collected by a different behavior policy, and offline RL learns from a fixed dataset without interacting with the environment. Each approach has its own advantages and use cases depending on the specific requirements of the RL problem at hand.

image (23).png

What makes Offline Reinforcement Learning Difficult?

Offline reinforcement learning (RL) presents several challenges that make it a difficult problem to tackle. Here are some of the key factors that contribute to the complexity of offline RL:

Distribution mismatch:

Offline RL involves learning from a fixed dataset of pre-collected experiences, which may not fully represent the dynamics and states of the online environment. There can be a significant difference between the distribution of the offline data and the distribution encountered during training or deployment. This distribution mismatch can lead to poor performance or even instability during the learning process.

Overestimation and extrapolation:

RL algorithms, particularly value-based methods like Q-learning, can be prone to overestimating the values of actions when learning from off-policy data. This issue arises when the behavior policy used for data collection explores different regions of the state-action space compared to the target policy. Overestimation can lead to suboptimal policies and hinder the learning process. Extrapolation errors may also occur when the agent needs to make predictions or take actions in states that were not sufficiently covered in the offline data.

Exploration-exploitation trade-off:

Offline RL lacks the ability to gather new data from the environment to explore and discover better policies. The absence of online exploration makes it challenging to strike the right balance between exploration (gaining knowledge) and exploitation (leveraging existing knowledge). The agent must rely solely on the provided offline dataset, potentially limiting its ability to explore and discover optimal actions in unexplored or uncertain regions of the state-action space.

Data quality and biases:

The quality and biases present in the offline dataset can significantly impact the learning process. The dataset may contain noisy, biased, or suboptimal trajectories, which can mislead the RL algorithm and lead to subpar policies. Identifying and mitigating data quality issues, such as removing outliers or correcting biases, is crucial for effective offline RL.

Stability and safety:

Offline RL algorithms need to ensure stability and safety during the learning process. Without the ability to collect new data and explore, there is a risk of overfitting to the limited offline dataset or encountering catastrophic failures in unfamiliar states. Ensuring stable and safe learning from offline data is a critical concern in offline RL.