\[\DeclareMathOperator*\trace{Tr} \DeclareMathOperator*\argmax{argmax} \DeclareMathOperator*\argmin{argmin} \DeclareMathOperator*\prob{\mathbb{P}}\]

Due at 11:59pm on Thursday, May 10. Please submit to Blackboard and follow the general guidelines regarding homework assignments.

REINFORCE

Suppose that the probability that a random variable \(X\) takes on the value \(x\) is given by the parametrized function \(p_\theta(x)\),

\[\prob[X = x] = p_\theta(x)\]

I am following the convention that random variables (e.g. \(X\)) are written in upper case, while values (e.g. \(x\)) are written in lower case.

Given a reward function \(r(X)\), define its expected value by

\[\begin{equation}\label{eq:ExpectedReward} \langle r(X)\rangle = \sum_x p_\theta(x) r(x) \end{equation}\]

where the sum is over all possible values \(x\) of the random variable \(X\). The expected reward is a function of the parameters \(\theta\). How can we maximize this function?

The REINFORCE algorithm consists of the following steps

  • Draw a sample \(x\) of the random variable \(X\).
  • Compute the eligibility (for reinforcement)
\[\begin{equation}\label{eq:Eligibility} e_\theta(x) = \nabla_\theta\log p_\theta(x) \end{equation}\]
  • Observe the reward \(r(x)\).
  • Update the parameters via
\[\begin{equation}\label{eq:ThetaUpdate} \Delta\theta \propto r(x)e_\theta(x) \end{equation}\]

The basic theorem is that this algorithm is stochastic gradient ascent on the expected reward of Eq. (\ref{eq:ExpectedReward}). To apply the REINFORCE algorithm,

  • It is sufficient to observe reward from random samples; no explicit knowledge of the reward function is necessary.
  • One must have enough knowledge about \(p_\theta(x)\) to be able to compute the eligibility.

As a special case, suppose that \(X=(S,A)\) is a state-action pair taking on values \(x=(s,a)\). The reward \(r(S,A)\) is a function of both state and action, and the probability distribution of state-action pairs is

\[\prob[X=x] = \prob[S=s, A=a] = \prob[S=s]\prob[A=a|S=s]\]

We interpret the probability \(\prob[S=s]\) as describing the states of the world. The conditional probability \(\log\prob[A=a|S=s]\) describes a stochastic machine that takes the state \(S\) of the world as input and produces action \(A\) as output. So the eligibility is

\[\nabla_\theta\log\prob[X=x] = \nabla_\theta\log\prob[S=s] + \nabla_\theta\log\prob[A=a|S=s]\]

The first term in the sum vanishes because the parameter \(\theta\) does not affect the world. Therefore we can be ignorant of \(\prob[S=s]\), which is fortunate given that the world is complex. Only the second term in the sum matters, because \(\theta\) only affects the stochastic machine. So the eligibility reduces to

\[\nabla_\theta\log\prob[X=x] = \nabla_\theta\log\prob[A=a|S=s]\]

We’ve only considered the case in which successive states of the world are drawn independently at random. It turns out that REINFORCE easily generalizes to control of a Markov decision process, in which states of the world have temporal dependences. It also generalizes to partially observable Markov decision processes, in which the stochastic machine has incomplete information about the state of the world.

REINFORCE-backprop

Suppose that the stochastic machine is a neural net with a softmax output. The parameter vector \(\theta\) then refers to the weights and biases. Note that \(\log\prob[A=a|S=s]\) is the same (up to a minus sign) as the loss function used for neural net training, assuming that the desired output is \(a\) (see softmax exercise in Problem Set 1). So \(\nabla_\theta\log\prob[A=a|S=s]\) should be the same as the backprop updates assuming that the desired output is \(a\). The minus sign doesn't matter because here we are performing stochastic gradient ascent rather than stochastic gradient descent. Therefore we can combine REINFORCE and backprop in the following algorithm

  • Draw a sample \(s\) of the random variable \(S\) (observe the world).
  • Feed \(s\) as input to a neural network with a softmax output.
  • Choose a random action \(a\) by sampling from the softmax probability distribution.
  • Pretend that the chosen action \(a\) is the correct one, and compute the regular backprop updates for the weights and biases.
  • Observe the reward \(r\).
  • Before applying the incremental updates to the weights and biases, multiply the updates by the reward \(r\).

Theoretical Exercises

  1. Prove that Eq. (\ref{eq:ThetaUpdate}) is stochastic gradient ascent on the expected reward of Eq. (\ref{eq:ExpectedReward}). In other words, show that the expectation value of the right hand side of Eq. (\ref{eq:ThetaUpdate}) is equal to \(\nabla_\theta\langle r(X)\rangle\).

  2. Prove that the eligibility of Eq. (\ref{eq:Eligibility}) has zero mean. This means that the eligibility is like a “noise” that is intrinsic to the system. Note that statisticians refer to the eligibility as the score function, which is used to define Fisher information.

  3. Prove that the gradient of the expected reward is the covariance of the reward and the eligibility.

  4. Consider a multilayer perceptron with a single hidden layer (two layers of connections) and a final softmax output. Write out the REINFORCE-backprop algorithm in full.

Programming Exercises

Now we will implement variants of REINFORCE in several different contexts.

Throughout this section, we recommend Sergey Levine’s deep reinforcement learning course at Berkeley as a reference that covers these topics in more detail.

Bandit

First, we will experiment with a simple bandit environment.

  • Download bandit.ipynb and implement REINFORCE for a stochastic multi-armed bandit where indicated. When correct the policy should converge to consistently select the best arm almost every time(a lower LR and more iterations would make this every time). Show the output of your policy in your submission.

  • Now change the number of arms from n_arms = 5 to n_arms = 1000, increase the number of iterations to 10,000 or 100,000, and rerun a few times. Does the policy still converge to the best arm? If not, what do you think is happening to prevent this?

Ultimately, bandit problems are not very interesting, and hard to solve. To quote Wikipedia’s multi-armed bandit article, courtesy of Alex Irpan’s excellent blog post on the challenges of RL:

Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it.

Therefore, we will next look at the much more tractable problem of solving Markov Decision Processes via policy gradient methods. For your own satisfaction(no need to turn in), explain what the differences are between a multi-armed bandit and a Markov decision process, two classes of problems solvable via reinforcement learning.

OpenAI Gym

Next, we will implement the finite time horizon version of the REINFORCE algorithm on a simple continuous optimization problem, part of the OpenAI Gym.

The environment we will be exploring is CartPole-v0, in which our RL policy will learn to control a cart that moves along a one dimensional line and supports an upright pole, which is gradually falling off the cart in the initial state \(s_0\).

The agent’s job is to move the cart to keep this from happening for as long as possible(up to some max episode length \(T\)) by moving the cart to keep the pole balanced. The cart can move either left or right at any given timestep, and gets a reward of 1 in any timestep where the pole is still upright, and reward of 0 if it falls.

The episode ends when the pole passes the point of no return for falling down, so the max episode reward attainable \(\text{max}(R_{\text{episode}}) = T\). Episodes end at \(T=200\) if not terminated by the pole falling earlier.

This task is fundamentally trivial, but makes for an easy way to verify that our policy is working, since training takes less than a minute and will always succeed if everything is set up correctly.

Starter code can be found in reinforce.ipynb. A very simple neural network policy \(\pi_{\theta}\) with one hidden layer has been defined for you, along with code to load and run the Gym environment. Currently, it just samples a random action, which is not a very good policy.

  • You will need to complete the following:.
    1. Fill in the compute_eligibility function to compute \(e_{\theta}(x_i) = \nabla_{\theta}(\text{log}(p_{\theta}(x_i)))\) for \((s_i,a_i)\) pair \(x_i\) at timestep \(i\).
    2. Fill in the propagate_reward function to compute \(R(x_i) = \displaystyle\sum_{t=i}^{T}\gamma^{t-i}r(x_t)\), the time-decayed accumulated reward for all timesteps \(t{\geq}i\) by propagating rewards backwards to previous states.
    3. Fill in the REINFORCE function to update the weights given \(e_{\theta}(x_i)\) and \(R(x_i)\) by computing \(\Delta\theta \propto R(x_i)e_{\theta}(x_i)\).
    4. Fill in the train function to train \(\pi_{\theta}\) using sampled pairs \((e_{\theta}(x_i),R(x_i))_{i\sim[0,1,\dots,T]}\). Use the above functions as appropriate to do this.
  • Implement and test the above, make sure your agent learns to solve CartPole before moving on(episode reward should be 200 every episode once trained). Show the plot of your reward over time.

  • Next, you will extend REINFORCE to implement the more popular actor-critic algorithm.
    1. Add a value function \(V(s)\) as a branch of the policy network, which tries to predict \({\langle}R(s_t){\rangle}\) the expected reward we will get when in state \(s_t\).
    2. Modify the observed reward during training by computing an advantage \(A(x_t) = R(x_t) - V(s_{t})\) and using it in place of the accumulated time discounted reward \(R(x_t)\).
    3. Train the value function using the L2 loss function \(L = \frac{1}{2}\|V(s_t) - R(x_t)\|^2\). Thus, the value function’s objective is to predict observed future rewards, which we can then use as a baseline to discount our observed rewards and reduce the variance of our gradients compared to using \(R(x_t)\) directly.
  • Once you’ve got that set up, test it and show your reward plot, as well as a plot of \(A(x_t)\) over time during an episode after training has converged. The number of epochs taken to solve CartPole with actor-critic should be comparable to or better than the epochs taken with just REINFORCE.

  • (OPTIONAL EXTRA CREDIT) Now that you’ve implemented everything for CartPole, let’s try a harder environment. Enable the Pong-v0 environment in the code, and train actor-critic on it. You will likely need to adjust the training hyperparameters(learning rate, batch size, maybe switch to a better optimizer than SGD), and will definitely need more episodes to converge. Training will take hours, so Start Early. Show a plot of your reward over time! Full convergence(beats the bot almost every serve) will take a long time to achieve, and likewise getting reward much better than random at the beginning of training will take a while.

  • (FURTHER INTEREST) We’ve only scratched the surface of RL in this assignment. If you’re interested in learning more, try implementing Q-learning as presented by the classic deep Q-networks paper on the above tasks. The Berkeley course is a good guide here as well.