Problem Set 2

\( \DeclareMathOperator*\argmax{argmax} \)

Due Tuesday, Feb. 25 at 3 pm EST on Gradescope. Please follow the general guidelines regarding homework assignments.

Theoretical exercises

Consider a neural network with the synaptic weight matrix \[ \begin{pmatrix} 0 &1 &1 &1\cr 0 &0 &1 &0\cr 0 &0 &0 &0\cr 0 &1 &1 &0 \end{pmatrix} \] We follow the convention that \(W_{ij}\) denotes the strength of the neuron \(j\rightarrow\) neuron \(i\) connection, and \(0\) strength means there is no connection. Show that this is a feedforward net.
- Show it using a diagram of the network graph.
- Show it by performing a topological sort to the neurons. Your answer should include the sorting permutation, as well as the weight matrix after applying the permutation.
Using LT neurons, construct a multilayer perceptron (MLP) that computes the sum of three binary inputs \(x_{1}\), \(x_{2}\), \(x_{3}\). The two MLP outputs \(y_{1}\) and \(y_{2}\) should represent the two bits of the sum when written as a binary number, i.e., 00, 01, 10, 11.
Backprop for a general feedforward net. In class, we derived the backpropagation algorithm for a multilayer perceptron. Here we will generalize to any feedforward net. Write the forward pass as the iteration of \begin{equation} x_i = f\left(\sum_j W_{ij}x_j + b_i\right) \label{eq:ForwardPass} \end{equation} for \(i=1\) to \(N\), where \(N\) is the number of neurons. Here assume that \(W_{ij}\) is topologically sorted so that it is nonzero only for \(i>j\).
- Write \(\vec{b}\) as a function of \(\vec{x}\). (This is the inverse of the normal mode of network operation, that \(\vec{x}\) is a function of \(\vec{b}\). The inverse function can be interpreted in the following way. Suppose that you have complete control over \(\vec{b}\), and you want to make the activity of the network take on a given value \(\vec{x}\). The output of the \(\vec{x}\to\vec{b}\) function tells you how you should set \(\vec{b}\). The inverse function is mathematically useful for deriving backprop, even if you never use the network in this way. For simplicity assume that the activation function \(f\) is monotone increasing, so that its inverse \(f^{-1}\) is well-defined.)
- Define an online loss function \(e(\vec{x})\). Applying the chain rule to the composition of functions \(\vec{x}\to\vec{b}\to e\) yields \begin{equation} \frac{\partial e}{\partial x_j} = \sum_i \frac{\partial e}{\partial b_i}\frac{\partial b_i}{\partial x_j} \end{equation} Use this to derive the backward pass as the iteration of \begin{equation} \frac{\partial e}{\partial b_j} = f’(f^{-1}(x_j)) \left[\frac{\partial e}{\partial x_j} + \sum_i \frac{\partial e}{\partial b_i} W_{ij}\right] \end{equation} from \(j=N\) to \(1\).
- Suppose that the loss function is \begin{equation} e(\vec{x}) = \frac{1}{2} \sum_{i\in S} (d_i - x_i)^2 \end{equation} where \(S\) is a subset of \(\{1,\ldots,N\}\), and \(d_i\) is the desired output for neuron \(i\). What is \(-\partial e/\partial x_j\)? (One way to answer this is to think about the meaning of the chain rule for \(\vec{x}\to\vec{b}\to e\). You will be able to see how backprop for a multilayer perceptron follows from the above as a special case where \(S\) is the set of neurons in the output layer. Meditate on this for yourself; you only need to submit the calculation for \(-\partial e/\partial x_j\).)

Programming exercises

Classifying MNIST images with a multilayer perceptron. Here is the Colaboratory link: Backprop.ipynb.
- The sample code trains a two-layer perceptron with 10 output neurons. The code displays the first layer weight vectors, the input vector, and a learning curve. Run the code and report the final value of the time-averaged classification error.
- Implement hyperbolic tangent as the activation function. Observe that the output range of the neurons is now different. Run the code to show that it works.
- Implement rectification (ReLU) as the activation function. Run the code to show that it works.
- Implement code to train a three-layer neural network. Don’t use autograd. Use ReLU activation function. Run the code to show that it works.
You may need to fiddle with parameters such as the learning rate parameter eta to get the training to converge. As an optional exercise, modify the last layer by removing the final activation function \(f\) and replacing it by softmax. (Don’t submit this.)
Autograd: We have re-implemented the function that trains a two layer network, this time using PyTorch’s autograd capabilities. There is nothing to turn in for this question. Just examine and run the code to watch the network train. Some things to notice:
- You do not need to specify the derivative of the loss, just the loss function itself.
- The backward pass has been greatly simplified so that now you just call e.backward() to compute the derivatives of e with respect to all Parameters. This populates the attribute .grad of all variables in the computation graph (weights, activations) with the gradient of e with respect to that variable.
- Rather than manually updating weights for gradient descent, you call optim.step() which performs a step of gradient descent for all parameters
- Some additions to the code were required to make this possible:
- The learnable parameters are now torch.nn.Parameters. This tells PyTorch to compute and store their gradients.
- You have to initialize the optimizer by giving it the parameters you want update and the learning rate.
- You have to set the gradients of all variables to zero (by calling optim.zero_grad()) before calling .backward() on every iteration. Pytorch actually adds gradients to the .grad attribute when calling .backward() so we need to zero the gradients.

For a further introduction to PyTorch’s autograd capabilities, check out their website which provides an excellent introduction to autograd in PyTorch.