Problem Set 2

\( \DeclareMathOperator*\argmax{argmax} \)

Due by 11:59PM Thursday night, Feb. 22. Please submit to Blackboard and follow the general guidelines regarding homework assignments.

Theoretical exercises

Consider a neural network with the synaptic weight matrix \[ \begin{pmatrix} 0 &1 &1 &1\cr 0 &0 &1 &0\cr 0 &0 &0 &0\cr 0 &1 &1 &0 \end{pmatrix} \] We follow the convention that \(W_{ij}\) denotes the strength of the \(i\leftarrow j\) connection, and \(0\) strength means there is no connection. Show that this is a feedforward net.
- Show it using a diagram of the network graph.
- Show it by performing a topological sort. Your answer should include the sorting permutation, as well as the weight matrix after applying the permutation.
Using LT neurons, construct a multilayer perceptron (MLP) that computes the sum of three binary inputs \(x_{1}\), \(x_{2}\), \(x_{3}\). The two MLP outputs \(y_{1}\) and \(y_{2}\) should represent the two bits of the sum when written as a binary number, i.e., 00, 01, 10, 11.
Backprop for a general feedforward net. In class, we derived the backpropagation algorithm for a multilayer perceptron. Here we will generalize to any feedforward net. Write the forward pass as the iteration of \begin{equation} x_i = f\left(\sum_j W_{ij}x_j + b_i\right) \label{eq:ForwardPass} \end{equation} for \(i=1\) to \(N\), where \(N\) is the number of neurons. Here assume that \(W_{ij}\) is topologically sorted so that it is nonzero only for \(i>j\).
- Write \(\vec{b}\) as a function of \(\vec{x}\). (This is the inverse of the normal mode of network operation, that \(\vec{x}\) is a function of \(\vec{b}\). The inverse function can be interpreted in the following way. Suppose that you have complete control over \(\vec{b}\), and you want to make the activity of the network take on a given value \(\vec{x}\). The output of the \(\vec{x}\to\vec{b}\) function tells you how you should set \(\vec{b}\). The inverse function is mathematically useful for deriving backprop, even if you never use the network in this way. For simplicity assume that the activation function \(f\) is monotone increasing, so that its inverse \(f^{-1}\) is well-defined.)
- Define an online loss function \(e(\vec{x})\). Applying the chain rule to the composition of functions \(\vec{x}\to\vec{b}\to e\) yields \begin{equation} \frac{\partial e}{\partial x_j} = \sum_i \frac{\partial e}{\partial b_i}\frac{\partial b_i}{\partial x_j} \end{equation} Use this to derive the backward pass as the iteration of \begin{equation} \frac{\partial e}{\partial b_j} = f’(f^{-1}(x_j)) \left[\frac{\partial e}{\partial x_j} + \sum_i \frac{\partial e}{\partial b_i} W_{ij}\right] \end{equation} from \(j=N\) to \(1\).
- Suppose that the loss function is \begin{equation} e(\vec{x}) = \frac{1}{2} \sum_{i\in S} (d_i - x_i)^2 \end{equation} where \(S\) is a subset of \(\{1,\ldots,N\}\). What is \(-\partial e/\partial x_j\)? (To answer correctly, you will have to think about the meaning of the chain rule for \(\vec{x}\to\vec{b}\to e\). If your answer is correct, you will be able to see how backprop for a multilayer perceptron follows from the above as a special case where \(S\) is the set of neurons in the output layer. Meditate on this for yourself; you don’t need to submit anything. You will need the sensitivity lemma \(\partial e/\partial W_{ij} = \left(\partial e/\partial b_i\right)x_j\).)

Programming exercises

Classifying MNIST images with a multilayer perceptron. Here is the Colaboratory link: Backprop.ipynb.
- The sample code trains a two-layer perceptron with 10 output neurons. The code displays the first layer weight vectors, the input vector, and a learning curve. Run the code and report the final value of the time-averaged classification error.
- Implement hyperbolic tangent as the activation function. Observe that the output range of the neurons is now different. Run the code to show that it works.
- Implement rectification (ReLU) as the activation function. Run the code to show that it works.
- Implement code to train a three-layer neural network. Use ReLU activation function. Run the code to show that it works.
You may need to fiddle with parameters such as the learning rate parameter eta to get the training to converge. As an optional exercise, modify the last layer by removing the final activation function \(f\) and replacing it by softmax. (Don’t submit this.)