\( \DeclareMathOperator*\argmax{argmax} \)

Due Tuesday, Feb. 18 at 3 pm EST on Gradescope. Please follow the general guidelines regarding homework assignments.

Theoretical exercises

  1. Let \(x_{1},\dots,x_{N}\) be \(N\) boolean variables, taking on the values 0 or 1.
    • Consider the conjunction \(\bar{x}_{1}\wedge\bar{x}_{2}\cdots\wedge\bar{x}_{n}\wedge x_{n+1}\wedge\cdots\wedge x_{N}\) in which the first \(n\) variables are negated. Express this function as an LT neuron \(H\left(\sum_{i}w_{i}x_{i}-\theta\right)\) with weights \(w_{i}\) and threshold \(\theta\), where \(H\) is the Heaviside step function. Note that conjunction, or AND, is defined by \(x_{1}\wedge\cdots\wedge x_{N}=1\text{ if and only if }x_{i}=1\text{ for all }i\)

    • Do the same for the disjunction \(\bar{x}_{1}\vee\bar{x}_{2}\cdots\vee\bar{x}_{n}\vee x_{n+1}\vee\cdots\vee x_{N}\). Note that disjunction, or OR, is defined by \(x_{1}\vee\cdots\vee x_{N}=1\text{ if and only if }x_{i}=1\text{ for some }i\)

  2. Logistic regression. The delta rule was first introduced in class using the LT neuron, which has binary output. Then we learned that variants of the delta rule can be derived for a model neuron with a smooth, analog activation function \(f\). The form of the learning rule depends on the loss function is used to measure the deviation between desired and actual outputs. If the desired output \(y\) is binary (0 or 1), and the activation function ranges from 0 to 1, then some people like to interpret the output of the model neuron as the probability of a biased coin. These probability-lovers like to use the “log loss”, \[ e\left(\vec{w},\vec{x},y\right)=-y\log f\left(\vec{w}\cdot\vec{x}\right)-\left(1-y\right)\log\left[1-f\left(\vec{w}\cdot\vec{x}\right)\right] \]
    • Derive the version of the delta rule \((\Delta\vec{w}=?)\) that results from online gradient descent using the above “log loss”.
    • Specialize to the case where the activation function is given by the logistic function \[ f(u)=\frac{1}{1+e^{-u}}. \] You will get a simple result.
    • How does this differ from the version of the delta rule derived in class from the squared error loss function?

    You’ve derived an algorithm for logistic regression. Statisticians use logistic regression when outputs are binary, and linear regression when outputs are analog.

  3. Delta rule for a layer of multiple neurons. Suppose that there are \(k\) LT neurons, which all share the same \(n\) inputs. The \(k\) weight vectors of the neurons are the rows of a \(k\times n\) matrix \(W\). The desired output \(\vec{y}\) and bias \(\vec{b}\) are \(k\times 1\) vectors, and the input \(\vec{x}\) is an \(n\times 1\) vector. The goal of supervised learning is to find \(W\) and \(\vec{b}\) such that this approximation holds: \[ \vec{y}\approx H\left(W\vec{x}+\vec{b}\right). \] Here the Heaviside step function \(H\) has been extended to take a vector argument, simply by applying it to each component of the vector. We could write \(k\) separate delta rules, one for each of the LT neurons, or we could alternatively write it as a single matrix-vector equation. Write the delta rule for both weights and biases in matrix-vector format.

  4. Multiclass training with softmax. We could train multiple LT neurons with separate delta rules to classify digits. Alternatively, there are various multiclass training methods. For example, define the softmax function from \(R^{n}\) to \(R^{n}\) by \[ f_{i}(u_{1},\ldots,u_{n})=\frac{e^{u_{i}}}{\sum_{j=1}^{n}e^{u_{j}}} \] If we scale the magnitude of \(\vec{u}\) to infinity, i.e., \(\lambda\vec{u}\) with \(\lambda\to\infty\), then \(f_i\to 1\) for the \(i\) such that \(u_i\) is maximal, and \(f_i\to 0\) for all other \(i\). This is the reason for the name softmax.
    • Show that the gradient of the softmax function is: \[ \frac{\partial f_{i}}{\partial u_{j}}=f_{i}\delta_{ij}-f_{i}f_{j} \] where \(\delta_{ij}\) is the Kronecker delta, \(\delta = \begin{cases}0 &\text{if}~ i\neq j, \\ 1 &\text{if}~ i=j.\end{cases}\)
    • If any input vector \(\vec{x}\) falls into \(1\) of \(n\) classes, we can represent the correct class for any input vector through an \(n\)-dimensional output vector with one-hot or unary encoding: \[ y_{i} = \begin{cases} 1, & i=\text{correct class}\cr 0, & \text{otherwise} \end{cases} \] Suppose we would like to train \(n\) weight vectors \(\vec{w}_{1},\ldots,\vec{w}_{n}\) so that \[ y_{i}\approx f_{i}(\vec{w}_1\cdot\vec{x},\ldots,\vec{w}_n\cdot\vec{x}) \] Derive online gradient descent for the cost function \[ e(\vec{w}_1,\ldots,\vec{w}_n,\vec{x},y_1,\ldots y_n) = - \sum_i y_i\log f_i(\vec{w}_1\cdot\vec{x},\ldots,\vec{w}_n\cdot\vec{x}) \] This can be viewed as a generalization of logistic regression.

Programming exercises

The MNIST dataset was created at AT&T Bell Labs in the 1990s by selecting from and modifying a larger dataset from NIST. It consists of images of handwritten digits, and was heavily used in machine learning research in the 1990s and 2000s. Documentation about the dataset can be found at Yann LeCun’s personal website.

Here are the Colaboratory links for this assignment.

Follow the instructions in Using Colaboratory to get started.

  1. Intro to MNIST
    • Select MNIST-Intro.ipynb. The notebook should open in your browser.

    • Execute the code in the notebook cells to visualize the MNIST images and their statistical properties. (No work to submit here.)

    • Write code to compute the pixelwise standard deviation of the images in each digit class. (You can start from the provided code for the pixelwise mean of the images in each digit class. Use torch.std to compute standard deviation.)

    • For each digit class, visualize the standard deviation as an image. Use the provided montage function.

    • Explain why the results look the way they do. Write your answer in a notebook cell. You can use Markdown to format your answer if you select Cell > Cell Type > Markdown for the relevant cell.

  2. Classifying MNIST images with an LT neuron.
    • Select DeltaRule.ipynb.

    • The sample code trains an LT neuron to be activated by images of “two” while remaining inactive for images of other digits. When you run the code, you should see time-varying images of the weight vector, and the input vector. You will also see a learning curve, defined as the cumulative time average of the classification error versus time. What is the final value of this time-averaged classification error?

    • In class we derived the loss function for the LT neuron delta rule. Modify the code so that it additionally plots another learning curve, the cumulative time average of the loss function versus time.

    • After training a model in machine learning, you’ll want to know whether your results are good or bad. It’s often helpful to compare with the performance of simple “baseline” models. For example, a simple baseline here is the trivial recognition algorithm that always returns an output of 0. What would be the classification error for this algorithm?

    • Compute the classification error in the test set (the last cell of the notebook shows how to set up the test set).

  3. Train 10 model neurons to detect the 10 digit classes by applying the multiclass softmax algorithm that you derived above. The predicted class can be defined as the neuron for which the softmax output is largest. If the prediction disagrees with the true digit class, that constitutes a classification error.

    • Use torch.nn.Softmax to compute the softmax. See below if you are interested in also implementing the softmax function yourself.

    • Your code should plot a learning curve, the cumulative time average of the classification error versus time. What is the final value?

    • Your code should visualize the ten learned weight vectors displayed as images.

    • Add other visualizations that are helpful for monitoring the progress of learning and evaluating the final result.

If you want to implement your own softmax:
Take a look at the function described in Theory problem 4 (please include biases). However, the naive implementation of this can be numerically unstable, returning NaN for relatively large values of \(w\) and \(x\) because of the term \(e^{wx+b}\). To overcome this, first observe that if you subtract a constant from every input element to the softmax function, the output remains unchanged: \[ f_{i}(u_{1},\ldots,u_{n})=\frac{e^{u_{i}}}{\sum_{j=1}^{n}e^{u_{j}}} = \frac{e^{u_{i}-\lambda}}{\sum_{j=1}^{n}e^{u_{j}-\lambda}} \] If you choose \(\lambda = \max_i{u_{i}}\), then the maximum argument possible to \(\exp\{\cdot\}\) is 0. This should prevent NaNs in the softmax.