Problem Set 1

\( \DeclareMathOperator*\argmax{argmax} \renewcommand{\vec}[1]{{\bf #1}} \)

Due Friday, Feb. 9 at 3 pm EST on Gradescope. Please follow the general guidelines regarding homework assignments.

1. Conjunctions and disjunctions of boolean variables and their negations

Let \(x_{1},…,x_{N}\) be \(N\) boolean variables, taking on the values 0 or 1.

Consider the conjunction \(\bar{x}_{1}\wedge\bar{x}_{2}\cdots\wedge\bar{x}_{n}\wedge x_{n+1}\wedge\cdots\wedge x_{N}\) in which the first \(n\) variables are negated. Express this function as a single neuron \(H\left(\sum_{i}w_{i}x_{i}-\theta\right)\) with weights \(w_{i}\) and threshold \(\theta\), where \(H\) is the Heaviside step function. Note that conjunction, or AND, is defined by \(x_{1}\wedge\cdots\wedge x_{N}=1\text{ if and only if }x_{i}=1\text{ for all }i\)
Do the same for the disjunction \(\bar{x}_{1}\vee\bar{x}_{2}\cdots\vee\bar{x}_{n}\vee x_{n+1}\vee\cdots\vee x_{N}\). Note that disjunction, or OR, is defined by \(x_{1}\vee\cdots\vee x_{N}=1\text{ if and only if }x_{i}=1\text{ for some }i\)

2. Delta rule for \(k\) neurons independently performing binary classification tasks.

Suppose that there are \(k\) neurons, which all share the same \(n\) inputs. The input \(\vec{x}\) is an \(n\)-dimensional row vector, and the \(k\) weight vectors of the neurons are the rows of a \(k\times n\) matrix \(W\). The desired output \(\vec{y}\) and bias \(\vec{b}\) are \(k\)-dimensional row vectors. The goal of supervised learning is to find \(W\) and \(\vec{b}\) such that this approximation holds: \begin{equation} \vec{y}\approx \sigma\left(\vec{x}W^\top+\vec{b}\right). \end{equation} The matrix \(W^\top\) is \(n\times k\), and is the transpose of the matrix \(W\). Here we are following the notation conventions of nn.Linear: (1) row vectors, (2) vector times matrix because we are using row vectors, and (3) transpose of \(W\) so that the weight vector of each neuron is also a row vector. The sigmoid function has been extended to take a vector argument, simply by applying it to each component of the vector.

No proofs or derivations are needed in this problem, since we already derived the delta rule from gradient descent in class. Just write the answer in the following.

Write the delta rule for the update \(\Delta W_{ij}\). Explicitly write out all indices of the matrix \(W\) and the vectors \(\vec{y}\), \(\vec{x}\), \(\vec{b}\).
Write an equivalent but more compact version using matrices and vectors, but no indices. Hint: you will need to use the outer product.

3. Delta rule for softmax neurons.

In the previous problem, neuron \(i\) was trained to estimate the probability \(p_i\) that the input belongs to the \(i\)th class. The neurons functioned independently, so there was no guarantee that \(\sum_i p_i = 1\). Alternatively, one could define the output of neuron \(i\) to be \begin{equation} f_{i}(\vec{w}_1\cdot\vec{x},\ldots,\vec{w}_k\cdot\vec{x}) \end{equation} where \(f_i\) is the \(i\)th element of the softmax function defined in Problem Set 0. Now the output of neuron \(i\) depends on the weight vectors of all the neurons; their computations are no longer independent. Because of this interaction, their probability estimates sum to one.

We represent the correct class for any input vector through an \(k\)-dimensional output vector with one-hot or unary encoding: \begin{equation} y_{i} = \begin{cases} 1, & i=\text{correct class}\cr 0, & \text{otherwise} \end{cases} \end{equation}

Derive online gradient descent for the multi-class cross entropy loss function \begin{equation} l = - \sum_i y_i\log f_i(\vec{w}_1\cdot\vec{x},\ldots,\vec{w}_k\cdot\vec{x}) \end{equation} Your derivation should be short, since you should use as a starting point the formulas derived in Problem 5 of Pset 0. Your derivation should end with the delta rule for softmax. Write one version with indices, and another one in matrix-vector format without indices.

4. Introduction to the MNIST dataset

The MNIST dataset was created at AT&T Bell Labs in the 1990s by selecting from and modifying a larger dataset from NIST. It consists of images of handwritten digits, and was heavily used in machine learning research in the 1990s and 2000s.

Open the notebook MNIST-Intro.ipynb. It should open in Google Colab in your browser. Follow the instructions in Using Colaboratory to get started.
Work through the notebook. Execute the code in the cells. To aid your understanding, you can insert your own code cells to inspect variables and so on.
Complete the exercises at the end of the notebook.

5. Binary classification with the delta rule

Here you’ll use the delta rule to train a neuron to be a “two”-detector, i.e., to be activated by images of “two” while remaining inactive for images of other digits.

Open the notebook DeltaRule.ipynb, and work through it.
Complete the exercises at the end of the notebook.

6. Multi-class classification

MultiClass.ipynb uses the delta rule to train 10 neurons to detect the 10 digit classes in MNIST. In the sample code, the 10 neurons are trained independently. In this problem, you will modify the code to use the delta rule for the softmax nonlinearity.

Open the notebook and work through it.
Complete the exercise at the end of the notebook.