\[\DeclareMathOperator*\trace{Tr} \DeclareMathOperator*\argmax{argmax} \DeclareMathOperator*\argmin{argmin}\]

Due at 3pm on Tuesday April 14th. Please submit to Gradescope and follow the general guidelines regarding homework assignments.

Theoretical exercises

Please submit all theory exercises in one PDF.

  1. In the competitive learning algorithm, a population of \(k\) neurons learns from a series of input vectors. For each input \(\vec{x}\), the \(k\) neurons compete to be closest, \[ \begin{equation}\label{eq:Competition} \forall a=1..k, \quad y_a = \begin{cases} 1, \text{ if } a = \argmin_b |\vec{x}-\vec{w}_b|\cr 0, \text{ otherwise} \end{cases}, \end{equation} \] where \(|\cdot|\) is the Euclidean norm. The winning neuron is active and the rest are silent. Then the weight vectors are updated via the Hebb rule and weight decay, \[ \begin{equation}\label{eq:HebbRuleWeightDecay} \forall b=1..k, \quad\Delta \vec{w}_b \propto y_b(\vec{x}-\vec{w}_b) \end{equation} \] Note that the weight decay is activity-dependent, happening only for the neuron that is active.

    • Show that Eqs. (\ref{eq:Competition}) and (\ref{eq:HebbRuleWeightDecay}) are equivalent to performing $$ \begin{align*} a &= \argmin_b |\vec{x}-\vec{w}_b|\cr \Delta \vec{w}_a &\propto \vec{x}-\vec{w}_a \end{align*} $$ after each input vector \(\vec{x}\). These equations are simpler than Eqs. (\ref{eq:Competition}) and (\ref{eq:HebbRuleWeightDecay}), but postsynaptic neural activity \(y_a\) does not appear explicitly in them, so the relation to Hebbian plasticity is not manifest.

    • Show that the competitive learning update is gradient descent on $$ e(\vec{w}_1,\ldots,\vec{w}_k,\vec{x}) = \min_{\vec{y}\in U} \frac{1}{2}\sum_b y_b |\vec{x}-\vec{w}_b|^2 $$ where \(U\) is the set of all "one-hot" binary vectors.

    It follows that competitive learning is online gradient descent on $$ E = \sum_{t=1}^T e(\vec{w}_1,\ldots,\vec{w}_k,\vec{x}_t) = \min_Y\frac{1}{2}\sum_{t=1}^T Y_{ta} |\vec{x}_t-\vec{w}_a|^2 $$ where the minimum is over all assignment matrices. Inside the minimum is the \(k\)-means cost function.

  2. Suppose that you apply competitive learning (or \(k\)-means clustering) with \(k=2\) neurons to the 2D input vectors shown in this diagram. EllipseClusters Competitive learning can converge to different steady states, depending on the initial conditions. Identify two very different steady states that are possible. (Each neuron should have at least one input vector assigned to it.) Which one is a global minimum of the cost function, and which is only a local minimum? Justify your answer.

  3. Let \(\vec{x}_1,\ldots,\vec{x}_T\) be a series of input vectors. Suppose that the input vectors have zero mean, $$ 0 = \frac{1}{T}\sum_{t=1}^T\vec{x}_t $$ Define the covariance matrix $$ C = \frac{1}{T}\sum_{t=1}^T \vec{x}_t\vec{x}_t^\top $$ (Don't confuse the transpose symbol \(\top\) with the number \(T\) of input vectors.) The diagonal elements of \(C\) equal to the variances of the input elements, $$ \sigma_i^2 = C_{ii} = \frac{1}{T}\sum_{t=1}^T X_{it}^2 $$ where \(X_{it}\) is the \(i\)th element of the \(t\)th input vector.

    • The trace of a square matrix is defined as the sum of its diagonal elements. Suppose that matrix product \(AB\) is a square matrix. Prove that \(\trace AB = \trace BA\). Note that \(A\) and \(B\) need not be square; only their product is required to be square.

    • Show that the sum of the eigenvalues of \(C\) is equal to the sum of the variances \(\sum_i\sigma_i^2\)

  4. Convergence of Oja’s Rule in the average velocity approximation. Consider a linear neuron \(y=\vec{w}\cdot\vec{x}\), where \(\vec{w}\) is the weight vector, \(\vec{x}\) is the input vector, and \(y\) is the output of the neuron. Oja’s Rule is \[ \begin{equation}\label{eq:OjasRule} \Delta\vec{w} \propto y\vec{x} - y^2\vec{w} \end{equation} \]
    • Show that the average velocity approximation for Oja’s Rule is \[ \begin{equation}\label{eq:OjasRuleAverageVelocity} \Delta\vec{w} \propto C\vec{w}-\left(\vec{w}^\top C\vec{w}\right)\vec{w} \end{equation} \] where \(C=\langle\vec{x}\vec{x}^\top\rangle\) is the covariance of the input vectors.

    • Show that a steady state of Eq. (\ref{eq:OjasRuleAverageVelocity}) is an eigenvector of \(C\) normalized to have length one.

    It turns out that the principal eigenvector of \(C\) (the one with the largest eigenvalue) is a stable steady state, while the other eigenvectors are unstable steady states, but you’re not required to prove that here.

Programming exercises

  1. Write code that implements the competitive learning algorithm for the MNIST dataset.

    • Run your code on all the training images with \(k=10\). Display the 10 weight vectors as images. Do they look like the 10 digit classes? Why or why not? Do they look like examples in the dataset? Why or why not?

    • Run your code on just the “two” images with \(k=5\). Display the 5 weight vectors as images. Describe qualitatively what the algorithm has learned.

    There are two real-world situations in which competitive learning might be used instead of \(k\)-means clustering. First, competitive learning can be used to cluster datasets larger than available RAM, since the online algorithm holds only one input vector in RAM at any given time. Second, competitive learning can be useful for tracking clusters that vary slowly with time.

  2. Write code that implements Oja’s Rule for the MNIST dataset. Oja’s Rule can be viewed as an efficient online method of finding the principal component of a series of data vectors. Alternatively, one could compute the covariance matrix of the data, find all its eigenvalues and eigenvectors, and discard all but the principal eigenvector. However, this would be inefficient in use of memory and time, if only the top principal component is desired.

    • Apply your Oja’s Rule code to the “two” class of MNIST digits.

    • Compute the covariance matrix of the “two” digits in the MNIST dataset. The definition of the covariance matrix given above assumed that the input vectors have zero mean. Therefore you should “center” the input vectors by computing their mean, and then subtract the mean from each input vector. You should do this for Oja’s Rule also.

    • Find all eigenvalues and eigenvectors of the covariance matrix. Sort the eigenvectors by decreasing eigenvalues and display the top 10 eigenvectors as images. The top eigenvector should look the same as the result of Oja’s Rule.