\[\DeclareMathOperator*\trace{Tr} \DeclareMathOperator*\argmax{argmax} \DeclareMathOperator*\argmin{argmin} \DeclareMathOperator*\prob{\mathbb{P}} \renewcommand{\vec}[1]{{\bf #1}}\]

1. Odds and Logits

Please read about odds and logit. The National Safety Council has tabulated the probability of dying from various causes. They use the term “odds of death,” but they are actually tabulating probability of death.

  1. What are the odds of dying from heart failure? Motor vehicle crash?
  2. What is the corresponding logit in each case?
  3. What are the minimum and maximum possible odds?
  4. What are the minimum and maximum possible logit values?
  5. If the logit is negative, the odds are ______ and the probability is ______.

One of the main points here is that the logit changes little when the odds change a lot, due to the presence of the logarithm.

2. The cross entropy loss

Suppose it’s the job of a person or machine to predict 1 of \(n\) discrete alternatives. For example, a weather forecaster might predict “sunny”, “cloudy”, or “rainy”. Suppose that the prediction is probabilistic, i.e., 20 percent chance of sunny, 50 percent chance of cloudy, and 30 percent chance of rain. How do we grade the performance of the person or machine, once we know the truth? The cross entropy loss is commonly used. Let \(c_i\) be a one-hot vector, i.e., one of its element is equal to 1 and the rest are zero. In our weather case, this would be \((0, 1, 0)\) if the day turned out to be cloudy. Let \(p_i\) be a probability vector. This would be \((0.2, 0.5, 0.3)\) for the prediction of our hypothetical forecaster. Then we define the cross entropy loss as \begin{equation}\label{eq:CrossEntropy} L = - \sum_{i=1}^n c_i \log p_i \end{equation}

  1. Suppose that \(\vec{p}\) is fixed. What is the smallest that \(L\) can be, and for what \(\vec{c}\) is this achieved? What is the largest that \(L\) can be, and for what \(\vec{c}\) is this achieved?
  2. Suppose that \(\vec{c}\) is fixed. What is the smallest that \(L\) can be, and for what \(\vec{p}\) is this achieved? What is the largest that \(L\) can be, and for what \(\vec{p}\) is this achieved?
  3. For the binary case, what is the relation between loss and logits?

Note that we say “loss” function when small means good and large means bad.

3. The softmax function

The softmax function maps a vector \(\vec{x}\) to a vector \(\vec{p}\), \begin{equation}\label{eq:Softmax} p_i = \frac{\exp(x_i)}{\sum_{j=1}^n \exp(x_j)} \end{equation} If we scale the magnitude of \(\vec{x}\) to infinity, i.e., \(\lambda\vec{x}\) with \(\lambda\to\infty\), then \(p_i\to 1\) for the \(i\) such that \(x_i\) is maximal, and \(p_i\to 0\) for all other \(i\). For finite \(\vec{x}\), we get a "softened" version of the maximum function, which is the reason for the name softmax. (Exercise: What happens if there is a tie?)

  1. Verify that the \(p_i\) are nonnegative, and that \(\sum_i p_i = 1\). That means the output of softmax is a probability vector.
  2. Verify that the softmax function preserves ordering, i.e., that the elements of \(\vec{x}\) and \(\vec{p}\) have the same ordering. To do this, show that \(x_1\geq\ldots\geq x_n\) if and only if \(p_1\geq\ldots\geq p_n\), and note this is true for other orderings too.
  3. Suppose that \(x_i = i\), for \(i=1\) to \(10\). Make a plot of \(p_i\) vs. \(i\), and try to understand its shape.
  4. Characterize the infinite set of vectors \(\vec{x}\) that are mapped to the same vector \(\vec{p}\).
  5. In the binary case \(n=2\), show that you can write \(p_1\) as a logistic function of \(x=x_1-x_2\).

Because the softmax function is a generalization of the logistic function, the vector \(\vec{x}\) can also be called logits. The softmax function is typically used as the last operation in a neural net used for assigning an image or other input to 1 of \(n\) discrete classes.

4. Gradient

Gradient descent is a numerical method for finding local minima of functions, and is used for neural network learning. The method involves initializing the variables of the function and then taking small steps in the direction of the negative gradient.

  1. Suppose that \(E(\vec x) = \sum_i x_i^2\). What is the gradient of \(E\)?
  2. Suppose that \(E(x, y) = 2x^2 + y^2\). What is the gradient of \(E\)? Where is the minimum of \(E\) located? Does minus the gradient point in the direction of the minimum? Would that prevent gradient descent from finding the minimum?

5. Combining the cross entropy loss with softmax

The PyTorch QuickStart Tutorial uses the function torch.nn.CrossEntropyLoss. This is actually the cross entropy loss composed with the softmax function. In other words, it is what you get when you substitute Eq. (\ref{eq:Softmax}) for \(p_i\) in Eq. (\ref{eq:CrossEntropy}).

  1. Calculate the gradient of \(L\) with respect to \(\vec{p}\), \(\partial L/\partial p_i\), from Eq. (\ref{eq:CrossEntropy}).
  2. From Eq. (\ref{eq:Softmax}), show that the Jacobian matrix of first partial derivatives is \begin{equation} \frac{\partial p_i}{\partial x_j} = p_i\delta_{ij} - p_ip_j \end{equation} Here \(\delta_{ij}\) is the Kronecker delta.
  3. Apply the multivariate chain rule for differentiation, \begin{equation} \frac{\partial L}{\partial x_j} = \sum_i\frac{\partial L}{\partial p_i} \frac{\partial p_i}{\partial x_j} \end{equation} to derive an expression for the gradient of \(L\) with respect to \(\vec{x}\). Hint: the final answer is very simple. Verify that at most one of the elements of the gradient can be negative if \(\vec{c}\) is a one-hot vector. Which element is it?

6. Linear algebra and PyTorch

Learn how to implement linear algebra with Pytorch by going through the tutorials in Section 2.1 Data Manipulation and Section 2.3 Linear Algebra of the online textbook Dive into Deep Learning. Note the link at the top that opens the section as a notebook in Google Colab.