Problem Set 6: Transformers
Due at 11:59 pm on Friday, Apr. 5th. Please submit to Gradescope. Please follow the general guidelines regarding homework assignments.
-
After applying Layer Normalization, what is the Euclidean norm of the activity vector?
-
Consider Algorithm 4 in Phuong and Hutter, for the case of self-attention. Suppose that \(W_k\) is the identity matrix \(I\) and \(W_q = \omega I\). Assume that all tokens have the same Euclidean norm, and are distinct from each other. What is the output of the algorithm in the limit as \(\omega\to\infty\)?
-
Suppose that \(W_k = 0\) and \(W_q\) is arbitrary. What is the output of Algorithm 4?
-
What is the complexity of Algorithm 5, measured in number of multiply-adds? You can neglect the computations involving bias vectors, and the softmax computation of Eq. (2). Consider both bidirectional and unidirectional cases. Your answer will contain a sum of terms. Which one dominates for GPT-3?
For the last question, you can consult the GPT-3 paper for model hyperparameters, which are also summarized at the end of the minGPT README.