Problem Set 6: Transformers

After applying Layer Normalization, what is the Euclidean norm of the activity vector?
Consider Algorithm 4 in Phuong and Hutter, for the case of self-attention. Suppose that \(W_k\) is the identity matrix \(I\) and \(W_q = \omega I\). Assume that all tokens have the same Euclidean norm, and are distinct from each other. What is the output of the algorithm in the limit as \(\omega\to\infty\)?
Suppose that \(W_k = 0\) and \(W_q\) is arbitrary. What is the output of Algorithm 4?
What is the complexity of Algorithm 5, measured in number of multiply-adds? You can neglect the computations involving bias vectors, and the softmax computation of Eq. (2). Consider both bidirectional and unidirectional cases. Your answer will contain a sum of terms. Which one dominates for GPT-3?

For the last question, you can consult the GPT-3 paper for model hyperparameters, which are also summarized at the end of the minGPT README.