Attention Mechanisms in Transformers

Attention is the mechanism that lets transformers process sequences without the recurrence bottleneck of RNNs. Rather than maintaining hidden state that propagates from left to right, every token in the sequence can directly attend to every other token in a single step. This parallelism is what makes transformers practical to train on modern hardware.

The core operation is scaled dot-product attention. Given a sequence of input vectors, we project each token into three roles: a query (Q), a key (K), and a value (V). Each of these projections is learned. The attention output is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

Intuitively, the dot product $QK^\top$ measures how much each query “matches” each key. The softmax turns these raw scores into a probability distribution, which we use to take a weighted sum of the values. The result for each token is a blend of the values of all tokens, weighted by how relevant their keys were to this token’s query.

The elegance here is that Q, K, and V are all linear projections of the same input — so attention is learning, from data, which aspects of each token are relevant when acting as a query, when being queried, and what information to pass forward. In self-attention, every token can attend to every other token in the same sequence (including itself).

The scaling factor √d_k explained

The division by $\sqrt{d_k}$ (the square root of the key dimension) looks like a minor detail but matters a lot in practice.

Without scaling, when $d_k$ is large, the dot products $QK^\top$ grow in magnitude. Specifically, if the components of Q and K are independent random variables with mean 0 and variance 1, then $QK^\top = \sum_{i=1}^{d_k} Q_i K_i$ has variance $d_k$. The standard deviation grows as $\sqrt{d_k}$.

Large dot product values push the softmax into its saturated region — where one logit dominates and the gradient through the softmax approaches zero. This is the same saturation problem that makes training deep sigmoid networks hard, just appearing in a different context.

By dividing by $\sqrt{d_k}$, we normalize the dot products back to unit variance regardless of dimension. The softmax input stays in a numerically well-behaved range, gradients flow cleanly, and attention weights don’t degenerate into near-one-hot distributions during early training.

The choice of $\sqrt{d_k}$ specifically (rather than $d_k$ or some learned parameter) comes from the variance calculation: dividing the sum by $\sqrt{d_k}$ makes the variance of the result $\frac{d_k}{d_k} = 1$. It’s the right normalization constant, not an arbitrary hyperparameter.

A useful sanity check during training: log the pre-softmax attention logits. If they’re all very large in magnitude early in training, the scaling is probably off.

Multi-head attention extends this by running several attention operations in parallel with different projection matrices. Each “head” learns to attend along a different dimension — one head might focus on syntactic relationships, another on semantic similarity, another on positional proximity. The outputs are concatenated and projected back down:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.

The parameter count stays manageable because each head operates on a smaller dimension: if the model dimension is $d_{model} = 512$ and there are $h = 8$ heads, each head sees $d_k = d_v = 64$ dimensions.

One thing I found counterintuitive: attention is $O(n^2)$ in sequence length, where $n$ is the number of tokens. For a 1000-token sequence, that’s $10^6$ pairs. This is why long-context transformers are expensive and why so much recent work — sparse attention, flash attention, linear attention — focuses specifically on making this operation cheaper.