Question 1

What problem with RNNs did the transformer architecture solve?

Accepted Answer

RNNs processed tokens sequentially, struggling with long-range dependencies. RNNs processed tokens one at a time and had a "vanishing gradient" problem that made it hard to remember context from earlier in a long sequence. Transformers process all tokens in parallel with attention.

Question 2

Why does doubling a model's context window significantly increase its compute cost?

Accepted Answer

Self-attention scales quadratically with sequence length. Each token must attend to every other token. If you double the tokens, you square the number of attention computations — 2x tokens → 4x compute.

Question 3

The sentence "She went to the river bank to fish" — how does self-attention help the model understand "bank"?

Accepted Answer

It attends to nearby words like "river" and "fish" to determine the context-appropriate meaning. Self-attention lets "bank" attend strongly to "river" and "fish," weighting those tokens high. This creates a context-sensitive representation that correctly maps to "riverbank" not "financial bank."

Transformer Architecture Basics

Why Transformers Replaced RNNs

Self-Attention: The Key Mechanism

Scale is the Secret Sauce

What This Means for AI Tools

Deep Dive Articles

Related Concepts