May 2026 Writing ~18 min read

Understanding LLMs: from
bag of words to attention

A few months before ChatGPT came out, I was fine-tuning a DistilBERT classifier and feeling quite proud of myself. This is the journey from there to here, explained so your project manager can follow along.

In late 2022 I was working on a text classification model for multilingual field data. DistilBERT was the tool: a distilled version of BERT, smaller and faster, with surprisingly good multilingual performance. The task was sorting open-ended survey responses into categories. It worked. I was pleased. I started building a question-answering system on top of it.

Then in November, OpenAI released ChatGPT. The "GPT" stands for Generative Pre-trained Transformer. Within weeks, the model I'd spent months fine-tuning felt like a quaint antique. Not because it stopped working, but because the paradigm had shifted beneath my feet. What I'd been doing manually (curating training data, selecting architectures, tuning hyperparameters) was now something you could prompt a general model to do in natural language.

This piece traces the conceptual path from the simplest idea of "how a computer understands text" to the transformer architecture that powers GPT, Claude, and every other large language model you're hearing about. I'll build it up one layer at a time, with exercises you can try right here. No maths background required, though we won't shy away from the intuition behind the maths.

Bag of Words: the simplest model of text

The starting point for understanding how machines process language is surprisingly dumb. It's called Bag of Words, and it does exactly what it sounds like: take a sentence, throw all the words into a bag, lose all the order, and just count how often each word appears.

"The cat sat on the mat" becomes: {the: 2, cat: 1, sat: 1, on: 1, mat: 1}. That's it. No grammar, no meaning, no structure. Just counting. And yet, for many practical tasks (spam detection, topic classification, sentiment analysis), this works surprisingly well. If an email contains the words "prince," "Nigeria," "urgent," and "transfer," you don't need to parse the grammar to know what you're looking at.

Exercise 01 · Bag of Words type a sentence and see it decomposed

Build a bag of words from your own text

Results will appear here...

Notice how word order is completely lost. "Dog bites man" and "Man bites dog" produce identical bags. That's the fundamental limitation. Also notice how common words (the, on, and) dominate the count. That's where TF-IDF comes in.

TF-IDF: not all words are equal

The problem with raw word counts is that the most frequent words are usually the least informative. "The" appears everywhere. It tells you nothing about what a document is about. What you actually want to know is: which words appear in this document that don't appear much in other documents?

That's TF-IDF: Term Frequency times Inverse Document Frequency. The "TF" part counts how often a word appears in a document (like Bag of Words). The "IDF" part penalises words that appear in many documents. "The" gets a low IDF because it's everywhere. "Malnutrition" gets a high IDF if it only appears in a few documents in your collection. Multiply them together and you get a score that highlights distinctive words.

Exercise 02 · TF-IDF compare two documents

See which words matter most in each document

Results will appear here...

Words that appear in one document but not the other get the highest scores. Shared words ("the," "children," "commune") get downweighted. This is the core intuition: distinctiveness is more informative than frequency.

TF-IDF powered search engines for decades. When you searched Google in 2003, some variant of this is what was happening underneath. It's still used today as a baseline and as a feature in many NLP pipelines. But it still throws away word order and has no concept of meaning. "Bank" the financial institution and "bank" the river side are the same token. The words "not good" get scored the same as "good" plus "not."

Word embeddings: meaning as geometry

The breakthrough that changed NLP came from a deceptively simple idea: what if you represented each word not as a count, but as a position in space? A high-dimensional space where words that mean similar things are close together.

This is word embeddings, and the most famous early version is Word2Vec (2013). The insight is that you can learn these positions by looking at context: words that appear near each other in sentences should be near each other in the embedding space. "Cat" and "dog" appear in similar contexts ("the ___ sat on the mat"), so they end up near each other. "Cat" and "quarterly" do not.

The really remarkable thing is that the geometry encodes relationships. The vector from "king" to "queen" is approximately the same as the vector from "man" to "woman." The space isn't just encoding similarity; it's encoding the structure of meaning itself. If you've seen 3Blue1Brown's series on neural networks, this is where the visualisations of high-dimensional spaces start to become genuinely beautiful.

Exercise 03 · Word similarity explore proximity in meaning-space

Which words would be close in embedding space?

WORD A

WORD B

Results will appear here...

This uses a simplified heuristic (not a real embedding model), but the principle holds: words from similar semantic fields score high. In a real model, "doctor" and "nurse" would be closer than "doctor" and "banana." Try different pairs.

The sequence problem: why order matters

Word embeddings solved the meaning problem, but not the sequence problem. "The patient was treated by the doctor" and "the doctor was treated by the patient" use exactly the same words with the same embeddings. The meaning is completely different. You need a model that can process sequences, that understands which word comes before and after which.

The first major attempt at this was Recurrent Neural Networks (RNNs), and later LSTMs (Long Short-Term Memory networks). These process text one word at a time, left to right, carrying a "hidden state" forward that encodes what they've seen so far. Think of it as reading a sentence while trying to hold the whole thing in working memory.

The problem is that working memory. RNNs forget. By the time they've processed 50 words, the information from the first word has degraded significantly. LSTMs improved this with explicit memory gates, but even they struggled with really long sequences. If you're summarising a 10-page document, the model has effectively forgotten the first page by the time it reaches the last.

Attention: all you need, apparently

In 2017, a team at Google published a paper called "Attention Is All You Need." The title was bold. The paper was correct. It introduced the Transformer architecture, which is the foundation of every major language model since: BERT, GPT-2, GPT-3, GPT-4, Claude, LLaMA, Mistral, all of them.

The core idea is attention: instead of processing text sequentially (left to right), let every word look at every other word simultaneously, and learn which words to pay attention to for each task. When the model processes the word "it" in "The cat sat on the mat because it was tired," the attention mechanism learns to look back at "cat" (not "mat") to resolve what "it" refers to.

This is not a metaphor. The model literally computes a score for every pair of words in a sentence: how much should word A attend to word B? These scores form a matrix (one row per word, one column per word), and the values tell you where the model is "looking."

Exercise 04 · Attention click cells to explore

A simplified attention matrix

Brighter cells = stronger attention. In a real transformer, "it" would attend strongly to "cat" (coreference resolution). This demo simulates plausible patterns; a real model learns these from billions of examples. Each "head" in multi-head attention looks for a different type of relationship.

Tokenisation: how text actually enters the model

Before any of this can happen, the text needs to be broken into pieces the model can process. You might assume it's split into words, but it isn't. Modern LLMs use subword tokenisation (usually BPE: Byte-Pair Encoding). Common words stay whole ("the" = one token). Uncommon words get split into pieces ("tokenisation" might become "token" + "isation"). Very rare words get split into individual characters.

This matters because the token is the fundamental unit of the model. It doesn't "see" words; it sees tokens. And the tokeniser is trained primarily on English text, which means English gets efficient, compact tokenisation. Zarma? A single Zarma word might become five or six tokens, each meaningless on its own. This is one of the mechanisms through which the language gap (from the data trust essay) manifests technically.

Exercise 05 · Tokenisation see the token cost of different languages

Compare how text gets tokenised

Results will appear here...

English gets ~1 token per word. French gets ~1.2. Zarma can get 2-3+. This token inflation means the model "sees" less context in low-resource languages for the same window size. It also means API costs are higher per word.

From BERT to GPT: the fork in the road

The transformer architecture was the shared foundation, but the field quickly split into two camps based on how the transformer was used.

BERT (2018, Google) is an encoder model. It reads the entire input at once and builds a rich representation of meaning. Good for understanding: classification, question answering, named entity recognition. This is the family of models I was using in 2022 when I was training my DistilBERT classifier on field survey data. DistilBERT was a distilled (smaller, faster) version of BERT that retained most of its multilingual capability. For a classification task on a specific dataset, it was genuinely good.

GPT (2018 onwards, OpenAI) is a decoder model. It reads text left-to-right and predicts the next token. Good for generation: writing text, conversation, code. It's autoregressive, meaning it generates one token at a time, feeding each output back as input for the next step. When you chat with ChatGPT or Claude, every word in the response was predicted one at a time.

The twist is that GPT-3 (2020) showed that decoder models, scaled up massively, can also do understanding tasks. They don't need task-specific fine-tuning; they can be prompted in natural language. "Classify this text as positive or negative:" and the model just does it. This is what made my fine-tuned DistilBERT classifier feel obsolete. Not because it was worse at my specific task, but because a general model could do a passable version of the same task with zero training data.

Scale: the uncomfortable truth

The transformer architecture is elegant. But the thing that actually made LLMs work at the level they do now is not architectural innovation. It's scale. More data, more parameters, more compute. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. GPT-4 is estimated at over a trillion. Each jump brought emergent capabilities that weren't present in the smaller model: reasoning, code generation, multilingual transfer.

There is an active and important debate about whether scale alone is sufficient to reach general intelligence, or whether it produces increasingly impressive but fundamentally limited pattern matching. I don't have a settled view on this. What I do know is that the practical implications for field work are real: a model trained on a trillion tokens of primarily English text has a very different relationship to a Zarma health survey response than it does to an English policy document. The architecture is the same; the data is not.

The DistilBERT postscript

My classification model still works, by the way. It still does what I built it to do: sort survey responses into categories, in multiple languages, quickly and cheaply. It runs on a laptop. It doesn't need an API key. It doesn't hallucinate. For that specific task, it remains a better tool than a general LLM.

The lesson is that the "right" model depends on the task, the context, and the constraints. A fine-tuned BERT variant for a specific classification task is still often the correct choice. An LLM for open-ended generation, summarisation, and reasoning is often the correct choice. The field moved fast, but it didn't move in a straight line. Every layer in this stack (from bag of words to transformers) is still in use somewhere, solving problems it's well-suited for.

"The field moved fast, but it didn't move in a straight line. Every layer in this stack is still in use somewhere, solving problems it's well-suited for."

04 references

Resources for going deeper

If you want to build real intuition for these concepts (not just read about them), these are the resources I'd recommend. Start with the videos, then read the paper once you have the visual foundation.

Playlist · 4 ch.

Foundational3Blue1Brown

Neural Networks · Chapter 1–4

Grant Sanderson · 2017

Start here if you've never seen how a neural network actually learns. Chapters 1–4 cover the essentials; the GPT/Transformer episodes build on that foundation beautifully. The gradient-descent visualisation in Chapter 2 is worth the price of admission alone.

↗ 02

Channel

ChannelYouTube

3Blue1Brown — the full catalogue

@3blue1brown

Grant's visual explanations of neural networks, backpropagation, and linear algebra are the gold standard. The way he renders high-dimensional intuition as animation is genuinely unmatched.

↗ 03

Channel

Technical deep-diveUmar Jamil

Transformer architectures, line by line

@umarjamilai

Umar walks through attention mechanisms and specific model implementations from scratch. Best if you have some coding background and want to understand what's happening inside the model, not just around it.

↗ 04

arXiv:1706.03762

Attention Is All You Need

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Vaswani et al. 2017

Primary sourceNeurIPS 2017

"Attention Is All You Need"

Vaswani · Shazeer · Parmar · Uszkoreit · Jones · Gomez · Kaiser · Polosukhin

The original transformer paper. Surprisingly readable. If you've followed this essay, you have enough context to understand what it's doing and why it mattered. Section 3.2 — the attention mechanism — is the key; the rest is implementation detail.

↗

Notes

1. DistilBERT (Sanh et al., 2019) is 60% the size of BERT-base with 97% of its performance on most benchmarks. Its multilingual variant (distilbert-base-multilingual-cased) covers 104 languages, including French and Arabic, though performance varies significantly by language resource level.

2. The GPT acronym: Generative Pre-trained Transformer. "Generative" because it generates text. "Pre-trained" because the base model is trained on a large corpus before being fine-tuned or prompted. "Transformer" because it uses the architecture from the 2017 paper.

3. The tokeniser exercise uses a simplified heuristic to illustrate the principle. Real BPE tokenisers (like tiktoken for GPT or the SentencePiece models used by Claude) produce different splits. The directional observation (English more efficient than low-resource languages) holds true in the real tokenisers as well.