How LLMs Work: Layman to Hero

Deep Dive · 2025

How Large Language Models Actually Work

A complete guide from “what even is a neural network?” to transformers, RLHF, inference, and scaling laws — no PhD required.

By a curious engineer ~22 min read Layman → Hero

What is a neural network? Layman
The hardware stack — where does it run? Layman
Tokens — the alphabet of AI Layman
Embeddings — meaning as coordinates Layman
The Transformer architecture Intermediate
Attention — how the model “thinks” Intermediate
Pre-training on the internet Intermediate
RLHF — teaching it to be helpful Intermediate+
Fine-tuning and adapters Hero
Inference — how generation works Hero
Scaling laws and emergent abilities Hero

In late 2022, a chatbot wrote a sonnet, debugged Python, explained black holes to a 10-year-old, and passed the bar exam — all in the same week. The world lost its mind. Engineers, teachers, lawyers, and your uncle at Thanksgiving all asked the same question: how does it actually do that?

This guide answers that question completely. We start from absolute zero — no math, no jargon — and by the end you’ll understand the full engineering stack: tokens, embeddings, the transformer, attention, training at scale, RLHF, fine-tuning, inference, and the mysterious “emergent abilities” that surprised even the researchers who built these systems.

📍 How to read this. Each section carries a difficulty badge. If you’re already comfortable with one concept, jump ahead. If you’re starting fresh, read top to bottom — each section builds on the last.

Section 01

🌱 Layman

What is a neural network?

Before we talk about LLMs, transformers, or tokens — we need to answer a more basic question that most articles skip. What on earth is a neural network, and why does it matter?

Start with the word “neural.” It comes from neurons — the cells in your brain. Your brain has roughly 86 billion neurons. Each neuron receives electrical signals from others, does a tiny calculation, and either fires a signal forward or stays quiet. The magic isn’t in any single neuron; it’s in the connections between billions of them.

A neural network is a software imitation of this idea. Instead of biological cells, we have mathematical functions called nodes. Instead of biological synapses, we have numbers called weights. And instead of electricity, we have numbers flowing through layers of calculations.

🏭 Analogy

Think of a neural network like an assembly line with many workers. You send raw material (your input — say, a photo or a sentence) into one end. Each worker (layer) looks at what the previous worker passed them, does their small job, and passes it forward. By the time the material exits the last worker, it’s been transformed into a finished product — a classification, a prediction, a generated sentence.

Fig 1. A simple neural network. Each circle is a node; each line is a weight. Numbers flow left → right, getting transformed at each layer until the output emerges.

What are “weights” and why do they matter?

Every connection between nodes has a weight — a number that controls how strongly that connection influences the next node. A weight of 0.9 means “pass nearly everything through.” A weight of 0.01 means “almost ignore this.” A negative weight means “this actually pushes against the output.”

A small neural network might have thousands of weights. GPT-4 has an estimated one trillion. The entire “knowledge” of an LLM — every fact it knows about history, science, language, code — is encoded in those numbers. Training is the process of figuring out what all those numbers should be.

How does a network learn?

It learns by making guesses and getting corrected — billions of times. You show it an input, it produces an output, you compare the output to the correct answer, and you calculate how wrong it was (the loss). Then you use a technique called backpropagation to trace that error back through the network and nudge every weight slightly in the direction that would have produced a better answer. Do this enough times, and the weights settle into values that make the network good at its task.

🔑 The key insight. Nobody hand-codes the rules. Nobody writes “if the user asks about Paris, mention the Eiffel Tower.” The network discovers its own rules by adjusting weights across trillions of examples. The rules live in the numbers — implicit, distributed, and often impossible for even the researchers to fully read back out.

So where does “Large Language Model” fit?

An LLM is a neural network specifically designed and trained to understand and generate language. The “large” refers to the scale — hundreds of billions to trillions of weights. The specific type of neural network architecture used in all modern LLMs is called the Transformer, which we’ll cover in Section 5. But first — where does all this actually run?

Section 02

🌱 Layman

The hardware stack — where does an LLM actually run?

A neural network with a trillion parameters isn’t running on a laptop. Understanding the hardware stack gives you a much clearer picture of why LLMs are expensive, why inference has latency, and why companies like NVIDIA became trillion-dollar businesses overnight.

There are three distinct hardware contexts where neural networks operate, and they have very different characters.

Fig 2. The three-tier LLM hardware stack. Training happens once on massive GPU clusters. Inference runs continuously on server-side GPUs. Users only see the API responses on their devices.

Tier 1 — Training clusters (where learning happens)

Training a frontier model requires thousands of high-end GPUs — typically NVIDIA H100s — running in parallel for months. A single H100 costs around $30,000. OpenAI, Google, and Anthropic operate clusters with tens of thousands of them, linked by ultra-fast interconnects so gradients can flow between chips in microseconds.

Why GPUs and not regular CPUs? Because neural network training is fundamentally matrix multiplication — multiplying huge grids of numbers together. GPUs were originally designed for video games, which also require enormous amounts of parallel matrix math (rendering 3D scenes). That same architecture turns out to be perfect for neural networks.

🎮 Analogy

A CPU is like a brilliant professor who can solve any problem but works on one problem at a time. A GPU is like a lecture hall of 10,000 students who are each slightly less brilliant, but can all work in parallel. For matrix multiplication, the lecture hall wins every time.

~$30K

cost of a single NVIDIA H100 GPU

16,384

H100s in Meta’s largest training cluster

80 GB

memory per H100 — model must fit here

Tier 2 — Inference servers (where your prompts are answered)

Once training is complete, the model’s weights are frozen and deployed on inference servers. These are also GPUs — but often slightly older or smaller ones, since inference is less compute-intensive than training. The entire model must fit in GPU memory to run efficiently. A 70-billion parameter model in 16-bit precision takes about 140 GB — requiring at least two A100 GPUs just to hold the weights, before doing any computation.

This is why model size is so closely watched. A model that’s twice as large costs roughly twice as much to serve per token — and when you’re handling millions of requests a day, that compounds fast.

Tier 3 — Your device (just the display)

When you type a message to Claude or ChatGPT, your device sends that text over the internet to Tier 2. The server computes the response, and streams tokens back to you one by one. Your laptop or phone does almost no AI computation — it’s just displaying text. The heavy lifting is entirely server-side.

Exception: some smaller “edge” models like Llama running locally via tools like Ollama can run on consumer laptops — but they’re significantly smaller and less capable than frontier models. The capability gap between “fits on my MacBook” and “requires a $100M data center” is enormous.

💡 Why this matters for understanding LLMs. The neural network is the weights — those trillions of numbers. Hardware is simply the medium that stores and executes them. Training is about finding the right weights. Inference is about using them. Everything else in this article is about how those weights encode intelligence.

Section 03

🌱 Layman

Tokens — the alphabet of AI

Before an LLM can do anything, it needs to convert raw text into something a computer can process. That “something” is called a token.

A token is not a word. It’s more like a syllable or a chunk. The sentence “I love pizza” becomes three tokens: I, love, pizza. But “unbelievable” might split into un, bel, iev, able — four tokens. Common short words are one token; rare long words get chopped up.

🧩 Analogy

Think of tokens like Scrabble tiles. You don’t play the word “extraordinary” — you lay down individual tiles. The AI’s vocabulary is a fixed set of ~50,000 tiles, and every sentence in existence can be spelled out using them.

Fig 3. “ChatGPT is surprisingly good!” → 7 tokens. Note how “surprisingly” splits across two tokens — rare or long words get broken up.

~100K

tokens in GPT-4’s vocabulary

~¾

of a word per token on average in English

2T+

tokens used to train GPT-4

Why not just use letters? Because then the model would need billions of steps to learn that c-a-t means the same animal as c-a-t-s. Tokens give the model meaningful chunks to work with — richer than letters, cheaper than full words.

Section 04

🌱 Layman

Embeddings — meaning as coordinates

Once we have tokens, we need to give them meaning. Computers work with numbers, not English. So every token gets converted into a list of numbers — a vector in a high-dimensional space. This is called an embedding.

🗺️ Analogy

Imagine a city map where words are locations. “King” and “Queen” are in the same neighbourhood. “Dog” and “puppy” are neighbours. “Paris” is near “France” in one direction, and near “London” in another. The model learns these coordinates by reading billions of sentences — positions that capture meaning, grammar, and context.

Fig 4. A 2D “slice” of embedding space. Similar meanings cluster together. The famous formula: King − Man + Woman ≈ Queen works because relationships are encoded as directions in this space.

In reality, embeddings have thousands of dimensions — GPT-4 uses around 12,288. Each dimension captures some aspect of meaning that humans never explicitly defined. The model learned all of it just by predicting text.

“The model doesn’t know what ‘king’ means — it knows that ‘king’ appears in similar contexts as ‘ruler’, ‘throne’, and ‘sovereign’, and that’s enough.” — A rough paraphrase of how it actually works

Section 05

⚡ Intermediate

The Transformer architecture

In 2017, a Google paper titled “Attention Is All You Need” introduced the Transformer. It wasn’t the first neural network for text, but it was by far the best — and it’s the foundation of every major LLM today: GPT-4, Claude, Gemini, Llama, all of them.

The transformer takes your token embeddings and processes them through a stack of layers. Each layer has two main components: an attention mechanism and a feed-forward network. Think of layers as passes of reasoning — the deeper the layer, the more abstract the understanding.

Fig 5. The Transformer pipeline. Tokens flow through an embedding layer, then through N identical layers (each with attention + FFN), then a head predicts the next token.

One subtle but crucial detail: the transformer also adds positional encodings to the embeddings. Without these, the model would have no idea whether “dog bit man” or “man bit dog” — the words are the same, just in different order. Positional encodings inject sequence information so the model can tell position 1 from position 10 from position 100.

Section 06

⚡ Intermediate

Attention — how the model “thinks”

This is the secret sauce. The attention mechanism lets every token look at every other token in the input and decide: how much should I borrow meaning from you?

Consider the sentence: “The animal didn’t cross the street because it was too tired.” What does “it” refer to? Humans immediately know it’s the animal, not the street. Attention is how the model knows too — the token “it” attends heavily to “animal” and lightly to “street”.

🔦 Analogy

Imagine each word in a sentence has a spotlight. When it’s “it”‘s turn to update its meaning, it shines its spotlight on every other word. The words the spotlight hits brightest — those are the ones that most influence how “it” gets understood in this context.

Fig 6. When processing “it”, the attention mechanism assigns high weight to “animal” (0.72), moderate weight to “tired”, and low weight to “street”. This is how context is resolved.

Multi-head attention

Here’s where it gets interesting. The model doesn’t run attention just once — it runs it in parallel heads. Each head learns to attend to different types of relationships: one head might focus on syntax, another on coreference (like our “it”→”animal” example), and another on semantic similarity. GPT-4 has 96 attention heads per layer.

All the heads’ outputs are concatenated and projected back together, giving the model a rich, multi-perspective view of every token in every position.

🔑 The Q, K, V intuition. Each token generates three vectors: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what do I give you?”). Attention scores are computed by matching Queries against Keys — like a search engine — then using the scores to blend Values.

Section 07

⚡ Intermediate

Pre-training on the internet

All that architecture is just plumbing. The magic happens in training — the process that fills those billions of parameters with knowledge.

Pre-training is simple in concept and extraordinary in scale. You feed the model vast amounts of text and ask it to do one thing: predict the next token. Given “The Eiffel Tower is in”, the model must predict “Paris”. Given “To be or not to”, it must predict “be”. Do this trillions of times, and something remarkable happens — to predict the next word well, the model has to understand what it’s reading.

Fig 7. The pre-training loop, repeated trillions of times. Each step nudges the model’s parameters slightly toward better predictions, accumulating into knowledge.

~1T

parameters in frontier models (GPT-4 estimated)

$100M+

compute cost for a frontier training run

~3 months

training time on thousands of GPUs

The training data includes books, Wikipedia, code repositories, scientific papers, forums, websites — essentially a large sample of human written knowledge. The model doesn’t memorize it verbatim. Instead, it learns to compress it into patterns, relationships, and rules that let it generate new text that follows similar distributions.

Section 08

⚡ Intermediate+

RLHF — teaching it to be helpful

After pre-training, you have a model that’s incredibly good at completing text — but it might complete “How do I make a bomb” just as readily as “What’s a good pasta recipe”. It has no sense of what makes a response good.

This is where Reinforcement Learning from Human Feedback (RLHF) comes in. It’s a three-step process that aligns the model with human values and preferences.

Fig 8. The RLHF pipeline. Three phases transform a raw pre-trained model into a helpful, harmless, honest assistant.

The beauty of RLHF is that humans don’t need to specify every rule. Instead of writing “be polite in situation X, be concise in situation Y”, human raters just pick the better of two responses. The reward model infers the patterns. It’s teaching by example, not by rule-writing.

⚠️ The alignment problem lurking here. The reward model is trained on human preferences — but humans can have biases, inconsistencies, and blind spots. An imperfectly trained reward model can cause the LLM to “reward hack” — finding responses that score high on the metric but aren’t actually good. This is an active research area.

Section 09

🦸 Hero

Fine-tuning and adapters

Pre-training + RLHF gives you a general-purpose assistant. But what if you want a model that specialises in medical diagnoses, or legal contracts, or customer support for your SaaS product? That’s where fine-tuning comes in.

Fine-tuning means taking a pre-trained model and continuing to train it on a smaller, domain-specific dataset. The weights shift slightly to accommodate the new knowledge, while largely retaining what was learned before. Think of it like a software engineer who already knows Python learning a new framework — they don’t forget Python, they just add a new skill.

The problem: fine-tuning is expensive

Full fine-tuning requires updating all the billions of parameters — nearly as expensive as training from scratch. Enter Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA (Low-Rank Adaptation).

Fig 9. LoRA freezes the original weight matrix W and adds two small matrices A and B. During inference, the outputs are summed. Training only A and B costs a fraction of full fine-tuning.

LoRA works because of a hypothesis: the changes needed to specialize a model live in a low-dimensional subspace. Instead of modifying a 4096×4096 matrix (16M parameters), you train two small matrices of rank 16 — just 2×4096×16 = 131K parameters. At inference time, you merge them back: W + AB. Same speed, tiny training cost.

Section 10

🦸 Hero

Inference — how generation works

When you chat with Claude or ChatGPT, what’s happening behind the scenes? The model doesn’t produce the whole response at once — it generates one token at a time, feeding each token back in as input to generate the next.

Fig 10. Autoregressive generation: each new token is appended to the context and the model runs again. This is why output appears token by token in your browser.

Temperature and sampling

At each step the model outputs a probability distribution over all ~100K tokens. How do you pick which one to use? That’s controlled by temperature:

T
Temperature = 0 — always pick the highest probability token. Deterministic, predictable, but repetitive. Good for factual Q&A.
T
Temperature = 0.7 — sample from the distribution, weighted by probability. More creative, some randomness. The default for most chatbots.
T
Temperature = 1.5+ — heavily randomized. Can produce wild creativity or complete nonsense.

Beyond temperature, there’s also top-p (nucleus) sampling — only consider the smallest set of tokens whose combined probability exceeds p (e.g. 0.9). This prevents the model from ever picking extremely unlikely tokens, which helps quality while preserving creativity.

KV Cache — the inference speed trick

Running the full transformer on every token in your growing context every step would be agonizingly slow. The KV cache solves this: at each step, the model caches the Key and Value vectors for all previous tokens. On the next step, it only has to compute attention for the new token, not re-compute everything from scratch. This makes generation roughly O(n) instead of O(n²) — the difference between a fast chatbot and an unusable one.

Section 11

🦸 Hero

Scaling laws and emergent abilities

Perhaps the most surprising discovery in AI research isn’t a new architecture — it’s that making existing architectures bigger and training them on more data reliably produces better models. This regularity is captured in scaling laws.

Fig 11. Model loss drops predictably with compute — a power law relationship. The Chinchilla paper (2022) showed that most pre-2022 models were over-parameterized and under-trained on data.

The Chinchilla insight

In 2022, DeepMind published the “Chinchilla” paper with a counterintuitive finding: GPT-3 (175B parameters, trained on 300B tokens) was severely under-trained. Optimal compute allocation means scaling parameters and data in roughly equal proportions. A 70B parameter model trained on 1.4 trillion tokens (Llama 2) beats the 175B model trained on fewer tokens. More data per parameter beats brute-force parameter count.

Emergent abilities

Here’s the uncanny part: as models scale, some capabilities don’t improve gradually — they appear suddenly, at a threshold. A model below 10B parameters can barely do 3-digit arithmetic. Cross some threshold and a 100B model solves it reliably. These emergent abilities include chain-of-thought reasoning, few-shot learning, code generation, and more.

🧠 Why do abilities emerge? The honest answer: nobody fully knows. One theory is that complex tasks require multiple sub-skills, each learned separately, and only when all are present does the combined capability “click”. Another theory: the tasks require the model to represent abstractions that only become stable above certain parameter thresholds. Active research continues.

10×

compute increase every ~2.5 years in frontier models

~67

emergent abilities identified as of 2023

20T+

tokens used to train Llama 3 frontier models

Conclusion

Putting it all together

Let’s zoom all the way back out. When you type a question and an LLM answers you, here is the full pipeline in one breath:

1
An LLM is a neural network — a software system of weighted connections that learns by adjusting numbers across billions of examples. It runs on GPU clusters during training and on inference servers when you talk to it.
2
Your text is split into tokens — subword chunks from a fixed vocabulary.
3
Each token is converted to an embedding — a high-dimensional vector encoding meaning and position.
4
Embeddings pass through N transformer layers, each running multi-head attention (where every token considers every other token) and a feed-forward network.
5
A prediction head outputs a probability distribution over all ~100K tokens. Temperature + sampling pick the next one.
6
That token is appended to the context (cached via KV cache), and step 5 repeats until the model produces an end-of-sequence token.
7
All of this was shaped by pre-training on trillions of tokens, refined by RLHF on human preferences, and optionally specialized via fine-tuning with adapters like LoRA.
8
The model’s capabilities follow scaling laws — more compute + data → lower loss → better reasoning — with some abilities appearing suddenly at scale thresholds.

There’s no magic here — just elegant mathematics applied at extraordinary scale. The wonder isn’t that it works. The wonder is that next-token prediction, that simple-sounding game of “guess what comes next”, turned out to be enough.

🌊 Final thought

A single water molecule has no idea it’s part of a wave. No individual weight in a neural network “knows” what Paris is. But a trillion weights, shaped by human language, can discuss it — in any language, from any angle, at any level of depth. That’s the scale at which something new emerges.

How Large Language Models Actually Work

How Large Language Models Actually Work

Contents

What is a neural network?

What are “weights” and why do they matter?

How does a network learn?

So where does “Large Language Model” fit?

The hardware stack — where does an LLM actually run?

Tier 1 — Training clusters (where learning happens)

Tier 2 — Inference servers (where your prompts are answered)

Tier 3 — Your device (just the display)

Tokens — the alphabet of AI

Embeddings — meaning as coordinates

The Transformer architecture

Attention — how the model “thinks”

Multi-head attention

Pre-training on the internet

RLHF — teaching it to be helpful

Fine-tuning and adapters

The problem: fine-tuning is expensive

Inference — how generation works

Temperature and sampling

KV Cache — the inference speed trick

Scaling laws and emergent abilities

The Chinchilla insight

Emergent abilities

Putting it all together

Comments

Leave a Reply Cancel reply