How do Large language models work from zero to hero
How do LLMs work ?

How Large Language Models Actually Work

How LLMs Work: Layman to Hero
Deep Dive · 2025

How Large Language Models Actually Work

A complete guide from “what even is a neural network?” to transformers, RLHF, inference, and scaling laws — no PhD required.

By a curious engineer ~22 min read Layman → Hero

In late 2022, a chatbot wrote a sonnet, debugged Python, explained black holes to a 10-year-old, and passed the bar exam — all in the same week. The world lost its mind. Engineers, teachers, lawyers, and your uncle at Thanksgiving all asked the same question: how does it actually do that?

This guide answers that question completely. We start from absolute zero — no math, no jargon — and by the end you’ll understand the full engineering stack: tokens, embeddings, the transformer, attention, training at scale, RLHF, fine-tuning, inference, and the mysterious “emergent abilities” that surprised even the researchers who built these systems.

📍 How to read this. Each section carries a difficulty badge. If you’re already comfortable with one concept, jump ahead. If you’re starting fresh, read top to bottom — each section builds on the last.

🌱 Layman

What is a neural network?

Before we talk about LLMs, transformers, or tokens — we need to answer a more basic question that most articles skip. What on earth is a neural network, and why does it matter?

Start with the word “neural.” It comes from neurons — the cells in your brain. Your brain has roughly 86 billion neurons. Each neuron receives electrical signals from others, does a tiny calculation, and either fires a signal forward or stays quiet. The magic isn’t in any single neuron; it’s in the connections between billions of them.

A neural network is a software imitation of this idea. Instead of biological cells, we have mathematical functions called nodes. Instead of biological synapses, we have numbers called weights. And instead of electricity, we have numbers flowing through layers of calculations.

🏭 Analogy
Think of a neural network like an assembly line with many workers. You send raw material (your input — say, a photo or a sentence) into one end. Each worker (layer) looks at what the previous worker passed them, does their small job, and passes it forward. By the time the material exits the last worker, it’s been transformed into a finished product — a classification, a prediction, a generated sentence.
Simple neural network diagram A three-layer neural network showing input nodes on the left, hidden layer nodes in the middle, and output node on the right, with weighted connections between every node. INPUT LAYER HIDDEN LAYERS OUTPUT LAYER x₁ x₂ x₃ “The” “cat” “sat” “on” p=0.74 w=0.89 Each line is a weight — a number the network learned during training

Fig 1. A simple neural network. Each circle is a node; each line is a weight. Numbers flow left → right, getting transformed at each layer until the output emerges.

What are “weights” and why do they matter?

Every connection between nodes has a weight — a number that controls how strongly that connection influences the next node. A weight of 0.9 means “pass nearly everything through.” A weight of 0.01 means “almost ignore this.” A negative weight means “this actually pushes against the output.”

A small neural network might have thousands of weights. GPT-4 has an estimated one trillion. The entire “knowledge” of an LLM — every fact it knows about history, science, language, code — is encoded in those numbers. Training is the process of figuring out what all those numbers should be.

How does a network learn?

It learns by making guesses and getting corrected — billions of times. You show it an input, it produces an output, you compare the output to the correct answer, and you calculate how wrong it was (the loss). Then you use a technique called backpropagation to trace that error back through the network and nudge every weight slightly in the direction that would have produced a better answer. Do this enough times, and the weights settle into values that make the network good at its task.

🔑 The key insight. Nobody hand-codes the rules. Nobody writes “if the user asks about Paris, mention the Eiffel Tower.” The network discovers its own rules by adjusting weights across trillions of examples. The rules live in the numbers — implicit, distributed, and often impossible for even the researchers to fully read back out.

So where does “Large Language Model” fit?

An LLM is a neural network specifically designed and trained to understand and generate language. The “large” refers to the scale — hundreds of billions to trillions of weights. The specific type of neural network architecture used in all modern LLMs is called the Transformer, which we’ll cover in Section 5. But first — where does all this actually run?


🌱 Layman

The hardware stack — where does an LLM actually run?

A neural network with a trillion parameters isn’t running on a laptop. Understanding the hardware stack gives you a much clearer picture of why LLMs are expensive, why inference has latency, and why companies like NVIDIA became trillion-dollar businesses overnight.

There are three distinct hardware contexts where neural networks operate, and they have very different characters.

LLM hardware stack diagram Three-tier hardware stack showing GPU training clusters at the top, inference servers in the middle, and end-user devices at the bottom, with data and model weights flowing between tiers. TIER 1 — TRAINING H100 GPU H100 GPU H100 GPU × 8,192 Purpose: learn the weights runs once · months · $100M+ trained weights (checkpoint) TIER 2 — INFERENCE SERVERS A100 GPU A100 GPU × N Purpose: serve user requests runs 24/7 · milliseconds per token API response (tokens) TIER 3 — END USER DEVICES Browser / App Mobile Laptop Purpose: display results, send prompts

Fig 2. The three-tier LLM hardware stack. Training happens once on massive GPU clusters. Inference runs continuously on server-side GPUs. Users only see the API responses on their devices.

Tier 1 — Training clusters (where learning happens)

Training a frontier model requires thousands of high-end GPUs — typically NVIDIA H100s — running in parallel for months. A single H100 costs around $30,000. OpenAI, Google, and Anthropic operate clusters with tens of thousands of them, linked by ultra-fast interconnects so gradients can flow between chips in microseconds.

Why GPUs and not regular CPUs? Because neural network training is fundamentally matrix multiplication — multiplying huge grids of numbers together. GPUs were originally designed for video games, which also require enormous amounts of parallel matrix math (rendering 3D scenes). That same architecture turns out to be perfect for neural networks.

🎮 Analogy
A CPU is like a brilliant professor who can solve any problem but works on one problem at a time. A GPU is like a lecture hall of 10,000 students who are each slightly less brilliant, but can all work in parallel. For matrix multiplication, the lecture hall wins every time.
~$30K
cost of a single NVIDIA H100 GPU
16,384
H100s in Meta’s largest training cluster
80 GB
memory per H100 — model must fit here

Tier 2 — Inference servers (where your prompts are answered)

Once training is complete, the model’s weights are frozen and deployed on inference servers. These are also GPUs — but often slightly older or smaller ones, since inference is less compute-intensive than training. The entire model must fit in GPU memory to run efficiently. A 70-billion parameter model in 16-bit precision takes about 140 GB — requiring at least two A100 GPUs just to hold the weights, before doing any computation.

This is why model size is so closely watched. A model that’s twice as large costs roughly twice as much to serve per token — and when you’re handling millions of requests a day, that compounds fast.

Tier 3 — Your device (just the display)

When you type a message to Claude or ChatGPT, your device sends that text over the internet to Tier 2. The server computes the response, and streams tokens back to you one by one. Your laptop or phone does almost no AI computation — it’s just displaying text. The heavy lifting is entirely server-side.

Exception: some smaller “edge” models like Llama running locally via tools like Ollama can run on consumer laptops — but they’re significantly smaller and less capable than frontier models. The capability gap between “fits on my MacBook” and “requires a $100M data center” is enormous.

💡 Why this matters for understanding LLMs. The neural network is the weights — those trillions of numbers. Hardware is simply the medium that stores and executes them. Training is about finding the right weights. Inference is about using them. Everything else in this article is about how those weights encode intelligence.

🌱 Layman

Tokens — the alphabet of AI

Before an LLM can do anything, it needs to convert raw text into something a computer can process. That “something” is called a token.

A token is not a word. It’s more like a syllable or a chunk. The sentence “I love pizza” becomes three tokens: I, love, pizza. But “unbelievable” might split into un, bel, iev, able — four tokens. Common short words are one token; rare long words get chopped up.

🧩 Analogy
Think of tokens like Scrabble tiles. You don’t play the word “extraordinary” — you lay down individual tiles. The AI’s vocabulary is a fixed set of ~50,000 tiles, and every sentence in existence can be spelled out using them.
Tokenization diagram The sentence “ChatGPT is surprisingly good!” split into tokens, each shown as a labeled box with its token ID below. “ChatGPT is surprisingly good!” Tokenizer Chat #9953 GPT #38840 is #318 surpris #44673 ingly #2627 good #1104 ! #0 Named entity Common word Sub-word chunk Punctuation

Fig 3. “ChatGPT is surprisingly good!” → 7 tokens. Note how “surprisingly” splits across two tokens — rare or long words get broken up.

~100K
tokens in GPT-4’s vocabulary
of a word per token on average in English
2T+
tokens used to train GPT-4

Why not just use letters? Because then the model would need billions of steps to learn that c-a-t means the same animal as c-a-t-s. Tokens give the model meaningful chunks to work with — richer than letters, cheaper than full words.


🌱 Layman

Embeddings — meaning as coordinates

Once we have tokens, we need to give them meaning. Computers work with numbers, not English. So every token gets converted into a list of numbers — a vector in a high-dimensional space. This is called an embedding.

🗺️ Analogy
Imagine a city map where words are locations. “King” and “Queen” are in the same neighbourhood. “Dog” and “puppy” are neighbours. “Paris” is near “France” in one direction, and near “London” in another. The model learns these coordinates by reading billions of sentences — positions that capture meaning, grammar, and context.
Word embedding space A 2D projection of a high-dimensional embedding space showing word clusters: royalty (king, queen), canines (dog, wolf, puppy), and cities (Paris, London, Berlin) grouped together. Dimension 1 (power / status) Dimension 2 (animacy) royalty cluster King Queen canine cluster Dog Wolf Puppy city cluster Paris London Note: this is a simplified 2D projection Real embeddings have 1,024–8,192 dimensions

Fig 4. A 2D “slice” of embedding space. Similar meanings cluster together. The famous formula: King − Man + Woman ≈ Queen works because relationships are encoded as directions in this space.

In reality, embeddings have thousands of dimensions — GPT-4 uses around 12,288. Each dimension captures some aspect of meaning that humans never explicitly defined. The model learned all of it just by predicting text.

“The model doesn’t know what ‘king’ means — it knows that ‘king’ appears in similar contexts as ‘ruler’, ‘throne’, and ‘sovereign’, and that’s enough.” — A rough paraphrase of how it actually works

⚡ Intermediate

The Transformer architecture

In 2017, a Google paper titled “Attention Is All You Need” introduced the Transformer. It wasn’t the first neural network for text, but it was by far the best — and it’s the foundation of every major LLM today: GPT-4, Claude, Gemini, Llama, all of them.

The transformer takes your token embeddings and processes them through a stack of layers. Each layer has two main components: an attention mechanism and a feed-forward network. Think of layers as passes of reasoning — the deeper the layer, the more abstract the understanding.

Transformer architecture diagram A vertical stack showing the transformer pipeline: input tokens, embedding layer, multiple transformer layers (each with attention and feed-forward), and finally a prediction head outputting the next token. Input tokens [“The”, ” cat”, ” sat”, …] Embedding + position encoding Layer N (deepest) Multi-head attention Feed-forward network · · · Layer 2 Multi-head attention Feed-forward network Prediction head Probability over all ~100K tokens 96 layers in GPT-4

Fig 5. The Transformer pipeline. Tokens flow through an embedding layer, then through N identical layers (each with attention + FFN), then a head predicts the next token.

One subtle but crucial detail: the transformer also adds positional encodings to the embeddings. Without these, the model would have no idea whether “dog bit man” or “man bit dog” — the words are the same, just in different order. Positional encodings inject sequence information so the model can tell position 1 from position 10 from position 100.


⚡ Intermediate

Attention — how the model “thinks”

This is the secret sauce. The attention mechanism lets every token look at every other token in the input and decide: how much should I borrow meaning from you?

Consider the sentence: “The animal didn’t cross the street because it was too tired.” What does “it” refer to? Humans immediately know it’s the animal, not the street. Attention is how the model knows too — the token “it” attends heavily to “animal” and lightly to “street”.

🔦 Analogy
Imagine each word in a sentence has a spotlight. When it’s “it”‘s turn to update its meaning, it shines its spotlight on every other word. The words the spotlight hits brightest — those are the ones that most influence how “it” gets understood in this context.
Self-attention mechanism Attention weights from the token “it” to all other tokens in the sentence. The animal gets the highest weight (thickest line), street gets a thin line. The animal didn’t cross the street because it was tired ← QUERY token Attention from “it” → all other tokens Line thickness = attention weight 0.72 (high) 0.14 0.08

Fig 6. When processing “it”, the attention mechanism assigns high weight to “animal” (0.72), moderate weight to “tired”, and low weight to “street”. This is how context is resolved.

Multi-head attention

Here’s where it gets interesting. The model doesn’t run attention just once — it runs it in parallel heads. Each head learns to attend to different types of relationships: one head might focus on syntax, another on coreference (like our “it”→”animal” example), and another on semantic similarity. GPT-4 has 96 attention heads per layer.

All the heads’ outputs are concatenated and projected back together, giving the model a rich, multi-perspective view of every token in every position.

🔑 The Q, K, V intuition. Each token generates three vectors: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what do I give you?”). Attention scores are computed by matching Queries against Keys — like a search engine — then using the scores to blend Values.

⚡ Intermediate

Pre-training on the internet

All that architecture is just plumbing. The magic happens in training — the process that fills those billions of parameters with knowledge.

Pre-training is simple in concept and extraordinary in scale. You feed the model vast amounts of text and ask it to do one thing: predict the next token. Given “The Eiffel Tower is in”, the model must predict “Paris”. Given “To be or not to”, it must predict “be”. Do this trillions of times, and something remarkable happens — to predict the next word well, the model has to understand what it’s reading.

Pre-training loop The training loop: input tokens flow through the model, produce a prediction, the prediction is compared to the ground truth via a loss function, and gradients flow backward to update the weights. Input text “The cat sat on” Transformer billions of params doing matrix math Predicted next “the” (p=0.42) “mat” (p=0.28)… Loss function Truth: “the” Loss = 0.87 Backpropagation — nudge weights to reduce loss ① Feed ② Forward pass ③ Predict ④ Compare

Fig 7. The pre-training loop, repeated trillions of times. Each step nudges the model’s parameters slightly toward better predictions, accumulating into knowledge.

~1T
parameters in frontier models (GPT-4 estimated)
$100M+
compute cost for a frontier training run
~3 months
training time on thousands of GPUs

The training data includes books, Wikipedia, code repositories, scientific papers, forums, websites — essentially a large sample of human written knowledge. The model doesn’t memorize it verbatim. Instead, it learns to compress it into patterns, relationships, and rules that let it generate new text that follows similar distributions.


⚡ Intermediate+

RLHF — teaching it to be helpful

After pre-training, you have a model that’s incredibly good at completing text — but it might complete “How do I make a bomb” just as readily as “What’s a good pasta recipe”. It has no sense of what makes a response good.

This is where Reinforcement Learning from Human Feedback (RLHF) comes in. It’s a three-step process that aligns the model with human values and preferences.

RLHF pipeline Three phases of RLHF: supervised fine-tuning on demonstration data, reward model training from human preferences, and RL fine-tuning using PPO. PHASE 1 Supervised fine-tuning Human labellers write ideal responses Model learns to mimic these demonstrations SFT model ready (helpful but inconsistent) PHASE 2 Train reward model Show humans 2 responses → they pick the better one Reward model learns to score quality RM predicts: “this is a 8.4/10 response” PHASE 3 RL fine-tuning (PPO) SFT model generates responses to prompts RM scores each one → PPO updates weights Model learns to generate high-reward responses

Fig 8. The RLHF pipeline. Three phases transform a raw pre-trained model into a helpful, harmless, honest assistant.

The beauty of RLHF is that humans don’t need to specify every rule. Instead of writing “be polite in situation X, be concise in situation Y”, human raters just pick the better of two responses. The reward model infers the patterns. It’s teaching by example, not by rule-writing.

⚠️ The alignment problem lurking here. The reward model is trained on human preferences — but humans can have biases, inconsistencies, and blind spots. An imperfectly trained reward model can cause the LLM to “reward hack” — finding responses that score high on the metric but aren’t actually good. This is an active research area.

🦸 Hero

Fine-tuning and adapters

Pre-training + RLHF gives you a general-purpose assistant. But what if you want a model that specialises in medical diagnoses, or legal contracts, or customer support for your SaaS product? That’s where fine-tuning comes in.

Fine-tuning means taking a pre-trained model and continuing to train it on a smaller, domain-specific dataset. The weights shift slightly to accommodate the new knowledge, while largely retaining what was learned before. Think of it like a software engineer who already knows Python learning a new framework — they don’t forget Python, they just add a new skill.

The problem: fine-tuning is expensive

Full fine-tuning requires updating all the billions of parameters — nearly as expensive as training from scratch. Enter Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA (Low-Rank Adaptation).

LoRA adapter diagram LoRA adds small adapter matrices A and B alongside frozen pre-trained weights W. The adapter’s output is added to the frozen layer’s output, allowing fine-tuning with only 0.1% of parameters. Input x W (frozen) 7B parameters, locked A d × r B r × d LoRA adapter (trainable) + Output h = Wx + ABx r = 4 to 64 (rank, tiny) Result: ~0.1% of params trained, 80–90% of full fine-tune quality Frozen (no gradient) Trainable adapter

Fig 9. LoRA freezes the original weight matrix W and adds two small matrices A and B. During inference, the outputs are summed. Training only A and B costs a fraction of full fine-tuning.

LoRA works because of a hypothesis: the changes needed to specialize a model live in a low-dimensional subspace. Instead of modifying a 4096×4096 matrix (16M parameters), you train two small matrices of rank 16 — just 2×4096×16 = 131K parameters. At inference time, you merge them back: W + AB. Same speed, tiny training cost.


🦸 Hero

Inference — how generation works

When you chat with Claude or ChatGPT, what’s happening behind the scenes? The model doesn’t produce the whole response at once — it generates one token at a time, feeding each token back in as input to generate the next.

Autoregressive token generation Token-by-token generation showing how each output token is appended to the context and fed back to generate the next token. Step 1 “The capital of France is” ” Paris” p=0.87 ← highest Step 2 (feed back) “The capital of France is Paris” “.” p=0.71 Step 3 → [EOS] → stop “The capital of France is Paris.” [EOS] STOP

Fig 10. Autoregressive generation: each new token is appended to the context and the model runs again. This is why output appears token by token in your browser.

Temperature and sampling

At each step the model outputs a probability distribution over all ~100K tokens. How do you pick which one to use? That’s controlled by temperature:

  • T
    Temperature = 0 — always pick the highest probability token. Deterministic, predictable, but repetitive. Good for factual Q&A.
  • T
    Temperature = 0.7 — sample from the distribution, weighted by probability. More creative, some randomness. The default for most chatbots.
  • T
    Temperature = 1.5+ — heavily randomized. Can produce wild creativity or complete nonsense.

Beyond temperature, there’s also top-p (nucleus) sampling — only consider the smallest set of tokens whose combined probability exceeds p (e.g. 0.9). This prevents the model from ever picking extremely unlikely tokens, which helps quality while preserving creativity.

KV Cache — the inference speed trick

Running the full transformer on every token in your growing context every step would be agonizingly slow. The KV cache solves this: at each step, the model caches the Key and Value vectors for all previous tokens. On the next step, it only has to compute attention for the new token, not re-compute everything from scratch. This makes generation roughly O(n) instead of O(n²) — the difference between a fast chatbot and an unusable one.


🦸 Hero

Scaling laws and emergent abilities

Perhaps the most surprising discovery in AI research isn’t a new architecture — it’s that making existing architectures bigger and training them on more data reliably produces better models. This regularity is captured in scaling laws.

Scaling laws chart A log-log plot showing how model loss decreases predictably as compute, data, and parameters increase. The Chinchilla optimal frontier is marked. Compute / Parameters / Data (log scale) Loss (log scale) Small Medium Large GPT-2 GPT-3 GPT-4 ← Chinchilla optimal Emergent: few-shot reasoning appears Scaling curve Data-optimal (Chinchilla)

Fig 11. Model loss drops predictably with compute — a power law relationship. The Chinchilla paper (2022) showed that most pre-2022 models were over-parameterized and under-trained on data.

The Chinchilla insight

In 2022, DeepMind published the “Chinchilla” paper with a counterintuitive finding: GPT-3 (175B parameters, trained on 300B tokens) was severely under-trained. Optimal compute allocation means scaling parameters and data in roughly equal proportions. A 70B parameter model trained on 1.4 trillion tokens (Llama 2) beats the 175B model trained on fewer tokens. More data per parameter beats brute-force parameter count.

Emergent abilities

Here’s the uncanny part: as models scale, some capabilities don’t improve gradually — they appear suddenly, at a threshold. A model below 10B parameters can barely do 3-digit arithmetic. Cross some threshold and a 100B model solves it reliably. These emergent abilities include chain-of-thought reasoning, few-shot learning, code generation, and more.

🧠 Why do abilities emerge? The honest answer: nobody fully knows. One theory is that complex tasks require multiple sub-skills, each learned separately, and only when all are present does the combined capability “click”. Another theory: the tasks require the model to represent abstractions that only become stable above certain parameter thresholds. Active research continues.
10×
compute increase every ~2.5 years in frontier models
~67
emergent abilities identified as of 2023
20T+
tokens used to train Llama 3 frontier models

Putting it all together

Let’s zoom all the way back out. When you type a question and an LLM answers you, here is the full pipeline in one breath:

  • 1
    An LLM is a neural network — a software system of weighted connections that learns by adjusting numbers across billions of examples. It runs on GPU clusters during training and on inference servers when you talk to it.
  • 2
    Your text is split into tokens — subword chunks from a fixed vocabulary.
  • 3
    Each token is converted to an embedding — a high-dimensional vector encoding meaning and position.
  • 4
    Embeddings pass through N transformer layers, each running multi-head attention (where every token considers every other token) and a feed-forward network.
  • 5
    A prediction head outputs a probability distribution over all ~100K tokens. Temperature + sampling pick the next one.
  • 6
    That token is appended to the context (cached via KV cache), and step 5 repeats until the model produces an end-of-sequence token.
  • 7
    All of this was shaped by pre-training on trillions of tokens, refined by RLHF on human preferences, and optionally specialized via fine-tuning with adapters like LoRA.
  • 8
    The model’s capabilities follow scaling laws — more compute + data → lower loss → better reasoning — with some abilities appearing suddenly at scale thresholds.

There’s no magic here — just elegant mathematics applied at extraordinary scale. The wonder isn’t that it works. The wonder is that next-token prediction, that simple-sounding game of “guess what comes next”, turned out to be enough.

🌊 Final thought
A single water molecule has no idea it’s part of a wave. No individual weight in a neural network “knows” what Paris is. But a trillion weights, shaped by human language, can discuss it — in any language, from any angle, at any level of depth. That’s the scale at which something new emerges.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *