How Large Language Models Actually Work
A complete guide from “what even is a neural network?” to transformers, RLHF, inference, and scaling laws — no PhD required.
Contents
- What is a neural network? Layman
- The hardware stack — where does it run? Layman
- Tokens — the alphabet of AI Layman
- Embeddings — meaning as coordinates Layman
- The Transformer architecture Intermediate
- Attention — how the model “thinks” Intermediate
- Pre-training on the internet Intermediate
- RLHF — teaching it to be helpful Intermediate+
- Fine-tuning and adapters Hero
- Inference — how generation works Hero
- Scaling laws and emergent abilities Hero
In late 2022, a chatbot wrote a sonnet, debugged Python, explained black holes to a 10-year-old, and passed the bar exam — all in the same week. The world lost its mind. Engineers, teachers, lawyers, and your uncle at Thanksgiving all asked the same question: how does it actually do that?
This guide answers that question completely. We start from absolute zero — no math, no jargon — and by the end you’ll understand the full engineering stack: tokens, embeddings, the transformer, attention, training at scale, RLHF, fine-tuning, inference, and the mysterious “emergent abilities” that surprised even the researchers who built these systems.
What is a neural network?
Before we talk about LLMs, transformers, or tokens — we need to answer a more basic question that most articles skip. What on earth is a neural network, and why does it matter?
Start with the word “neural.” It comes from neurons — the cells in your brain. Your brain has roughly 86 billion neurons. Each neuron receives electrical signals from others, does a tiny calculation, and either fires a signal forward or stays quiet. The magic isn’t in any single neuron; it’s in the connections between billions of them.
A neural network is a software imitation of this idea. Instead of biological cells, we have mathematical functions called nodes. Instead of biological synapses, we have numbers called weights. And instead of electricity, we have numbers flowing through layers of calculations.
Fig 1. A simple neural network. Each circle is a node; each line is a weight. Numbers flow left → right, getting transformed at each layer until the output emerges.
What are “weights” and why do they matter?
Every connection between nodes has a weight — a number that controls how strongly that connection influences the next node. A weight of 0.9 means “pass nearly everything through.” A weight of 0.01 means “almost ignore this.” A negative weight means “this actually pushes against the output.”
A small neural network might have thousands of weights. GPT-4 has an estimated one trillion. The entire “knowledge” of an LLM — every fact it knows about history, science, language, code — is encoded in those numbers. Training is the process of figuring out what all those numbers should be.
How does a network learn?
It learns by making guesses and getting corrected — billions of times. You show it an input, it produces an output, you compare the output to the correct answer, and you calculate how wrong it was (the loss). Then you use a technique called backpropagation to trace that error back through the network and nudge every weight slightly in the direction that would have produced a better answer. Do this enough times, and the weights settle into values that make the network good at its task.
So where does “Large Language Model” fit?
An LLM is a neural network specifically designed and trained to understand and generate language. The “large” refers to the scale — hundreds of billions to trillions of weights. The specific type of neural network architecture used in all modern LLMs is called the Transformer, which we’ll cover in Section 5. But first — where does all this actually run?
The hardware stack — where does an LLM actually run?
A neural network with a trillion parameters isn’t running on a laptop. Understanding the hardware stack gives you a much clearer picture of why LLMs are expensive, why inference has latency, and why companies like NVIDIA became trillion-dollar businesses overnight.
There are three distinct hardware contexts where neural networks operate, and they have very different characters.
Fig 2. The three-tier LLM hardware stack. Training happens once on massive GPU clusters. Inference runs continuously on server-side GPUs. Users only see the API responses on their devices.
Tier 1 — Training clusters (where learning happens)
Training a frontier model requires thousands of high-end GPUs — typically NVIDIA H100s — running in parallel for months. A single H100 costs around $30,000. OpenAI, Google, and Anthropic operate clusters with tens of thousands of them, linked by ultra-fast interconnects so gradients can flow between chips in microseconds.
Why GPUs and not regular CPUs? Because neural network training is fundamentally matrix multiplication — multiplying huge grids of numbers together. GPUs were originally designed for video games, which also require enormous amounts of parallel matrix math (rendering 3D scenes). That same architecture turns out to be perfect for neural networks.
Tier 2 — Inference servers (where your prompts are answered)
Once training is complete, the model’s weights are frozen and deployed on inference servers. These are also GPUs — but often slightly older or smaller ones, since inference is less compute-intensive than training. The entire model must fit in GPU memory to run efficiently. A 70-billion parameter model in 16-bit precision takes about 140 GB — requiring at least two A100 GPUs just to hold the weights, before doing any computation.
This is why model size is so closely watched. A model that’s twice as large costs roughly twice as much to serve per token — and when you’re handling millions of requests a day, that compounds fast.
Tier 3 — Your device (just the display)
When you type a message to Claude or ChatGPT, your device sends that text over the internet to Tier 2. The server computes the response, and streams tokens back to you one by one. Your laptop or phone does almost no AI computation — it’s just displaying text. The heavy lifting is entirely server-side.
Exception: some smaller “edge” models like Llama running locally via tools like Ollama can run on consumer laptops — but they’re significantly smaller and less capable than frontier models. The capability gap between “fits on my MacBook” and “requires a $100M data center” is enormous.
Tokens — the alphabet of AI
Before an LLM can do anything, it needs to convert raw text into something a computer can process. That “something” is called a token.
A token is not a word. It’s more like a syllable or a chunk. The sentence “I love pizza” becomes three tokens: I, love, pizza. But “unbelievable” might split into un, bel, iev, able — four tokens. Common short words are one token; rare long words get chopped up.
Fig 3. “ChatGPT is surprisingly good!” → 7 tokens. Note how “surprisingly” splits across two tokens — rare or long words get broken up.
Why not just use letters? Because then the model would need billions of steps to learn that c-a-t means the same animal as c-a-t-s. Tokens give the model meaningful chunks to work with — richer than letters, cheaper than full words.
Embeddings — meaning as coordinates
Once we have tokens, we need to give them meaning. Computers work with numbers, not English. So every token gets converted into a list of numbers — a vector in a high-dimensional space. This is called an embedding.
Fig 4. A 2D “slice” of embedding space. Similar meanings cluster together. The famous formula: King − Man + Woman ≈ Queen works because relationships are encoded as directions in this space.
In reality, embeddings have thousands of dimensions — GPT-4 uses around 12,288. Each dimension captures some aspect of meaning that humans never explicitly defined. The model learned all of it just by predicting text.
“The model doesn’t know what ‘king’ means — it knows that ‘king’ appears in similar contexts as ‘ruler’, ‘throne’, and ‘sovereign’, and that’s enough.” — A rough paraphrase of how it actually works
The Transformer architecture
In 2017, a Google paper titled “Attention Is All You Need” introduced the Transformer. It wasn’t the first neural network for text, but it was by far the best — and it’s the foundation of every major LLM today: GPT-4, Claude, Gemini, Llama, all of them.
The transformer takes your token embeddings and processes them through a stack of layers. Each layer has two main components: an attention mechanism and a feed-forward network. Think of layers as passes of reasoning — the deeper the layer, the more abstract the understanding.
Fig 5. The Transformer pipeline. Tokens flow through an embedding layer, then through N identical layers (each with attention + FFN), then a head predicts the next token.
One subtle but crucial detail: the transformer also adds positional encodings to the embeddings. Without these, the model would have no idea whether “dog bit man” or “man bit dog” — the words are the same, just in different order. Positional encodings inject sequence information so the model can tell position 1 from position 10 from position 100.
Attention — how the model “thinks”
This is the secret sauce. The attention mechanism lets every token look at every other token in the input and decide: how much should I borrow meaning from you?
Consider the sentence: “The animal didn’t cross the street because it was too tired.” What does “it” refer to? Humans immediately know it’s the animal, not the street. Attention is how the model knows too — the token “it” attends heavily to “animal” and lightly to “street”.
Fig 6. When processing “it”, the attention mechanism assigns high weight to “animal” (0.72), moderate weight to “tired”, and low weight to “street”. This is how context is resolved.
Multi-head attention
Here’s where it gets interesting. The model doesn’t run attention just once — it runs it in parallel heads. Each head learns to attend to different types of relationships: one head might focus on syntax, another on coreference (like our “it”→”animal” example), and another on semantic similarity. GPT-4 has 96 attention heads per layer.
All the heads’ outputs are concatenated and projected back together, giving the model a rich, multi-perspective view of every token in every position.
Pre-training on the internet
All that architecture is just plumbing. The magic happens in training — the process that fills those billions of parameters with knowledge.
Pre-training is simple in concept and extraordinary in scale. You feed the model vast amounts of text and ask it to do one thing: predict the next token. Given “The Eiffel Tower is in”, the model must predict “Paris”. Given “To be or not to”, it must predict “be”. Do this trillions of times, and something remarkable happens — to predict the next word well, the model has to understand what it’s reading.
Fig 7. The pre-training loop, repeated trillions of times. Each step nudges the model’s parameters slightly toward better predictions, accumulating into knowledge.
The training data includes books, Wikipedia, code repositories, scientific papers, forums, websites — essentially a large sample of human written knowledge. The model doesn’t memorize it verbatim. Instead, it learns to compress it into patterns, relationships, and rules that let it generate new text that follows similar distributions.
RLHF — teaching it to be helpful
After pre-training, you have a model that’s incredibly good at completing text — but it might complete “How do I make a bomb” just as readily as “What’s a good pasta recipe”. It has no sense of what makes a response good.
This is where Reinforcement Learning from Human Feedback (RLHF) comes in. It’s a three-step process that aligns the model with human values and preferences.
Fig 8. The RLHF pipeline. Three phases transform a raw pre-trained model into a helpful, harmless, honest assistant.
The beauty of RLHF is that humans don’t need to specify every rule. Instead of writing “be polite in situation X, be concise in situation Y”, human raters just pick the better of two responses. The reward model infers the patterns. It’s teaching by example, not by rule-writing.
Fine-tuning and adapters
Pre-training + RLHF gives you a general-purpose assistant. But what if you want a model that specialises in medical diagnoses, or legal contracts, or customer support for your SaaS product? That’s where fine-tuning comes in.
Fine-tuning means taking a pre-trained model and continuing to train it on a smaller, domain-specific dataset. The weights shift slightly to accommodate the new knowledge, while largely retaining what was learned before. Think of it like a software engineer who already knows Python learning a new framework — they don’t forget Python, they just add a new skill.
The problem: fine-tuning is expensive
Full fine-tuning requires updating all the billions of parameters — nearly as expensive as training from scratch. Enter Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA (Low-Rank Adaptation).
Fig 9. LoRA freezes the original weight matrix W and adds two small matrices A and B. During inference, the outputs are summed. Training only A and B costs a fraction of full fine-tuning.
LoRA works because of a hypothesis: the changes needed to specialize a model live in a low-dimensional subspace. Instead of modifying a 4096×4096 matrix (16M parameters), you train two small matrices of rank 16 — just 2×4096×16 = 131K parameters. At inference time, you merge them back: W + AB. Same speed, tiny training cost.
Inference — how generation works
When you chat with Claude or ChatGPT, what’s happening behind the scenes? The model doesn’t produce the whole response at once — it generates one token at a time, feeding each token back in as input to generate the next.
Fig 10. Autoregressive generation: each new token is appended to the context and the model runs again. This is why output appears token by token in your browser.
Temperature and sampling
At each step the model outputs a probability distribution over all ~100K tokens. How do you pick which one to use? That’s controlled by temperature:
-
T
Temperature = 0 — always pick the highest probability token. Deterministic, predictable, but repetitive. Good for factual Q&A.
-
T
Temperature = 0.7 — sample from the distribution, weighted by probability. More creative, some randomness. The default for most chatbots.
-
T
Temperature = 1.5+ — heavily randomized. Can produce wild creativity or complete nonsense.
Beyond temperature, there’s also top-p (nucleus) sampling — only consider the smallest set of tokens whose combined probability exceeds p (e.g. 0.9). This prevents the model from ever picking extremely unlikely tokens, which helps quality while preserving creativity.
KV Cache — the inference speed trick
Running the full transformer on every token in your growing context every step would be agonizingly slow. The KV cache solves this: at each step, the model caches the Key and Value vectors for all previous tokens. On the next step, it only has to compute attention for the new token, not re-compute everything from scratch. This makes generation roughly O(n) instead of O(n²) — the difference between a fast chatbot and an unusable one.
Scaling laws and emergent abilities
Perhaps the most surprising discovery in AI research isn’t a new architecture — it’s that making existing architectures bigger and training them on more data reliably produces better models. This regularity is captured in scaling laws.
Fig 11. Model loss drops predictably with compute — a power law relationship. The Chinchilla paper (2022) showed that most pre-2022 models were over-parameterized and under-trained on data.
The Chinchilla insight
In 2022, DeepMind published the “Chinchilla” paper with a counterintuitive finding: GPT-3 (175B parameters, trained on 300B tokens) was severely under-trained. Optimal compute allocation means scaling parameters and data in roughly equal proportions. A 70B parameter model trained on 1.4 trillion tokens (Llama 2) beats the 175B model trained on fewer tokens. More data per parameter beats brute-force parameter count.
Emergent abilities
Here’s the uncanny part: as models scale, some capabilities don’t improve gradually — they appear suddenly, at a threshold. A model below 10B parameters can barely do 3-digit arithmetic. Cross some threshold and a 100B model solves it reliably. These emergent abilities include chain-of-thought reasoning, few-shot learning, code generation, and more.
Putting it all together
Let’s zoom all the way back out. When you type a question and an LLM answers you, here is the full pipeline in one breath:
- 1An LLM is a neural network — a software system of weighted connections that learns by adjusting numbers across billions of examples. It runs on GPU clusters during training and on inference servers when you talk to it.
- 2Your text is split into tokens — subword chunks from a fixed vocabulary.
- 3Each token is converted to an embedding — a high-dimensional vector encoding meaning and position.
- 4Embeddings pass through N transformer layers, each running multi-head attention (where every token considers every other token) and a feed-forward network.
- 5A prediction head outputs a probability distribution over all ~100K tokens. Temperature + sampling pick the next one.
- 6That token is appended to the context (cached via KV cache), and step 5 repeats until the model produces an end-of-sequence token.
- 7All of this was shaped by pre-training on trillions of tokens, refined by RLHF on human preferences, and optionally specialized via fine-tuning with adapters like LoRA.
- 8The model’s capabilities follow scaling laws — more compute + data → lower loss → better reasoning — with some abilities appearing suddenly at scale thresholds.
There’s no magic here — just elegant mathematics applied at extraordinary scale. The wonder isn’t that it works. The wonder is that next-token prediction, that simple-sounding game of “guess what comes next”, turned out to be enough.
