Home Blog Article

Architecture Explained: 9 Essential Concepts for AI Engineers

Every LLM you work with today is a Transformer. Understanding the architecture does not mean implementing it. It means knowing why context placement affects your prompts, why RAG needs two separate model calls, and why bigger models know more facts. This post breaks down nine core concepts – self-attention, Q/K/V, multi-head attention, FFN, layer norm, residual connections, positional encoding, and encoder vs decoder – using plain English analogies and visual diagrams. No math, no ML history, no prior knowledge assumed.

The transformer architecture is the foundation of every LLM you use today. GPT-4, Claude, Gemini, Llama. All of them

When you write a system prompt, chunk a document for RAG, or wonder why your model loses context beyond 128k tokens, the Transformer architecture is what is happening under the hood. You do not need to build one. But understanding how it works changes how you use it.

This post covers nine core concepts with plain English analogies, visual diagrams, and direct callouts for how each one affects your day-to-day work. No math. No history. No comparisons to older models.

1. Self-Attention: Everyone Talks to Everyone at Once

This is the most important idea in the whole architecture. When the model reads a sentence, every word looks at every other word simultaneously. Not one at a time. Not left to right. All at once, in parallel.

Think of a meeting room where every person can pass a sticky note to every other person at the exact same instant. Each note asks: “How relevant are you to what I am trying to say?” Every person reads all the replies and decides who to pay attention to. That is self-attention.

In practice, the word “bank” in “I deposited money at the bank” pays very high attention to “deposited” and “money”, and almost nothing to “I” or “at”. The model figures this out entirely by itself during training. You never told it what “bank” means.

Self-Attention: how “bank” reads the sentence
“I deposited money at the bank” every token attends to every other, simultaneously I deposited 30% attn money 60% attn at the bank source token High attention Medium Ignored

What this means in practice: every token you add to a prompt shifts how the model weighs everything else. A vague instruction early in a long prompt can get drowned out by more specific text near the actual query. This is why context placement matters when you are writing prompts.

2. Query, Key, Value: The Search Engine Inside Every Token

The attention mechanism runs every token through three different lenses at the same time. These are called Q (Query), K (Key), and V (Value). They are learned matrices the model develops during training.

I find the library analogy the clearest way to think about this. Your Query is what you type into the search bar: “books about financial institutions.” The Key is the index tag on each book’s spine: “Banking, Finance, Rivers.” The model matches your query against all keys to find the best fit. Then it opens the book and reads. That content is the Value.

Q, K, V: Three Lenses on Every Token
INPUT TOKEN “bank” Q Query “What am I looking for?” the question K Key “What do I advertise?” the label V Value “What do I hand over?” the content Ask question Match labels Receive content

Q, K, and V are all learned during training. The model teaches itself which questions to ask, how to label tokens, and what information to extract. You do not define any of this. It figures it out from billions of examples of human text.

3. Attention Score: Deciding Who Gets Heard

Once every token has a Q and every other token has a K, the model computes a score for every pair. It is basically asking: how much should token A pay attention to token B?

Think of it like a compatibility score. Every word scores every other word based on how well their Q and K match. After all scores are computed, a softmax step converts the raw numbers into percentages that sum to 1.0. The result looks like: “spend 60% of your attention on ‘money’, 30% on ‘deposited’, 10% on the rest.”

There is also a scaling step where the score is divided by the square root of the dimension size. Without this, scores can get so large that softmax becomes extremely spiky and essentially ignores everything except the single top match. Scaling keeps the distribution healthy.

Attention Score: Four-Step Computation Pipeline
Q · K dot product Step 1 ÷ √d scale down Step 2 prevents spiky scores softmax sums to 1.0 Step 3 × V weighted blend Step 4 Attended Output “bank” knows it’s a financial context Result
// Conceptual flow, for each token
raw_score  = dot_product(Q_token, K_all_others)   // Q·K
scaled     = raw_score / sqrt(dimension_size)      // scale
weights    = softmax(scaled)                       // sums to 1.0
output     = weighted_sum(weights × V_all_others)  // final blend

4. Multi-Head Attention: Eight Experts Reading the Same Sentence

One round of attention can only capture one type of relationship at a time. Multi-head attention solves this by running the entire Q/K/V process H times in parallel. Each head learns a completely different relationship between tokens.

Picture giving one article to eight editors at the same time. Editor 1 looks for grammatical structure. Editor 2 tracks who the subject of each sentence is. Editor 3 watches for negations and conditionals. Editor 4 spots named entities. Each editor works independently and produces their own notes. Then all the notes get combined into one final verdict. That is multi-head attention. Each head has its own separate Q/K/V matrices, learns different patterns, and the outputs are concatenated and projected back into a single vector at the end.

Multi-Head Attention: 4 Heads Running in Parallel
Input Embeddings HEAD 1 Q₁ K₁ V₁ Grammar structure subject/verb/obj HEAD 2 Q₂ K₂ V₂ Coreference resolution “she” → “Mary” HEAD 3 Q₃ K₃ V₃ Semantic similarity meaning clusters HEAD 4 Q₄ K₄ V₄ Positional patterns word order cues Concat + Linear Projection all heads merged into one rich vector, then to FFN

This is why LLMs handle ambiguous, multi-part instructions better than you might expect. When you write “summarise this document in a professional tone focusing on risks”, separate heads track the task, the style constraint, and the content filter in parallel. They do not compete. Their outputs get combined.

5. Feed-Forward Network: Where the Model’s Memory Lives

After attention figures out which tokens to focus on, each token passes through a feed-forward network (FFN). It is a small two-layer neural network applied independently to every token position.

Here is the insight most people miss. The attention layer is about relationships: token A relates to token B. The FFN layer is about facts: what does token A, in this context, actually know?

Attention tells the model: “the word ‘Python’ in this sentence is about programming, not snakes” because nearby tokens like “function” and “import” had high attention weight. The FFN then fires a set of neurons that say: “Python means programming language, created by Guido van Rossum, dynamically typed, widely used in AI.” The FFN is the model’s knowledge base. This is why bigger models with more FFN parameters know more facts.

FFN: Per-Token Knowledge Transformation
TOKEN VECTOR “Python” after attention: context = code import function d=4096 dims Feed-Forward Network (FFN) Layer 1 (x4 wider) Layer 2 (back to d) ENRICHED VECTOR “Python” now knows: language · Guido dynamic · AI/ML knowledge added

There is research showing you can surgically edit a model’s factual knowledge by updating specific FFN weights without retraining the whole model. This is the basis of model editing and knowledge patching techniques. The facts really are localised there.

6. Layer Normalisation: Keeping the Signal From Exploding

Every token is a long list of numbers called its embedding vector. After each sub-layer (attention or FFN), those numbers can drift badly. Some become very large, others shrink toward zero. Layer normalisation rescales them back to a stable, consistent range before the next sub-layer runs.

Think of mixing eight instruments in a studio. If one guitar suddenly gets 10x louder, it dominates the whole track and everything else becomes inaudible. Layer Norm is the compressor pedal that keeps every signal in a reasonable range so all voices stay audible. Without it, training collapses. Gradients either explode or vanish, and the model never converges.

Layer Norm: Signal Before vs After Normalisation
BEFORE Layer Norm Activations are all over the place activation values Layer Norm mean=0, std=1 AFTER Layer Norm Stable, consistent range activation values

Modern LLMs use Pre-Layer Norm, which means normalisation happens before each sub-layer rather than after. This stabilises training at large scale and is one of the reasons 70B+ models train reliably without exploding gradients.

7. Residual Connections: The Shortcut That Makes Deep Networks Work

A Transformer can have 96, 128, or more layers stacked on each other. Without residual connections, training that deep is nearly impossible. The gradient signal degrades as it flows backwards through all those layers and lower layers simply stop learning.

Imagine driving through 96 city intersections, one per layer. Every traffic light costs you momentum and time. A residual connection is a highway bypass that lets the original signal skip directly to later layers, while the normal processing route still runs in parallel. Gradients flow backward through the highway unimpeded, so even layer 80 gets a clear training signal. This is also why early layers can specialise in basic syntax while upper layers build higher-level semantics. The residual stream carries a progressively refined representation all the way through.

Residual Connection: The Skip Highway Pattern
x (input) Sub-Layer (Attention or FFN) SubLayer(x) RESIDUAL HIGHWAY original x passes through unchanged + x + Sub(x) LayerNorm → Output Formula: LN( x + SubLayer(x) ) learned + original preserved
// What happens at EVERY sub-layer, both attention and FFN
output = LayerNorm( x + SubLayer(x) )
//                 ^ original x always preserved, this is the highway
//                       ^ what the sub-layer learned to add on top

8. Positional Encoding: Teaching the Model Where Words Live

There is a subtle but critical problem with attention: it is position-agnostic by default. If you shuffle all the tokens in a sentence randomly, the raw attention computation produces identical scores. It only looks at which tokens match, not where they sit in the sequence. “The cat bit the dog” and “The dog bit the cat” would look identical to the model without positional encoding.

Write every word on a separate flashcard and throw them in a bag. Without positional encoding, the Transformer reaches in, sees all the cards at once, and has no idea which came first, second, or last. Positional encoding is writing a sequence number on each card before they go in. Now the model knows the order and can factor it into every decision it makes.

Positional Encoding: Injecting Order Into Every Token
“The cat sat on the mat” The cat sat on the mat + + + + + + PE(pos=0) PE(pos=1) PE(pos=2) PE(pos=3) PE(pos=4) PE(pos=5) token embedding + positional signal = position-aware vector Each token now carries: meaning + where it sits in the sentence RoPE (Rotary Positional Encoding) is the modern variant used by Llama, Claude, GPT-4

Context window limits (4k, 8k, 128k, 1M tokens) are directly tied to how far positional encoding can accurately represent positions. Techniques like RoPE and YaRN exist specifically to extend models beyond their original training context length. When a vendor advertises an “extended context” model, this is the underlying mechanism they improved.

9. Encoder vs Decoder: Why BERT, GPT, and T5 Are Fundamentally Different

The original Transformer had two parts: an encoder to read and understand input, and a decoder to generate output word by word. Modern models specialised into using just one of those parts, or both. This is not a minor implementation detail. It determines what the model can and cannot do.

An encoder is like a researcher who reads an entire document before forming an opinion. It sees every word at once (bidirectional). A decoder is like a writer drafting a sentence. It can only see what it has already written, never what comes next (causal, left-to-right). An encoder-decoder is a translator who reads the full source sentence first, then dictates the translation one word at a time.

Attention Patterns: Encoder vs Decoder vs Encoder-Decoder
ENCODER-ONLY BERT · RoBERTa · BGE T1 T2 T3 T4 T5 T1 T2 T3 T4 T5 Bidirectional: sees ALL tokens Best for: embeddings, classification, RAG retrieval Masked LM training DECODER-ONLY GPT-4 · Claude · Llama T1 T2 T3 T4 T5 T1 T2 T3 T4 T5 × × × × × × × × × × Causal: x = future is blocked Best for: generation, chat, reasoning, code Next token prediction ENC-DECODER T5 · BART · mT5 Enc cross-attn Dec Read all, then write one by one Best for: translation, summarisation, T2T Cross-attention bridges encoder to decoder Seq-to-seq training
ArchitectureExamplesAttentionBest ForTrained With
Encoder-onlyBERT, RoBERTa, BGEBidirectional, sees all tokensEmbeddings, classification, RAG retrievalMasked Language Model
Decoder-onlyGPT-4, Claude, Llama, GeminiCausal, past tokens onlyGeneration, chat, reasoning, codeNext token prediction
Encoder-DecoderT5, BART, mT5Both plus cross-attentionTranslation, summarisation, T2TSequence-to-sequence

If you are building a RAG system, your embedding model (BGE, text-embedding-3) is encoder-only. Your generation model (Claude, GPT-4) is decoder-only. They use completely different internal attention patterns, which is exactly why you need two separate model calls. One reads everything at once to build a representation. The other generates left-to-right, one token at a time.

The Transformer Architecture: Mental Model in One View

Every time an LLM processes your prompt, this pipeline runs once per layer, repeated 32 to 128 or more times:

ConceptWhat It DoesWhy You Care
Self-AttentionEvery token attends to every other simultaneouslyContext placement in prompts matters
Q / K / VAsk, Match, Receive: a learned search inside each tokenThe model figures out relevance itself
Attention ScoreQ·K, scale, softmax, weighted blend of VHow importance becomes a number
Multi-HeadH heads run in parallel, each learning different relationshipsHandles multi-part instructions well
FFNPer-token knowledge lookup after attentionWhere facts are stored; bigger model = more facts
Layer NormRescales activations to a stable range each layerWhy training does not collapse
ResidualSkip connections carry the original signal forwardHow 96-layer networks train without exploding
Positional Enc.Injects token order since attention ignores positionWhy context window limits exist
Enc. vs Dec.Bidirectional understanding vs causal generationWhy RAG needs two separate model calls

You do not need to implement any of this. But the next time you are debugging a prompt, designing a RAG pipeline, or choosing between an embedding model and a generation model, you will know exactly which part of the Transformer to look at.

If you want to go deeper, the original Attention Is All You Need paper is surprisingly readable once you have this foundation. The Hugging Face Transformers docs are also a good next step for seeing how these concepts map to real model code. If you are building on top of LLMs, my post on LLM tokenization covers how text becomes tokens before it ever reaches the transformer architecture.

Prabhat Kashyap

Prabhat Kashyap

Senior Technical Architect @ HCL Tech · working with Leonteq Security AG

10+ years building distributed systems and fintech platforms. I write about the things I actually debug at work — the messy, non-obvious parts that don't make it into official docs.

Engineering deep dives on Scala, Java, Rust, and AI Systems. Written by a senior engineer who builds real fintech systems.

© 2026 prabhat.dev