Tokenization Explained: How Text Is Processed Before It Reaches an LLM
Tokenization is the process that converts raw text into numbers before it reaches a language model. This post breaks down how UTF-8 encoding and Byte Pair Encoding (BPE) work together – with real examples, token split tables, and a look at why tokenization affects API pricing, context limits, and model behaviour.
Prabhat Kashyap
When you type a message into ChatGPT or Claude, you are not handing the model a string of characters. What arrives at the model is a sequence of integers – each one representing a chunk of your original text. The journey from raw characters to those integers is called tokenization, and it involves a surprisingly elegant algorithm.
Let us walk through it step by step, with real examples.
Step 1 – Encoding text as bytes
Every character on your keyboard (and beyond) can be represented as a sequence of bytes using UTF-8 encoding. A byte is a value from 0 to 255 – giving us exactly 256 possible symbols to start with.
Take the word “token”. Here is what its UTF-8 byte values look like:
Character
t
o
k
e
n
UTF-8 byte
116
111
107
101
110
Each character maps to a number between 0 and 255. Simple ASCII characters like English letters are always a single byte. Characters outside basic English – like the accented é in “café” – take two bytes (195 and 169). Emoji can take three or four bytes each. This matters later when we talk about cost.
This byte-level representation is universal – it works for any language, any script, any emoji. But feeding raw bytes directly to a model is inefficient. Common English words like the, and, or is would all be shredded into individual bytes, forcing the model to reconstruct meaning from tiny fragments. We can do better.
Why not just use characters? A character-level approach keeps the vocabulary small, but sequences become very long. The model must process more positions to understand the same text – which is expensive and makes long-range patterns harder to learn.
Step 2 – Byte Pair Encoding (BPE)
The solution is to find a smarter middle ground – chunks of text that are larger than single bytes but smaller than full words. This is what Byte Pair Encoding (BPE) does.
Originally a data compression algorithm, BPE was adapted for NLP by Sennrich et al. in 2016 and later adopted by OpenAI for the GPT series. The algorithm works like this:
Start with your training corpus encoded as bytes. Vocabulary = 256 symbols (one per possible byte value).
Count every adjacent pair of symbols across the entire corpus. Find the most frequently occurring pair.
Merge that pair into a new single symbol and replace every occurrence of it in the corpus. For example, if t + h is the most common pair, create a new symbol th.
Repeat from step 2. Each iteration the sequence gets shorter and the vocabulary grows by one. Repeat thousands of times.
Stop when you reach your target vocabulary size.
A concrete trace
Suppose our corpus contains just two words repeated many times: low and lower. We start with individual bytes for each character, then watch the merges happen:
After just three merge operations, the two-word corpus has gone from nine separate byte symbols down to three tokens: low, low, and er. In a real training run, this process happens across billions of words. Common sequences like ing, tion, and pre – and even full short words like the or is – quickly become single tokens.
Step 3 – What the vocabulary looks like
After enough merges, the vocabulary is rich and varied. Some entries are full common words. Others are subwords, prefixes, or suffixes. Rare words get split into smaller pieces but they are always representable.
Here is how GPT-4’s tokenizer (cl100k_base) actually splits some example words:
Input word
Tokens
Token count
tokenization
token · ization
2
unbelievable
un · believ · able
3
café
caf · é
2 (é costs more)
the
the
1 (extremely common)
🚀
🚀
3 (emoji = multiple bytes)
Notice the pattern: common English subwords and short words become single tokens. Rare combinations or non-ASCII characters get split further. The tokenizer is, in effect, a compression algorithm that has learned the statistical structure of language.
The tradeoff – sequence length vs. vocabulary size
At the heart of BPE is a deliberate tension between two forces pulling in opposite directions:
More merges → shorter sequences, larger vocabulary. Fewer tokens per sentence means the model can attend to longer contexts with the same compute budget.
Fewer merges → longer sequences, smaller vocabulary. Each token carries less semantic meaning but the vocabulary is simpler to manage.
The sweet spot – where this tradeoff is most efficient for natural language – turns out to be vocabularies in the range of tens of thousands to a few hundred thousand symbols. Here is how major models compare:
Tokenizer
Model
Vocabulary size
Multi-byte character support
r50k_base
GPT-3
50,257
Limited
cl100k_base
GPT-3.5 / GPT-4
100,277
Good
o200k_base
GPT-4o
200,000
Excellent
SentencePiece
LLaMA 2
32,000
Good
SentencePiece
LLaMA 3
128,256
Excellent
Why this matters in practice
Tokenization is not just an implementation detail – it directly affects how you build with and pay for LLMs.
Pricing is counted in tokens, not words
APIs charge per token. A rough rule of thumb: 1 token ≈ 0.75 words in English. Code, JSON, and non-English text can be significantly more expensive – a Python snippet with lots of punctuation or a Hindi paragraph may cost 2–3× more tokens per character than plain English prose. When you are optimising prompts for cost, you are really optimising token count.
The model “sees” tokens, not characters
This is why models sometimes struggle with simple spelling tasks. When asked “how many r’s are in ‘strawberry’?”, the model does not see individual letters – it sees tokens like straw and berry. Counting characters requires looking inside a token, which is not how the model was trained to process information.
Context limits are in tokens
When a model advertises a “128K context window,” that means 128,000 tokens – not words, not characters. A typical novel (~80,000 words) fits comfortably. A large codebase almost certainly does not.
Tokenization is the lens through which every language model sees the world. Understanding it gives you a sharper intuition for why models behave the way they do.
Putting it all together
Here is the full journey from your keyboard to the model:
The model never sees your original text. It sees a list of integers. Each integer is looked up in a table to retrieve a high-dimensional vector (the embedding), and that vector is what actually flows through the neural network layers.
Tokenization is the bridge between human language and machine arithmetic. It is a deceptively simple idea – compress frequent patterns into single symbols – that turns out to be one of the most consequential design decisions in the entire LLM stack.
See it for yourself
Tiktokenizer is a free browser tool that lets you paste any text and see exactly how it gets split into tokens, with support for multiple models including GPT-4, GPT-4o, and LLaMA. Try pasting a code snippet, a sentence in Hindi, or a run of emoji – the differences in token count are immediately visible and genuinely surprising.
Senior Technical Architect
@ HCL Tech · working with Leonteq Security AG
10+ years building distributed systems and fintech platforms.
I write about the things I actually debug at work — the messy,
non-obvious parts that don't make it into official docs.