Login
Back to Blog
Tokens vs Bytes in AI: What LLMs Actually See When You Type

Tokens vs Bytes in AI: What LLMs Actually See When You Type

C
Crazyrouter Team
March 29, 2026
177 viewsEnglish
Share:

Tokens vs Bytes in AI: What LLMs Actually See When You Type#

You type "你好 Hello" into GPT-5. That's 7 characters. But the model processes it as 2 tokens — and your bill is based on those tokens, not the characters.

Meanwhile, your computer stores that same text as 12 bytes.

So what's the difference between bytes, characters, and tokens? Why does AI use tokens instead of raw bytes? And why does the same sentence cost more in Chinese than in English?

This guide explains the full pipeline — from the bytes on your hard drive to the tokens that AI actually reads.


Start at the Bottom: What Is a Byte?#

A byte is the smallest unit of data your computer stores. One byte = 8 bits = a number from 0 to 255.

When you save text to a file, your computer encodes each character into bytes using a standard called UTF-8:

CharacterUTF-8 BytesByte CountHex
H72148
e101165
228, 189, 1603e4 bd a0
229, 165, 1893e5 a5 bd
🚀240, 159, 154, 1284f0 9f 9a 80

Key pattern:

  • English letters: 1 byte each
  • Chinese/Japanese/Korean characters: 3 bytes each
  • Emojis: 4 bytes each

So the text "你好 Hello" is stored as 12 bytes:

code
你     好     [space]  H   e   l   l   o
e4 bd a0  e5 a5 bd  20  48  65  6c  6c  6f
(3 bytes) (3 bytes) (1) (1) (1) (1) (1) (1) = 12 bytes

Bytes are how computers store text. But AI models don't read bytes directly.


The Four Levels: Bytes → Characters → Words → Tokens#

There are four ways to break down a piece of text. Each level groups the data differently:

Level"Hello, World"CountDescription
Bytes48 65 6c 6c 6f 2c 20 57 6f 72 6c 6412Raw storage units
CharactersH e l l o , ␣ W o r l d12Human-readable letters
WordsHello, World2Space-separated units
TokensHello , World3What AI actually processes

Now the same comparison with Chinese:

Level"你好世界"CountDescription
Bytese4 bd a0 e5 a5 bd e4 b8 96 e7 95 8c123 bytes per character
Characters你 好 世 界4Each character = 1 Chinese word
Words你好 世界2Meaning-based segmentation
Tokens你好 世界2Depends on the tokenizer

Four-level hierarchy showing how text is represented as bytes, characters, words, and tokens in AI systems

The critical insight: tokens are not bytes, not characters, and not words. They're a middle ground — sub-word units that balance vocabulary size with sequence length.


Why Not Just Use Bytes or Words?#

If bytes and words are simpler, why did AI invent tokens? Because both extremes have serious problems:

Problem with bytes: sequences are too long#

"Hello" is 5 bytes. A 1,000-word blog post is ~5,000 bytes. A novel is ~500,000 bytes.

An AI model needs to process the relationships between all positions in a sequence. The computational cost grows quadratically with sequence length (O(n²) for attention). Double the sequence length → 4x the compute cost.

Using raw bytes would make sequences 3-4x longer than necessary, making AI models unaffordably slow.

Problem with words: vocabulary explodes#

English has over 170,000 words in common use. Add technical terms, names, URLs, code, and other languages — you'd need a vocabulary of millions.

A huge vocabulary means:

  • A massive embedding table (memory-hungry)
  • Most words are rare, so the model barely learns them
  • Any word not in the vocabulary is completely unknown (the "OOV" problem)

Tokens: the sweet spot#

Tokens split text into sub-word units — pieces that are larger than bytes but smaller than words:

code
"unbelievable" → ["un", "bel", "ievable"]    (3 tokens)
"tokenization" → ["Token", "ization"]         (2 tokens)
"Hello"        → ["Hello"]                    (1 token — common enough to keep whole)

This gives you:

  • Short sequences (efficient processing)
  • Small vocabulary (~100K-200K entries, not millions)
  • No unknown words (any text can be broken into known sub-words)

How BPE Tokenization Actually Works#

Most modern LLMs use a variant of Byte Pair Encoding (BPE) to build their tokenizer. Here's how it works:

Step 1: Start with individual characters#

Take your training text and split everything into single characters:

code
"low" → ["l", "o", "w"]
"lower" → ["l", "o", "w", "e", "r"]
"newest" → ["n", "e", "w", "e", "s", "t"]

Step 2: Count the most frequent pair#

Look at all adjacent character pairs and find the most common one:

code
("l", "o") appears 2 times  ← most frequent
("o", "w") appears 2 times
("e", "w") appears 1 time
...

Step 3: Merge the most frequent pair#

Replace all occurrences of ("l", "o") with a new token "lo":

code
"low" → ["lo", "w"]
"lower" → ["lo", "w", "e", "r"]

Step 4: Repeat thousands of times#

Keep counting and merging until you reach your target vocabulary size (typically 50K-200K tokens).

After many merges, common words like "the", "Hello", and "you" become single tokens, while rare words get split into sub-word pieces.

Step-by-step visualization of BPE byte pair encoding tokenization process showing character splitting, frequency counting, and merging

See it yourself with tiktoken#

You can inspect exactly how GPT-5 tokenizes any text using OpenAI's tiktoken library:

python
import tiktoken

# o200k_base is used by GPT-4o and GPT-5
enc = tiktoken.get_encoding("o200k_base")

text = "你好 Hello"
tokens = enc.encode(text)
token_strings = [enc.decode([t]) for t in tokens]

print(f"Text: {text}")
print(f"UTF-8 bytes: {len(text.encode('utf-8'))}")
print(f"Tokens ({len(tokens)}): {token_strings}")
print(f"Token IDs: {tokens}")

Tested output:#

code
Text: 你好 Hello
UTF-8 bytes: 12
Tokens (2): ['你好', ' Hello']
Token IDs: [177519, 32949]

12 bytes compressed into just 2 tokens. That's the power of BPE — common phrases get their own token, dramatically shortening the sequence.


Different Models, Different Tokenizers#

Each AI model family uses its own tokenizer with a different vocabulary. The same text can produce different token counts:

Textcl100k_base (GPT-4)o200k_base (GPT-4o/5)UTF-8 Bytes
Hello, how are you today?7725
Explain quantum computing in simple terms7641
你好,请用中文解释一下什么是token15947
こんにちは、トークンとは何ですか?121051
Python fibonacci function (5 lines)282892

Key takeaway: GPT-5's tokenizer (o200k_base) is significantly more efficient for Chinese and Japanese — 你好 is 15 tokens on GPT-4 but only 9 tokens on GPT-5 for the same Chinese sentence. That's a 40% cost reduction just from a better tokenizer.

Claude uses its own tokenizer, and Gemini uses SentencePiece. Token counts will vary across providers — which means the same prompt can cost different amounts depending on which model you use.


How Token Counts Affect Your Bill#

AI APIs charge per token, not per byte or per word. And different languages have very different token efficiencies:

Token efficiency by language (o200k_base)#

LanguageTextTokensBytesBytes/Token
English"Hello, how are you today?"7253.6
Chinese"你好,今天怎么样?"5275.4
Japanese"こんにちは"11515.0
Korean"안녕하세요"2157.5
Codedef fibonacci(n): (5 lines)28923.3

For GPT-5's tokenizer, Japanese is actually the most token-efficient language — こんにちは packs 15 bytes into a single token.

Real cost comparison#

Suppose you process 1 million characters of text through GPT-5 ($1.25/M input tokens):

Language~Tokens per 1M charsCost
English~330K$0.41
Chinese~500K$0.63
Mixed (EN+CN)~400K$0.50

Chinese text costs roughly 50% more than English for the same character count — because each Chinese character takes more tokens on average.

Optimization tip: compare models before committing#

Different models have different tokenizers, different prices, and different capabilities. The cheapest option for your use case might surprise you.

With a unified API gateway, you can test the same prompt across multiple models and compare both quality and cost:

python
from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

# Test the same prompt across models
for model in ["gpt-5", "gpt-5-mini", "deepseek-v3.2", "claude-sonnet-4"]:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "Explain tokenization in 2 sentences"}],
        max_tokens=100
    )
    usage = response.usage
    print(f"{model}: {usage.prompt_tokens} in / {usage.completion_tokens} out")

With Crazyrouter, one API key gives you access to 627+ models — making it easy to find the most cost-effective model for your specific language and workload.


The Future: Will Byte-Level Models Replace Tokens?#

In late 2024, Meta released research on the Byte Latent Transformer (BLT) — a model architecture that processes raw bytes directly, eliminating the tokenizer entirely.

Why byte-level models are interesting:#

  • Truly language-agnostic — no tokenizer bias toward English
  • No out-of-vocabulary problem — any byte sequence is valid input
  • No tokenizer-model mismatch — the model sees exactly what you typed

Why tokens still dominate in 2026:#

  • Byte sequences are 3-4x longer — much more expensive to process
  • All production models (GPT-5, Claude, Gemini) still use token-based architectures
  • BLT is research-stage — not yet available as a commercial API

The practical reality: tokens will remain the standard for at least the next 3-5 years. Understanding how they work is a necessary skill for any developer using AI APIs.


Quick Reference: Tokens vs Bytes vs Characters vs Words#

PropertyBytesCharactersWordsTokens
What is itRaw storage unitHuman-readable symbolSpace-separated textSub-word unit for AI
SizeAlways 1 byte1-4 bytes (UTF-8)VariableVariable
"Hello"5511
"你好"6211
"unbelievable"121213
Used byComputers (storage)Humans (reading)Search enginesAI models
AI billing?NoNoNoYes

Key relationships:#

  • 1 English token ≈ 4 characters ≈ 4 bytes ≈ 0.75 words
  • 1 Chinese token ≈ 1-3 characters ≈ 3-9 bytes ≈ 1-2 words
  • 1,000 tokens ≈ 750 English words ≈ 4,000 characters

Useful tokenizer tools:#


FAQ#

Are tokens the same as bytes?#

No. Bytes are the raw storage units of your computer (1 byte = 8 bits). Tokens are the processed units that AI models read. The text "Hello" is 5 bytes but 1 token. The text "你好" is 6 bytes but also 1 token. The relationship between bytes and tokens depends on the language and the specific tokenizer.

Why don't AI models just process raw bytes?#

Efficiency. "Hello" as bytes is a sequence of length 5; as a token it's length 1. For a 1,000-word article, byte-level processing would create sequences 3-4x longer. Since the computational cost of transformer attention is O(n²), this would make AI models dramatically more expensive and slower.

Does the same text use the same number of tokens across all models?#

No. GPT-5 uses the o200k_base tokenizer, GPT-4 uses cl100k_base, and Claude uses its own. A Chinese sentence might be 15 tokens on GPT-4 but only 9 tokens on GPT-5. This means the same prompt can cost different amounts on different models.

Why does Chinese text cost more than English in AI APIs?#

Two reasons. First, Chinese characters require 3 bytes each in UTF-8 (vs 1 byte for English letters). Second, even with modern tokenizers, Chinese characters are less efficiently compressed into tokens. A typical Chinese sentence uses about 50% more tokens than an English sentence conveying the same meaning.

How can I reduce my token costs?#

Five practical approaches: (1) Write concise prompts — every extra word costs tokens. (2) Use max_tokens to cap output length. (3) Pick the cheapest model that meets your quality bar. (4) Cache repeated queries. (5) Use an API gateway like Crazyrouter to easily compare pricing across 627+ models with a single API key.


Further Reading#


Understanding the difference between bytes and tokens is the first step to controlling your AI costs. For more developer guides and the latest model pricing data, visit the Crazyrouter Blog.

Related Articles