Encoding 2026-06-07 9 min read

Unicode, Properly Explained

Unicode 1.0 shipped in 1991 with a 16-bit code space — 65,536 characters, 'more than enough for every script in current use,' the manual said. By 1996 it was clear that promise was wrong, and the entire ecosystem has been digging out of the workarounds ever since. Almost every weird Unicode bug you've ever encountered is downstream of that single broken assumption.

UnicodeUTF-8UTF-16Code PointsNormalizationGrapheme Clusters

The promise that broke

The original Unicode spec gave every character a fixed 16-bit code. Java was designed around it. So was Windows NT. So was JavaScript. The whole 1990s wave of "Unicode-aware" languages and operating systems baked in char = 16 bits as a fundamental assumption.

The Unicode 2.0 spec in 1996 admitted 65,536 wasn't enough. The fix was to expand the code space to roughly 1.1 million code points organized into 17 "planes," and to bolt on a backwards-compatibility scheme called UTF-16 that lets the new larger code points be represented as pairs of the old 16-bit units. We'll get to that. Just understand: every UTF-16 weirdness you've hit is the cost of that one decision.

The vocabulary

A few terms that people use loosely and get punished for:

Code point. A number in the Unicode code space, written U+xxxx. U+0041 is "A". U+1F600 is "😀". The character é has the code point U+00E9. Code points are abstract — they don't have a byte representation until you pick an encoding form.
Encoding form. A rule for turning code points into bytes. Unicode defines three: UTF-8 (1–4 bytes per code point), UTF-16 (2 or 4 bytes), UTF-32 (always 4).
Plane. A block of 65,536 code points. Plane 0 is the Basic Multilingual Plane (BMP) — U+0000 to U+FFFF, the original 16-bit world. Planes 1–16 are the "supplementary planes," and that's where every emoji past 1996, every CJK extension, and most historic scripts live.
Grapheme cluster. What a human reader thinks of as "one character." Often one code point, but not always. é written as e + combining acute (U+0065 U+0301) is one grapheme cluster, two code points.

Mixing these up — saying "character" when you mean "code point" or "byte" — is responsible for at least half of the bugs in this space.

Why UTF-8 won the web

UTF-8 (Ken Thompson and Rob Pike, 1992) has properties that read like a wishlist:

ASCII-compatible. Every ASCII byte is itself in UTF-8. A pure-ASCII file is also a valid UTF-8 file, no conversion needed. This single property ensured backwards compatibility with thirty years of UNIX tooling.
Self-synchronizing. If you start reading bytes in the middle of a UTF-8 stream, you can find the next code point boundary by walking forward at most 3 bytes. UTF-16 has the same property in theory; UTF-32 doesn't need it.
No byte-order ambiguity. A UTF-8 byte sequence reads the same on big-endian and little-endian machines. UTF-16 and UTF-32 don't, which is why they have a Byte Order Mark.
Compact for Latin scripts. English text is 1 byte per character in UTF-8; 2 in UTF-16; 4 in UTF-32. For predominantly-CJK text the comparison flips — UTF-8 uses 3 bytes for most CJK characters, UTF-16 uses 2.

UTF-8 isn't ideal for everything. CJK-heavy files are larger in UTF-8 than UTF-16. Random-access by code point index is O(n) because code point boundaries are variable-width. But for the dominant use case — Latin-leaning text moving over networks — UTF-8 won decisively. The web, JSON, and most modern protocols mandate UTF-8 by default.

Surrogate pairs: the 1996 hack

UTF-16 has to encode 1.1M code points using 16-bit units. The fix: reserve U+D800–U+DFFF as "surrogates" — code points that exist only to be paired up. A high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF) together encode one code point in the supplementary planes.

This works, but it leaves scars:

Surrogate code points are themselves invalid Unicode characters. A "lone surrogate" — one half of a pair, unpaired — is illegal in well-formed Unicode but technically representable in UTF-16. Many string libraries silently allow them; most network protocols reject them. JSON happens to allow lone surrogates in \uXXXX escapes, which is the origin of half the cross-system Unicode round-trip bugs.
JavaScript and Java strings are sequences of UTF-16 code units, not code points. "😀".length returns 2 in JavaScript because the emoji is a surrogate pair. "😀".charAt(0) returns half of the emoji.
UTF-8 and UTF-32 have no concept of surrogates. They're a UTF-16-only artifact. If you serialize a JS string containing a lone surrogate to UTF-8, the standards say you should error or substitute the replacement character U+FFFD; in practice many tools produce invalid UTF-8 instead.

The BOM debacle

A Byte Order Mark is the code point U+FEFF written at the start of a file to signal endianness. UTF-16 needs it. UTF-32 needs it. UTF-8 doesn't — there's no byte order in 1-byte units.

Microsoft put one at the start of UTF-8 files anyway, originally so Notepad could distinguish UTF-8 from local code pages. The Unicode standard tolerates this but does not recommend it. Every Unix tool, every web standard, and most modern editors don't expect the three-byte UTF-8 BOM (EF BB BF) and treat it as part of the file content. This is why a CSV exported from Excel sometimes has a mysterious invisible character at the start of the first column header.

If you're writing tooling: don't emit UTF-8 BOMs. If you're reading them: strip them defensively.

Normalization

There are often multiple ways to spell the same human-perceived character.

é as one code point: U+00E9 (Latin small letter e with acute) — "precomposed"
é as two code points: U+0065 + U+0301 (lowercase e + combining acute) — "decomposed"

These are canonically equivalent, meaning they should display identically and should compare equal in any sane string comparison. They aren't byte-equal. Naive string comparison says they're different. This is why a username typed on a Mac (which prefers decomposed forms in the filesystem) sometimes mismatches when typed on Windows (which prefers precomposed).

The fix is normalization, which rewrites strings into a canonical form before comparison. The four forms:

NFC — Canonical Composition. Combines code points where possible. The most common choice.
NFD — Canonical Decomposition. Splits into base + combiners.
NFKC — Compatibility Composition. NFC plus aggressive replacements (full-width digits become regular digits, ligatures expand).
NFKD — Compatibility Decomposition. The kitchen sink.

Pick NFC for storage and comparison unless you have a specific reason. NFKC is appropriate for search ("ﬃ" should match "ffi") but loses information you can't recover. NFKD and NFD are usually intermediate forms in algorithms, not what you store.

Grapheme clusters and the length lie

"hello".length is 5 in every language. Reasonable.

"é".length (precomposed) is 1 in JavaScript, but if the same é is typed as decomposed e + ◌́, the length is 2.

"😀".length is 2 in JavaScript (UTF-16 surrogate pair), 1 in Python (which is code-point-indexed), 4 in Go (which exposes byte length). All three are technically correct given each language's definition of "length."

"👨‍👩‍👧‍👦".length is 11 in JavaScript and 7 in Python. The "family" emoji is four people emoji joined by three Zero-Width Joiners (U+200D). Each person is a surrogate pair in UTF-16. The user sees one family.

What the user almost always means by "length" is "number of grapheme clusters," and that's what most languages don't expose by default. JavaScript needs Intl.Segmenter (ES2022). Swift exposes it natively as String.count. Python needs the regex library or grapheme. If your text-truncation logic treats characters as code units or code points, you will eventually mid-split an emoji and produce a tofu box ▯.

ZWJ and modifier sequences

Modern emoji aren't single code points. They're sequences:

Skin tone: 👋🏼 is U+1F44B (waving hand) + U+1F3FC (medium-light skin tone modifier).
ZWJ sequences: 👨‍🍳 is U+1F468 (man) + U+200D (ZWJ) + U+1F373 (cooking).
Family: 👨‍👩‍👦 is three people glued with ZWJs.
Flag: 🇯🇵 is two regional indicator symbols, U+1F1EF + U+1F1F5 (J + P).

The renderer is supposed to display these as single glyphs if it has the right font. If it doesn't, you see the components — which is why a flag sometimes renders as two letters in a colored box on systems with older emoji fonts.

Practical rules

For storage and transport: UTF-8, no BOM.
For comparison: NFC-normalize first. Always. Even within ostensibly homogeneous data.
Don't trust string.length. If you need a user-perceived character count, use a grapheme segmenter.
Don't index into strings by integer in user-facing code. You will eventually mid-split a surrogate pair, a combining sequence, or a ZWJ sequence.
Treat lone surrogates as data corruption unless you have a documented reason to keep them.
If a string round-trips correctly through UTF-8 but breaks on a system that uses UTF-16, suspect normalization or surrogate handling.

Inspect any character

The Unicode tool on this site shows code points, UTF-8 / UTF-16 byte sequences, and grapheme cluster boundaries for any string. Useful for the 'why is my emoji length 7' moments.

Open the Unicode inspector

Related guides

Keep the session useful with adjacent reading instead of exiting after one article.

View all guides

QR Code 2026-06-10