URL Encoding, Properly Explained
Tim Berners-Lee added percent-encoding to URLs in 1991 because the only existing precedent for escaping characters in plain text was shell quoting, and that was a mess. The result has gone through six RFC revisions and is still a leading source of subtle bugs — because two slightly different variants are now both 'standard,' depending on whether you're building a URL or submitting an HTML form.
What it actually is
Percent-encoding is a way to put any byte into a URL as a printable ASCII triple: % followed by two hex digits. %20 is byte 0x20, the ASCII space. %E4%BD%A0 is three bytes (0xE4 0xBD 0xA0) which together happen to be UTF-8 for "你". The encoding doesn't know or care; it operates on bytes.
Which means percent-encoding by itself doesn't specify how text becomes bytes. RFC 3986 (the current URI spec, from 2005) says use UTF-8 unless a scheme says otherwise. Most modern URL handlers do this. Older ones sometimes don't, which is why a Latin-1 query parameter that scans fine in IE6 becomes garbled in Chrome — the bytes are encoded by one charset and decoded as another. The percent-encoding round-trips perfectly; the interpretation of the resulting bytes is where the bug lives.
Reserved vs unreserved
The RFC 3986 alphabet splits characters into two camps:
Unreserved. Letters, digits, and the four characters -, ., _, ~. These never need encoding. An encoder that emits %41 instead of A is technically wrong — RFC 3986 §2.4 says producers should not encode unreserved characters.
Reserved. Characters with structural meaning in a URL. Two further sub-camps:
- gen-delims:
:,/,?,#,[,],@. These delimit the major URL components. - sub-delims:
!,$,&,',(,),*,+,,,;,=. These have meaning inside specific components.
Whether a reserved character needs encoding depends on where it sits. A ? in a path segment must be encoded — otherwise the parser thinks the query string starts there. A ? inside a query string is also a delimiter, but later occurrences are usually accepted as data. The rules are scope-dependent, which is the part nobody internalizes.
This is why JavaScript ships two functions:
encodeURI()— assumes you're encoding an entire URL, so it leaves reserved characters alone.encodeURIComponent()— assumes you're encoding a single component (one path segment, one query value), so it encodes almost everything reserved.
You almost always want encodeURIComponent. The only legitimate use of encodeURI is escaping spaces in a URL you've otherwise hand-built, which is rare and usually a sign you should be using a URL builder library instead.
The form-data variant
When HTML forms submit, browsers don't use RFC 3986 percent-encoding. They use a variant defined by the WHATWG HTML spec called application/x-www-form-urlencoded. The differences:
- Spaces become
+instead of%20. - The character set used to convert text to bytes is the form's
accept-charset, often UTF-8 but not guaranteed. - More characters are aggressively encoded.
This is why a query string from an HTML form looks like ?q=hello+world and one built by JS using encodeURIComponent looks like ?q=hello%20world. Both are legal, both decode to "hello world" — but only because virtually every server-side parser knows to handle both.
It's also why + in a URL is ambiguous. Inside a query string, it usually means space. Inside a path, it usually doesn't. If you have a literal + you want to preserve in a query value, you must encode it as %2B. This is the source of the perennial "phone numbers in query strings" bug: +1-555-0100 round-trips as 1-555-0100 if your decoder follows form-encoded rules.
Double encoding
The single most common bug in this space.
You take ?q=hello world, encode it to ?q=hello%20world, then encode the whole URL again somewhere downstream — %20 becomes %2520, because the % itself got percent-encoded as %25. Your server now sees the literal string hello%20world as the query value, including the percent sign, instead of hello world.
This happens whenever someone encodes once before passing to a library that encodes again. Or when a frontend percent-encodes for display, then a backend percent-encodes for redirection. The signature is %25 showing up where it shouldn't. The fix is figuring out which layer should own encoding and removing it from everywhere else — usually the layer closest to the wire wins, and everything upstream should hand it raw strings.
Internationalized domain names are different
A URL like https://例.jp/ is not percent-encoded in the host portion. Hosts use Punycode (RFC 3492), which encodes Unicode as xn-- ASCII strings: 例.jp becomes xn--fsq.jp. This is a totally separate mechanism from percent-encoding and applies only to the host. Path and query stay percent-encoded.
If you're trying to "URL-encode" a domain name and it isn't working, that's why — domain names need IDNA processing, not percent-encoding. Conflating the two will silently produce URLs that resolve in your test browser and 404 on someone else's.
Common pitfalls
- Using
encodeURIwhere you neededencodeURIComponent. The result is a URL where reserved characters in your data (an&in a search query, say) are interpreted as URL structure. - Decoding a form-encoded payload with a strict RFC 3986 decoder. The
+characters survive as+instead of becoming spaces. - Encoding twice. Look for
%25in your inputs. - Building URLs by concatenating strings instead of using a
URLbuilder or query-string library. The library knows the rules; your+ '&q=' + valuedoes not. - Encoding a
#in a query value and forgetting that the fragment delimiter has higher precedence than the query delimiter. Many parsers strip everything from#onward before parsing the query, so a literal#in a query value must be%23.
Practical rules
- Use
encodeURIComponentfor query parameters and path segments. NeverencodeURI. Almost never raw concatenation. - If a
+ends up where you didn't expect, your URL is being treated as form-encoded by something downstream. - Encode once. The layer closest to the wire owns it.
- Hosts use Punycode, not percent-encoding. Two distinct mechanisms.
- Spaces should be
%20in paths.+is acceptable only in query strings, and only because of HTML-form heritage. - Don't reach for percent-encoding to escape data inside JSON or HTML — those have their own escape mechanisms; reusing percent-encoding silently bakes character-set assumptions into your data.
Try both flavors in the browser
The URL encoder on this site supports both RFC 3986 and form-encoded variants side by side. Useful when you're trying to figure out why your + got read as a space, or why your %20 didn't.
Open the URL encoderRelated guides
Keep the session useful with adjacent reading instead of exiting after one article.
QR Codes, Properly Explained
How QR codes actually work — finder patterns, Reed-Solomon error correction, static vs. dynamic redirects, and the real reasons codes fail in print.
Base64, Properly Explained
A 1989 hack for smuggling binary through 7-bit email transports — and why we still use it for JWTs, data URIs, and a hundred other places. Two alphabets, one common decode failure, and the things it categorically isn't.
Unicode, Properly Explained
A broken 1991 promise, three encoding forms, surrogate pairs as backwards-compat scaffolding, and why string.length lies in basically every language. Plus the surprisingly recent reason emoji families are seven code points each.