Format 2026-05-31 10 min read

XML, Properly Explained

XML 1.0 was published as a W3C Recommendation in February 1998. It was designed as a simplified subset of SGML — itself an ISO standard from 1986 — that could be parsed without the full tag-soup tolerance SGML demanded. The 'simplified' framing is funny in retrospect, given the empire of namespaces, schemas, transforms, and query languages XML grew on top of itself. But the thing under all of it is still small, well-defined, and stricter than anything that's tried to replace it.

XMLXSDXPathXSLTNamespacesXXE

JSON won the API war so completely that XML now feels like an artifact. It isn't. XML is the substrate underneath several systems you rely on every day — SAML SSO, SOAP banking gateways, government tax e-filing schemas, RSS / Atom feeds, the OOXML format inside .docx/.xlsx/.pptx, the EPUB format inside ebook readers, Android resources, Java configuration in a long tail of enterprise codebases, XHTML where it's still rendered. The parts of the stack where XML is irreplaceable share a property: they need a schema-validated, namespaced, queryable, transformable document model, and JSON simply doesn't ship those features in the format itself.

This post is an attempt to explain XML the way someone who'll have to deal with it twice a year actually needs to think about it.

What it actually is

An XML document is a tree of elements. Each element has a name, optional attributes, optional child elements, and optional text content. Elements are written with start tags, end tags, and self-closing tags. Attributes are name="value" pairs on the start tag. Comments use . Text content can include character references (&, <, >, ", ') and numeric character references (A, A).

<?xml version="1.0" encoding="UTF-8"?>
<order id="42" status="paid">
  <customer>Alice</customer>
  <items>
    <item sku="ABC-123" qty="2"/>
  </items>
</order>

That's the format. Five rules, roughly:

Every start tag has a matching end tag (or is self-closed).
Tags nest, never overlap.
Attribute values are quoted.
There is exactly one root element.
Reserved characters in text content are escaped.

A document satisfying those rules is well-formed. A well-formed document that also matches a declared schema (DTD, XSD, RELAX NG) is valid. Most XML in production is well-formed but not validated, because schema validation is expensive and most pipelines skip it.

Well-formed vs valid: the most-confused distinction

This matters operationally. Almost every XML parser will reject a document that isn't well-formed (mismatched tags, unescaped &, etc.). Almost no XML parser will, by default, validate against a schema unless you ask it to. People who've been burned by XML usually mean "well-formed XML accepted by my parser turned out to be semantically wrong, because nobody validated it against the schema."

Validation is also where XML's expressive power shines and JSON's poverty shows. An XSD can express:

"This element must have exactly 1-N children of type X."
"This attribute must match a regex."
"This number must be between 0 and 100."
"This element is allowed only if a sibling has a specific value."
"This subtree must be unique by key."

JSON Schema can express most of this too, but XSD predates it by 15 years, ships with mature tooling in every enterprise language, and has standardized type libraries. If you're shipping data into a legacy banking, tax, or healthcare pipeline, the schema is XSD-shaped because those industries finished their format wars before JSON Schema existed.

Namespaces: the thing nobody intuits on the first try

A real-world XML document combines multiple vocabularies — your business data plus signature elements plus encryption elements plus metadata. To prevent name collisions (your <id> vs the W3C signature spec's <id>), XML 1.0 added namespaces in a 1999 amendment.

A namespace is a URI bound to a prefix. The URI is just an identifier — it doesn't have to resolve, doesn't get fetched, isn't a URL in any meaningful sense. The prefix is shorthand for the URI within the document.

<order xmlns="http://example.com/order/v1"
       xmlns:sig="http://www.w3.org/2000/09/xmldsig#">
  <customer>Alice</customer>
  <sig:Signature>...</sig:Signature>
</order>

Here <order> and <customer> are in the order namespace (no prefix = default namespace). <sig:Signature> is in the W3C XML Signature namespace. A consumer that knows http://www.w3.org/2000/09/xmldsig# knows exactly what <sig:Signature> means, regardless of which prefix the producer chose.

The trap: prefixes are arbitrary. The same document written with <dsig:Signature> instead of <sig:Signature> is equivalent if the URI binding is the same. People who wrote XPath queries against sig: and then received the same data with dsig: have been confused for hours by this. Always query by namespace URI, not prefix.

Entity expansion and the billion laughs attack

XML supports entities — named macros declared in a DTD that expand inline. The classic abuse:

<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
  <!-- ... &lol9; ... -->
]>
<lolz>&lol9;</lolz>

Each level expands tenfold, so &lol9; expands to a billion lols. A naive parser will allocate gigabytes and OOM. This is the billion laughs attack, known since 2003, and any modern parser disables internal entity expansion past a small budget.

A more dangerous variant is XXE (XML External Entity) injection, where the entity points at an external URL or local file:

<!DOCTYPE x [
  <!ENTITY exfil SYSTEM "file:///etc/passwd">
]>
<x>&exfil;</x>

A parser that resolves external entities will read /etc/passwd and inline it into the parsed document, which an attacker can then exfiltrate. XXE has been used against PayPal, Facebook, banks, governments — every few months a new XXE CVE shows up because someone re-enabled defaults somewhere. Modern parsers disable external-entity resolution by default, but it's worth verifying explicitly:

Java: factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true).
Python: use defusedxml instead of xml.etree.
Go: encoding/xml doesn't process external entities at all (good).
C#: XmlReaderSettings.DtdProcessing = DtdProcessing.Prohibit.

If you're parsing untrusted XML and you can't articulate exactly which entity-expansion features are off, assume you're vulnerable.

Encoding: declaration, BOM, and the actual bytes

The <?xml version="1.0" encoding="UTF-8"?> line at the top is the XML declaration, and its encoding= attribute tells the parser how to decode the bytes that follow.

What goes wrong:

The declaration says UTF-8 but the file is actually written in Latin-1. The parser will accept ASCII bytes fine and then fail at the first non-ASCII character with a confusing error.
The declaration says UTF-8 and the file starts with a UTF-8 BOM (EF BB BF). Most parsers handle this; some choke. RFC 7303 says BOM is allowed.
The declaration says UTF-16 but the file has no BOM. The parser doesn't know endianness; well-defined parsers refuse, others guess.
No declaration. The default is UTF-8. People assume Latin-1 and write a tool that produces ill-formed bytes.

A surprising fraction of "the parser broke on this XML file" turns out to be encoding mismatch, not malformed structure. When in doubt, hex-dump the first 8 bytes and compare against the declaration.

CDATA: the escape hatch you mostly don't need

Embedding HTML or code with lots of < and & in XML text content is painful — every angle bracket and ampersand has to be escaped. CDATA sections let you write a literal block:

<script><![CDATA[
  if (x < 10 && y > 0) { return "ok"; }
]]></script>

Inside <![CDATA[ ... ]]>, almost everything is literal — except the closing ]]>, which can't appear (you have to split it across two CDATA sections if it does). CDATA is sometimes treated as "a different kind of string"; it isn't. To the parser, the result is identical to the equivalent escaped text. Don't write business logic that branches on "was this CDATA or not."

XPath, XSLT, XSD: the queryable-data ecosystem

JSON has nothing equivalent to XPath in the format itself. XML has XPath built into the model:

/order/items/item[@sku='ABC-123']/qty

That's a query language for navigating an XML tree, returning a node-set. XPath 1.0 (1999) is universal across XML parsers. XPath 2.0/3.x adds rich type and function libraries; XPath 3.x is required by XSLT 3.

XSLT is a Turing-complete language for transforming one XML document into another. People who haven't written XSLT think it's regrettable; people who have written XSLT for a living have either retired or learned to enjoy it. It's still the canonical way to render XML data into HTML at scale (a lot of government PDFs are XSLT under the hood).

XSD (XML Schema Definition) is the schema language for validation. Big, sprawling, but the de-facto standard.

These are why XML keeps showing up in places JSON doesn't reach. A SOAP service with WSDL can be statically validated, code-generated into client stubs in a dozen languages, and queried with XPath — all with off-the-shelf tooling. You can do the equivalent for JSON, but you assemble it from five different tools.

Why XML lost (and where it didn't)

JSON beat XML for HTTP APIs because:

JavaScript already had JSON.parse natively; XML required a separate library and a DOM API.
JSON's verbose-to-information ratio is lower. {"x":1} vs <x>1</x> sounds petty, but at scale it dominates bandwidth.
JSON has no schema, no namespaces, no transformations, no validation — and for most CRUD APIs, you don't need them.
The XML ecosystem accumulated cruft (SOAP, WSDL, WS-*, XLink, XPointer, the XHTML 2.0 misadventure) that made "use XML for your API" mean "use this twelve-headed standards stack for your API."

Where XML won: anywhere a document needs to be schema-validated, namespaced, signed, encrypted, transformed, and archived for decades. Banking. Healthcare. Government. Legal e-filing. SAML. SOAP. The OOXML inside Office documents. EPUB. SVG (yes — SVG is XML). RSS/Atom. The Maven ecosystem. Anything with a 30-year lifetime where the consumers and producers might be a generation apart.

Common pitfalls

< or & in text. Must be < and &. A surprising amount of broken XML is generated by printf instead of a real serializer.
Mixing CDATA and entity references thinking they're different. They aren't, to the parser.
Trusting xmlns prefixes. Always resolve to URI before comparing.
Forgetting the XML declaration's encoding and hand-rolling files in Latin-1.
Parsing XXE-vulnerable input with a default parser and a default config. Modern defaults are usually safe, but check.
Comparing two XML documents by string equality. Whitespace inside element content is often (but not always) significant. Attribute ordering inside a tag is never significant. <a x="1" y="2"/> and <a y="2" x="1"/> are equivalent. Use a canonical form (xmllint --c14n) before diffing.
Mutating a document via string concatenation rather than DOM/SAX. You'll get the escaping wrong and produce malformed XML.
Treating XML as JSON by ignoring attributes. <item sku="X">2</item> has both an attribute and text content; a JSON-style serializer that maps to {item: 2} loses the SKU.

When to use XML vs JSON vs something else

XML when: you need schema validation as part of the format, you're integrating with a SOAP/SAML/government/banking system that already speaks XML, you need namespaces because you're combining multiple vocabularies, you need XSLT to render the data into a presentation format.

JSON when: it's an HTTP API, the consumer is a web frontend, you don't need namespaces or schema validation in the format itself, you want the smallest possible parser everywhere.

Protobuf / Avro / Cap'n Proto when: you need wire-level efficiency, evolution-safe schemas, and your producer and consumer are both under your control.

TOML / YAML when: it's a config file written by humans.

Markdown / plain text when: it's prose.

XML's reputation as "obsolete" is wrong. It's specialized. The jobs that need what XML uniquely provides aren't going away, and JSON isn't going to grow a schema language with namespace support that catches up to XSD. The right move when you meet XML in 2026 isn't to wish it were JSON; it's to use a real parser, validate against a schema, and keep external entity resolution off.

Format and validate XML locally

The XML tool on this site formats and validates XML in the browser using a real parser, with attribute preservation and indentation control. Useful when an upstream system hands you a single-line XML blob and you need to read it before pasting it back. Nothing leaves your browser.

Open the XML tool

Related guides

Keep the session useful with adjacent reading instead of exiting after one article.

View all guides

QR Code 2026-06-10