</>
Back to Blog
Dev 2026-06-03 8 min read

Regular Expressions, Properly Explained

Stephen Cole Kleene formalized regular expressions in 1951 as a way to describe a class of formal languages. By the time grep shipped in 1973, programmers had already started extending the math with features that aren't formally regular at all. Modern regex engines are descendants of that compromise — and the line between 'regex' and 'informal pattern language' has never been formally redrawn.

Regular ExpressionsRegexPCREReDoSBacktracking

What's actually regular

Kleene's regular languages can be described with three operations: concatenation, alternation, and the Kleene star. That's it. A pattern using only those is provably equivalent to a finite state machine, which means it can be matched against any string in time proportional to the string's length. No backtracking. No surprises.

The features that aren't part of regular languages, but are part of every modern regex flavor:

  • Backreferences. (a)\1 matches "the same thing twice." Not regular. Provably non-regular: the language {ww : w ∈ Σ*} is not regular but is matched by (.+)\1.
  • Lookarounds. (?=foo) and friends. Equivalent in expressive power to regular languages in theory, but combined with backreferences they push complexity into realms regular automata can't handle.
  • Recursive patterns. PCRE's (?R) and named recursion. Definitely not regular.

The reason this matters: regular languages have well-known algorithms that match in O(n). Backreferences and recursion need backtracking, and backtracking can be exponential. Almost every regex performance footgun traces back to this divide.

NFA vs DFA

There are two algorithmic strategies for matching regex.

DFA (Deterministic Finite Automaton). Compile the pattern into a state machine. Walk the input one character at a time, advancing through states. Always linear in input length. Cannot do backreferences or lookbehind.

NFA with backtracking. Try alternatives in order, back up when one fails. Exponential worst case. Can do backreferences, lookarounds, recursion, and most "modern" features.

Real-world engines split:

  • PCRE, Perl, Python, Ruby, Java, .NET, JavaScript — backtracking NFA. Feature-rich, vulnerable to ReDoS.
  • Go's regexp, Rust's regex, RE2, grep -E — DFA / hybrid. Linear-time guarantee. No backreferences. No lookbehind in some versions.

If you're writing regex against untrusted input — user-submitted patterns, network protocols, log lines — RE2 / Go regexp is the right tool. If you're writing patterns over data you control, the backtracking flavors give you more expressive power.

Catastrophic backtracking

Run this against a long input of A's followed by a B:

^(a+)+$

Against aaaaaaaaaaaaaaaaaab, an NFA engine tries every way to split the A's between the inner and outer plus, and each failure means more backtracking. The number of attempts grows exponentially with input length. On a long-enough input the regex appears to hang.

The vulnerability has a name: ReDoS (regex denial of service). It's been used to take down production services running for years. Stack Overflow had a famous outage in 2016 from a regex like ^[\s‌]+|[\s‌]+$ against pathological input.

The shape to watch for is "nested quantifiers on overlapping alternatives":

  • (a+)+ — same character matched two ways
  • (a|a)+ — alternation where both branches accept the same input
  • (.*)* — overlapping greedy matches

A safer rewrite of (a+)+ is a+. A safer rewrite of (a|b)+a is to use a possessive quantifier (?>a|b)+a or atomic group, or to use a DFA-based engine that doesn't backtrack.

If you must accept user-submitted regex, run them under a timeout and a length cap. Don't hand them to PCRE without bounds.

Flavor differences

The most common surprises when porting patterns:

  • . matching newlines. Default off in most flavors; flag-controlled (s in PCRE, re.DOTALL in Python, no equivalent in JavaScript until ES2018's s flag).
  • Anchors and multiline. ^ and $ match start/end of string by default; with m they match start/end of line. Whether \n counts as a line terminator vs \r\n vs varies.
  • Lookbehind. PCRE/Python require fixed-length lookbehind. ECMAScript and recent PCRE2 allow variable-length. Go's regexp doesn't have it at all.
  • Named captures. PCRE: (?<name>...). Python: (?P<name>...). JavaScript: (?<name>...) since ES2018. Different syntax, same idea.
  • Unicode property escapes. \p{L} for "letter" works in PCRE, Python (with regex module), and ES2018+, with subtly different definitions of categories.
  • Word boundary \b. Almost universal, but the definition of "word character" varies. ASCII-only by default in most engines; \u-flagged engines extend it to Unicode.

If a pattern works in PCRE and not in JavaScript, the cause is usually one of those five.

Greedy vs lazy vs possessive

Quantifiers default to greedy: match as much as possible, back off if necessary. <.*> against <a><b> matches the whole string, then backtracks until the trailing > matches.

Add a ? to make them lazy: match as little as possible. <.*?> against <a><b> matches <a>, then <b> on the next call.

Add a + to make them possessive (PCRE, Java, recent ECMAScript): match greedily, then refuse to give back. <.*+> against <a><b> matches <a><b> and then fails to find a >, with no backtracking attempted. Possessive quantifiers eliminate backtracking and are the right tool when you know greediness is correct and you don't want catastrophic backtracking risk.

The pattern most people reach for when "match HTML tags" turns out to be <[^>]*> — character classes are usually faster and clearer than .*?, and they don't backtrack at all.

Common pitfalls

  • Using regex to parse HTML, JSON, XML, or CSV. They're all not regular languages. Use a real parser.
  • Forgetting that \d matches Devanagari digits in Unicode mode. Use [0-9] if you actually mean Arabic digits.
  • Anchoring with ^ and $ when you meant \A and \z. With m mode the meanings split apart.
  • Capturing when you don't need to. (?:...) is the non-capturing form. Saves memory and clarifies intent.
  • Embedding regex literals in strings without escaping the language's own escape rules. "\\d+" is a string of \d+, but "\d+" is also \d+ in some languages and a parse error in others.
  • Treating [a-z] as case-insensitive. It isn't. Use the i flag.

When regex isn't the answer

A few categories where it usually shouldn't be:

  • Validating email. RFC 5322 email is too complex for regex. Use a parser library, or accept any string with @ and a . after it and let the verification email do the work.
  • Validating credit cards. Regex catches obvious typos. Don't use it as authentication of validity — use the Luhn check, or just send it to your payment processor.
  • Cleaning user input. Regex often catches the obvious cases and misses the smart ones. For HTML-stripping or SQL escaping, use a real escaper from your platform's standard library.
  • Anything across multiple lines with structure. Multi-line state usually means parsing.

Practical rules

  • For untrusted patterns, use a DFA engine (RE2 family) or run with a timeout.
  • Use non-capturing groups by default. Capture only when you need the value.
  • Use anchors. ^pattern$ is much faster than pattern if you know it should match the whole string.
  • Prefer character classes over .*? when applicable.
  • Watch for (x+)+, (x|x)+, (.*)* shapes. Rewrite or use possessive.
  • When a regex starts feeling unreadable, that's a sign you should split it into stages or use a parser.
  • Test against pathological inputs before shipping a regex against user data.

Test patterns interactively

The regex tool on this site lets you build patterns against test text with live highlighting, walks you through capture groups, and warns about constructs that backtrack catastrophically. All client-side.

Open the regex tool

Related guides

Keep the session useful with adjacent reading instead of exiting after one article.

View all guides

Cookie Consent

We use cookies to enhance your experience and show relevant ads. You can customize your preferences.