Format 2026-05-29 10 min read

SQL, Properly Explained

SQL was standardized as ANSI X3.135 in 1986 and ISO 9075 in 1987. The standard has been revised roughly every five years since (SQL:1992, :1999, :2003, :2008, :2011, :2016, :2023), and at no point has any major database fully implemented it. The dialects you actually meet — PostgreSQL, MySQL, SQLite, T-SQL, Oracle, BigQuery, Snowflake, Spark, ClickHouse — agree on the core 80% of CRUD and disagree on virtually everything else. A SQL formatter that doesn't know which dialect it's looking at is guessing.

SQLANSI SQLPostgreSQLMySQLNULLJOINSQL Injection

SQL has the most successful syntax in computing — it has outlived four generations of frameworks and is still what every analyst, engineer, and AI system reaches for when there's a database in the loop. It also has more legacy weirdness per square inch than almost any language in current use. Both facts trace back to the same source: SQL is older than C, older than Unix as we know it, and was designed for a market (relational database vendors in the late 1970s) that explicitly wanted to embrace-extend-extinguish each other's syntax.

This post is for the person who reads SQL across multiple systems and would like the recurring confusions to stop being surprises.

What SQL actually is

SQL is a declarative query language for relational data. You describe the result you want; the database planner figures out how to compute it. Every SQL statement is fundamentally one of:

DML (Data Manipulation Language): SELECT, INSERT, UPDATE, DELETE, MERGE.
DDL (Data Definition Language): CREATE, ALTER, DROP, TRUNCATE for tables, indexes, views, schemas.
DCL (Data Control Language): GRANT, REVOKE.
TCL (Transaction Control Language): BEGIN, COMMIT, ROLLBACK, SAVEPOINT.

The SELECT statement is overwhelmingly the most-used and most-misunderstood. Its full grammar (in the standard, simplified):

SELECT [DISTINCT] columns
FROM tables
[JOIN ... ON ...]
WHERE row_predicate
GROUP BY grouping_columns
HAVING group_predicate
ORDER BY columns
LIMIT n OFFSET m

Almost everyone who's written SQL more than once knows that order. The thing they don't know is that the database doesn't execute it in that order.

The clause execution order

The order you write SQL clauses is for human convenience. The order the database processes them, conceptually, is:

FROM — pick tables, apply joins.
WHERE — filter rows.
GROUP BY — collapse rows into groups.
HAVING — filter groups.
SELECT — compute output columns.
DISTINCT — drop duplicate rows.
ORDER BY — sort.
LIMIT/OFFSET — paginate.

Three consequences fall out of this order, and they're responsible for most of the "why doesn't this query work" questions:

You can't reference a SELECT alias in WHERE. The alias doesn't exist yet at WHERE-evaluation time. PostgreSQL is strict about this; some dialects (MySQL, BigQuery) allow it as an extension.
You can reference a SELECT alias in ORDER BY. ORDER BY runs after SELECT.
HAVING operates on groups; WHERE operates on rows. Filtering on COUNT(*) > 5 belongs in HAVING. Filtering on country = 'US' belongs in WHERE — pushing it into HAVING still works, but is slower because you're carrying rows through GROUP BY that you'll then throw away.

The execution order is the conceptual one; the actual query planner reorders aggressively for performance. But the dependency rules above are real semantics, not just optimization.

NULL is not what you think

NULL is SQL's "unknown." It is not zero. It is not empty string. It is not false. The semantics propagate eerily.

SELECT NULL = NULL;         -- NULL (NOT true!)
SELECT NULL <> NULL;        -- NULL
SELECT NULL = '';           -- NULL
SELECT NULL + 1;            -- NULL
SELECT NULL || 'x';         -- NULL (Postgres)
SELECT 1 IN (1, 2, NULL);   -- TRUE
SELECT 1 NOT IN (1, 2, NULL); -- NULL — not FALSE

The rule: any operation involving NULL returns NULL (with a few documented exceptions like COALESCE, IS NULL, IS NOT NULL).

The killer is NOT IN with NULL in the subquery result:

SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned);

If banned contains a single NULL user_id, this query returns no rows, ever. Because for any user id, id NOT IN (..., NULL, ...) evaluates as NULL, which is not TRUE, so WHERE rejects the row. This bug has shipped to production at every company.

Defenses: prefer NOT EXISTS over NOT IN. Or filter NULLs out of the subquery: WHERE id NOT IN (SELECT user_id FROM banned WHERE user_id IS NOT NULL). Or use a left anti-join.

Another NULL surprise: COUNT(*) includes NULLs; COUNT(column) does not. COUNT(DISTINCT column) also drops NULLs. People who use the wrong COUNT to "count rows where this column is set" usually meant COUNT(*) and got COUNT(column).

Sorting and grouping treat NULLs as a single group, even though NULL = NULL is NULL. Consistency wasn't the point.

JOINs: the four kinds and the cross-join trap

Five join types, in increasing scariness:

INNER JOIN (or just JOIN): rows where the predicate matches in both tables.
LEFT JOIN (or LEFT OUTER JOIN): every row from the left, plus matching rows from the right (NULLs where no match).
RIGHT JOIN: mirror of LEFT. Avoid; people read top-to-bottom and reverse-direction joins are confusing. Rewrite as LEFT.
FULL OUTER JOIN: every row from either side, NULLs where no match. MySQL doesn't support this directly until 8.0; emulate with LEFT JOIN UNION ALL RIGHT JOIN.
CROSS JOIN (or , in old syntax): every row from the left paired with every row from the right. O(N*M) rows.

The trap: forgetting the JOIN predicate.

SELECT * FROM orders, customers;

That's a CROSS JOIN. If orders has 1M rows and customers has 100k, you've just asked for 100 billion rows. Most planners will start producing them. ANSI-style JOIN ... ON is loud about missing predicates; comma-style is silent. Stop using comma-joins.

Another subtle one: LEFT JOIN ... ON vs LEFT JOIN ... WHERE. The predicates aren't equivalent.

-- A: keeps every order, even ones with no matching shipment
SELECT * FROM orders o
LEFT JOIN shipments s ON s.order_id = o.id AND s.status = 'sent';

-- B: drops orders whose shipment isn't 'sent' OR missing
SELECT * FROM orders o
LEFT JOIN shipments s ON s.order_id = o.id
WHERE s.status = 'sent';

In B, the WHERE clause turns the LEFT JOIN back into an INNER JOIN, because a non-matching s.status is NULL, and NULL = 'sent' is NULL, which fails the WHERE. People write B intending A all the time.

The dialect map

The SQL standard nominally agrees on the keywords. The dialects disagree on:

Identifier quoting. Standard: double quotes ("my column"). MySQL: backticks. T-SQL: square brackets [...] or double quotes if SET QUOTED_IDENTIFIER ON. Postgres: double quotes, case-sensitive.
String literals. Standard: single quotes. MySQL allows double quotes for strings unless ANSI_QUOTES is set.
String concatenation. Standard: ||. MySQL: CONCAT(...) (or || only with PIPES_AS_CONCAT). T-SQL: +.
LIMIT n OFFSET m in Postgres / MySQL / SQLite. T-SQL uses OFFSET m ROWS FETCH NEXT n ROWS ONLY (since 2012). Oracle: similar FETCH FIRST syntax in 12c+.
Auto-increment. Postgres: SERIAL (legacy) or GENERATED ... AS IDENTITY (preferred). MySQL: AUTO_INCREMENT. SQLite: INTEGER PRIMARY KEY is implicitly rowid. T-SQL: IDENTITY(1,1).
Upsert. Postgres: ON CONFLICT ... DO UPDATE. MySQL: ON DUPLICATE KEY UPDATE or INSERT ... ON DUPLICATE KEY UPDATE. SQLite: ON CONFLICT (similar to Postgres). T-SQL: MERGE (with footnotes — Microsoft has documented enough MERGE bugs that careful users prefer atomic INSERT+UPDATE).
Boolean type. Postgres: real BOOLEAN. MySQL: BOOLEAN is alias for TINYINT(1). SQL Server: no boolean — use BIT. SQLite: any expression result.
JSON functions. Postgres: ->, ->>, jsonb_*, mature. MySQL 5.7+: JSON_EXTRACT, ->, similar but different operator semantics. SQLite: json_extract. SQL Server: JSON_VALUE, JSON_QUERY. BigQuery, Snowflake: their own. Don't expect portability.
Date arithmetic. Postgres: now() - interval '1 day'. MySQL: DATE_SUB(NOW(), INTERVAL 1 DAY). T-SQL: DATEADD(DAY, -1, GETDATE()). SQLite: datetime('now', '-1 day').

A formatter that produces output for "SQL" without committing to a dialect will silently mis-tokenize at least some inputs. The version on this site supports 14 dialects explicitly because that's the only way the keyword lists, function names, and identifier rules stay correct.

SQL injection

The lesson, restated for the millionth time because every cohort of new programmers re-learns it: never build SQL by string concatenation with user input.

# Catastrophically wrong
cursor.execute(f"SELECT * FROM users WHERE name = '{name}'")

# Correct
cursor.execute("SELECT * FROM users WHERE name = %s", (name,))

The wrong version, given name = "'; DROP TABLE users; --", builds:

SELECT * FROM users WHERE name = ''; DROP TABLE users; --'

The right version sends the SQL and the parameter to the database separately; the database parses the SQL once and binds the parameter as data, never as code. There is no clever escape function that's safe — the only safe path is parameterized queries (a.k.a. prepared statements, a.k.a. bind variables).

Edge cases people miss:

Identifier interpolation. Parameters bind values, not column or table names. SELECT * FROM ? doesn't work. If you must dynamically name a table, validate against an allowlist before string-concatenating.
LIKE wildcards. A user-supplied search term might contain % or _, which are wildcards. Escape them in the application or use database-specific functions.
ORM injections. Most ORMs are safe by default but offer raw-SQL escape hatches. Audit the escape hatches.
Stored procedures. A stored procedure that builds dynamic SQL with EXEC and concatenated input is just as injectable as application code.
Second-order injection. Storing a malicious string and executing it later (e.g., from a log table). Defense: parameterize on the use, not just the input.

The first generation of injection defenses (escape ' to '') is provably incomplete and you should never rely on it. Use parameterized queries; use ORMs; if you must build SQL dynamically, build the structure with code-controlled string concatenation and bind the values separately.

Performance: the pieces a formatter can't help with

A pretty SQL query can still be a slow SQL query. The headline failure modes:

Missing indexes on JOIN columns, WHERE filter columns, ORDER BY columns. Always check the query plan (EXPLAIN ANALYZE on Postgres, EXPLAIN on MySQL, query plan on SSMS).
Implicit type coercion preventing index use. WHERE phone = 5551234 where phone is VARCHAR may force a full-table scan on MySQL because of how it casts. Always match the column type literally.
SELECT * in production code. It's fine for ad-hoc; it's a footgun in code where someone adds a BLOB column three years later and queries get 100x slower.
N+1 queries from ORMs that lazy-load relationships. A formatter shows you the SQL but not the loop.
OR predicates that split queries into two index scans. Sometimes faster as UNION ALL of two separate selects.
Functions on the left side of WHERE. WHERE LOWER(name) = 'alice' can't use an index on name. Use a functional index or store a lowercase column.

Formatting the query nicely makes the bug legible. It doesn't make it fast.

Common pitfalls

NOT IN with a NULL-returning subquery. Rewrite as NOT EXISTS.
COUNT(column) when you wanted COUNT(*).
GROUP BY column ordinals. GROUP BY 1 works in MySQL/Postgres/BigQuery; ANSI standard requires named columns; T-SQL allows it. Names are clearer.
Implicit cross-joins from a forgotten ON clause.
Mixing aggregates and non-aggregates without grouping. Postgres rejects; MySQL pre-5.7 silently picks an arbitrary row. Use GROUP BY properly.
Date arithmetic across timezones. Always store UTC. Display in the user's timezone at the application layer. The blog post on Unix timestamps goes deeper.
0 = '' in MySQL. Implicit coercion strikes again. Quote strings.
COUNT(DISTINCT a, b). Postgres counts NULLs as distinct; some other dialects don't. Check.
Locking surprises. SELECT ... FOR UPDATE semantics differ subtly across MySQL InnoDB, Postgres, and SQL Server. Read the docs for the engine you're on.
MERGE in T-SQL has documented bugs around concurrent execution. Aaron Bertrand's "Use Caution with SQL Server's MERGE Statement" is required reading if you write MERGE for SQL Server.

Reading SQL well

If you're handed a 200-line query and asked to debug it, the order of operations is:

Format it. Indent the clauses. Line up the JOINs. Lowercase or uppercase the keywords consistently.
Identify the FROM and JOIN structure. Draw it on paper if it's hairy.
Read the WHERE clause for filtering, then GROUP BY for aggregation, then HAVING for post-aggregate filtering.
SELECT last — it's the output, not the logic.
ORDER BY and LIMIT are presentation, not query semantics.

Most "what does this query do" questions resolve to "the formatter unmixed the clauses and the structure became visible." That's most of what a SQL formatter is for: not making the query fast, not making it correct, just making it readable enough that the person debugging it can find the bug.

Beyond formatting, the rules that actually keep you out of trouble: parameterize every user input, treat NULL as a third value not a falsy one, prefer NOT EXISTS over NOT IN, never trust comma-joins, and don't expect anything beyond core CRUD to be portable across dialects.

Format SQL for any major dialect

The SQL formatter on this site supports 14 dialects (PostgreSQL, MySQL, SQLite, T-SQL, PL/SQL, BigQuery, Redshift, Spark SQL, and more) with configurable indentation, keyword case, and minification. Useful when you're reading a one-line CI-generated query that's actually 200 columns of joined CTEs. All client-side.

Open the SQL tool

Related guides

Keep the session useful with adjacent reading instead of exiting after one article.

View all guides

QR Code 2026-06-10