Databases & Indexes - The Structures That Make Queries Fast
Helpful context:
- Trees & Heaps - When Flat Structures Hit Their Limit
- Operating Systems - The Software That Runs All Other Software
A startup runs a query: “get all orders for this user.” The table has fifty thousand rows. The query returns in 3ms. A year later, the table has fifty million rows. The same query takes 30 seconds. The CTO declares a database crisis. The engineer who wrote the query adds one line - CREATE INDEX idx_orders_user_id ON orders(user_id) - and the query returns in 2ms. Fifty million rows. Two milliseconds.
What just happened? The answer involves B-trees, disk physics, and a counterintuitive truth: the data structure holding your index is more important than the data structure holding your data.
What an Index Actually Is
The concept is older than computers. At the back of most textbooks there is an index: an alphabetically sorted list of terms, each followed by the page numbers where that term appears. When you need to find everything the author says about “Byzantine fault tolerance,” you do not read the book from page one. You flip to the index, find “Byzantine fault tolerance” in alphabetical order, read “pages 247, 312, 418,” and go directly to those pages.
A database index is this exact structure, applied to rows. It is a separate, sorted data structure that the database maintains alongside the main table. It maps column values to row locations. When you query WHERE user_id = 42, the database consults the index - “where are all the rows with user_id = 42?” - and jumps directly to those rows without reading anything else.
The main table itself has no guaranteed ordering. Rows are stored in whatever order they were inserted - which is usually close to insertion order, and not sorted by any column your queries care about. The index provides the sorted structure the table lacks.
The cost: every index must be updated whenever the underlying data changes. Adding a row to a table with three indexes means writing to four places on disk: the main table and each index. This is the fundamental tradeoff that runs through all of index design - faster reads, slower writes.
What It Actually Looks Like
Take a small orders table. The main table - called the heap - stores rows in insertion order, completely unsorted:
Heap (insertion order)
row | id | user_id | status | created_at | total_cents
1 | 1 | 9823 | delivered | 2024-03-15 10:22:00 UTC | 4999
2 | 2 | 42 | pending | 2024-03-15 10:23:00 UTC | 1200
3 | 3 | 7701 | shipped | 2024-03-15 10:24:00 UTC | 8750
4 | 4 | 42 | delivered | 2024-03-14 09:15:00 UTC | 3300
5 | 5 | 1054 | pending | 2024-03-15 10:25:00 UTC | 500
6 | 6 | 42 | shipped | 2024-03-13 14:30:00 UTC | 7200
To find all rows where user_id = 42, the database reads all six rows and checks each one. With six rows this is instant. With 50 million rows this is catastrophic.
Now you run CREATE INDEX idx_orders_user ON orders(user_id). The database builds a separate sorted structure. Its leaf level - the bottom of the B-tree where the actual entries live - looks like this:
Index on (user_id) - leaf level, sorted
user_id | pointer to heap row
42 | → row 2
42 | → row 4
42 | → row 6
1054 | → row 5
7701 | → row 3
9823 | → row 1
The entries are sorted by user_id. To find all rows where user_id = 42, the database traverses the B-tree from the root to the leaf position where 42 lives, then scans forward until user_id is no longer 42. It finds three entries, follows each pointer to fetch the actual row from the heap. Six rows read from the index, three heap fetches. With 50 million rows, the ratio is the same: O(log n) to reach the first match, then exactly as many heap fetches as there are matching rows.
Above the leaf level, the B-tree’s internal nodes act as a directory: each internal node stores a range of key values and a pointer to the child node covering that range. The root might say “everything below 5000 is left, everything from 5000 to 8000 is middle, everything above 8000 is right.” You follow one branch at each level until you arrive at the leaf. With branching factor 200, 50 million rows fit in a tree just four levels deep - four disk reads to locate any value.
What a composite index looks like
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC) produces a leaf level sorted first by user_id, then by created_at newest-first within each user:
Index on (user_id, created_at DESC) - leaf level
user_id | created_at | pointer
42 | 2024-03-15 10:23:00 UTC | → row 2
42 | 2024-03-14 09:15:00 UTC | → row 4
42 | 2024-03-13 14:30:00 UTC | → row 6
1054 | 2024-03-15 10:25:00 UTC | → row 5
7701 | 2024-03-15 10:24:00 UTC | → row 3
9823 | 2024-03-15 10:22:00 UTC | → row 1
A query WHERE user_id = 42 ORDER BY created_at DESC LIMIT 20 navigates to user_id = 42, then reads the next 20 entries in leaf order - they are already in the right order, so no sort step is needed at all. That is the point of ordering columns in an index: the sort the query needs is already the order the index stores things in.
Notice that the (user_id, created_at DESC) index also helps a plain WHERE user_id = 42 query - the leftmost column alone is a valid entry point into the tree. A query WHERE created_at > X with no user_id filter cannot use this index at all: you cannot skip the first column.
What a partial index looks like
CREATE INDEX idx_orders_pending ON orders(id) WHERE status = 'pending' contains only the rows matching the predicate:
Partial index on (id) WHERE status = 'pending'
id | pointer
2 | → row 2
5 | → row 5
Two entries instead of six. At 50 million rows where 2% are pending, this index has 1 million entries instead of 50 million - fifty times smaller. Fifty times less storage, fifty times less memory used when this index is cached, fifty times less write overhead when non-pending rows are inserted or updated.
What a covering index looks like
Say the query is SELECT status, created_at FROM orders WHERE user_id = 42. With the basic index on (user_id), the database finds the matching leaf entries and then follows a pointer to the heap for each one to read the status and created_at columns - an extra disk read per result row.
With an index on (user_id, status, created_at), the leaf entries already contain all three values:
Index on (user_id, status, created_at) - leaf level
user_id | status | created_at | pointer
42 | delivered | 2024-03-14 09:15:00 UTC | → row 4
42 | pending | 2024-03-15 10:23:00 UTC | → row 2
42 | shipped | 2024-03-13 14:30:00 UTC | → row 6
...
The query needs status and created_at - both are right there in the index. The database never follows the pointer to the heap. This is called an index-only scan, and EXPLAIN ANALYZE will tell you when the planner is using one.
Why Indexes Exist: The Disk Problem
A table without an index is a heap - an unordered pile of rows. Finding orders for user 42 means reading every single row and checking whether user_id = 42. This is a full table scan. At 50 million rows, reading every row from disk takes seconds even with fast SSDs.
The deeper issue is not just volume - it is random I/O. SSDs are fast for sequential reads but significantly slower for random access. A full table scan is sequential: read blocks from beginning to end. That is fast. An inefficient query that jumps around the disk following pointer chains is slow. The entire art of index design is about enabling sequential or highly localized reads for common query patterns.
Without an index, the cost of WHERE user_id = 42 is O(n) - linear in the number of rows. With a B-tree index, it is O(log n) - logarithmic. At 50 million rows, log₂(50,000,000) is roughly 25. Twenty-five comparisons instead of fifty million reads. That is the 30-second-to-2ms gap.
B-Trees: Why Not a Binary Search Tree?
The obvious data structure for a sorted index is a binary search tree. O(log n) lookup, easy to understand. Why doesn’t every database use a BST?
Disk physics. A BST node holds one key and two children. Traversing a BST requires following pointers - each pointer is a disk read. For a table with 50 million rows, a BST has depth ~26. That is 26 random disk reads per lookup. At 0.1ms per random disk read (a fast SSD), 26 reads is 2.6ms of pure seek time before you have even found your row.
A B-tree node can hold hundreds of keys. The exact number is tuned to fill one disk page (typically 4KB or 8KB). PostgreSQL’s default page size is 8KB; a B-tree node at this size holds roughly 100 - 200 keys plus pointers. A tree over 50 million rows with branching factor 200 has depth ~4. That is 4 disk reads instead of 26 - a 6x improvement just from the data structure choice.
The B+ tree variant (used by PostgreSQL and MySQL InnoDB) goes further: all actual data resides in leaf nodes, and leaf nodes are linked in a doubly-linked list. This makes range scans - WHERE user_id BETWEEN 1000 AND 2000 - extremely efficient: find the first matching leaf node, then follow the linked list forward until the range is exhausted. No backtracking through internal nodes.
B-trees support:
- Equality:
WHERE user_id = 42- traverse the tree to the leaf - Range queries:
WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31'- find the start, scan forward - Prefix matching:
WHERE username LIKE 'meg%'- find the ‘meg’ prefix, scan forward - ORDER BY / MIN / MAX - the tree is already sorted; no extra sort step
Hash Indexes: The Deceptively Simple Alternative
Hash indexes store a hash of the indexed value mapped to the row location. Lookup is O(1) for equality queries - faster in theory than the O(log n) B-tree lookup.
But hashing destroys ordering. A hash index cannot serve range queries, prefix matches, or ORDER BY. It is exclusively for WHERE column = exact_value. For high-cardinality columns where you only ever query for exact values - UUIDs, session tokens, email addresses - hash indexes seem ideal.
In practice, most engineers default to B-tree even for equality-only workloads. The reasons: PostgreSQL’s hash indexes were historically not crash-safe (not WAL-logged until version 10 - the WAL, or Write-Ahead Log, is the journal the database writes changes to before applying them, ensuring they survive crashes; anything not WAL-logged can be lost or corrupted on an unexpected shutdown). More fundamentally, a B-tree equality lookup is log₂(n) - at 50 million rows that is 26 comparisons, which takes microseconds. The theoretical advantage of hash lookup over B-tree lookup is barely measurable in practice. The inflexibility of hash indexes often is not worth the marginal gain.
Discomfort Check: More Indexes Does Not Mean Faster
This is the most common index misconception. Indexes have a write cost. Every INSERT, UPDATE, or DELETE must update every index on that table. A table with eight indexes takes roughly eight times as many disk writes as an unindexed table for every write operation.
For read-heavy tables (a product catalog, a user profile store), this is usually acceptable. For write-heavy tables - a time-series events table ingesting thousands of rows per second, an order processing table under peak load - excessive indexing directly degrades write throughput.
Concrete example: a logging table that receives 10,000 inserts per second. Adding six indexes to support various analytics queries may reduce the sustainable insert rate to 3,000 per second. You have improved read performance at the cost of crippling write performance.
pg_stat_user_indexes shows how often each index is actually used. Indexes with zero or near-zero usage should be candidates for removal. pg_stat_user_tables shows the ratio of sequential scans to index scans - if a table has high sequential scan count despite having indexes, either the indexes are not matching queries or the planner correctly deems them unhelpful.
Composite Indexes and the Leftmost Prefix Rule
A composite index covers multiple columns: CREATE INDEX idx_orders_user_status ON orders(user_id, status).
The order of columns is not arbitrary. This index helps:
WHERE user_id = 42- the leftmost column alone is a valid prefixWHERE user_id = 42 AND status = 'shipped'- both columns, in order
This index does not help:
WHERE status = 'shipped'- starts with the second column, skipping the first
Think of a physical phone book sorted by (last name, first name). You can find everyone named “Smith” (leading column). You can find “Smith, Megha” (both columns). But finding everyone named “Megha” without knowing their last name requires reading the entire book - the second column is only useful after the first has constrained the search space.
For query optimization: put equality conditions first, range conditions last. A query WHERE user_id = 42 AND created_at > '2024-01-01' is best served by (user_id, created_at). User ID equality narrows to one user’s rows, then the B-tree’s ordered structure scans the created_at range efficiently within that subset.
Covering Indexes and Index-Only Scans
After a B-tree lookup finds the matching key, the database normally does a second lookup - a “heap fetch” - to retrieve the actual row from the main table. This is a second random disk read per result row.
A covering index includes all columns a query needs in the index itself, eliminating the heap fetch entirely:
-- Index on (user_id, status, created_at)
SELECT status, created_at FROM orders WHERE user_id = 42;
-- PostgreSQL EXPLAIN shows: "Index Only Scan" - no heap access
The tradeoff: wider indexes consume more storage and have higher write overhead, since more columns must be updated in the index when the row changes. Measure before adding extra columns to indexes purely for coverage.
Partial Indexes: Indexing a Subset
A partial index covers only rows matching a predicate:
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
If your application frequently queries WHERE status = 'pending' AND created_at > X, this index is far smaller than a full index on created_at - it only contains pending orders, not the vastly larger set of completed and cancelled orders. Smaller index means faster scans, less write overhead, and less memory pressure on the buffer cache.
Partial indexes are underused. Common patterns that benefit: WHERE is_deleted = false (active records only), WHERE status IN ('processing', 'pending') (open transactions), WHERE account_tier = 'premium' (premium features for a small percentage of users).
Cardinality: When Indexes Make Things Worse
An index on a boolean column has cardinality 2. Suppose is_active is true for 60% of rows. An index scan on WHERE is_active = true must follow index pointers to 60% of the rows scattered across the heap. Accessing 60% of a heap scattered randomly is slower than a single sequential scan.
The query planner knows this. It estimates the selectivity of the predicate and chooses the plan with the lowest expected cost. When you run EXPLAIN and find the planner chose a sequential scan despite an existing index, it is usually correct - do not force it.
The threshold is roughly: if a predicate matches more than 5 - 10% of rows, a sequential scan is often faster than an index scan (the exact threshold depends on table size, page cache warmth, and hardware).
This is why composite indexes with a high-cardinality leading column (user_id, UUID, email) are effective: they narrow the result set dramatically before the scan begins.
Worked Example: Identifying Indexes for a Real Table
This is the skill that actually matters. Given a schema and a set of queries, how do you decide what to index and in what order?
The schema:
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
status VARCHAR(20) NOT NULL, -- 'pending', 'processing', 'shipped', 'delivered', 'cancelled'
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
total_cents INT NOT NULL
);
Assume 50 million rows. 10 million distinct users (average 5 orders per user). Status distribution: ‘pending’ 2%, ‘processing’ 3%, ‘shipped’ 15%, ‘delivered’ 75%, ‘cancelled’ 5%.
The queries the application actually runs:
-- Q1: Load a user's order history (every page load for the orders page)
SELECT id, status, created_at, total_cents
FROM orders
WHERE user_id = $1
ORDER BY created_at DESC
LIMIT 20;
-- Q2: Check a specific order belongs to a user (security check before showing it)
SELECT id FROM orders WHERE id = $1 AND user_id = $2;
-- Q3: Background worker - pick up pending orders to process
SELECT id, user_id FROM orders WHERE status = 'pending';
-- Q4: Admin dashboard - revenue in the last hour
SELECT COUNT(*), SUM(total_cents) FROM orders
WHERE created_at > NOW() - INTERVAL '1 hour';
-- Q5: User's orders filtered by status (less frequent than Q1)
SELECT id, created_at, total_cents
FROM orders
WHERE user_id = $1 AND status = $2
ORDER BY created_at DESC;
Analyzing each query:
Q1 - WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20
user_id has high cardinality: 10 million distinct values across 50 million rows. A user has on average 5 orders. The index needs to locate those 5 rows out of 50 million. This is the most frequent query in the application.
The ORDER BY tells you something important: if your index is (user_id, created_at DESC), the database can use the index to both filter by user_id and return rows in the right order without a separate sort step. Compare: with only an index on user_id, the database finds the user’s rows and then sorts them. With (user_id, created_at DESC), the rows come back sorted from the index directly.
Candidate: CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
Q2 - WHERE id = $1 AND user_id = $2
id is the primary key - PostgreSQL creates a B-tree index on it automatically. The database finds the row by its primary key in one index lookup, then verifies user_id matches. One index lookup, one comparison. No additional index needed.
Q3 - WHERE status = 'pending'
Status has cardinality 5. Only 2% of rows are ‘pending’ - roughly 1 million rows out of 50 million. Two percent is selective enough that an index helps (below the 5-10% threshold).
But do not create a plain index on status. A full index on status contains all 50 million rows. When the planner uses it to find ‘pending’ rows, it follows 1 million pointers scattered across the heap. Every other status value in the index is dead weight you are paying write overhead to maintain.
Better: a partial index that only indexes rows where status = 'pending'. This index contains only those 1 million rows - it is 50 times smaller, faster to scan, and has lower write overhead (only inserts and updates involving ‘pending’ rows touch it).
Candidate: CREATE INDEX idx_orders_pending ON orders(id) WHERE status = 'pending';
When you update a row from ‘pending’ to ‘processing’, PostgreSQL removes it from this index automatically.
Q4 - WHERE created_at > NOW() - INTERVAL '1 hour'
A range query on created_at run by admins. At 50 million rows inserted over, say, 3 years, one hour’s worth is roughly 0.001% of the table - around 500 rows. Very selective. An index on created_at lets the B-tree find the right starting point and scan forward the short distance to now.
This query is run infrequently (admin dashboard). You might choose to add the index anyway because admin dashboards tend to be slow and complained about. Or you might skip it if the table is very write-heavy and the write overhead matters.
Candidate (if deemed worth it): CREATE INDEX idx_orders_created ON orders(created_at);
Q5 - WHERE user_id = $1 AND status = $2 ORDER BY created_at DESC
Two equality conditions plus a sort. The (user_id, created_at DESC) index from Q1 already handles this partially: it uses user_id equality, then scans created_at in order, filtering status as it goes. For most users this is fine because a user has few enough orders that the status filter is cheap even if it is not using an index.
If Q5 is measured as slow in production (a specific user with tens of thousands of orders, or status filtering that discards 90% of results), you could add (user_id, status, created_at DESC). But start with the Q1 index and see if it is needed.
Final index decisions:
-- Serves Q1 (order history page), Q5 (filtered by status)
-- Leftmost prefix user_id also serves any future WHERE user_id = $1 queries
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
-- Serves Q3 (pending order worker) - partial, 50x smaller than a full status index
CREATE INDEX idx_orders_pending ON orders(id) WHERE status = 'pending';
-- Serves Q4 (admin dashboard) - add only if admin queries are a pain point
CREATE INDEX idx_orders_created ON orders(created_at);
What you deliberately did not index:
total_cents- it appears only in SELECT, never in WHERE. Indexing it helps nothing.- A full index on
status- the partial index is strictly better for this workload. (user_id, status)withoutcreated_at- Q5 still needs a sort step; the three-column composite is only worth it if measured as a bottleneck.
The three questions to ask for any table:
-
What WHERE clauses hit this table, and how often? Pull from query logs or application code. Frequency matters - a slow query run once a day may be acceptable; the same query run 10,000 times per minute is a crisis.
-
What is the cardinality and selectivity of each predicate? Run
SELECT COUNT(DISTINCT column) FROM tableto see cardinality. A predicate that returns 0.01% of rows is highly selective and benefits from an index. A predicate that returns 60% of rows does not. -
What is the read/write ratio of this table? A product catalog with 100:1 read/write can tolerate many indexes. An events table with 1:100 read/write should have as few indexes as possible.
Full-Text Search: GIN Indexes
GIN (Generalized Inverted Index) handles composite values: arrays, JSONB, and full-text. For full-text search, a GIN index builds an inverted index - a mapping from each lexeme (a stemmed, normalized word form - “running,” “ran,” and “runs” all reduce to the lexeme “run”) to the rows containing it.
-- Create a GIN index for full-text search
CREATE INDEX idx_articles_fts ON articles
USING GIN(to_tsvector('english', title || ' ' || body));
-- Use it
SELECT title FROM articles
WHERE to_tsvector('english', title || ' ' || body)
@@ to_tsquery('english', 'database & performance & index');
GIN indexes update more slowly than B-trees - they batch internal writes. The FASTUPDATE option buffers pending insertions in a separate list and flushes them periodically, improving write performance at the cost of slightly higher worst-case query time during a flush. For high-ingestion workloads, tune FASTUPDATE and vacuum settings carefully.
For search requirements beyond PostgreSQL’s built-in full-text - fuzzy matching, multilingual analysis, relevance ranking at scale - Elasticsearch and OpenSearch provide dedicated inverted index engines with richer query capabilities.
Reading EXPLAIN ANALYZE
EXPLAIN ANALYZE is the most important diagnostic tool for query performance. It shows the actual execution plan and real timing:
EXPLAIN ANALYZE
SELECT user_id, SUM(amount) FROM orders
WHERE user_id = 42 AND created_at > NOW() - INTERVAL '30 days'
GROUP BY user_id;
Output (simplified):
GroupAggregate (cost=0.56..12.34 rows=1 width=16)
(actual time=0.234..0.241 rows=1 loops=1)
-> Index Scan using idx_orders_user_created on orders
(cost=0.56..12.20 rows=28 width=16)
(actual time=0.055..0.198 rows=28 loops=1)
Index Cond: (user_id = 42 AND created_at > ...)
Planning Time: 0.8 ms
Execution Time: 0.3 ms
Key things to read:
- cost=startup..total: planner’s estimate in abstract units. High startup cost = first rows are delayed.
- actual time=X..Y: real milliseconds. If these diverge wildly from cost estimates, run
ANALYZE table_nameto refresh statistics. - rows estimate vs actual: if the planner estimated 28 rows but found 2.8 million, it made a catastrophically wrong plan decision. Stale statistics cause this.
- Seq Scan on a large table: almost always indicates a missing index or a predicate too broad for the index to help.
- Nested Loop with many loops: in a join, a nested loop that loops 50,000 times over an inner scan is usually the sign of a missing index on the join column.
Cloud Connections: PostgreSQL vs MySQL vs Aurora
PostgreSQL and MySQL both default to B-tree indexes for everything that is not explicitly specified otherwise. The difference is in the storage engine.
MySQL InnoDB stores table rows in a clustered index - the primary key is the physical storage order of the table. Queries on the primary key are always index-only scans. Secondary indexes store the primary key value as the row pointer, meaning a secondary index lookup involves two B-tree traversals: one to find the primary key, one to fetch the row. This is generally fine, but it means you want to be deliberate about primary key choice - a UUID primary key on a large table causes random write ordering that hurts insert performance.
PostgreSQL’s heap-based storage separates the index from the row data. Clustered indexes are possible (CLUSTER table_name USING index_name) but are a one-time operation, not maintained on subsequent writes. PostgreSQL uses MVCC (Multi-Version Concurrency Control) to handle concurrent reads and writes without blocking - rather than locking a row when it is being updated, the database keeps multiple versions of that row simultaneously so readers can still see the old version while a write is in progress. The side effect: the heap accumulates dead row versions (old versions no longer visible to any transaction). VACUUM is the background process that reclaims this space. Understanding vacuum tuning is essential for high-write PostgreSQL deployments.
AWS RDS runs PostgreSQL and MySQL as managed services, handling backups, failover, and replication. One important RDS behavior: read replicas have a small replication lag. Queries routed to read replicas may see stale data. For reads that must be current (read-your-own-write scenarios), route to the primary.
Amazon Aurora reimplements the storage engine. Writes go to a distributed storage layer that replicates across 6 copies in 3 availability zones. Aurora replicas read from the same storage as the primary, meaning replication lag is typically under 10ms (versus seconds for RDS read replicas). For OLTP workloads with strict consistency requirements on reads, Aurora’s replica architecture is a significant advantage.
The Critique: ORMs and the N+1 Problem
An ORM (Object-Relational Mapper) is a library that lets you write database queries as method calls on objects rather than raw SQL - user.orders instead of SELECT * FROM orders WHERE user_id = $1. Most web frameworks include one: Django’s ORM, SQLAlchemy, ActiveRecord, Hibernate. They are convenient but they hide what SQL is actually being generated, which is where the trouble starts.
ORM-generated queries are the primary source of index-related performance problems in application code. The N+1 problem is endemic:
# Fetches 100 users - 1 query
users = session.query(User).limit(100).all()
# For each user, fetches their order count - 100 separate queries
for user in users:
print(user.name, len(user.orders)) # lazy loading triggers a query per user
# Total: 101 queries. Should be 1 or 2.
The fix is explicit eager loading - tell the ORM to join or subquery-fetch related data upfront. But this requires awareness of the problem. Many engineers do not realize lazy-loaded ORM relationships generate per-row queries until they profile production traffic and discover hundreds of queries per page load.
ORMs also generate suboptimal SQL in less obvious ways: unnecessary SELECT * when only two columns are needed, missing LIMIT clauses on unfiltered queries, unnecessary ORDER BY on result sets that do not need ordering (which forces a sort operation). The discipline of reviewing ORM-generated SQL in development - logging queries, using tools like Django Debug Toolbar or SQLAlchemy’s echo=True - catches these problems before production.
COUNT(*) vs COUNT(column)
COUNT(*) counts rows. COUNT(column_name) counts non-NULL values in that column. This is not just a semantic distinction - it has real performance implications.
When you write COUNT(*), the database does not need to read the value of any column. It only needs to verify that a row exists. InnoDB can satisfy COUNT(*) by scanning the smallest available secondary index rather than the full table. When you write COUNT(email), the database must read the email column for every row and check whether it is NULL.
-- COUNT(*): counts all rows, no column reads needed
SELECT COUNT(*) FROM users;
-- COUNT(email): counts rows where email IS NOT NULL
-- may return a smaller number if email can be null
SELECT COUNT(email) FROM users;
Use COUNT(*) unless you specifically need to exclude NULL values. COUNT(1) is not faster than COUNT(*) - every modern database treats them identically.
Future Outlook
Learned indexes replace B-tree structures with machine learning models trained on the data distribution. A simple linear model can map a sorted column’s range to disk positions more compactly than a B-tree if the data is uniform. Research continues (Google’s “The Case for Learned Index Structures,” 2018), and learned indexes are entering production in specialized read-heavy systems.
Columnar storage inverts the fundamental layout assumption. A row-oriented database (PostgreSQL, MySQL) stores all columns for a row together. A columnar store (DuckDB, Apache Parquet, BigQuery, Redshift) stores each column contiguously. For analytics queries that read one or two columns from millions of rows (SELECT AVG(revenue) FROM orders WHERE year = 2024), columnar storage is dramatically faster - it reads only the relevant columns and compresses columns of the same type far more efficiently than mixed-type rows.
Two terms worth knowing here: OLTP (Online Transaction Processing) describes the kind of workload a user-facing application generates - many small, fast reads and writes, one user’s data at a time. OLAP (Online Analytical Processing) describes analytics workloads - a few very expensive queries that scan large portions of the table to produce aggregates and reports. Row-oriented databases are optimized for OLTP; columnar stores are optimized for OLAP.
For workloads that mix both, the architecture is increasingly to write to PostgreSQL for transactional workloads and export to Parquet/DuckDB for analytical queries - rather than trying to serve both from one database.
| Concept | Key Point |
|---|---|
| Full table scan | O(n); reads every row; fast for tiny tables, catastrophic at scale |
| B-tree | O(log n); sorted; supports equality, range, prefix, ORDER BY |
| B+ tree | Data only at leaves; leaves linked for range scan efficiency |
| B-tree vs BST | B-tree has high branching factor (100 - 200); fits disk pages; fewer random reads |
| Hash index | O(1) equality only; no range queries; rarely worth it over B-tree |
| Write amplification | Every index adds a write on INSERT/UPDATE/DELETE; more indexes = slower writes |
| Composite index | Column order matters; equality columns first, range columns last; leftmost prefix rule |
| Covering index | All needed columns in the index; eliminates heap fetch |
| Partial index | Index a subset of rows; smaller, faster, lower write overhead |
| Cardinality | Low cardinality (booleans) makes indexes counterproductive for large result sets |
| GIN | Inverted index for arrays, JSONB, full-text search |
| EXPLAIN ANALYZE | Shows actual plan and timing; stale statistics cause bad plan estimates |
| When to index | High-cardinality WHERE columns, frequent queries, selectivity under ~10% |
| ORM N+1 | Lazy-loaded relationships generate per-row queries; use eager loading |
| Aurora vs RDS | Aurora replicas read shared storage; ~10ms lag vs seconds for RDS replicas |
| Columnar stores | Column-per-file layout; DuckDB, Parquet; ideal for analytics queries |
| COUNT(*) vs COUNT(col) | COUNT(*) skips reading column values; COUNT(col) must check for NULLs |
Read Next: