Search & ElasticSearch - Indexing the World So It Can Be Found // Megha Bose

Helpful context:

Databases & Indexes - The Structures That Make Queries Fast

You type “nearby pizza” into Yelp. In under 100 milliseconds, you get twenty results - ranked by relevance, sorted partly by distance, filtered by whether they’re open right now. Yelp has indexed over 8 million businesses. The query touched none of them directly; it touched a data structure specifically designed so that it never has to.

This is the inverted index, and it is the reason search engines exist as a separate category of infrastructure from relational databases.

Why Not Just Use a Database?

A SQL LIKE '%pizza%' query does a full table scan. With 8 million rows, each containing text fields of variable length, that is not a query that returns in 100ms without extraordinary hardware. And “nearby” makes it worse: finding the nearest N businesses requires computing distances to a large number of candidates.

Relational databases optimize for structured queries over normalized data with well-defined relationships. Search is a different problem: it’s matching user intent (often fuzzy, often misspelled, often in natural language) against a corpus of documents, returning results ranked by relevance rather than by a deterministic predicate. The data structures required are fundamentally different.

The Inverted Index

The inverted index flips the data model. Instead of storing documents and scanning them at query time, the index maps each term to the list of documents containing it.

When a document is indexed, it goes through an analyzer: the text is tokenized (split into words), lowercased, stop words are removed (“the”, “a”, “through”), and often stemmed (“running” → “run”). The result is a list of tokens. For each token, the document’s ID is added to that token’s posting list.

A query for “pizza” returns the posting list for “pizza” - potentially millions of document IDs - instantly, as a hash lookup. A query for “pizza restaurant” intersects the posting list for “pizza” with the posting list for “restaurant.” Intersection of sorted posting lists is O(N) in the length of the shorter list.

This is why Elasticsearch returns results in under 100ms across millions of documents when a SQL LIKE query on the same dataset would take minutes. The work happened at index time, not at query time.

Relevance: TF-IDF and BM25

Not all matches are equal. A restaurant review that mentions “pizza” twenty times is more relevant to a pizza query than a blog post that mentions pizza once in passing. Term Frequency captures this: the more times a term appears in a document, the more relevant the document is for that term.

But a word that appears in every document is not useful for distinguishing relevance. “The” appears in every English document and tells you nothing. Inverse Document Frequency captures this: a term is more discriminating when it appears in fewer documents. High IDF means the term is rare and therefore more informative when it matches.

TF-IDF combines these two signals. Elasticsearch’s default scorer is BM25, a probabilistic improvement. BM25 adds two tunable parameters: $k_1$ controls how term frequency saturates (in TF-IDF, relevance grows linearly with term count; BM25 adds diminishing returns - the tenth mention of “pizza” adds less than the first), and $b$ controls field-length normalization (a document that mentions pizza once in a 50-word review is more pizza-focused than one that mentions it once in a 5,000-word essay). BM25 consistently outperforms raw TF-IDF in search quality benchmarks.

Elasticsearch Architecture

Elasticsearch distributes the inverted index across a cluster. Documents are organized into indices (conceptually similar to tables). Each index is split into shards - the unit of parallelism and distribution. Shards are spread across nodes so that a query can execute in parallel across multiple machines, each returning partial results that are merged on the coordinating node.

Each shard also has one or more replicas. Replicas serve read requests (increasing read throughput) and provide fault tolerance (if a primary shard’s node goes down, a replica promotes itself). With 3 nodes and a replication factor of 1, you can lose any one node without losing data.

The write path: a document write goes to the primary shard first. The primary indexes it locally, then propagates to replicas. There is a replication lag, which creates a brief window where the document exists on the primary but not on replicas. If a read hits a replica during this window, it may not see the document.

Near-Real-Time Indexing: The 1-Second Gap

Elasticsearch is “eventually consistent” in a specific way that surprises new users. A freshly indexed document is not immediately searchable. Lucene writes new documents to an in-memory buffer; every second (by default), this buffer is “refreshed” - flushed to a new segment, making it searchable. The refresh interval is configurable, but below 1 second you start to pay significant performance costs.

This 1-second gap matters for use cases like search-as-you-type for content you just created. If a user posts a listing on Craigslist and then immediately searches for it, they may not find it. Systems that need read-your-writes consistency after a write must either wait for the refresh, force a refresh (expensive at scale), or use the primary store (the database) for immediate reads and Elasticsearch only for search queries.

Geospatial Search: Finding “Nearby”

Geohashing

A geohash encodes a latitude/longitude pair as a string where shared prefix means proximity. The longer the shared prefix, the closer the two locations. The string u4pruydq represents a small area in Paris; u4pruy is a larger area around it; u4p is a broad region. Searching within a geohash prefix finds all points in the corresponding geographic cell.

Geohashing is how Uber, Lyft, and DoorDash match drivers to riders in real time. Every driver’s location is indexed by their current geohash. A rider request computes the geohash of the pickup location and queries for drivers in neighboring geohash cells - which are found by shared prefix. The query is a string prefix scan, not a distance computation over every driver in the database.

Elasticsearch Geo Queries

Elasticsearch supports several geo query types over geo_point fields:

geo_distance: find all documents within a given radius of a point. The typical “nearby pizza” query - center on the user’s location, set a radius of 2km, filter to open restaurants, sort by combination of relevance score and distance.
geo_bounding_box: find documents within a rectangular region defined by top-left and bottom-right coordinates. Used in map-based search where the user sees a map viewport and wants results within it.
geo_polygon: find documents within an arbitrary polygon - useful for neighborhood-level queries (“restaurants in SoHo”).

Geo filters in Elasticsearch run in filter context: they don’t affect relevance scores and are heavily cached, making them much cheaper than scoring queries. A compound query might use a full-text match query in the must clause (affecting score) and a geo_distance filter in the filter clause (not affecting score, cached), combining relevance ranking with geographic filtering efficiently.

How Airbnb Uses Elasticsearch

Airbnb’s search problem is a canonical hard case: users search for accommodation with structured filters (dates, guest count, amenities) and unstructured text (“oceanview cabin with fireplace”), ranked by relevance plus a proprietary score that factors in host quality, booking history, and price competitiveness.

Elasticsearch handles the text search and geospatial filtering. Airbnb’s ranking layer adds a learned ranking model on top: Elasticsearch returns the top N candidates (by combined text and geo score), and a separate ML service re-ranks them by the fuller feature set. The Elasticsearch query does the heavy lifting of eliminating irrelevant candidates from millions of listings; the ML model does the fine-grained ranking over a smaller set.

This pattern - Elasticsearch for candidate retrieval, ML for re-ranking - is common across search-intensive products. Elasticsearch is good at “find everything relevant”; ML ranking is good at “from the relevant set, order by expected utility.”

AWS OpenSearch vs Elasticsearch vs Algolia

AWS OpenSearch is Amazon’s fork of Elasticsearch (forked after Elastic changed the license from Apache 2.0). For AWS-native deployments, OpenSearch integrates natively with IAM, VPC, CloudWatch, and other AWS services. If you’re already running on AWS and don’t need Elasticsearch-specific features, OpenSearch is the pragmatic choice.

Elasticsearch (self-managed or via Elastic Cloud) gives you the full Elastic Stack - Kibana for visualization, Logstash for ingestion pipelines, APM. If you need the complete observability ecosystem or are not AWS-bound, Elastic Cloud is competitive.

Algolia is not open source and is not self-hosted, but it is the simplest operational path. Algolia provides a managed search API with excellent developer experience, sub-10ms response times, and built-in relevance tuning. The tradeoff: it’s expensive at scale and you have less control over the ranking model. For many product teams, the operational burden of running Elasticsearch clusters (shard sizing, heap tuning, cluster upgrades) makes Algolia’s cost worth it.

Typesense is an open-source, self-hostable Algolia alternative. Simpler to operate than Elasticsearch, optimized for e-commerce and instant-search use cases, with built-in typo tolerance.

The Operational Complexity Problem

Elasticsearch is operationally intensive. The common failure modes:

Shard sizing: each shard is a Lucene index. Shards that are too large (over ~50GB) cause slow queries and difficult recovery from node failures. Too many small shards cause overhead on the master node, which must track all shard metadata. The right shard count depends on your document volume and expected query patterns - and it’s hard to change after the fact.

Heap tuning: Elasticsearch runs on the JVM. The JVM heap must be tuned carefully: too small and garbage collection pauses degrade query latency; too large and the JVM spends time in GC. The rule of thumb is no more than 32GB heap (to stay under the JVM’s compressed oops threshold), with at least as much RAM reserved for the filesystem cache (which Lucene relies on heavily).

Mapping explosions: Elasticsearch’s dynamic mapping automatically creates new fields when new JSON keys are indexed. In a logs-over-time use case where log structures are inconsistent, this can create thousands of fields per index - a “mapping explosion” that causes master node instability. The fix is explicit mappings that define the schema in advance and set dynamic: false for unknown fields.

Split-brain: in older Elasticsearch versions, network partitions could cause two halves of a cluster to each elect a master node, resulting in split-brain - each half accepts writes, data diverges. Elasticsearch’s minimum master nodes configuration (set to quorum) prevents this, and newer versions use a Raft-based consensus protocol that handles partitions more gracefully.

Vector Search: The Next Layer

Search is moving beyond keyword matching. A search for “comfortable shoes for standing all day” may not return the most relevant results from a keyword index - “comfortable” and “standing all day” may not appear verbatim in product descriptions. Semantic search uses dense vector embeddings: encode both the query and the documents into high-dimensional vectors using a language model; find the documents whose vectors are most similar to the query vector.

Elasticsearch added approximate nearest neighbor (ANN) search via the HNSW algorithm and dense_vector fields. Dedicated vector databases - Pinecone, Weaviate, Qdrant - are purpose-built for this workload with better performance characteristics than Elasticsearch for pure vector search.

The current state of the art is hybrid search: combine sparse retrieval (BM25 on the inverted index) with dense retrieval (ANN on vector embeddings), and merge the results. Sparse retrieval is precise when keywords match; dense retrieval handles semantic similarity and synonyms. Together, they outperform either approach alone - this is Reciprocal Rank Fusion (RRF) in practice.

Retrieval-Augmented Generation (RAG) systems - LLMs that retrieve relevant context from a corpus before generating answers - are the dominant driver of vector search adoption. Elasticsearch and OpenSearch both support hybrid sparse+dense search for RAG use cases.

What Not to Use Elasticsearch For

Elasticsearch is a read-optimized search index. It is not a database. Common mistakes:

Using Elasticsearch as the primary store: Elasticsearch does not have ACID transactions. A document write that fails partway through leaves the document in an undefined state. Use a database as the source of truth; synchronize to Elasticsearch asynchronously.

Relying on Elasticsearch for strong consistency: the 1-second refresh gap and eventual replication mean Elasticsearch is not appropriate for workloads that require reading your own writes immediately.

Running complex aggregations on large datasets: Elasticsearch aggregations (group by, sum, count) work, but they hold intermediate results in heap memory. Very large aggregations can cause JVM heap pressure and cluster instability. For analytics over large datasets, a proper OLAP system (BigQuery, Redshift, ClickHouse) is the right tool.

Multi-Region Search

Running a globally distributed search cluster introduces consistency and latency tradeoffs. Elasticsearch’s cross-cluster replication (CCR) replicates indices from a primary cluster to follower clusters in other regions. Reads in each region hit the local follower cluster (low latency); writes go to the primary cluster and are replicated asynchronously. This means follower clusters may lag the primary by seconds - acceptable for search, where the 1-second refresh gap already exists.

For data sovereignty requirements - GDPR requires that EU user data stay in EU regions - the index architecture must be segmented by region. EU users' data lives in an EU cluster; US users' data lives in a US cluster. Queries that cross regions must either federate across clusters or be prohibited.

Future Outlook

The trend is toward hybrid sparse+dense search as the default, with hardware acceleration (GPU inference for embedding generation) making vector search fast enough for sub-100ms latency at scale. Multimodal search - searching across text, images, and video together - is the next frontier for platforms like Pinterest and YouTube.

For many product teams, managed vector databases (Pinecone, Weaviate Cloud) will replace self-managed Elasticsearch for pure semantic search use cases. Elasticsearch’s strength - the combination of full-text, structured, and geo search with deep observability integration - keeps it relevant for the complex, multi-faceted search problems that most product search involves.

Summary

Dimension	Elasticsearch	Algolia	Typesense	Pinecone/pgvector
Full-text search	Excellent	Excellent	Good	Poor
Geospatial	Excellent	Good	Basic	None
Vector/semantic	Good (HNSW)	None	Basic	Excellent
Operational complexity	High	None (managed)	Low	None (managed)
Cost at scale	Medium	High	Low	Medium
Use case	Complex search, logs, APM	Product search, SaaS	Self-hosted Algolia alternative	RAG, semantic search

Read Next:

Identity & OTP - Proving Who You Are Without Sharing a Password