Helpful context:


A language model knows everything it was trained on and nothing else. If you trained it until January 2024 and you ask it about something that happened in February 2024, it will either confess ignorance or, more dangerously, hallucinate a plausible-sounding answer. The same problem applies to private knowledge - a company’s internal documentation, a user’s personal files, a specialized corpus that never made it into the training data.

The naive solution is: retrain the model, or fine-tune it on the new information. Both are expensive, slow, and have their own failure modes. A fine-tuned model tends to forget what it knew before (catastrophic forgetting), and retraining for every knowledge update is impractical.

The better solution is to not bake the knowledge into the model’s weights at all. Instead, at inference time, retrieve the relevant pieces of knowledge from an external store and hand them to the model as context. The model’s job then becomes: reason over this retrieved information to produce an answer. This is Retrieval-Augmented Generation, or RAG.


The Retrieval Problem

Given a query, find the most relevant documents from a large corpus. This sounds like search, and it is, but the way “relevant” is defined is what changes everything.

Traditional search is lexical - it matches keywords. A query for “cardiovascular exercise benefits” would return documents that contain those exact words or close morphological variants. This works well when the user knows the right vocabulary and when documents use it. It fails when a user asks “how does running help your heart” and the relevant document talks about “aerobic activity and cardiovascular health” - same concept, different words.

The fix is to search over meaning, not tokens. And the way you represent meaning computationally is with embeddings: dense vectors in a high-dimensional space where semantically similar things are geometrically nearby.

An embedding model (a neural network, usually a transformer) takes a piece of text and produces a vector - say, 768 or 1536 numbers. The model is trained such that “I enjoy running” and “I like jogging” produce vectors that are close together, while “I enjoy running” and “the French Revolution began in 1789” produce vectors that are far apart. The embedding is a learned compression of semantic content.

Now retrieval becomes a geometric problem: given a query vector, find the document vectors that are most similar to it. Similarity is typically measured by cosine similarity (the angle between vectors) or dot product.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Embed the query and all documents
query_vec = embedding_model.encode("how does running help your heart")
doc_vecs  = embedding_model.encode(["cardiovascular exercise benefits",
                                     "French Revolution 1789",
                                     "aerobic activity and cardiovascular health"])

similarities = [cosine_similarity(query_vec, d) for d in doc_vecs]
# [0.91, 0.12, 0.89] -- retrieves docs 0 and 2 as most relevant

This is semantic search: retrieval based on meaning rather than lexical overlap.


The Scale Problem: Why Exact Search Fails

Searching over thousands of documents with exact cosine similarity is fine. Over millions or billions, it is not. Computing the cosine similarity between a query vector and every document vector in a corpus of 100 million items, where each vector has 768 dimensions, requires around 150 billion floating-point operations per query. Even on fast hardware, this takes seconds. At high query volumes, this is untenable.

This is the approximate nearest neighbor (ANN) problem: find the vectors most similar to a query, not exactly, but fast enough to be practical. The approximation is a deliberate tradeoff - you will occasionally miss the single most similar vector, but you will get very good results in milliseconds rather than exact results in seconds.

The most successful algorithm for this problem in practice is HNSW.


HNSW: Hierarchical Navigable Small World

HNSW (Malkov & Yashunin, 2018) is a graph-based index that enables sub-linear search over very large collections of vectors. Understanding it requires two ideas: small world graphs, and hierarchy.

Small World Graphs

A small world graph is one with two seemingly contradictory properties simultaneously: high local clustering (your neighbors are neighbors of each other) and short paths between any two nodes (the famous “six degrees of separation” result). In such a graph, navigating from any arbitrary node to any other takes very few hops, even if the graph has millions of nodes.

HNSW constructs a proximity graph where each node is a vector, and nodes are connected to their nearest neighbors. Navigating this graph from an entry point to the query is a greedy local search: at each step, move to whichever neighbor is closest to the query. This finds very good approximate nearest neighbors very fast.

Hierarchy

The problem with a flat proximity graph is that greedy local search can get stuck in local optima - regions where every neighbor is farther from the query than the current node, but you are not actually near the global nearest neighbor. This is the coarse navigation problem: you need long-range connections to quickly cross large distances in the space before using short-range connections for fine-grained search.

HNSW solves this by building a hierarchy of graphs. At the top layer, every node has long-range connections spanning the whole dataset - you can cross large distances quickly. At the bottom layer, every node has short-range connections to its closest neighbors for fine-grained precision. Intermediate layers have intermediate connection ranges.

graph TB subgraph L2["Layer 2 - sparse, long-range"] direction LR A2((A)) --- E2((E)) end subgraph L1["Layer 1 - medium-range"] direction LR A1((A)) --- C1((C)) --- D1((D)) --- E1((E)) end subgraph L0["Layer 0 - dense, short-range"] direction LR A0((A)) --- B0((B)) --- C0((C)) --- Cp(("C'")) --- D0((D)) --- Dp(("D'")) --- E0((E)) --- Ep(("E'")) end A2 -. search drops .-> A1 A1 -. search drops .-> A0

Search starts at the top layer, greedily navigates to the node closest to the query, then drops to the next layer and repeats. Each layer refines the result. The final search in layer 0 gives the approximate nearest neighbors.

Construction: nodes are inserted one at a time. Each new node is assigned to a random maximum layer (exponential distribution, so most nodes appear only in layer 0). It is connected to its closest neighbors in every layer up to its maximum.

The Tradeoffs

HNSW has several key parameters:

  • M: the number of bidirectional connections per node at construction time. Higher M gives better recall (you find the true nearest neighbors more often) at the cost of more memory and slower construction.
  • ef_construction: the size of the dynamic candidate list during construction. Higher values give better index quality at the cost of construction time.
  • ef_search (or ef): the size of the candidate list during search. This is the main recall-vs-latency knob at query time. Higher ef finds better nearest neighbors but takes longer.

The key insight is that ef_search can be tuned independently of how the index was built. You build once, then adjust ef at query time based on your latency budget.

HNSW characteristics:

  • Memory: $O(nM)$ where $n$ is the number of vectors. With M=16 and 768-dimensional float32 vectors, storing 1 million vectors takes about 3 GB for the vectors plus ~100 MB for the graph structure.
  • Search time: $O(\log n)$ in practice.
  • Recall at ef=100: typically 95-99% for well-tuned parameters.
  • Does not support deletion well (marks nodes as deleted but does not reclaim graph space).

Other ANN algorithms exist - IVF (inverted file index) partitions the space into Voronoi cells, SCANN (Google) uses learned quantization, Annoy (Spotify) uses random projection trees - but HNSW dominates modern vector database implementations because of its strong recall-latency tradeoff and the fact that it works well across different data distributions without requiring a separate quantization step.


Vector Databases

A vector database is a storage and retrieval system built around dense vector search. It handles the infrastructure problems you would otherwise have to solve yourself: indexing, persistence, updates, scaling, filtering, and the operationalization of ANN search at production load.

The three most commonly encountered options are Qdrant, Pinecone, and pgvector. They represent three distinct design philosophies.

Qdrant

Qdrant is an open-source vector database written in Rust, designed to be self-hosted or deployed on Qdrant Cloud. It stores vectors alongside a payload (arbitrary JSON metadata), and its killer feature is filtered search: you can query for nearest neighbors subject to a filter on the payload, and the filter is applied efficiently without a post-filtering step.

The naive approach to filtered search is: find the top-K nearest neighbors, then apply the filter and hope that enough of the top-K pass. This fails when the filter is selective - if only 1% of your corpus matches the filter, you need to retrieve the top-10000 to get 100 results that pass. Qdrant handles this with its own indexing over payload fields that integrates with the vector index, keeping latency sensible even for selective filters.

Qdrant uses HNSW internally and supports scalar quantization (int8) and product quantization to compress vectors, trading some recall for 4-8x memory reduction.

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, Filter, FieldCondition, MatchValue

client = QdrantClient("localhost", port=6333)

# Create a collection
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)

# Insert vectors with metadata payloads
client.upsert(
    collection_name="docs",
    points=[
        {"id": 1, "vector": embed("Climate change effects"), "payload": {"source": "ipcc", "year": 2023}},
        {"id": 2, "vector": embed("Quantum computing basics"), "payload": {"source": "arxiv", "year": 2022}},
    ]
)

# Semantic search with a filter
results = client.search(
    collection_name="docs",
    query_vector=embed("global warming impacts"),
    query_filter=Filter(
        must=[FieldCondition(key="year", match=MatchValue(value=2023))]
    ),
    limit=5
)

Qdrant is a strong choice when: you need full control over your infrastructure, you have selective filtering requirements, or you want to run everything on your own hardware.

Pinecone

Pinecone is a fully managed, proprietary vector database offered as a cloud service. You never operate it - there is no server to configure, no index to tune. You get a namespace, a client library, and an API.

Its design is oriented around operational simplicity: you send vectors in, query vectors out. The tradeoffs are the natural cloud tradeoffs - you give up control (no tuning of HNSW parameters, no access to internals) in exchange for not having to manage the infrastructure. It scales automatically, handles updates gracefully, and has a generous free tier for prototyping.

Pinecone introduced namespaces for multi-tenancy (different users or datasets can share an index without seeing each other’s data) and sparse-dense hybrid search (combining BM25 keyword scores with vector similarity in a single query). Hybrid search is important in practice because pure semantic search can miss exact matches - if a user asks for “GPT-4”, you want to find documents that mention “GPT-4” even if the semantic vector for “GPT-4” is not particularly close to documents about language models in general.

import pinecone

pc = pinecone.Pinecone(api_key="your-api-key")
index = pc.Index("my-index")

# Upsert
index.upsert(vectors=[
    ("id1", embed("Climate change effects"), {"source": "ipcc"}),
    ("id2", embed("Quantum computing basics"), {"source": "arxiv"}),
])

# Query
results = index.query(
    vector=embed("global warming impacts"),
    top_k=5,
    filter={"source": {"$eq": "ipcc"}},
    include_metadata=True
)

Pinecone is the path of least resistance for teams that want to prototype and ship quickly without deep infrastructure investment.

pgvector

pgvector is a PostgreSQL extension that adds a vector data type and approximate nearest neighbor search operators. It is not a standalone vector database - it is vector search living inside your relational database.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id        SERIAL PRIMARY KEY,
    content   TEXT,
    embedding VECTOR(768),
    source    TEXT,
    year      INT
);

-- Create HNSW index
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Semantic search with relational filters
SELECT content, source, 1 - (embedding <=> '[0.1, 0.2, ...]') AS similarity
FROM documents
WHERE year = 2023
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 5;

The appeal is enormous: if you already run Postgres, you do not need to learn, operate, or pay for a new system. Your vectors live alongside your relational data. You can join vector search results with your existing tables in a single query. Transactions, backups, permissions - all handled by the infrastructure you already have.

The tradeoff is performance at scale. pgvector’s HNSW index is solid but optimized for the constraints of a general-purpose database engine, not for vector search alone. For collections beyond a few million vectors with high query throughput, dedicated vector databases are measurably faster. For most applications - especially internal tools, small-to-medium datasets, or anything where “fast enough” is fast enough - pgvector is the right answer.

A rough guide:

System Best For Not Ideal For
Qdrant Self-hosted, selective filters, control Teams that want zero-ops
Pinecone Fastest to production, managed, hybrid search Cost-sensitive, large datasets, control needed
pgvector Existing Postgres users, joins, simplicity Very high throughput, billions of vectors

RAG: Retrieval-Augmented Generation

With semantic search in place, RAG is the architecture that connects retrieval to generation.

The core pipeline:

  1. Offline (index time): take your corpus, split it into chunks, embed each chunk, store the embeddings and the original text in a vector database.
  2. Online (query time): embed the user’s query, retrieve the top-K most similar chunks, construct a prompt that includes those chunks as context, pass the prompt to the language model, return the model’s response.
def rag_query(user_question, retriever, llm, top_k=5):
    # Step 1: embed and retrieve
    query_vec = embedding_model.encode(user_question)
    chunks = retriever.search(query_vec, top_k=top_k)

    # Step 2: construct context
    context = "\n\n".join([c.text for c in chunks])

    # Step 3: generate
    prompt = f"""Answer the question using the provided context.
Context:
{context}

Question: {user_question}
Answer:"""

    return llm.complete(prompt)

The model is not asked to recall from training. It is asked to read and reason. This dramatically reduces hallucination for knowledge-grounded questions: the model produces a bad answer when there is nothing useful in the retrieved context, not when its weights happen to contain stale or incorrect information.

Chunking

How you split documents into chunks has significant impact on retrieval quality. A chunk needs to be small enough that a single chunk is about one coherent thing (large chunks introduce irrelevant content alongside relevant content), but large enough to contain enough context to be useful (a chunk that is a single sentence may not have enough information to answer anything on its own).

Common strategies:

  • Fixed-size chunks with overlap: split every 512 tokens, overlap by 64. Simple and surprisingly robust. The overlap prevents a relevant sentence from being split across chunk boundaries.
  • Semantic chunking: split at sentence or paragraph boundaries, then merge adjacent chunks as long as they stay semantically similar. Produces more coherent units but is more expensive.
  • Document-level hierarchy: embed individual sentences or paragraphs for retrieval, but retrieve the surrounding paragraph or section as context. Small-to-big retrieval: the retrieval granularity is fine, but the context window you hand to the model is coarser.

A common failure mode: chunks are too large, so each chunk contains many topics, and the retrieval score is diluted across all of them. The chunk that is 80% relevant on one topic and 20% about something else will score lower than a chunk that is 100% relevant on its own.

Reranking

The embedding model used for retrieval is optimized for speed across a large corpus - it needs to embed millions of documents efficiently. It is not optimized for precision in ranking a small set of retrieved candidates.

Reranking is a two-stage approach: retrieve a larger set of candidates with the fast embedding model (top-50), then use a slower but more accurate cross-encoder to score each candidate against the query and re-rank them.

A cross-encoder takes (query, document) as a joint input and produces a single relevance score. Because it sees both together, it can model interactions between them that a bi-encoder (which embeds each independently) cannot. The tradeoff is that it is $O(n)$ in the number of candidates - you cannot use it to search over millions of documents, but you can use it to precisely rank the top 50 that the embedding model retrieved.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# candidates: list of (text, initial_score) tuples from vector search
query = "effects of climate change on agriculture"
candidate_texts = [c.text for c in candidates]
scores = reranker.predict([[query, text] for text in candidate_texts])

# Re-sort by cross-encoder scores
reranked = sorted(zip(candidate_texts, scores), key=lambda x: x[1], reverse=True)
top_5 = reranked[:5]

Reranking consistently improves answer quality in RAG systems and is worth doing whenever you have latency budget for the extra model call.

Pure semantic search misses exact matches. If a user asks about “RFC 7231” and your documents contain that exact string, the semantic embedding of “RFC 7231” may not be particularly close to any document’s embedding in the vector space. Keyword search would find it trivially.

Hybrid search combines sparse retrieval (BM25 or TF-IDF, operating on token overlap) with dense retrieval (embedding similarity), then merges the scores. Reciprocal rank fusion is a simple and effective merging strategy: for each document, compute $1 / (k + \text{rank})$ from each retriever and sum them.

Most production RAG systems use hybrid search because the domains where exact string matching helps (proper nouns, technical identifiers, version numbers) are exactly the domains where semantic search struggles most.


Embeddings: Which Model?

The embedding model is the bottleneck for retrieval quality. Common options:

  • OpenAI text-embedding-3-small / text-embedding-3-large: very strong, cheap per call, but adds an external API dependency and latency.
  • Cohere Embed v3: competitive with OpenAI, also managed API.
  • BGE (BAAI General Embedding): open-source, strong on the MTEB benchmark, can be self-hosted. BGE-M3 supports sparse and dense retrieval in a single model.
  • E5 (Microsoft): another strong open-source family, good for long documents.
  • Nomic Embed: open-source, optimized for long contexts.

The MTEB (Massive Text Embedding Benchmark) leaderboard is the practical reference for comparing embedding models across tasks. Different models perform differently on different retrieval tasks - domain-specific data (legal, medical, code) often benefits from fine-tuning a smaller model on in-domain data rather than using a large general-purpose model.

Embedding dimensionality is also a choice. Higher dimension means more expressive but more storage and slower similarity computation. Many models allow Matryoshka representation learning (MRL): truncating the embedding to a smaller dimension with graceful degradation, giving you a knob to trade quality for cost.


Failure Modes

RAG is not magic. Understanding when it fails is essential.

Retrieval failure: the relevant information exists in the corpus but does not get retrieved. Causes: bad chunking, wrong embedding model for the domain, query too vague, corpus too large for the index quality.

Context window overflow: the retrieved chunks together exceed the model’s context limit. Fix: retrieve fewer or shorter chunks, or use a model with a larger context window.

Context dilution: too many retrieved chunks, most of which are marginally relevant. The model attends to the irrelevant content and produces a muddled answer. Fix: use a reranker, reduce top-K.

Position bias: language models tend to pay less attention to information in the middle of long contexts (the “lost in the middle” problem). Fix: put the most relevant chunk first or last, not buried in the middle.

Hallucination despite retrieval: the model ignores the context and generates from its weights. This happens when the prompt does not clearly instruct the model to restrict itself to the provided context, or when the model is poorly calibrated for grounded generation. Fix: explicit system instructions to only use the provided context, and to say “I don’t know” when the context does not contain the answer.

Stale index: documents were updated but the vectors were not re-indexed. Fix: incremental indexing triggered by document updates.


When Not to Use RAG

RAG adds latency (retrieval + model call), operational complexity (maintaining the index), and a new failure mode (retrieval failure). It is not always the right answer.

Fine-tuning is better when: the knowledge is stable and unlikely to change, you need the model to internalize patterns of reasoning (not just facts), the corpus is small enough to fit in training, or you need near-zero latency on the retrieval step.

Long-context models (1M token context window) can sometimes replace RAG when the entire relevant corpus fits in context - just stuff it all in. This works for small codebases, individual documents, or narrow domains.

RAG shines when: the knowledge changes frequently (news, financial data, company documentation), the corpus is too large to fit in a context window or as training data, you need citations or source attribution, or you need to combine knowledge from multiple domains that would conflict if all fine-tuned into one model.

Most serious production systems that need knowledge-grounded generation use some combination of all three: a base model fine-tuned on the domain, RAG for current information, and occasionally full context stuffing for small high-precision tasks. The right answer is always situational.