Evaluating LLMs - Why Benchmarks Are Harder Than They Look // Megha Bose

Helpful context:

How do you know if a language model is good? The question sounds simple. It is not. A language model is asked to do many things simultaneously - reason, recall facts, write code, follow instructions, refuse harmful requests, maintain coherence over long contexts - and no single number captures all of them. The history of LLM evaluation is largely a history of proxies that seemed reasonable, got saturated, and were quietly replaced.

Perplexity: The Intrinsic Baseline

Perplexity is the oldest and most principled LLM metric. Given a test corpus $W = w_1, w_2, \ldots, w_N$:

$$\text{PP}(W) = \exp\left(-\frac{1}{N} \sum_{t=1}^{N} \log P(w_t \mid w_1, \ldots, w_{t-1})\right)$$

Perplexity is the exponential of the average negative log-likelihood per token - equivalently, the geometric mean of the inverse token probabilities. A perplexity of $k$ means the model is, on average, as uncertain as if it were choosing uniformly among $k$ options at each token. Lower is better.

Perplexity has a clean information-theoretic interpretation: it is $2^{H}$ where $H$ is the cross-entropy of the model distribution against the true data distribution. Minimizing perplexity is exactly minimizing the cross-entropy loss used during training.

Discomfort check. If perplexity is so principled, why don’t we just use it? Three problems. First, perplexity is distribution-specific: perplexity on Wikipedia text and perplexity on code are not comparable, and a model optimized for one will have high perplexity on the other. Second, perplexity measures average token probability, not task capability. A model that assigns 99% probability to the right next token in fluent English prose might still fail to reason about basic arithmetic. Third, and most practically: you cannot compare perplexities across models with different tokenizers. GPT-4 and LLaMA tokenize the same text differently, producing different token counts and hence different perplexities even if the models are equivalent in capability.

Perplexity is invaluable for tracking progress within a single model family (same tokenizer, same training distribution) and for debugging. It is not useful for comparing models across architectures.

Multiple-Choice Knowledge Benchmarks

The most widely reported benchmarks evaluate knowledge and reasoning via multiple-choice questions. The model is shown a question and $k$ answer choices; it selects the one it assigns highest probability to (or the one it generates).

MMLU (Massive Multitask Language Understanding) - 57 subjects spanning STEM, humanities, social sciences, and professional domains (law, medicine, accountancy). 15,908 questions at the level of US high school to professional exams. The original GPT-3 achieved ~43% (random is 25%); GPT-4 achieves ~86%. MMLU is now saturated at the high end - most frontier models score above 85%.

HellaSwag - tests commonsense reasoning about everyday situations. Given a scenario description, choose which of four continuations is most plausible. Designed to be easy for humans (95%+) but hard for early models. Now saturated: frontier models score 95%+.

ARC (AI2 Reasoning Challenge) - elementary and middle school science questions. Two sets: ARC-Easy and ARC-Challenge. ARC-Challenge is selected to require reasoning that retrieval alone cannot solve. Still has headroom but increasingly saturated.

WinoGrande - pronoun resolution requiring commonsense knowledge. “The trophy doesn’t fit in the suitcase because it’s too big. What is too big?” Requires understanding which noun the pronoun refers to based on world knowledge.

Discomfort check. Multiple-choice seems like a clean evaluation format. What’s the problem? Several. First, the format leaks signal: models can often identify the correct answer by its style (longer, more hedged, more professionally worded) without understanding the question. Second, these benchmarks are overwhelmingly in English - they measure English-language performance on English knowledge. Third, the questions are fixed and finite: once they appear in training data (data contamination), benchmark scores become inflated measures of memorization, not generalization. Fourth, multiple-choice tests a model’s ability to rank answers, not generate them - a model can get 90% on MMLU while generating incoherent free-form responses.

Code Generation Benchmarks

HumanEval - 164 Python programming problems with unit tests. The model generates a function body given a docstring; the solution is executed against tests. The metric is pass@k: the probability that at least one of $k$ generated solutions passes all tests.

$$\text{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$$

where $n$ is the number of samples generated and $c$ is the number that pass. pass@1 (single attempt) is the standard reported metric. GPT-4 achieves ~87% pass@1 on HumanEval; earlier GPT-3.5 achieved ~48%.

MBPP (Mostly Basic Python Programming) - 374 crowd-sourced Python problems, slightly harder distribution than HumanEval.

SWE-bench - a harder benchmark testing the ability to resolve real GitHub issues in large codebases. The model must read an issue description and generate a patch that passes the associated tests. Significantly harder than HumanEval: frontier models achieve 20-50% depending on scaffolding. More realistic measure of software engineering capability.

Discomfort check. HumanEval has only 164 problems. Isn’t that too small for a reliable estimate? Yes, but the execution-based evaluation (unit tests) makes it much more reliable than human rating - you get a binary correct/incorrect signal with no ambiguity. The small size is a problem for sensitivity: the difference between 85% and 87% on 164 problems is 3 questions. This is why researchers report confidence intervals and use multiple seeds when generating samples.

Mathematical Reasoning

GSM8K - 8,500 grade school math word problems requiring multi-step arithmetic reasoning. Solutions are typically 3-8 steps. GPT-4 achieves ~92%; GPT-3 achieved ~35%. Now saturated for frontier models.

MATH - 12,500 competition math problems across 5 difficulty levels, spanning algebra, geometry, number theory, and calculus. Significantly harder than GSM8K. Frontier models achieve 70-85%; earlier models achieved <10%.

AIME/AMC - recent models are being evaluated on American Mathematics Competition problems, which require olympiad-level reasoning. These are not yet saturated and are increasingly used to differentiate frontier models.

Mathematical benchmarks are important because they require reasoning chains, not just recall. A model that solves MATH problems is demonstrating something qualitatively different from a model that pattern-matches to memorized solutions.

Instruction Following and Conversational Quality

MT-Bench - 80 multi-turn questions across eight categories (writing, roleplay, reasoning, math, coding, STEM, humanities, extraction). A strong LLM (GPT-4) is used as a judge, rating responses on a 1-10 scale. MT-Bench measures conversational quality and instruction adherence rather than factual knowledge.

AlpacaEval - head-to-head comparison against a reference model (text-davinci-003 originally). A judge model rates which response is preferred. Results are reported as win rate against the reference.

Chatbot Arena - human raters compare responses from two anonymized models and vote for the better one. Elo ratings are computed from these pairwise comparisons. This is the most realistic measure of human preference: real users, real conversations, blind evaluation. The limitation is cost and speed - accumulating enough votes takes weeks.

Discomfort check. Using a strong LLM as a judge seems circular - you’re using GPT-4 to evaluate models, including GPT-4 itself. Doesn’t GPT-4 have biases that favor its own output style? Yes, this is a real and documented problem called self-enhancement bias. GPT-4 tends to prefer responses written in its own style. Researchers address this by using multiple judges, comparing judge ratings against human ratings, and carefully designing prompts that ask for specific criteria rather than overall preference. Despite the limitation, LLM-as-judge is much faster and cheaper than human evaluation and correlates reasonably well with human preferences on most tasks.

Safety and Truthfulness

TruthfulQA - 817 questions designed to probe whether models give truthful answers to questions where humans commonly hold misconceptions. Example: “What happens if you eat a fruit seed?” A model trained to be confident and fluent will often generate plausible-sounding false claims. TruthfulQA specifically targets this failure mode.

BBQ (Bias Benchmark for QA) - tests for social biases by presenting ambiguous scenarios where the correct answer is “unknown” but biased models will guess based on group stereotypes.

WinoBias - pronoun resolution tasks that reveal gender biases in coreference resolution.

Long Context and Retrieval

RULER - tests long-context capabilities at context lengths up to 128k tokens. Includes needle-in-a-haystack retrieval, variable tracking, and aggregation tasks.

SCROLLS - summarization and question answering over long documents (books, legal contracts, scientific papers).

NIAH (Needle in a Haystack) - place a specific fact in a large context (often 100k+ tokens) at varying depths; ask the model to retrieve it. Tests whether models actually use their full context window or rely primarily on recent tokens.

Contamination: The Silent Killer

The most serious problem in LLM evaluation is data contamination: test data appearing in training data. Since frontier models are trained on web-scale datasets (hundreds of billions of tokens), the probability that any fixed benchmark appears somewhere in the training set is high. A model that has memorized the MMLU questions and answers achieves high scores through recall, not generalization.

Detecting contamination is hard: training datasets are rarely fully disclosed. Indicators include: unusually high scores on older benchmarks, performance drops on held-out subsets, models generating benchmark questions unprompted.

The response has been a continuous cycle: new benchmarks are created, models are trained on data that includes them, scores saturate, new benchmarks are needed. GSM8K took years to saturate; ARC-Challenge is nearly saturated now. The field has moved to dynamic benchmarks (where questions are generated fresh at evaluation time) and private test sets.

Discomfort check. If contamination is so pervasive, can we trust any benchmark? For frontier models with undisclosed training data, trust is limited. The most trustworthy benchmarks are: (1) Chatbot Arena, which uses novel conversations from real users; (2) benchmarks released after the model’s training cutoff; (3) execution-based benchmarks like HumanEval, where memorizing answers is harder than reasoning through them; (4) adversarial benchmarks designed to change frequently. Single benchmark scores from a single lab should be treated with skepticism. Cross-benchmark consistency and third-party evaluation are more reliable.

What Benchmarks Do Not Capture

No current benchmark reliably measures:

Long-horizon planning and agency: can the model execute a multi-step task over hours with tool use?
Calibration: does the model know when it does not know something?
Genuine novel reasoning: distinguishing memorization from true out-of-distribution generalization
Multilingual fairness: most benchmarks are English-centric; performance on low-resource languages is systematically underreported
Real-world task performance: the correlation between benchmark scores and performance on actual user tasks is positive but noisy

The field is increasingly aware that benchmark saturation and contamination mean that aggregate scores on established benchmarks have diminishing returns as a measure of genuine capability improvement.

Summary

Benchmark	What it tests	Status
Perplexity	Next-token prediction on held-out data	Useful within a model family; not cross-model comparable
MMLU	Broad knowledge, 57 subjects	Near-saturated for frontier models (85%+)
HumanEval	Python function generation (164 problems)	Pass@1 ~87% for GPT-4
GSM8K	Grade school math reasoning	Near-saturated (90%+)
MATH	Competition math	Active headroom (70-85%)
MT-Bench	Instruction following, LLM-as-judge	Widely used; self-enhancement bias risk
Chatbot Arena	Human preference, Elo ratings	Most realistic; slow to accumulate
TruthfulQA	Factual truthfulness	Important for safety evaluation
SWE-bench	Real GitHub issue resolution	Hard; 20-50% for frontier models

Read next: