Message Authentication & Hashing - Proof That No One Tampered With This // Megha Bose

Helpful context:

Encryption provides confidentiality but not integrity. An adversary who cannot read a ciphertext can still modify it - flipping bits, reordering blocks, or splicing together parts of different ciphertexts - and the recipient may decrypt a meaningless or adversarially controlled message without knowing anything went wrong. The solution is a message authentication code: a keyed tag attached to the message that an adversary cannot forge without knowing the key.

Hash functions add another layer: collision-resistant functions that compress arbitrary-length inputs to short digests, enabling efficient commitments and combined authentication. The two ideas - keyed MACs and unkeyed collision-resistant hashing - interact in subtle ways. A naive attempt to combine them (just prepend the key to the message and hash) breaks in ways that are not immediately obvious. Building secure constructions requires understanding exactly what each primitive guarantees and where each one fails.

This post builds both from scratch. We start with the definition of a MAC and the precise security game it must win, examine several constructions that look plausible but break, build up to provably secure CBC-MAC variants, then develop collision-resistant hashing and the Merkle-Damgård transform before arriving at HMAC - the construction that ties everything together and eliminates the length extension problem that undermines simpler hash-based MACs.

Data Integrity

The data integrity problem: given a message $m$ and a tag $t$, verify that $m$ was not modified in transit by anyone who doesn’t know the secret key.

MAC definition. A message authentication code is a pair $(\text{Mac}, \text{Vrfy})$:

$\text{Mac}_k(m)$ produces a tag $t$.
$\text{Vrfy}_k(m, t)$ outputs accept or reject.
Correctness: $\text{Vrfy}_k(m, \text{Mac}_k(m)) = \text{accept}$ always.

Security (EUF-CMA: existential unforgeability under chosen message attack). The adversary can adaptively query $\text{Mac}_k(\cdot)$ on any messages of its choice. It wins if it produces a valid $(m^, t^)$ where $m^*$ was not previously queried. The advantage must be negligible for all PPT adversaries.

Problematic Taggings

Before building a secure MAC, it helps to see what breaks.

Issue 1: Fixed-prefix extension. Tag $m$ by computing $t = F_k(m)$ (one PRF evaluation). For variable-length messages, append blocks: $t = F_k(m_1) | F_k(m_2) | \cdots$. Attack: the adversary queries $\text{Mac}$ on separate single-block messages $m_1$ and $m_2$, getting $t_1 = F_k(m_1)$ and $t_2 = F_k(m_2)$. It then forges a tag for the two-block message $m_1 | m_2$ as $(t_1, t_2)$ - a valid concatenated tag it never queried.

Issue 2: Length extension. Many naive hash-based MACs suffer from length extension: given $H(k | m)$, an adversary can compute $H(k | m | \text{pad} | m')$ without knowing $k$, simply by continuing the hash computation from the internal state. This breaks any MAC of the form $H(k | m)$.

Issue 3: Prefix reuse in CBC-MAC. If the adversary knows $\text{Mac}_k(m) = t$, and the MAC is computed by running a CBC chain, it can often forge a tag for a related message by XOR-ing the first block with $t$ - because that cancels out the chaining and reuses the MAC computation.

CBC-MAC

Construction. For a fixed-length message $m = m_1 | m_2 | \cdots | m_\ell$ of $\ell$ blocks: $$t_0 = 0^n, \quad t_i = F_k(t_{i-1} \oplus m_i), \quad \text{Mac}k(m) = t\ell$$ Apply the block cipher in CBC mode, starting from the all-zeros IV, and output the final block.

Security. For a fixed message length $\ell$, CBC-MAC (also called raw CBC-MAC) is EUF-CMA secure assuming $F_k$ is a PRF. The proof uses the PRF security to replace $F_k$ with a random function, then argues directly that the random-function CBC chain is unpredictable.

Variable-Length CBC-MAC and the Attack

Problem. If we use the same key for messages of different lengths, CBC-MAC is broken.

Attack. Let $m = m_1$ be a single block and suppose $\text{Mac}_k(m_1) = t$. Now consider the two-block message $m_1 | (t \oplus m_1)$:

$t_1 = F_k(0 \oplus m_1) = t$
$t_2 = F_k(t \oplus (t \oplus m_1)) = F_k(m_1) = t$

The two-block message gets the same tag as the one-block message, and the adversary never queried it. This is a valid forgery.

The root cause: the chaining structure means a one-block MAC output perfectly predicts where the two-block computation will land.

Secure CBC-MAC for Variable Lengths

Fix 2: ECBC-MAC (encrypted CBC-MAC). Use two independent keys $k_1, k_2$. Compute the raw CBC-MAC under $k_1$, then encrypt the final tag with $F_{k_2}$: $\text{Mac} = F_{k_2}(\text{CBC-MAC}_{k_1}(m))$. The extra encryption destroys the algebraic structure that the variable-length attack exploits. ECBC-MAC is provably secure for variable-length messages.

Fix 3: CMAC (standardized variant). Use the length prefix approach with a specific padding scheme: pad the last block with 1 followed by zeros ($10^*$ padding) to make it distinguishable, then XOR with one of two subkeys derived from $k$ depending on whether the message was already block-aligned. CMAC is the NIST standard.

The general requirement: the MAC over messages of different lengths must ensure that a valid $(\ell, m)$ pair cannot help forge a tag for any different $(\ell', m')$. The internal prefix trick ($| \ell | m |$ or $| i | m |$) ensures this by making each call to the MAC function structurally distinct.

Collision-Resistant Hash Functions

A hash function $H: \{0,1\}^* \to \{0,1\}^n$ is collision-resistant if no PPT algorithm can find $x \neq x'$ with $H(x) = H(x')$ except with negligible probability.

Note: collisions must exist (infinitely many inputs, finite outputs - pigeonhole). The question is whether they can be found efficiently.

Hash functions are public (no key). They provide a short, fixed-size “fingerprint” of data: if $H(x) = H(y)$, the parties can check large files by comparing $n$-bit digests.

Applications of collision resistance. If $H$ is collision-resistant:

Commitments: commit to $x$ by publishing $H(x)$; reveal $x$ later.
Hash-and-sign: to sign a long message $m$, sign $H(m)$ instead. If the signature scheme is secure and $H$ is collision-resistant, no adversary can forge $H(m') = H(m)$ to substitute a different message.

Birthday Attack

The birthday paradox. If we evaluate $H$ on $q$ random inputs, the expected number of collisions becomes $\approx q^2 / 2^n$. A collision occurs with probability $\approx 1/2$ when $q \approx 2^{n/2}$.

Birthday attack on hash functions. An adversary finding a collision:

Sample $q = 2^{n/2}$ random messages.
Compute $H(m_i)$ for each.
By birthday bound, with probability $\approx 1/2$, two messages have the same hash.

This means an $n$-bit hash function gives only $n/2$ bits of collision resistance. SHA-256 (256-bit output) has 128-bit collision resistance; this is why MD5 (128-bit output, only 64-bit collision resistance) is broken for collision resistance.

Merkle-Damgård Transform

The MD transform builds a collision-resistant hash over arbitrary-length inputs from a collision-resistant compression function $h: \{0,1\}^{n+t} \to \{0,1\}^n$ (fixed input size).

Construction. Pad the message to a multiple of $t$ blocks. Append the length in the final padding block. Then: $$z_0 = \text{IV}, \quad z_i = h(z_{i-1} | m_i), \quad H(m) = z_\ell$$

Theorem. If $h$ is collision-resistant, then the MD hash is collision-resistant.

Proof sketch. Any collision $H(m) = H(m')$ with $m \neq m'$ implies a collision in $h$: trace back the chains to find the first block where they diverge. $\square$

SHA-1, SHA-2 (SHA-256, SHA-512) all use Merkle-Damgård. The weakness: MD hashes are vulnerable to length extension attacks - given $H(m)$, one can compute $H(m | \text{pad} | m')$ for any $m'$ by continuing the computation from the final state $z_\ell$. This is why $H(k | m)$ is not a secure MAC (the adversary can extend without knowing $k$).

SHA-3 (Keccak) uses a sponge construction instead of MD, eliminating length extension.

HMAC

Construction. Given a hash function $H$ and a key $k$: $$\text{HMAC}_k(m) = H\bigl((k \oplus \text{opad}) | H((k \oplus \text{ipad}) | m)\bigr)$$ where $\text{opad} = 0x5c5c\ldots$ and $\text{ipad} = 0x3636\ldots$ are fixed constants (the outer and inner padding).

Why this works. The inner hash $H((k \oplus \text{ipad}) | m)$ commits to both the key and the message. The outer hash $H((k \oplus \text{opad}) | \cdot)$ with a different key-derived value prevents length extension: even if an adversary can extend the inner hash, the outer key makes that useless. HMAC is provably secure under the assumption that $H$ is a PRF (or under weaker assumptions about the compression function).

HMAC-SHA256 is the standard MAC in TLS, SSH, and most modern protocols. It inherits SHA-256’s 128-bit collision resistance and, crucially, is not vulnerable to length extension.

Bit Commitment Protocol

A commitment scheme lets a sender commit to a value without revealing it, then reveal it later - with two properties:

Hiding: the commitment reveals nothing about the committed value.
Binding: the sender cannot open the commitment to a different value.

Hash-based construction. To commit to a bit $b$: choose random $r \leftarrow \{0,1\}^n$; send $c = H(b | r)$. To reveal: send $(b, r)$. The receiver checks $H(b | r) = c$.

Hiding: $H(b | r)$ is computationally indistinguishable from random (pseudorandom output of $H$).
Binding: opening to a different $b'$ requires finding $r'$ with $H(b' | r') = H(b | r)$ - a collision in $H$.

Bit commitment is a foundational primitive for zero-knowledge proofs, coin-flipping protocols, and contract signing.

Summary

Concept	Key idea
MAC (EUF-CMA)	Adversary with encryption oracle cannot forge a new tag
CBC-MAC (fixed length)	Secure for fixed $\ell$; broken for variable length
Variable-length attack	One-block MAC used to forge two-block tag
ECBC-MAC / length prefix	Fix variable-length by encrypting final block or prepending $\|\ell\|$
Collision resistance	Hard to find $x \neq x'$ with $H(x) = H(x')$
Birthday attack	$2^{n/2}$ evaluations find collision; halves effective security
Merkle-Damgård	CR compression function $\to$ CR hash over $\{0,1\}^*$
Length extension	MD hash leaks internal state; $H(k\|m)$ is insecure
HMAC	Double hash with key-derived padding; eliminates length extension
Commitment	Hiding + binding; collision resistance implies binding

Read next: