The Wavelet Transform - Frequency Analysis That Knows Where It Is
Helpful context:
- Fourier Analysis - Every Signal Is a Sum of Sines
- Sequences & Series - Infinite Sums That Sometimes Converge
The Fourier transform tells you which frequencies are present in a signal. But it has no idea when they occur.
Take a recording of a Beethoven piano sonata. It contains hundreds of frequencies - the fundamental tones, their harmonics, the attack transients when keys are struck, the decay as they fade. Compute the Fourier transform of the entire recording: you get a list of which frequencies are present. But a low note in the first bar and the same low note in the last bar produce the same Fourier coefficient. A cymbal crash in the second minute and a cymbal crash in the eighth minute are indistinguishable. The Fourier transform has erased all temporal information.
For many applications - seismology, speech recognition, audio processing, ECG analysis, data compression - you need to know both what frequency and when. A seismologist needs to know when the P-wave arrived. A speech recognizer needs to know when each phoneme begins. A cardiologist needs to know where in the heartbeat cycle the QRS complex sits. The Fourier transform cannot tell you any of this.
This is the time-frequency problem. And it has a fundamental mathematical obstacle.
Section 1: The Uncertainty Principle
This section establishes the mathematical bedrock before we describe any particular method. You need to understand the limit before you can appreciate why wavelets are the answer to it.
Before describing any method for joint time-frequency analysis, you need to understand why there is no perfect solution. Not because our methods are inadequate - because the mathematics itself forbids it.
Theorem (Time-frequency uncertainty). For any function $f(t)$ with Fourier transform $F(\omega)$, define the time spread:
$$(\Delta t)^2 = \frac{\int t^2 |f(t)|^2 dt}{\int |f(t)|^2 dt}$$
and the frequency spread:
$$(\Delta\omega)^2 = \frac{\int \omega^2 |F(\omega)|^2 d\omega}{\int |F(\omega)|^2 d\omega}.$$
Then:
$$\Delta t \cdot \Delta\omega \geq \frac{1}{2}.$$
You cannot simultaneously have small $\Delta t$ (localized in time) and small $\Delta\omega$ (localized in frequency). The product has an irreducible lower bound.
The extremal functions - those achieving equality - are Gaussians: $f(t) = e^{-t^2/2}$ has $\Delta t = \Delta\omega = 1/\sqrt{2}$, giving $\Delta t \cdot \Delta\omega = 1/2$.
The uncertainty principle has concrete manifestations:
- A pure sinusoid $e^{i\omega_0 t}$ has $\Delta\omega = 0$ (perfectly localized in frequency) but $\Delta t = \infty$ (completely delocalized in time).
- A delta function $\delta(t - t_0)$ has $\Delta t = 0$ (perfectly localized in time) but $\Delta\omega = \infty$ (flat frequency content).
- Any practical signal occupies a “cell” of area $\geq 1/2$ in the time-frequency plane.
This is not a limitation of any particular method. It is a theorem. Any method for time-frequency analysis is choosing how to partition the uncertainty budget - what time resolution to sacrifice for frequency resolution, and vice versa, at different locations in the signal.
Section 2: The Short-Time Fourier Transform
The natural first idea: take the Fourier transform of small segments of the signal.
Choose a window function $g(t)$ - a smooth, localized function that is concentrated near $t = 0$ and decays away from it. Typical choices: a Gaussian, or a tapered cosine (Hann window). The Short-Time Fourier Transform (STFT) is:
$$\text{STFT}(b, \omega) = \int_{-\infty}^{\infty} f(t) \overline{g(t - b)} e^{-i\omega t} dt.$$
The window $g(t-b)$ is centered at time $b$. For each pair $(b, \omega)$, you are computing how much of frequency $\omega$ is present in $f$ near time $b$. Sliding $b$ across the signal and computing for all $\omega$ produces a 2D map of the signal’s time-frequency content - a spectrogram.
The STFT works. It gives genuine time-frequency information. But it has a serious limitation.
Once you choose the window $g$, the time resolution and frequency resolution are fixed for the entire signal. The window has some time width $\sigma_t$ and some frequency width $\sigma_\omega$ (related by the uncertainty principle: $\sigma_t \sigma_\omega \approx 1/2$). Every STFT coefficient corresponds to a rectangular “Heisenberg box” of dimensions $\sigma_t \times \sigma_\omega$ in the time-frequency plane. The entire signal is analyzed at this one resolution.
For stationary signals (signals whose statistical properties do not change over time), this is fine. A sustained musical note has roughly the same frequency content throughout; a fixed resolution works.
For non-stationary signals with features at multiple scales, the fixed resolution is a problem. A seismic signal contains slow surface waves (lasting minutes, low frequency) and fast body waves (lasting milliseconds, high frequency). A fixed window that is long enough to resolve the surface wave frequencies will smear the body wave arrival time; a short window that captures the body wave arrival will blur the surface wave frequencies. There is no single window size that works for both.
The STFT is stuck. Its resolution is the same everywhere in the time-frequency plane, regardless of what the signal is doing.
Section 3: Why the STFT Resolution Is Fixed
Here is the precise statement of the STFT’s limitation.
The STFT coefficient $\text{STFT}(b, \omega)$ can be thought of as the inner product of $f$ with the function $g(t-b)e^{i\omega t}$ - a windowed complex exponential centered at time $b$ and oscillating at frequency $\omega$. In the time-frequency plane, this function occupies a “Heisenberg box” - a region of dimensions roughly $\sigma_t \times \sigma_\omega$ centered at $(b, \omega)$.
The dimensions $\sigma_t$ (time width) and $\sigma_\omega$ (frequency width) are fixed by the window $g$: $\sigma_t = \Delta t(g)$ and $\sigma_\omega = \Delta\omega(\hat{g})$. By the uncertainty principle, $\sigma_t \sigma_\omega \geq 1/2$.
The entire time-frequency plane is tiled with boxes of the same fixed size. Every frequency is analyzed with the same time window. This is the fundamental rigidity of the STFT.
In contrast, at a given frequency $\omega_0$, the relevant time scale for oscillations is $T_0 = 2\pi/\omega_0$. For high-frequency oscillations (large $\omega_0$, small $T_0$), a short time window captures many cycles and gives good frequency resolution. For low-frequency oscillations (small $\omega_0$, large $T_0$), a long time window is needed to see even one cycle. The STFT uses the same window everywhere, which means it is using either a too-long window for high frequencies or a too-short window for low frequencies - it cannot be optimal for both simultaneously.
Wavelets fix this by making the window duration proportional to the oscillation period: the wavelet at scale $a$ (corresponding to frequency $\sim\omega_0/a$) has duration proportional to $a$. As scale increases (lower frequency), the wavelet stretches; as scale decreases (higher frequency), the wavelet compresses. Each Heisenberg box has the same area (the uncertainty bound) but different aspect ratios - tall and narrow for low frequencies, short and wide for high frequencies.
Section 4: Wavelets - The Adaptive Solution
Wavelets replace the fixed window with analysis functions that automatically adapt their duration to the frequency being analyzed.
The key idea: at high frequencies, use a short, narrow analysis function (good time resolution - you care about when the high-frequency event happened). At low frequencies, use a long, wide analysis function (good frequency resolution - high-frequency components are short enough that temporal precision is less important, but you want to distinguish nearby low frequencies).
A wavelet $\psi(t)$ is a function satisfying two conditions:
-
It is localized in both time and frequency - concentrated near $t = 0$ and near some nonzero frequency $\omega_0$.
-
It has zero mean: $\int_{-\infty}^{\infty} \psi(t) dt = 0$.
The zero-mean condition is mandatory. In the Fourier domain, $\hat{\psi}(0) = \int \psi(t) dt = 0$. This means the wavelet has no DC component - it is a band-pass function that oscillates and integrates to zero.
Discomfort check. Why must a wavelet have zero mean? Consider what happens if $\int \psi(t) dt \neq 0$. Then $|\hat{\psi}(0)| > 0$: the wavelet has energy at frequency zero, meaning it responds to the constant (DC) level of the signal. But the constant level carries no information about change or oscillation - it is just the baseline. A wavelet is meant to detect oscillatory features, transients, and variations. It should be blind to the background level. The admissibility condition $\hat{\psi}(0) = 0$ ensures the wavelet ignores DC and responds only to variations. Without it, the wavelet transform would confound signal features with signal level.
Given a “mother wavelet” $\psi(t)$, generate a family by scaling (stretching or compressing) and translating (shifting):
$$\psi_{a,b}(t) = \frac{1}{\sqrt{a}}\psi\left(\frac{t-b}{a}\right), \qquad a > 0,; b \in \mathbb{R}.$$
The scale parameter $a$ controls the duration: large $a$ stretches $\psi$ (longer duration, lower frequency); small $a$ compresses $\psi$ (shorter duration, higher frequency). The translation parameter $b$ shifts the center time. The $1/\sqrt{a}$ normalization keeps $|\psi_{a,b}|_2 = |\psi|_2$ for all $a, b$.
Discomfort check. What is the relationship between scale $a$ and frequency? If $\psi(t)$ has most of its energy at frequency $\omega_0$, then $\psi(t/a)$ has most of its energy at frequency $\omega_0/a$. Smaller $a$ means the wavelet is compressed in time, so it oscillates more rapidly - higher frequency. Larger $a$ means the wavelet is stretched out - lower frequency. The scale parameter is inversely proportional to frequency. This is why wavelet analysis is sometimes called a “constant-$Q$” analysis: the ratio of center frequency to bandwidth is constant, whereas the STFT maintains constant absolute bandwidth.
Section 5: The Continuous Wavelet Transform
Definition. The Continuous Wavelet Transform (CWT) of $f(t)$ is:
$$W(a, b) = \langle f, \psi_{a,b}\rangle = \int_{-\infty}^{\infty} f(t) \psi_{a,b}^(t) dt = \frac{1}{\sqrt{a}}\int_{-\infty}^{\infty} f(t) \psi^\left(\frac{t-b}{a}\right) dt.$$
The coefficient $W(a, b)$ measures how much the signal $f$ at time $b$ resembles the wavelet at scale $a$.
Interpretation. Large $|W(a, b)|$ means $f$ has a feature near time $b$ that looks like $\psi$ at scale $a$ (i.e., at frequency $\sim \omega_0/a$). Small $|W(a, b)|$ means the signal does not have that feature at that scale and time.
Reconstruction. Under the admissibility condition $C_\psi = \int_0^{\infty} |\hat{\psi}(\omega)|^2/\omega d\omega < \infty$ (which is guaranteed when $\hat{\psi}(0) = 0$ and $\hat{\psi}$ is well-behaved), the original signal can be recovered:
$$f(t) = \frac{1}{C_\psi}\int_0^{\infty}\int_{-\infty}^{\infty} W(a,b) \psi_{a,b}(t) \frac{db da}{a^2}.$$
This is the inverse CWT. It says the signal is a continuous superposition of scaled-and-shifted wavelets, weighted by the CWT coefficients.
Energy conservation (CWT Parseval). $|f|^2 = \frac{1}{C_\psi}\int_0^{\infty}\int_{-\infty}^{\infty}|W(a,b)|^2 \frac{db da}{a^2}$.
The CWT is redundant - there are infinitely many $(a, b)$ pairs, far more than the original signal has degrees of freedom. This redundancy is often useful for analysis (you can see the structure at every scale and time smoothly), but wasteful for compression or computation.
Section 6: The Scalogram
The natural way to display the CWT output is as a two-dimensional plot in the $(b, a)$ plane, with the coefficient magnitude $|W(a,b)|$ shown as a color or intensity. This is the scalogram (or wavelet power spectrum).
Convention: typically the vertical axis shows frequency (high at top, low at bottom), so small scale (high frequency) is at the top and large scale (low frequency) at the bottom. Time runs along the horizontal axis.
Reading a scalogram: bright regions indicate where $|W(a,b)|$ is large - the signal has features at that time and scale. A horizontal bright band indicates a persistent oscillation at that frequency. A vertical bright band (cutting across all scales) indicates a sharp transient (like an impulse or step change). A bright blob localized in both time and frequency indicates a brief burst of oscillation.
Example: a chirp. A chirp signal increases in frequency over time: $f(t) = \cos(t^2/2)$. Its Fourier transform shows all frequencies from low to high but cannot tell you they increase with time. Its CWT scalogram shows a diagonal ridge: at early times, the bright region is at low frequency (large scale); at late times, the bright region is at high frequency (small scale). The scalogram reveals the time-frequency structure directly.
Example: speech. The scalogram of a spoken word shows vowels as horizontal bands (sustained harmonics at specific frequencies), consonants as brief wideband bursts, and silence as empty regions. The temporal evolution of frequency content - the formant structure - is immediately visible.
The scalogram trades resolution for insight. Because of the uncertainty principle, each point in the scalogram represents a region of area $\geq 1/2$ in the time-frequency plane. High-frequency details are localized in time but smeared in frequency; low-frequency content is localized in frequency but smeared in time. This is the inherent geometry of any time-frequency analysis.
Section 7: Multi-Resolution Analysis
To make the wavelet transform computationally practical and non-redundant, we need discrete sampling in the $(a, b)$ space, structured carefully. The right framework is multi-resolution analysis (MRA).
An MRA is a ladder of approximation spaces. The signal is analyzed at multiple scales simultaneously: at each scale, you have a coarse approximation and the detail (what you would need to add to the approximation to get to the next finer scale).
More precisely: an MRA consists of a sequence of nested closed subspaces $\ldots \subset V_{-1} \subset V_0 \subset V_1 \subset \cdots$ of $L^2(\mathbb{R})$, where:
- $\bigcup_j V_j$ is dense in $L^2(\mathbb{R})$ (approximations get arbitrarily good at fine scales).
- $\bigcap_j V_j = \{0\}$ (the coarsest approximation is trivial).
- $f(t) \in V_j \iff f(2t) \in V_{j+1}$ (finer scales are obtained by halving the scale).
- There is a scaling function (father wavelet) $\phi(t)$ such that $\{\phi(t-k) : k \in \mathbb{Z}\}$ forms an orthonormal basis for $V_0$.
The orthogonal complement of $V_j$ in $V_{j+1}$ is a space $W_j$ spanned by the shifted and scaled mother wavelet $\psi$. Decompose $V_{j+1} = V_j \oplus W_j$. Expand: $L^2(\mathbb{R}) = V_0 \oplus W_0 \oplus W_1 \oplus W_2 \oplus \cdots$ (or $\cdots \oplus W_{-2} \oplus W_{-1} \oplus W_0 \oplus \cdots$ for the doubly infinite case).
In words: any signal can be decomposed into a coarse approximation (in $V_0$) plus detail at successively finer scales ($W_0, W_1, \ldots$). The coarse approximation captures the “low-frequency content”; each detail level captures the “detail content” at the corresponding scale.
The scaling function $\phi$ and wavelet $\psi$ satisfy two refinement equations:
$$\phi(t) = \sqrt{2}\sum_{k} h[k] \phi(2t - k)$$
$$\psi(t) = \sqrt{2}\sum_{k} g[k] \phi(2t - k)$$
where $h[k]$ are low-pass filter coefficients and $g[k] = (-1)^k h[1-k]$ are high-pass filter coefficients (the quadrature mirror filter relation). The filter coefficients $h[k]$ completely determine both $\phi$ and $\psi$.
Section 8: The Discrete Wavelet Transform
The MRA framework directly leads to an efficient algorithm: the Discrete Wavelet Transform (DWT), implemented as a filter bank.
Decomposition. Given a discrete signal $x[n]$ at the finest scale:
- Convolve with the low-pass filter $h[k]$ and downsample by 2 (keep every other sample). Output: approximation coefficients $c[n]$ at the next coarser scale.
- Convolve with the high-pass filter $g[k]$ and downsample by 2. Output: detail coefficients $d[n]$ at the current scale.
- Apply the same two operations to the approximation coefficients $c[n]$ to get a coarser approximation and detail. Repeat.
After $J$ levels of decomposition, a signal of length $N$ produces:
- $N/2^J$ approximation coefficients (the coarsest picture)
- $N/2^j$ detail coefficients at each level $j = 1, 2, \ldots, J$
Total coefficients: $N/2^J + N/2^J + N/2^{J-1} + \cdots + N/2 = N$. The DWT is non-redundant.
Reconstruction. Reverse the process: upsample (insert zeros between samples) and convolve with reconstruction filters $\tilde{h}$ and $\tilde{g}$. Sum the approximation and detail at each level. After $J$ steps, you recover $x[n]$ exactly (perfect reconstruction, a key property of properly designed wavelet filter banks).
Discomfort check. Why is the DWT $O(N)$ even though it operates at $J$ levels? At the first level, process $N$ samples. At the second level, process $N/2$ approximation coefficients. At the third level, $N/4$. And so on. The total work is $N + N/2 + N/4 + \cdots + N/2^J \leq 2N$. This is $O(N)$, regardless of $J$. Compare to the FFT, which is $O(N\log N)$. The DWT is faster because each level of decomposition reduces the problem size by half. The filter bank is a recursion with geometrically decreasing workloads, not logarithmically.
The DWT is implemented in software as two standard convolution-and-downsample operations, repeated. It is fast, memory-efficient, and produces a hierarchical multi-scale representation.
Section 9: The Wavelet Coefficients in Practice
What does the output of a DWT actually look like? Consider a simple example: a signal of $N = 8$ samples, $x = [x_0, x_1, x_2, x_3, x_4, x_5, x_6, x_7]$.
With the Haar wavelet ($h = [1/\sqrt{2}, 1/\sqrt{2}]$, $g = [1/\sqrt{2}, -1/\sqrt{2}]$):
Level 1 decomposition.
- Approximation: $c_1[k] = (x_{2k} + x_{2k+1})/\sqrt{2}$. Average adjacent pairs. 4 values.
- Detail: $d_1[k] = (x_{2k} - x_{2k+1})/\sqrt{2}$. Difference adjacent pairs. 4 values.
Level 2 decomposition. Apply the same to the 4 approximation coefficients:
- Approximation $c_2$: 2 values (averages of adjacent pairs from $c_1$).
- Detail $d_2$: 2 values (differences of adjacent pairs from $c_1$).
Level 3 decomposition. Apply to the 2 remaining approximation coefficients:
- Approximation $c_3$: 1 value.
- Detail $d_3$: 1 value.
Final result: $1 + 1 + 2 + 4 = 8$ coefficients total. The wavelet coefficients are arranged as:
$$[c_3, d_3, d_2[0], d_2[1], d_1[0], d_1[1], d_1[2], d_1[3]].$$
The single $c_3$ coefficient is the overall average of the signal. The four $d_1$ coefficients capture the finest-scale variation (between adjacent samples). The two $d_2$ coefficients capture coarser variation. The one $d_3$ coefficient captures the coarsest detail.
If the signal is smooth (neighboring samples have similar values), the detail coefficients are small - they measure differences, and differences are small when the signal varies slowly. If the signal has a sharp feature, only the detail coefficients at the appropriate scale and location will be large. This is sparsity emerging directly from the algorithm structure.
Section 10: Sparsity - Why Wavelets Beat Fourier for Some Tasks
The Fourier transform represents signals in terms of sinusoids - functions that extend over all time with no spatial localization. A localized event (a click, a spike, a sharp edge) spreads its energy across all Fourier coefficients. You need many Fourier coefficients to accurately represent a localized feature.
Wavelets are localized in both time and frequency. A spike in the signal near time $b$ at scale $a$ is captured by the wavelet coefficient $W(a, b)$ with $a$ small and $b$ near the spike location. All other coefficients $W(a', b')$ with $b'$ far from the spike or $a'$ much different from $a$ will be near zero. The spike’s energy is concentrated in a few wavelet coefficients.
This is sparsity: the wavelet representation of a signal with localized features requires far fewer significant coefficients than its Fourier representation.
Why sparsity matters for compression. A compression algorithm typically:
- Computes some transform of the signal.
- Retains only the $k$ largest-magnitude coefficients; sets the rest to zero.
- Stores the positions and values of those $k$ coefficients.
- Reconstructs by inverting the transform.
If the representation is sparse (most energy in few coefficients), step 2 discards little energy and reconstruction is accurate. The more sparse the representation, the fewer coefficients you need and the higher the compression ratio.
For natural images: smooth regions are efficiently represented by low-frequency Fourier coefficients. But edges - boundaries between objects - create high-frequency content that spreads across all Fourier coefficients. JPEG’s DCT-based compression handles smooth regions well but produces “ringing” artifacts near edges. Wavelet-based compression (JPEG2000) represents smooth regions with coarse-scale wavelet coefficients and edges with fine-scale coefficients at the edge locations. The total number of significant coefficients is much smaller, especially at high compression ratios.
Concrete example. Consider an image with one sharp edge: white on the left, black on the right. Its Fourier representation has coefficients $1/(\pi k)$ for all odd $k$ - slowly decaying, requiring many terms for accurate reconstruction. Its 1D Haar wavelet representation has one coefficient at the scale corresponding to the edge width and at the edge location, with all other coefficients zero. The wavelet representation is exactly $2$ significant coefficients; the Fourier representation requires $O(N)$.
Section 11: Vanishing Moments - A Closer Look
The concept of vanishing moments is central to understanding wavelet efficiency.
A wavelet $\psi$ has $M$ vanishing moments if:
$$\int_{-\infty}^{\infty} t^k \psi(t) dt = 0 \quad \text{for } k = 0, 1, \ldots, M-1.$$
We already know $k = 0$ (zero mean) is required. Vanishing moments for $k \geq 1$ are extra.
Why vanishing moments matter. Suppose $f(t)$ is smooth near the point $b$. Expand in a Taylor series: $f(t) = f(b) + f'(b)(t-b) + \frac{f''(b)}{2!}(t-b)^2 + \cdots$. The wavelet coefficient at scale $a$ and time $b$ is:
$$W(a,b) = \frac{1}{\sqrt{a}}\int f(t)\psi^*\left(\frac{t-b}{a}\right) dt.$$
Substitute the Taylor expansion. The zero-mean condition kills the constant term: $\int \psi^* dt = 0$ means the $f(b)$ term contributes zero. If $\psi$ has $M$ vanishing moments, then the terms for $k = 0, 1, \ldots, M-1$ all vanish. The first surviving term is order $a^{M+1/2}$ (after scaling). For small $a$ (fine scale, high frequency), $W(a,b) \to 0$ very fast.
In practical terms: near a smooth region, the detail wavelet coefficients are extremely small (order $a^{M+1/2}$). Near a discontinuity (where the Taylor expansion breaks down), the coefficients are large. More vanishing moments means smoother functions have smaller coefficients, giving sparser representations.
Consequences for filter design. More vanishing moments require longer filters (more coefficients). This is a tradeoff: Haar has 1 vanishing moment and 2 filter coefficients. $D_4$ has 2 vanishing moments and 4 filter coefficients. $D_8$ has 4 vanishing moments and 8 coefficients. The longer the filter, the smoother the wavelet and the sparser the representation for smooth signals - but also the more computation per level.
Discomfort check. Why do vanishing moments affect smoothness? The connection is through the two-scale relation: $\phi(t) = \sqrt{2}\sum_k h[k]\phi(2t-k)$. The scaling function’s smoothness is controlled by the zeros of the filter $H(e^{j\omega}) = \sum_k h[k]e^{-j\omega k}$ at $\omega = \pi$. Each vanishing moment contributes one zero at $\omega = \pi$: $M$ vanishing moments means $H(e^{j\pi}) = H'(e^{j\pi}) = \cdots = H^{(M-1)}(e^{j\pi}) = 0$. More zeros at $\pi$ means the high-pass filter rolls off more steeply, and the resulting wavelet is smoother.
Section 12: Common Wavelets
The choice of wavelet affects the sparsity of the representation, the smoothness of reconstruction, and the computational properties of the transform.
Haar wavelet. The oldest and simplest wavelet:
$$\psi_{\text{Haar}}(t) = \begin{cases} 1 & 0 \leq t < 1/2 \\ -1 & 1/2 \leq t < 1 \\ 0 & \text{otherwise} \end{cases}$$
A step function that goes up then immediately down. The corresponding scaling function is $\phi(t) = \mathbf{1}_{[0,1)}$. The filter coefficients: $h = [1/\sqrt{2}, 1/\sqrt{2}]$ (low-pass: average adjacent pairs), $g = [1/\sqrt{2}, -1/\sqrt{2}]$ (high-pass: difference adjacent pairs). The DWT with Haar wavelets is just iterative averaging and differencing.
Haar is orthogonal, has compact support (nonzero only on $[0,1]$), and is the simplest possible wavelet. Its weakness: it is not continuous - the scaling function and wavelet have discontinuities. This means the Haar DWT does not produce sparse representations for smooth signals (the discontinuous wavelet does not match smooth regions well). It is good for piecewise-constant signals and for understanding DWT algorithms.
Daubechies wavelets ($D_{2N}$). Ingrid Daubechies (1988) constructed a family of wavelets with compact support, orthogonality, and the maximum number of vanishing moments for their support size.
A wavelet $\psi$ has $M$ vanishing moments if $\int t^k \psi(t) dt = 0$ for $k = 0, 1, \ldots, M-1$. More vanishing moments means the wavelet coefficients of smooth functions are smaller: if $f$ is $M$-times differentiable, then wavelet coefficients at fine scale $a$ decay as $a^{M+1/2}$. More vanishing moments produce sparser representations for smooth signals.
The $D_4$ wavelet (Daubechies-4, using a 4-tap filter) has 2 vanishing moments and is the most common general-purpose choice. It is smoother than Haar (it is continuous) while remaining compactly supported. The $D_6$, $D_8$, $\ldots$ wavelets have progressively more vanishing moments and smoother shapes.
Morlet wavelet. A complex sinusoid modulated by a Gaussian:
$$\psi_{\text{Morlet}}(t) = \pi^{-1/4} e^{i\omega_0 t} e^{-t^2/2}$$
where $\omega_0$ is typically set to $5$ or $6$ radians (to satisfy the admissibility condition approximately). The Morlet wavelet is not compactly supported and does not generate an MRA, so it is used only for the continuous wavelet transform.
The Gaussian envelope gives the Morlet wavelet optimal time-frequency localization (it achieves the uncertainty bound $\Delta t \cdot \Delta\omega = 1/2$). The sinusoidal carrier makes it well-suited for analyzing oscillatory signals - the CWT with a Morlet wavelet is closely related to the STFT with a Gaussian window.
Morlet wavelets are widely used in neuroscience (EEG, fMRI data analysis), seismology, and audio analysis. Anywhere you want a smooth, interpretable time-frequency map.
Biorthogonal wavelets. Used in JPEG2000. They relax orthogonality (the analysis and reconstruction filters are different but dual) in order to achieve symmetric filters. Symmetric filters have linear phase - all frequencies are delayed by the same number of samples, which preserves the shape of transients. This is important for image compression, where phase distortion produces visible artifacts.
The specific biorthogonal wavelet used in JPEG2000 (the CDF 9/7 wavelet) has 4 and 4 vanishing moments for the analysis and synthesis filters respectively, providing good sparsity for natural images.
Section 13: Applications
JPEG2000. The successor to JPEG uses the 2D DWT with the CDF 9/7 biorthogonal wavelet. The image is decomposed into coarse approximation and multi-scale detail subbands. Coefficients are quantized, with more precision given to low-frequency (coarse) subbands. The quantized coefficients are entropy-coded. JPEG2000 outperforms JPEG at low bitrates: at $0.1$ bits per pixel, JPEG2000 produces recognizable images while JPEG produces severe blocking artifacts. JPEG2000 is used in digital cinema (DCI standard), medical imaging (DICOM), and satellite imagery.
Signal denoising. Model the signal as $y = f + n$ where $f$ is the true signal and $n$ is white noise. In the wavelet domain, $f$ concentrates its energy in a few large coefficients (by sparsity); white noise spreads its energy uniformly across all coefficients with roughly equal magnitude ($\sigma/\sqrt{N}$ for noise power $\sigma^2$). Soft thresholding: for each wavelet coefficient $W$, set $\hat{W} = \text{sign}(W)\max(|W| - \lambda, 0)$. Reconstruct from the thresholded coefficients. Coefficients below $\lambda$ are noise; coefficients above $\lambda$ are signal. The Donoho-Johnstone theorem (1994) shows that universal thresholding at $\lambda = \sigma\sqrt{2\log N}$ achieves near-optimal denoising: the mean squared error is within a logarithmic factor of the best possible.
ECG analysis. The electrocardiogram records the electrical activity of the heart over time. The QRS complex - a sharp, localized waveform representing ventricular depolarization - is the most important feature. It lasts roughly $80$-$120$ milliseconds and contains energy at $5$-$50$ Hz. Wavelet analysis at the appropriate scale captures the QRS complex cleanly, suppressing the lower-frequency baseline wander (scale too large) and higher-frequency muscle noise (scale too small). The precise timing and shape of the QRS complex, extracted by wavelet analysis, is used to detect arrhythmias, measure heart rate variability, and diagnose cardiac conditions.
Seismic analysis. Seismic signals from earthquakes contain body waves (P- and S-waves, arriving first, high frequency, short duration) and surface waves (arriving later, lower frequency, longer duration). The Fourier transform cannot separate these temporally because they overlap in time. The CWT with a Morlet wavelet produces a scalogram (plot of $|W(a,b)|^2$) showing body waves as high-frequency bursts at their arrival times, and surface waves as lower-frequency sustained energy. The arrival time of the P-wave, identified from the scalogram, is used to locate the earthquake’s epicenter.
Section 14: Why JPEG Uses DCT, Not DWT
A natural question: why does JPEG use the Discrete Cosine Transform (DCT) rather than the DWT? And why does JPEG2000 switch to wavelets?
The DCT is the DFT applied to a symmetrically extended signal (which produces real-valued coefficients). Applied to $8 \times 8$ image blocks, the DCT concentrates most energy in the upper-left corner (low-frequency components); the high-frequency components are typically near zero for natural images. Quantizing these coefficients (rounding to fewer bits) and discarding near-zero ones gives JPEG compression.
The DCT has excellent energy compaction for smooth signals - it achieves near-optimal compression for Markov random field image models. Its weakness: it operates on independent $8 \times 8$ blocks. At high compression ratios, you can see block boundaries in the compressed image (the “blocky” JPEG artifact). The blocks do not share information, so sharp edges that cross block boundaries are not handled efficiently.
The DWT operates on the entire image simultaneously. There are no block boundaries and no blocking artifacts. The wavelet representation captures edges as localized fine-scale coefficients regardless of where they fall in the image. At high compression, JPEG2000 produces “blurry” degradation (smooth fine-scale details are discarded) rather than “blocky” degradation. Perceptually, blurry is better than blocky at the same compression ratio.
The tradeoff: the DWT is computationally more complex to implement efficiently, especially for arbitrary image sizes (you need to handle boundaries carefully). The DCT on $8 \times 8$ blocks is simple, fast, and well-suited to hardware implementation. This is why JPEG persists in practice despite JPEG2000’s superior quality: the ecosystem and hardware support for JPEG is enormous.
Section 15: Thresholding in Detail
The thresholding procedure for denoising deserves a careful look because it illustrates how sparsity works in practice.
Given noisy observations $y = f + n$ where $n$ is white Gaussian noise with variance $\sigma^2$, compute the DWT: $\mathbf{W} = \mathbf{W}y = \mathbf{W}f + \mathbf{W}n$. The noise $\mathbf{W}n$ has the same distribution as $n$ (the DWT is orthogonal, so it preserves white Gaussianness). The signal $\mathbf{W}f$ is concentrated in a few large coefficients (by sparsity of $f$ in the wavelet domain).
Hard thresholding. Set coefficient $w$ to zero if $|w| < \lambda$, keep unchanged if $|w| \geq \lambda$:
$$\hat{w} = w \cdot \mathbf{1}[|w| \geq \lambda].$$
Removes small coefficients that are likely noise, keeps large ones that are likely signal.
Soft thresholding (shrinkage). Shrink all coefficients toward zero:
$$\hat{w} = \text{sign}(w)\max(|w| - \lambda, 0).$$
This produces a continuous estimator (unlike hard thresholding, which has a jump at $\pm\lambda$) and corresponds to minimizing $|y - f|^2/2 + \lambda|f|_{\ell^1}$ (Lasso regression in the wavelet domain).
Universal threshold. The Donoho-Johnstone universal threshold is $\lambda = \sigma\sqrt{2\log N}$ where $N$ is the signal length and $\sigma$ is the noise standard deviation (estimated from the finest-scale wavelet coefficients). At this threshold, the probability that any noise coefficient exceeds $\lambda$ is less than $1/N$ - so with high probability, all retained coefficients are signal. The resulting estimator achieves mean squared error within a logarithmic factor of the oracle estimator (which knows the true signal location).
In practice: estimate $\sigma$ from the median absolute deviation of the finest-scale detail coefficients $d_1[k]$: $\hat{\sigma} = \text{median}(|d_1[k]|)/0.6745$. Apply soft thresholding to all detail coefficients except the finest scale (to avoid over-smoothing transients). Invert the DWT. The result is a denoised estimate of $f$.
Section 16: Compressed Sensing - A Brief Detour
Wavelet sparsity has implications that go beyond compression. Compressed sensing (Donoho, Candès, Tao, ~2005) makes a startling claim: if a signal has a sparse representation in some basis, you can reconstruct it exactly from far fewer measurements than the Nyquist-Shannon sampling theorem would suggest.
The idea: you have a signal $f$ that has only $k$ nonzero coefficients in the wavelet domain - the other $N - k$ are zero. Naively, you would need $N$ samples to determine $f$. But it can be shown that with $O(k\log N)$ random linear measurements (projections of $f$ onto random vectors), you can recover $f$ exactly via convex optimization (specifically, $\ell^1$ minimization).
Why $\ell^1$? The $\ell^1$ norm $|c|_1 = \sum |c_k|$ promotes sparsity in optimization because the level sets of $|c|_1$ are cross-polytopes (diamond shapes in 2D, octahedra in 3D) - they have corners on the coordinate axes. An optimization problem minimizing $\ell^1$ subject to constraints tends to find solutions where many coordinates are exactly zero.
Wavelet sparsity is the key to making compressed sensing work for natural signals. Seismic imaging uses compressed sensing with wavelet-sparse signal models to reconstruct subsurface structure from far fewer measurements than traditional methods require. MRI uses compressed sensing with wavelet sparsity to reduce scan time (fewer $k$-space measurements, faster scans, less patient discomfort).
This represents a shift in the philosophy of signal acquisition: instead of sampling at the Nyquist rate and then compressing, you acquire fewer measurements from the start, choosing them in a way that enables reconstruction of the sparse wavelet representation.
Section 17: Connection to Deep Learning
One of the most intellectually satisfying results in modern machine learning is the connection between wavelets and convolutional neural networks.
CNN filters look like wavelets. When a CNN trained on images is examined, the filters learned in the first convolutional layer look remarkably like wavelets and Gabor functions - they are localized in space, oriented (sensitive to edges and gratings at particular angles), and appear at multiple scales. This is not surprising in retrospect. CNNs solve tasks that require detecting edges, textures, and shapes at multiple scales. Wavelets are the optimal tool for exactly these tasks. The network learns filters that are near-optimal for the data; the data requires wavelet-like filters.
Scattering transforms. Stéphane Mallat (2012) introduced the scattering network: a deep architecture in which the learned filters are replaced by fixed, predefined wavelets. The architecture is:
- Apply wavelet filters at multiple scales and orientations to produce wavelet coefficients.
- Apply the absolute value (modulus) to extract the magnitude, making the representation invariant to small shifts.
- Apply another round of wavelets to the magnitudes and take absolute values again.
- Repeat for several layers; average at the end.
The scattering transform produces feature representations that are:
- Invariant to translations (by design of the averaging).
- Stable to small deformations: if the input is deformed by a diffeomorphism close to the identity, the output changes by at most a constant times the deformation magnitude. This stability guarantee is mathematically proven.
- Informative: the scattering representation retains enough information to classify images accurately.
The stability to deformations is essentially absent in standard CNNs (which can be fooled by small adversarial perturbations). The scattering transform sacrifices some flexibility (fixed rather than learned filters) and achieves provable robustness in return. Scattering networks work well on small datasets where learned filters overfit, and they have been successfully applied to audio classification, texture recognition, and molecular property prediction.
Summary
| Concept | Content |
|---|---|
| Time-frequency problem | Need both when and what frequency; Fourier gives frequency but no time |
| Uncertainty principle | $\Delta t \cdot \Delta\omega \geq 1/2$; fundamental, not a method limitation |
| STFT | Fixed window; fixed resolution everywhere; good for stationary signals |
| Wavelet $\psi(t)$ | Localized in time and frequency; zero mean required |
| Zero mean | $\hat{\psi}(0) = 0$; wavelet is blind to DC; responds only to variations |
| Scale $a$ | Smaller $a$ = compressed wavelet = higher frequency; larger $a$ = lower frequency |
| Translation $b$ | Center time of the wavelet; where in time you are analyzing |
| CWT | $W(a,b) = \frac{1}{\sqrt{a}}\int f(t)\psi^*\left(\frac{t-b}{a}\right)dt$ |
| Admissibility | $\hat{\psi}(0) = 0$ ensures reconstruction formula converges |
| MRA | Ladder of approximation spaces; signal = coarse part + multi-scale details |
| DWT | Filter bank: low-pass for approximation, high-pass for detail; repeated |
| DWT complexity | $O(N)$: geometric series of subband sizes adds to $2N$ |
| Sparsity | Localized features concentrate in few wavelet coefficients; enables compression |
| Haar | Simplest wavelet; step function; orthogonal; discontinuous |
| Daubechies $D_{2N}$ | Smooth; compact support; $N$ vanishing moments; general purpose |
| Morlet | Complex sinusoid $\times$ Gaussian; optimal uncertainty; CWT analysis |
| Biorthogonal | Symmetric filters; linear phase; used in JPEG2000 |
| Applications | JPEG2000, denoising, ECG, seismology, scattering networks |
| Compressed sensing | Sparse wavelet representation enables exact recovery from $O(k\log N)$ measurements |
| Scalogram | 2D plot of $|W(a,b)|^2$ vs. time $b$ and scale $a$; visual time-frequency map |
| Vanishing moments | $\int t^k\psi(t) dt = 0$ for $k < M$; controls coefficient decay near smooth regions |
| Frequency warping | STFT has constant absolute bandwidth; wavelets have constant relative bandwidth ($Q$-factor) |
Three Distinctions That Matter
Looking back at the full arc from Fourier to wavelets, three contrasts deserve a clear statement.
Fixed basis vs. adaptive analysis. The Fourier transform uses a fixed basis: the complex exponentials, which are the same regardless of what signal you give it. The CWT uses a family indexed by two parameters $(a, b)$ - you look at the signal at every scale and every time location. This is adaptive in the sense that the coefficients tell you where interesting things happen, even if the basis functions themselves are fixed.
Global vs. local analysis. A Fourier coefficient $\hat{f}(\omega)$ depends on $f(t)$ for all $t$ - it is inherently global. A wavelet coefficient $W(a, b)$ depends primarily on $f(t)$ near time $b$ at scale $a$ - it is inherently local. This locality is what makes wavelets useful for detecting transients, edges, and events.
Redundant vs. non-redundant. The CWT is highly redundant: you compute $W(a, b)$ for all $(a, b) \in \mathbb{R}^+ \times \mathbb{R}$, producing a 2D function from a 1D signal. The DWT is non-redundant: exactly $N$ input samples produce exactly $N$ output coefficients. Redundancy is useful for analysis and interpretation; non-redundancy is essential for compression and storage.
The right tool depends on your goal. For compression: DWT. For analysis of oscillatory signals with unknown timing: CWT. For stationary signals where only frequency content matters: Fourier. For signals that are locally stationary within windows: STFT. The transforms are not competing alternatives; they are different tools suited to different aspects of the time-frequency tradeoff.
A signal processing practitioner should think of the uncertainty principle not as a barrier but as a guide. Every analysis method is a choice of how to tile the time-frequency plane - what shape of Heisenberg boxes to use and how to arrange them. Fixed-size rectangles (STFT), dyadic rectangles of varying aspect ratios (DWT), or continuous families (CWT): each choice is optimal for different signal structures. Knowing which structure your signal has is the most important decision in time-frequency analysis.
Read next: