Wavelet Transform
Prerequisite:
Where Fourier Analysis Falls Short
The Fourier transform tells us which frequencies are present in a signal - but not when. For a stationary signal like a pure tone, this is fine. For a non-stationary signal - speech that transitions between phonemes, seismic data with a transient spike, an ECG with a brief arrhythmia - knowing that a certain frequency exists somewhere in the signal is almost useless.
Consider two signals of length $N$: one that plays a 440 Hz tone for the first half and a 880 Hz tone for the second half, and one that plays both simultaneously throughout. Their Fourier magnitudes are identical. The Fourier transform has destroyed all temporal information.
Short-Time Fourier Transform: A Partial Fix
The Short-Time Fourier Transform (STFT) applies a sliding window $g(t)$ to the signal before Fourier-transforming:
$$\text{STFT}f(\tau, \xi) = \int{-\infty}^\infty f(t),\overline{g(t-\tau)},e^{-2\pi i \xi t},dt$$
By localizing the window around time $\tau$, we get a local frequency estimate. The result is a 2D time-frequency representation sometimes called a spectrogram.
The problem is that the window is fixed. A wide window gives good frequency resolution but poor time resolution; a narrow window gives good time resolution but poor frequency resolution. This is a direct consequence of the uncertainty principle: no window can achieve high resolution in both dimensions simultaneously. In the time-frequency plane, each STFT coefficient corresponds to a rectangle (Heisenberg box) of fixed area.
STFT: uniform Heisenberg boxes
Frequency
|
| [--][--][--][--][--]
| [--][--][--][--][--]
| [--][--][--][--][--]
+--+--+--+--+--+--+---> Time
All boxes same size. Fixed trade-off.
The wavelet transform breaks this constraint by using boxes of varying shape.
The Continuous Wavelet Transform
Mother Wavelet
A mother wavelet $\psi \in L^2(\mathbb{R})$ must satisfy the admissibility condition:
$$C_\psi = \int_{-\infty}^\infty \frac{|\hat{\psi}(\xi)|^2}{|\xi|},d\xi < \infty$$
A necessary consequence of admissibility is that $\psi$ must have zero mean:
$$\int_{-\infty}^\infty \psi(t),dt = 0$$
This is why the wavelet “wiggles” - it must average to zero. It is a small wave, which is the etymology of “wavelet.”
Definition
The Continuous Wavelet Transform (CWT) of a signal $f \in L^2(\mathbb{R})$ with respect to mother wavelet $\psi$ is
$$W_f(a, b) = \frac{1}{\sqrt{|a|}}\int_{-\infty}^\infty f(t),\overline{\psi!\left(\frac{t-b}{a}\right)},dt$$
where $a \in \mathbb{R}\setminus{0}$ is the scale parameter and $b \in \mathbb{R}$ is the translation parameter. The factor $1/\sqrt{|a|}$ normalizes the dilated wavelet to have constant $L^2$ norm.
Large $|a|$ gives a stretched wavelet sensitive to low-frequency (coarse) structure; small $|a|$ gives a compressed wavelet sensitive to high-frequency (fine) structure. This is multi-resolution analysis in action.
Reconstruction (inverse CWT):
$$f(t) = \frac{1}{C_\psi}\int_0^\infty\int_{-\infty}^\infty W_f(a,b),\psi!\left(\frac{t-b}{a}\right)\frac{da,db}{a^2}$$
Wavelet: varying Heisenberg boxes
Frequency
|
| [----][----] <- large scale: wide time, low freq
| [--][--][--]
| [-][-][-][-]
| [][][][][][] <- small scale: narrow time, high freq
+--+--+--+--+--+--> Time
Area of each box is constant (uncertainty principle),
but shape adapts to scale.
Examples of Wavelets
Haar Wavelet
The simplest wavelet, piecewise constant:
$$\psi_{\text{Haar}}(t) = \begin{cases} 1 & 0 \leq t < 1/2 \ -1 & 1/2 \leq t < 1 \ 0 & \text{otherwise} \end{cases}$$
The Haar wavelet detects step discontinuities. Its compactly supported and orthonormal family forms an unconditional basis for $L^2([0,1])$.
Morlet Wavelet
A Gaussian-modulated complex sinusoid:
$$\psi_{\text{Morlet}}(t) = e^{i\omega_0 t},e^{-t^2/2}$$
(with a small correction term to enforce zero mean). The Morlet wavelet gives excellent time-frequency localization and is widely used in neuroscience and geophysics.
Mexican Hat (Ricker) Wavelet
The second derivative of a Gaussian:
$$\psi_{\text{MH}}(t) = \frac{2}{\sqrt{3\sigma}\pi^{1/4}}\left(1 - \frac{t^2}{\sigma^2}\right)e^{-t^2/(2\sigma^2)}$$
Its shape resembles a Mexican hat in profile. It is the $L^2$-normalized Laplacian of a Gaussian, making it natural for edge detection.
Daubechies Wavelets
The Daubechies family $\psi_{D_N}$ (for $N = 1, 2, \ldots$) achieves compactly supported, orthonormal wavelets with $N$ vanishing moments ($\int t^k \psi(t),dt = 0$ for $k = 0, \ldots, N-1$). The $D_1$ wavelet is the Haar wavelet. Higher-order Daubechies wavelets are smooth and are the basis for JPEG 2000.
Multi-Resolution Analysis (MRA)
Definition. A Multi-Resolution Analysis of $L^2(\mathbb{R})$ is a sequence of closed subspaces ${V_j}_{j\in\mathbb{Z}}$ satisfying:
- $\cdots \subset V_{-1} \subset V_0 \subset V_1 \subset \cdots$
- $\overline{\bigcup_j V_j} = L^2(\mathbb{R})$ and $\bigcap_j V_j = {0}$
- $f(t) \in V_j \iff f(2t) \in V_{j+1}$ (scaling consistency)
- There exists a scaling function $\phi \in V_0$ such that ${\phi(t-k)}_{k\in\mathbb{Z}}$ is an orthonormal basis for $V_0$.
The detail spaces $W_j$ are orthogonal complements: $V_{j+1} = V_j \oplus W_j$. Then $W_j \perp W_k$ for $j \neq k$, and
$$L^2(\mathbb{R}) = \bigoplus_{j \in \mathbb{Z}} W_j$$
The wavelet $\psi$ generates orthonormal bases for each $W_j$ via translation and scaling: ${\psi_{j,k}(t) = 2^{j/2}\psi(2^j t - k)}_{k\in\mathbb{Z}}$ is an orthonormal basis for $W_j$.
Theorem. Given an MRA with scaling function $\phi$, the associated wavelet $\psi$ defined by the relation $\psi(t) = \sum_k (-1)^k \overline{h[1-k]}\phi(2t-k)$ (where $h[k]$ are the scaling filter coefficients) satisfies the admissibility condition and generates an orthonormal wavelet basis for $L^2(\mathbb{R})$.
Discrete Wavelet Transform and Filter Banks
The Discrete Wavelet Transform (DWT) implements MRA computationally using a filter bank: alternating convolution with a low-pass filter $h$ (the scaling filter, retaining approximation) and a high-pass filter $g$ (the wavelet filter, retaining detail), followed by downsampling by 2.
One level of DWT decomposition:
x[n] -----> [h] --> [downsample 2] ----> cA (approximation)
|
+----> [g] --> [downsample 2] ----> cD (detail)
Applying this recursively to the approximation coefficients gives the full wavelet decomposition:
Multi-level DWT:
Level 0: x[n] (N samples)
Level 1: cA1 (N/2), cD1 (N/2)
Level 2: cA2 (N/4), cD2 (N/4), cD1
Level 3: cA3 (N/8), cD3 (N/8), cD2, cD1
Each level halves the number of approximation coefficients and adds a detail array. The total count remains $N$.
Perfect Reconstruction. The filters $h$ and $g$ must satisfy the quadrature mirror filter (QMF) conditions to allow perfect reconstruction of $x[n]$ from the wavelet coefficients. For orthogonal wavelets: $g[n] = (-1)^n h[1-n]$.
Computational complexity: The DWT runs in $O(N)$ time (each level is $O(N)$ and the total work sums as $N + N/2 + N/4 + \cdots = 2N$). This is strictly faster than the FFT’s $O(N\log N)$.
Applications
JPEG 2000. Uses the Cohen-Daubechies-Feauveau 9/7 biorthogonal wavelet. Unlike JPEG’s block-DCT, wavelets decompose the entire image, avoiding block-boundary artifacts and achieving better compression at low bit rates.
Signal Denoising. Wavelet shrinkage (Donoho and Johnstone, 1994): compute DWT, apply a threshold to the detail coefficients (soft or hard thresholding), then invert. This exploits the sparsity of wavelet representations: clean signals concentrate energy in few large coefficients, while Gaussian noise spreads across all coefficients. Thresholding kills the noise while preserving the signal structure.
Edge Detection. The CWT with Mexican hat or Daubechies wavelets detects edges at multiple scales. The maxima of $|W_f(a, b)|$ across scales $a$ trace the locations of singularities in $f$.
Examples
Wavelet Scattering Networks. Mallat’s scattering transform applies alternating wavelet transforms and modulus nonlinearities to build translation-invariant, deformation-stable signal representations - a handcrafted deep network. It inspired early theoretical work explaining why convolutional networks are stable to input deformations.
Signal Preprocessing for ML. Wavelet-based denoising is used upstream of classification pipelines for time-series data (EEG, ECG, vibration signals). Retaining only wavelet coefficients above a threshold compresses the input while preserving discriminative features, reducing the dimensionality presented to a classifier or neural network.
Compression in Generative Models. Latent diffusion models (Stable Diffusion) first compress images into a lower-dimensional latent space using a VAE. Some architectures substitute or augment this with wavelet-based multi-scale decompositions, processing low- and high-frequency components with separate network branches.
Read Next: