Speech & Audio ML - Teaching Machines to Hear // Megha Bose

Helpful context:

Audio is the hardest modality to work with. Vision has a natural two-dimensional grid of pixels; text has a discrete vocabulary of tokens. Audio is a one-dimensional continuous pressure wave, varying thousands of times per second, carrying simultaneous information about phonemes, prosody, speaker identity, emotion, room acoustics, and background noise, all entangled in the same signal. The central challenge in audio ML is feature extraction: how do you transform a raw waveform into a representation that a model can actually learn from?

Sound as a Signal

A microphone converts air pressure variations into a voltage signal. A digital audio system samples this signal at a fixed rate - the sampling rate $f_s$, measured in Hz. CD audio uses $f_s = 44100$ Hz; speech systems typically use $f_s = 16000$ Hz.

The Nyquist-Shannon theorem says you can perfectly reconstruct any signal with maximum frequency $f_{\max}$ from samples taken at $f_s > 2f_{\max}$. Human speech contains energy up to about 8 kHz, so 16 kHz sampling is sufficient to capture all phonetically relevant information.

Each sample is a real number representing amplitude. A one-second clip at 16 kHz is a vector in $\mathbb{R}^{16000}$. This raw representation is almost useless for learning: nearby samples are highly correlated, the interesting structure is in frequency patterns not amplitude values, and the representation is completely invariant to what actually matters - which frequencies are active when.

From Waveform to Spectrogram: STFT Derived

The DFT (see Fourier Analysis ) assumes the signal is periodic and stationary over the entire window. Speech violates this completely: a phoneme lasts 50-100ms, and the frequency content changes dramatically from one phoneme to the next. Applying the DFT to an entire sentence gives you a blurred average of all the frequencies present at all times - useless for understanding what was said when.

The fix is to make the DFT local in time. Slide a short window across the signal, apply the DFT inside each window, and stack the results. This is the Short-Time Fourier Transform (STFT).

Step-by-step derivation. Start with the DFT of the full signal: $$X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-i2\pi kn/N}$$

To localize this to time $t$, multiply $x[n]$ by a window function $w[n - tH]$ centered at time step $t$ (where $H$ is the hop length in samples): $$X[k, t] = \sum_{n} x[n] \cdot w[n - tH] \cdot e^{-i2\pi kn/N}$$

The window $w$ selects a short segment of the signal around position $tH$ and tapers it to zero at the edges. Substituting $m = n - tH$ to shift the index to the local frame: $$X[k, t] = \sum_{m=0}^{N-1} x[m + tH] \cdot w[m] \cdot e^{-i2\pi k(m + tH)/N}$$

The term $e^{-i2\pi k \cdot tH/N}$ is a phase offset that depends on the absolute time $t$ but not on the local index $m$. For the spectrogram (which uses $|X[k,t]|^2$), this phase cancels out. So: $$|X[k, t]|^2 = \left|\sum_{m=0}^{N-1} x[m + tH] \cdot w[m] \cdot e^{-i2\pi km/N}\right|^2$$

This is exactly a DFT of the windowed frame $x[m + tH] \cdot w[m]$. Practically: extract a frame of $N$ samples centered at position $tH$, multiply pointwise by the window, apply the DFT, take the magnitude squared. Repeat by advancing $t$ by one hop.

Parameter choices: $N = 400$ samples (25ms at 16 kHz) captures enough of a phoneme to be informative. $H = 160$ samples (10ms) gives 75% overlap between consecutive frames, which ensures smooth spectral evolution. The output shape is $(\lfloor N/2 \rfloor + 1) \times T$ where $T$ is the number of frames.

Discomfort check. Why does the window matter? If you use a rectangular window - just cutting out the frame with sharp edges at both ends - the DFT sees a signal that abruptly jumps from nonzero to zero at the boundaries. This artificial discontinuity looks like high-frequency energy (Gibbs phenomenon) and leaks into neighboring frequency bins, masking the true spectrum. The Hann window $w[m] = \sin^2(\pi m / N)$ smoothly tapers to zero at both ends, eliminating the discontinuity and reducing leakage by over 30 dB compared to a rectangular window. The tradeoff is slightly reduced frequency resolution (the main lobe is wider), but in speech this is almost always worth it.

Discomfort check. What exactly is the STFT outputting - is it complex or real? The raw STFT $X[k,t]$ is complex: it has magnitude $|X[k,t]|$ and phase $\angle X[k,t]$. The magnitude tells you how much of frequency $k$ is present at time $t$; the phase tells you where in the cycle that frequency is. For speech recognition, only the magnitude matters - phase is speaker and microphone dependent and carries no phonetic information. The spectrogram discards phase by taking $|X[k,t]|^2$. For speech synthesis (going from spectrogram back to audio), you need the phase, which is why vocoders like Griffin-Lim or WaveGlow are used to reconstruct it.

The Mel Scale and Mel Spectrograms

The human auditory system does not perceive frequency linearly. We are more sensitive to differences in the low frequencies (distinguishing 100 Hz from 200 Hz is easy) than in high frequencies (distinguishing 5000 Hz from 5100 Hz is hard). The mel scale approximates this perceptual mapping:

$$m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)$$

A mel filterbank places $M$ triangular filters (typically $M = 80$) at evenly spaced positions on the mel scale. Each filter computes a weighted sum of the STFT magnitudes in its frequency range. The output of the filterbank applied to the STFT magnitude spectrum gives the mel spectrogram: a $M \times T$ matrix where $M$ is the number of mel filters and $T$ is the number of time frames.

Mel spectrograms are the standard input to modern speech models. They discard phase information (keeping only magnitude), compress the frequency axis to emphasize perceptually important ranges, and reduce the input from $N/2$ frequency bins to $M = 80$ mel bins. Whisper takes as input an $80 \times 3000$ log-mel spectrogram (30 seconds of audio).

Discomfort check. Why log-mel and not just mel? The log compression further mimics the auditory system, which perceives loudness on a logarithmic scale. It also makes the model more robust to volume variations: doubling the volume adds a constant to the log-mel spectrogram rather than scaling all values, which is much easier to handle. In practice, models trained on log-mel spectrograms generalize much better across recording conditions.

MFCCs: The Classic Feature

Before deep learning, the dominant speech feature was Mel Frequency Cepstral Coefficients (MFCCs). The computation pipeline: waveform → STFT magnitude → mel filterbank → log → discrete cosine transform (DCT) → keep the first 13 coefficients.

The DCT step decorrelates the mel filterbank outputs (adjacent mel filters overlap substantially) and compresses the representation. The first few DCT coefficients capture the gross spectral shape; higher coefficients capture fine detail. Dropping the high-order coefficients discards speaker-specific vocal tract details while retaining phonetically discriminative information.

MFCCs dominated speech recognition for decades. Modern deep learning systems use log-mel spectrograms directly, letting the model learn its own frequency decomposition. MFCCs remain common in resource-constrained environments (on-device keyword spotting) because they are cheap to compute and have low dimensionality.

Voice Activity Detection (VAD)

VAD is the binary classification problem: given an audio frame, is it speech or non-speech (silence, background noise, music)?

It sounds trivial. It is not. Whispered speech, noisy environments, overlapping speakers, and the boundary regions between speech and silence all create hard cases. VAD is a prerequisite for virtually every downstream speech task: sending only speech frames to an ASR model, detecting when to start/stop recording, suppressing noise before enhancement.

GMM-based VAD (the classic approach): fit Gaussian Mixture Models to speech and non-speech feature distributions. At inference, compute the likelihood ratio $\log P(\mathbf{x} | \text{speech}) - \log P(\mathbf{x} | \text{non-speech})$ and threshold. The features are typically MFCCs or energy-based features (short-time energy, zero-crossing rate). This is what WebRTC VAD uses - a GMM trained on spectral features with adaptive thresholding. Fast, runs in real time on CPU, works surprisingly well in clean conditions.

Neural VAD: a small neural network (often a simple LSTM or 1D CNN) trained on speech/non-speech labels. Better in difficult conditions, handles noise more robustly, slightly more compute.

WebRTC VAD specifically uses a combination of spectral energy analysis and GMM modeling, with three aggressiveness modes (0-3) trading off false positives against false negatives. Mode 3 aggressively filters out non-speech; mode 0 is very permissive.

Discomfort check. Can’t you just threshold energy? If the signal is loud, it’s speech; if it’s quiet, it’s not? This fails everywhere: breathing, coughing, and AC noise are loud but not speech; whispered speech is quiet; the start and end of utterances have ambiguous energy. Energy thresholding is a common baseline but has poor recall on quiet speech and poor precision in noisy environments. Spectral features are necessary to distinguish speech-shaped noise from actual speech.

ASR: The Full Pipeline

Automatic Speech Recognition (ASR) maps audio to text. The modern end-to-end approach (Whisper, wav2vec 2.0) collapses what was a four-stage pipeline into a single model, but understanding the stages clarifies what the model must learn.

Classical pipeline:

Feature extraction: waveform → MFCC or log-mel spectrogram
Acoustic model: maps feature frames to phoneme probabilities. Hidden Markov Models (HMMs) were standard; now a neural network (LSTM, transformer)
Pronunciation lexicon: maps phonemes to words
Language model: $n$-gram or neural LM, provides prior over word sequences
Decoder: beam search combining acoustic and language model scores

The decoder finds $\hat{W} = \arg\max_W P(W | X) = \arg\max_W P(X | W) P(W)$, where $P(X | W)$ is the acoustic model and $P(W)$ is the language model. This decomposition (acoustic × language) was the core of classical ASR.

CTC (Connectionist Temporal Classification): a loss function that allows training a sequence-to-sequence model without aligned labels. The model outputs a distribution over characters (plus a blank symbol) at each time step; CTC marginalizes over all valid alignments between the output sequence and the target text. CTC-trained models can be decoded greedily or with beam search.

End-to-end models (Whisper, wav2vec 2.0, Conformer) learn the full mapping from audio to text in a single model, removing the need for separate acoustic models, lexicons, and language models. They require more data but achieve state-of-the-art WER.

Whisper Architecture

Whisper (Radford et al., 2022) is an encoder-decoder transformer trained on 680,000 hours of weakly supervised audio-text pairs scraped from the web.

Encoder: the log-mel spectrogram is split into 30-second chunks and passed through two convolutional layers (stride 2, reducing the time dimension by 2x), then a transformer encoder with learned absolute positional embeddings. The convolutional frontend acts as a learned feature extractor, capturing local temporal patterns before the global attention.

Decoder: a standard transformer decoder that autoregressively generates tokens. The decoder attends to the encoder output via cross-attention at every layer. The input sequence includes special tokens specifying the task (transcribe vs. translate), language, and timestamp mode - making Whisper a multitask model.

Why encoder-decoder for ASR? The audio (encoded by the encoder) and the transcript (generated by the decoder) have very different sequence lengths and granularities. Cross-attention lets the decoder selectively attend to the relevant portion of the audio for each output token, rather than compressing all of the audio into a fixed vector. This alignment capability is crucial for long-form transcription.

Whisper-JAX reimplements Whisper in JAX with XLA compilation and batched inference, achieving near-real-time transcription by parallelizing the convolutional frontend and using efficient attention implementations. For streaming, the 30-second chunk constraint becomes a design challenge: either use a sliding window with overlap-and-add, or use a smaller model with shorter context.

WER and Evaluation Metrics

Word Error Rate (WER) is the standard ASR metric:

$$\text{WER} = \frac{S + D + I}{N}$$

where $S$ = substitutions, $D$ = deletions, $I$ = insertions, $N$ = total words in the reference. Computed via dynamic programming (edit distance at the word level). Lower is better; 0% is perfect.

WER has significant limitations. It treats all words equally: substituting “cat” for “dog” is the same error as substituting “1000” for “2000”. It is case-sensitive by default. It is sensitive to punctuation and normalization choices. A WER of 5% on clean read speech (LibriSpeech) is very different from 5% on noisy conversational speech (CHiME-6), but both are reported as “5% WER”.

Character Error Rate (CER) uses characters instead of words - more informative for morphologically rich languages. BLEU is used when evaluating speech translation. For speaker diarization, Diarization Error Rate (DER) is the metric.

Discomfort check. If Whisper achieves near-human WER on LibriSpeech, is ASR solved? No. LibriSpeech is read speech from audiobooks - clean, slow, and carefully enunciated. Real-world performance degrades substantially on: spontaneous speech with disfluencies, accented speech, far-field microphones, overlapping speakers, domain-specific vocabulary (medical, legal), and low-resource languages. The gap between benchmark WER and production WER is a persistent and underappreciated problem in ASR.

Latency in Production

For real-time speech applications, latency is as important as accuracy. The key metrics:

p50/p95/p99 latency: the 50th/95th/99th percentile of processing time across requests. p95 < 800ms means 95% of requests complete within 800ms.
Real-time factor (RTF): processing time / audio duration. RTF < 1 means faster than real-time.
Streaming vs. batch: batch processes a complete utterance; streaming processes frames as they arrive, with lower latency but potentially higher WER.

For streaming ASR, smaller Whisper variants (Whisper-tiny, Whisper-base) are often used with chunked audio, trading some accuracy for latency. Quantization (int8, fp16) and model distillation further reduce compute. CPU vs. GPU inference involves a different tradeoff: GPU has higher throughput, CPU has lower fixed cost for sparse traffic.

Latent Space and Latent Space Decomposition

What is the latent space?

When a neural network encodes an input $\mathbf{x} \in \mathbb{R}^d$ through layers into a representation $\mathbf{z} \in \mathbb{R}^k$ with $k \ll d$, $\mathbf{z}$ lives in the latent space. “Latent” means hidden - the representation is never directly observed, only inferred. The encoder $f: \mathbb{R}^d \to \mathbb{R}^k$ compresses the input; a decoder $g: \mathbb{R}^k \to \mathbb{R}^d$ reconstructs it. This is exactly an autoencoder.

Why is the latent space useful? Because it is structured. A well-trained encoder maps semantically similar inputs to nearby points in latent space. An audio encoder trained on speech should map two utterances of the same word by different speakers to nearby $\mathbf{z}$ vectors - because the underlying content is the same, even though the raw waveforms are very different. The latent space is a coordinate system over the space of meanings, not the space of signals.

You can explore this geometry: interpolating between two latent codes $\mathbf{z}_1$ and $\mathbf{z}_2$ along the line $\alpha \mathbf{z}_1 + (1-\alpha)\mathbf{z}_2$ and decoding each intermediate point should produce a smooth transition between the two corresponding utterances. This would be incoherent in the raw signal space - interpolating two waveforms just gives a blurry superposition.

The problem: entanglement.

A standard autoencoder learns a latent code that captures everything about the input in a single vector: content, speaker, prosody, noise, microphone characteristics, all compressed together. This is fine for reconstruction. It is useless if you want to separate these factors - you cannot change the speaker without changing everything else, because the latent code does not factor them out.

Latent space decomposition (LSD) is the approach of training the model to produce multiple, independent latent codes, each capturing a specific factor:

$$\mathbf{z}_{\text{content}}, \mathbf{z}_{\text{speaker}}, \mathbf{z}_{\text{prosody}} = E_c(\mathbf{x}), E_s(\mathbf{x}), E_p(\mathbf{x})$$

A decoder then reconstructs from the combination: $\hat{\mathbf{x}} = D(\mathbf{z}_{\text{content}}, \mathbf{z}_{\text{speaker}}, \mathbf{z}_{\text{prosody}})$. Voice conversion becomes: take $\mathbf{z}_{\text{content}}$ from utterance A and $\mathbf{z}_{\text{speaker}}$ from utterance B, feed both to $D$. The output speaks the content of A in the voice of B.

How disentanglement is enforced.

The model does not automatically learn to separate these factors - without explicit pressure, it will collapse everything into $\mathbf{z}_{\text{content}}$ and ignore the others. Three main approaches:

Adversarial training: train a speaker classifier on $\mathbf{z}_{\text{content}}$ and penalize the content encoder for making speaker classification easy. The content encoder is forced to remove speaker information to fool the classifier. This is gradient reversal - the classifier gradient is negated before it reaches the encoder.
Information bottleneck: constrain the capacity of $\mathbf{z}_{\text{content}}$ to be small (few dimensions, quantized, or through a variational bottleneck). Content is high-frequency and varies frame by frame; speaker identity is low-frequency and consistent across an utterance. If the bottleneck is tight enough in the right way, only content passes through.
Vector quantization (VQ-VQ): discretize $\mathbf{z}_{\text{content}}$ to a finite codebook. The model must express content as a sequence of discrete codes (analogous to phonemes). Speaker information cannot be expressed in a discrete phoneme-like code and is naturally excluded.

Discomfort check. Why is disentanglement so hard? Because the factors are genuinely entangled in the signal: the same phoneme /a/ sounds different when spoken by a child vs. an adult, in a whisper vs. a shout, in a quiet room vs. a car. The mapping from signal to factor is many-to-one and context-dependent. Any decomposition is an approximation - there is no clean boundary between “content” and “speaker” at the level of the waveform. The model must learn a useful approximation, not a perfect factorization.

Discomfort check. Is the latent space always continuous? No. VAEs learn a continuous latent space with a Gaussian prior, enabling smooth interpolation. VQ-VAEs quantize to a discrete codebook - interpolation between codes is not meaningful, but the discrete codes are often more interpretable and correspond more cleanly to linguistic units like phonemes. For speech content, discrete codes tend to work better; for speaker embeddings, continuous codes are standard.

Summary

Concept	What it is	Why it matters
Sampling rate	Samples per second (Hz)	Nyquist: determines max representable frequency
STFT	Windowed Fourier transform	Time-frequency representation of audio
Mel spectrogram	Log-mel filterbank output	Perceptually motivated feature; input to most speech models
MFCC	DCT of log-mel	Classic feature; low-dim, fast to compute
VAD	Speech vs. non-speech classification	Gating for all downstream speech tasks
WER	$(S+D+I)/N$	Primary ASR metric; limited by normalization choices
CTC	Alignment-free sequence loss	Enables end-to-end ASR training
Whisper	Encoder-decoder transformer on 680k hrs	SOTA multilingual ASR; encoder = conv + transformer
Latent decomposition	Separate content/speaker/style	Enables voice conversion and style transfer

Read next: