Floating-Point - Representing Infinity on a Finite Machine // Megha Bose

Helpful context:

Binary & Number Systems - How Computers Count

In 1994, Intel shipped a Pentium processor with a flaw in its floating-point division unit. The error was tiny - at most 5 parts in 10 billion - and affected only specific operands. Intel’s engineers initially dismissed it as inconsequential for most users. The public disagreed. After a professor of mathematics posted his findings online, the story spread. Intel eventually took a $475 million charge and recalled every affected chip.

The incident exposed something important: floating-point arithmetic is not a clean approximation of real arithmetic. It has specific, bounded behavior governed by a 1985 standard. When that behavior is misunderstood, the results range from annoying (0.1 + 0.2 ≠ 0.3) to catastrophic (ML training divergence, financial miscalculation, missile guidance error). When it is understood, you can write numerically stable code and reason precisely about where errors come from.

Why 0.1 Cannot Be Represented Exactly

Binary fractions work the same way decimal fractions do - but in base 2.

Just as 1/3 has no finite decimal representation (0.3333…), many simple decimal fractions have no finite binary representation. 0.1 in binary is 0.0001100110011... repeating forever. The hardware stores a finite approximation.

The actual stored value for 0.1 as a 64-bit double is:

$$0.1000000000000000055511151231257827021181583404541015625$$

That 55… at the 17th decimal place is the error. Add two of these approximations together and the errors compound. The result of 0.1 + 0.2 is not 0.3 (which is also approximated) but a different approximation that, printed with full precision, shows as 0.30000000000000004.

This is not a Python bug, not a hardware defect, and not worth losing sleep over in most contexts. It is the correct consequence of a standard design decision made in 1985.

IEEE 754: The Standard That Rules Everything

The IEEE 754 standard, published in 1985, governs floating-point arithmetic on virtually every CPU, GPU, and DSP manufactured since. Before it, different hardware vendors used different floating-point formats. Code that ran correctly on one machine could produce entirely different results on another. IEEE 754 ended that chaos.

A 64-bit double (float64) stores three fields:

Field	Bits	Purpose
Sign	1	0 = positive, 1 = negative
Exponent	11	Biased by 1023; encodes the power of 2
Mantissa	52	Fractional binary digits

The represented value is:

$$(-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent} - 1023}$$

The leading 1. before the mantissa is implicit - since every normalized binary number starts with 1, storing that bit would waste space. This “hidden bit” gives float64 an effective 53 bits of mantissa, producing about 15 - 16 significant decimal digits of precision. The range spans roughly $5 \times 10^{-324}$ to $1.8 \times 10^{308}$.

A 32-bit float (float32) has 8 exponent bits and 23 mantissa bits - about 7 significant decimal digits and range $1.2 \times 10^{-38}$ to $3.4 \times 10^{38}$.

Special Values: IEEE 754’s Edge Cases

IEEE 754 reserves certain bit patterns for special values:

NaN (Not a Number): The result of 0.0 / 0.0, sqrt(-1), inf - inf, or any operation on an existing NaN. NaN is contagious - any arithmetic involving NaN produces NaN. The strangest property: NaN != NaN is True. This is intentional and specified by the standard; it is the only reliable way to test for NaN without a dedicated function.

import math
x = float('nan')
print(x == x)          # False - NaN is not equal to itself
print(math.isnan(x))   # True - the correct test

Infinity: The result of dividing by zero (in IEEE 754; Python raises ZeroDivisionError instead) or overflow. Infinity follows algebraic rules: inf + 1 == inf, 1 / inf == 0.0, -inf < any_finite < inf.

Negative zero: -0.0 compares equal to 0.0 (-0.0 == 0.0 is True) but has a distinct bit pattern. It matters in edge cases: 1 / -0.0 == -inf while 1 / 0.0 == inf. You can test for it with math.copysign(1, x) < 0.

Subnormal numbers: When the exponent field is all zeros, the implicit leading 1 is dropped. This allows gradual underflow near zero - instead of snapping to zero, numbers get progressively less precise as they approach the minimum. Without subnormals, a - b == 0 could be true even when a != b for small values.

Machine Epsilon: The Unit of Precision

Machine epsilon $\varepsilon$ is the smallest value such that $1.0 + \varepsilon \neq 1.0$. For float64:

$$\varepsilon \approx 2.22 \times 10^{-16}$$

This is the gap between 1.0 and the next representable float. Near $10^{10}$, the gap between adjacent representable floats is about $10^{-6}$. Floating-point precision is relative to magnitude, not absolute.

import sys
print(sys.float_info.epsilon)  # 2.220446049250313e-16

This is why you should never write x == 0.0 for a float computed from arithmetic. Use abs(x) < 1e-10 or math.isclose(x, 0.0, abs_tol=1e-10) depending on context.

Catastrophic Cancellation: The Silent Killer

When two nearly-equal floating-point numbers are subtracted, the leading significant digits cancel, and the result is dominated by rounding noise. The relative error can become catastrophic.

a = 1000000.0000001
b = 1000000.0000000
result = a - b
# Mathematically: 1e-7
# Computed: something with ~50% relative error

The naive formula for population variance is $\text{Var} = \frac{\sum x_i^2}{n} - \bar{x}^2$. For a dataset like [10000001, 10000002, 10000003], you are subtracting two large, nearly-equal numbers. The classic textbook example of how not to compute variance. Welford’s online algorithm avoids this by computing variance incrementally without ever forming $\sum x_i^2$.

Catastrophic cancellation appears whenever you:

Subtract large nearly-equal numbers
Compute $\sqrt{x+1} - 1$ for small $x$ (use math.expm1 or math.log1p instead)
Use the quadratic formula naively for nearly-equal roots

Kahan Summation: Compensated Accumulation

Summing a long sequence of floats accumulates rounding error. Each addition introduces up to 0.5 ULP (unit in the last place) of error. Over $n$ terms, naive summation accumulates $O(n \cdot \varepsilon)$ error.

The Kahan summation algorithm tracks the rounding error from each addition and compensates for it in the next:

def kahan_sum(values):
    total = 0.0
    compensation = 0.0
    for v in values:
        y = v - compensation          # subtract the previous rounding error
        t = total + y
        compensation = (t - total) - y  # new rounding error
        total = t
    return total

# A sequence designed to expose the difference:
vals = [1e20, -1e20, 1.0]     # mathematically: 1.0
print(sum(vals))               # 0.0 - catastrophic cancellation
print(kahan_sum(vals))         # 1.0 - correct

NumPy’s np.sum uses pairwise summation - a divide-and-conquer approach that achieves $O(\log n \cdot \varepsilon)$ error, better than naive but not as precise as Kahan.

“Just Use Decimal” - And When Not To

Python’s decimal module provides arbitrary-precision decimal arithmetic:

from decimal import Decimal, getcontext
getcontext().prec = 28

a = Decimal('0.1')
b = Decimal('0.2')
print(a + b)   # Decimal('0.3') - exact in decimal

This is the right choice for financial calculations. A bank that stores $0.10 as a float will occasionally compute $0.09999… and round incorrectly. Banks store amounts in integer cents (or millicents for FX), or use Decimal. The US Federal Reserve’s payment processing systems use decimal arithmetic exactly for this reason.

But Decimal is 50 - 100x slower than native float. For scientific computing, ML, simulation, signal processing - anything where you’re computing thousands of operations per second - Decimal is prohibitively slow. The question is not “should I use float or Decimal” but “does the rounding error matter for this computation?”

For most ML training, the answer is emphatically no. Using float32 instead of float64 costs you 7 decimal digits of precision vs 15, but the gradients in a neural network are noisy anyway - the difference between float64 and float32 is swamped by stochastic gradient descent’s inherent randomness.

How ML Uses float16 and bfloat16

Modern ML hardware (NVIDIA A100, Google TPU, Apple Neural Engine) can perform float16 or bfloat16 operations at 2 - 8x the throughput of float32. This is not an approximation - it is the designed operating mode.

float16 has 5 exponent bits and 10 mantissa bits:

Range: $6 \times 10^{-5}$ to $65504$ - very limited
Precision: ~3 significant decimal digits

bfloat16 (Brain Floating Point, developed at Google Brain) has 8 exponent bits and 7 mantissa bits:

Range: same as float32 (the exponent field is identical)
Precision: ~2 - 3 significant decimal digits

The choice was deliberate. ML training suffers from gradient overflow and gradient underflow more than from insufficient mantissa precision. bfloat16 preserves float32’s dynamic range (crucial for avoiding overflow in activations and gradients) while halving the storage and bandwidth requirement. You can train most models in bfloat16 mixed precision with minimal accuracy loss.

float16’s narrow range causes training instability when gradients occasionally become very large or very small - a problem that led to the invention of loss scaling in mixed-precision training. bfloat16 sidesteps this by keeping the full exponent range.

float8 is now emerging (NVIDIA H100 has hardware support). Two variants: E4M3 (4 exponent bits, 3 mantissa bits, for forward pass) and E5M2 (5 exponent bits, 2 mantissa bits, for gradients). At 8 bits, a float holds 2 - 3 significant decimal digits at most. That this is sufficient for inference - and sometimes training - tells you something about the inherent noise tolerance of large neural networks.

The 1985 Standard Ruling 2024 ML

IEEE 754 was designed for scientific computing on workstations. Its authors could not have imagined matrix multiply units running at 312 TFLOPS on bfloat16. The standard has been extended (IEEE 754-2008 added bfloat16-adjacent concepts; hardware vendors added float8 before standards caught up), but the core model - sign, exponent, mantissa, hidden bit - has not changed.

This has practical consequences. The default rounding mode (round-to-nearest-even) was chosen to minimize statistical bias in scientific computing. It is not necessarily optimal for ML training. Research into stochastic rounding - round up or down randomly in proportion to proximity - shows lower training loss for small models, but hardware does not support it by default.

The GPU manufacturers are now the de-facto standards body for floating-point formats in ML. NVIDIA’s H100 supports float8, bfloat16, float16, float32, and float64, each at different throughput levels. The choice of numeric format is now a performance engineering decision, not just a precision decision.

Failure Modes You Should Know

The Patriot missile failure (1991): A timing counter tracking tenths of a second was stored in a 24-bit integer, converted to float for arithmetic. The representation of 0.1 in the 24-bit system had a small error. After 100 hours of uptime, the accumulated error was 0.34 seconds. The radar system missed a Scud missile. 28 soldiers died.

Financial rounding bugs: Repeated float arithmetic on prices can accumulate errors that sometimes round to wrong penny amounts. The canonical fix: store monetary values as integers (cents, or millicents for FX), do all arithmetic as integers, convert to display format only at presentation.

ML training divergence: float16 gradient overflow, unfixed, causes NaN to propagate through the entire model. One NaN in a gradient multiplied through every parameter means every weight becomes NaN. Loss explodes to NaN, which then propagates backward. The fix: gradient clipping and loss scaling. The cause: insufficient exponent range in float16.

Format	Exponent bits	Mantissa bits	Approx. decimal digits	Range
float8 E4M3	4	3	~2	±448
float8 E5M2	5	2	~1 - 2	±57344
float16	5	10	~3	±65504
bfloat16	8	7	~2 - 3	±3.4×10³⁸
float32	8	23	~7	±3.4×10³⁸
float64	11	52	~15 - 16	±1.8×10³⁰⁸

Read Next:

Cache-Friendly Code - Letting the Hardware Do What It Was Built For