Floating-Point & Precision // Megha Bose

Prerequisite: How Computers Execute Programs

Every programmer eventually runs into this:

>>> 0.1 + 0.2
0.30000000000000004

This is not a Python bug. It is not a hardware bug. It is the correct answer - given the constraints of how floating-point numbers are stored. Understanding why requires looking at IEEE 754, the standard that governs floating-point arithmetic on virtually every processor made in the last 40 years.

IEEE 754: How Floats Are Stored

A 64-bit double (float64) stores a number in three fields:

Field	Bits	Purpose
Sign	1	0 = positive, 1 = negative
Exponent	11	Biased by 1023; encodes the power of 2
Mantissa	52	Fractional digits in base 2

The value is: $(-1)^\text{sign} \times 1.\text{mantissa} \times 2^{\text{exponent} - 1023}$

The leading 1. before the mantissa is implicit - since every normalised binary number starts with 1, we get a free extra bit of precision. This gives float64 about 15–16 significant decimal digits of precision and a range from roughly $5 \times 10^{-324}$ to $1.8 \times 10^{308}$.

A 32-bit float32 has 8 exponent bits and 23 mantissa bits - about 7 decimal digits of precision.

Why 0.1 Cannot Be Represented Exactly

0.1 in base 10 is the infinite repeating fraction 0.0001100110011... in base 2 - exactly like 1/3 repeats in base 10. The 52 mantissa bits can only store a finite prefix of this infinite sequence. The stored value is the nearest representable float, which is:

$$0.1000000000000000055511151231257827021181583404541015625$$

When you add two of these slightly-off values, the errors compound. The result of 0.1 + 0.2 is not 0.3 (which also cannot be represented exactly) but a float that, when printed with full precision, shows as 0.30000000000000004.

Special Values

IEEE 754 defines several special cases:

NaN (Not a Number): Result of 0.0/0.0, sqrt(-1), or inf - inf. NaN is contagious - any operation involving NaN produces NaN. Notably, NaN != NaN is True.
+Infinity / -Infinity: Result of 1.0/0.0 or overflow. Infinity follows algebraic rules: inf + 1 == inf, 1/inf == 0.
-0: Negative zero compares equal to positive zero (-0.0 == 0.0 is True) but has a distinct bit pattern. Matters in edge cases like 1/-0.0 == -inf.
Subnormal numbers: When the exponent field is all zeros, the implicit leading 1 is dropped, allowing gradual underflow near zero at reduced precision.

import math
print(math.isnan(float('nan')))   # True
print(float('nan') == float('nan'))  # False - NaN != NaN
print(1.0 / 0.0)   # raises ZeroDivisionError in Python; C gives +inf

Rounding Modes

IEEE 754 specifies four rounding modes. The default is round-to-nearest-even (also called banker’s rounding): when a value falls exactly halfway between two representable floats, round to the one with an even least-significant bit. This avoids systematic upward bias in statistical computations.

Other modes (round toward zero, round toward $+\infty$, round toward $-\infty$) are used in interval arithmetic to produce rigorous error bounds.

Machine Epsilon

Machine epsilon $\epsilon_\text{mach}$ is the smallest value such that $1.0 + \epsilon_\text{mach} \ne 1.0$. For float64:

$$\epsilon_\text{mach} \approx 2.2 \times 10^{-16}$$

This is the unit of least precision (ULP) at 1.0. The precision you have at any given magnitude scales with the magnitude - near $10^{10}$, adjacent representable floats are about $10^{-6}$ apart.

import sys
print(sys.float_info.epsilon)  # 2.220446049250313e-16

Catastrophic Cancellation

When you subtract two nearly-equal floating-point numbers, you lose significant digits. The leading digits cancel, and the result is dominated by rounding noise.

a = 1.000000000000001
b = 1.000000000000000
print(a - b)   # 1.1102230246251565e-15  (correct is 1e-15; ~11% error)

# Severe case:
x = 1e15 + 1
y = 1e15
print(x - y)   # might be 0 or 2 depending on rounding

This arises in computing variance with the naive formula $\sum x_i^2 - n\bar{x}^2$ - a classic textbook mistake. Use Welford’s online algorithm instead.

Accumulation of Error: Kahan Summation

Summing a long sequence of floats accumulates rounding error. Each addition introduces up to $0.5$ ULP of error; over $n$ terms this can reach $O(n \cdot \epsilon_\text{mach})$. The Kahan summation algorithm compensates by tracking the lost low-order bits:

def kahan_sum(values):
    total = 0.0
    compensation = 0.0
    for v in values:
        y = v - compensation
        t = total + y
        compensation = (t - total) - y
        total = t
    return total

import math
n = 10_000_000
naive   = sum(0.1 for _ in range(n))          # accumulates error
kahan   = kahan_sum(0.1 for _ in range(n))
exact   = n * 0.1

print(f"Naive:  {naive:.6f}")   # 1000000.000000 (lucky here, but breaks on adversarial inputs)
print(f"Kahan:  {kahan:.6f}")   # more accurate
print(f"Exact:  {exact:.6f}")

NumPy’s np.sum uses pairwise summation (divide and conquer), which achieves $O(\log n \cdot \epsilon_\text{mach})$ error - much better than naive but not as good as Kahan.

When to Use Integer Arithmetic

For financial calculations, never use floats. $$0.10$ cannot be represented exactly. Instead, store amounts in integer cents (or millicents) and perform all arithmetic in integers:

# Wrong:
price = 1.10
tax   = 0.09
total = price + tax   # 1.1899999999999999

# Right: work in cents
price_cents = 110
tax_cents   = 9
total_cents = price_cents + tax_cents  # 119 - exact
print(f"${total_cents // 100}.{total_cents % 100:02d}")  # $1.19

Python’s `decimal` and `fractions`

When you need exact decimal arithmetic (not just integer), Python’s decimal module uses arbitrary-precision base-10 arithmetic:

from decimal import Decimal, getcontext
getcontext().prec = 28

a = Decimal('0.1')
b = Decimal('0.2')
print(a + b)   # 0.3 - exact in decimal

# For exact rational arithmetic:
from fractions import Fraction
print(Fraction(1, 10) + Fraction(2, 10))  # 3/10 - perfectly exact

Both are slower than float - Decimal by ~50–100x, Fraction by more. Use them only when exact results justify the overhead.

Examples

Demonstrate 0.1 + 0.2 and safe comparison:

import math

a = 0.1 + 0.2
b = 0.3

# Wrong: never use == for floats
print(a == b)              # False

# Right: use math.isclose
print(math.isclose(a, b, rel_tol=1e-9, abs_tol=1e-12))  # True

# Or for near-zero comparisons:
epsilon = 1e-10
print(abs(a - b) < epsilon)  # True

Kahan summation vs naive for a tricky case:

def kahan_sum(vals):
    s, c = 0.0, 0.0
    for v in vals:
        y = v - c
        t = s + y
        c = (t - s) - y
        s = t
    return s

# A sequence designed to expose error
vals = [1e20, -1e20, 1.0]   # mathematically: 1.0
print(sum(vals))             # 0.0  - catastrophic cancellation
print(kahan_sum(vals))       # 1.0  - correct

Floating-point arithmetic is a tool with known limitations, not a broken approximation. The key discipline is knowing when those limitations matter - financial calculations and ill-conditioned linear systems demand special care; most ML training is fine with float32 or even bfloat16. Knowing where you are on that spectrum is half the battle.

Read Next: