Prerequisite: Operating Systems Fundamentals


Concurrency is the hardest problem in everyday programming. Not because the concepts are mathematically deep, but because bugs are non-deterministic - they appear in production, disappear under a debugger, and reproduce only on 16-core machines under specific load. Yet concurrency is unavoidable: every web server handles concurrent requests, every GUI runs a background thread, and every database batches writes from multiple clients simultaneously. Understanding the failure modes is the first step to writing correct concurrent code.

Why Concurrency?

Two distinct motivations, often conflated:

I/O-bound tasks: Your program spends most of its time waiting - for a database query, an HTTP response, a file read. While one task waits, another could be making progress. Even a single CPU core can handle thousands of concurrent I/O-bound tasks with minimal overhead.

CPU-bound tasks: Your program does heavy computation (matrix multiplication, image encoding, ML inference). To actually run faster, you need multiple CPU cores executing in parallel. Threads and processes both work; the choice depends on language (Python’s GIL complicates this - see below).

Race Conditions

A race condition occurs when the outcome of a computation depends on the non-deterministic interleaving of operations from multiple threads.

import threading

counter = 0

def increment():
    global counter
    for _ in range(100_000):
        counter += 1   # NOT atomic: read, add, write

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join();  t2.join()

print(counter)  # Expected: 200000. Actual: somewhere between 100001 and 200000

The line counter += 1 compiles to three operations: load counter into a register, add 1, store back. If two threads both load the value 42 before either stores, both write 43 - one increment is lost.

Mutexes

A mutex (mutual exclusion lock) ensures only one thread executes a critical section at a time:

import threading

counter = 0
lock = threading.Lock()

def increment():
    global counter
    for _ in range(100_000):
        with lock:          # acquire on entry, release on exit
            counter += 1

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join();  t2.join()

print(counter)  # Always 200000

threading.Lock in Python maps to pthread_mutex on Linux. The with lock: pattern is essential - a bare lock.acquire() without a matching release() in a finally block will deadlock if an exception occurs.

Other synchronisation primitives:

  • threading.Semaphore(n): Allow up to n threads into the critical section simultaneously. Useful for rate-limiting (e.g., max 10 concurrent database connections).
  • threading.Event: One thread signals, others wait. Used for “work is ready” notifications.
  • threading.Condition: Combine a lock with wait/notify for producer-consumer patterns.

Deadlock

A deadlock occurs when two or more threads are each waiting for a resource held by the other, and none can proceed.

lock_a = threading.Lock()
lock_b = threading.Lock()

def thread1():
    with lock_a:
        time.sleep(0.01)   # context switch here
        with lock_b:       # waits forever if thread2 holds lock_b
            do_work()

def thread2():
    with lock_b:
        time.sleep(0.01)
        with lock_a:       # waits forever if thread1 holds lock_a
            do_work()

Deadlock requires four conditions (Coffman conditions), all of which must hold simultaneously:

  1. Mutual exclusion: Resources cannot be shared.
  2. Hold and wait: A thread holds one resource while waiting for another.
  3. No preemption: Resources cannot be forcibly taken.
  4. Circular wait: Thread A waits for Thread B waits for Thread A.

Prevention strategies: Impose a global lock ordering (always acquire lock_a before lock_b); use lock.acquire(timeout=1) and retry; use higher-level abstractions (queues, actors) that hide locks entirely.

Python’s GIL

The Global Interpreter Lock is a mutex in CPython that prevents more than one thread from executing Python bytecode at a time. This is necessary because CPython’s reference counting is not thread-safe - without the GIL, two threads could simultaneously decrement a reference count to zero and double-free an object.

Practical consequences:

  • I/O-bound threads: The GIL is released during I/O system calls. requests.get(url) releases the GIL while waiting for the network, so multiple threads genuinely run concurrently for I/O. Threading is effective here.
  • CPU-bound threads: Two threads crunching Python loops cannot actually run in parallel - they take turns holding the GIL. Threading does not help for CPU-bound Python code.

The GIL is being removed in Python 3.13+ (PEP 703, opt-in) and made default-off in 3.14, but extension modules that assumed the GIL exists will need updates.

# I/O-bound: threading works well
import threading, requests

urls = ["https://example.com"] * 10
results = []

def fetch(url):
    results.append(requests.get(url).status_code)

threads = [threading.Thread(target=fetch, args=(u,)) for u in urls]
for t in threads: t.start()
for t in threads: t.join()

multiprocessing: Bypassing the GIL

For CPU-bound work, use separate processes. Each has its own GIL and its own memory space:

from multiprocessing import Pool
import math

def heavy(n):
    return sum(math.sqrt(i) for i in range(n))

with Pool(processes=4) as pool:
    results = pool.map(heavy, [10_000_000] * 8)

The overhead is higher (process creation, IPC for passing data), but four processes on four cores genuinely run in parallel.

concurrent.futures: A Unified Interface

concurrent.futures provides ThreadPoolExecutor and ProcessPoolExecutor behind the same API:

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

urls = ["https://httpbin.org/delay/1"] * 8

with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {executor.submit(requests.get, url): url for url in urls}
    for future in as_completed(futures):
        url = futures[future]
        print(f"{url}: {future.result().status_code}")

# 8 requests, each taking 1s, complete in ~1s total (not 8s)

Switch ThreadPoolExecutor to ProcessPoolExecutor to bypass the GIL for CPU-bound tasks - no other code changes needed.

asyncio: Single-Threaded Concurrency

asyncio handles concurrency with a single thread and an event loop. Instead of blocking on I/O, coroutines await and yield control back to the event loop, which runs other coroutines until the I/O completes.

No locks needed for shared state (single-threaded), but CPU-bound work blocks the event loop entirely:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = ["https://httpbin.org/get"] * 10
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    print(f"Fetched {len(results)} pages")

asyncio.run(main())

asyncio.gather runs all tasks concurrently in a single thread. This scales to tens of thousands of concurrent I/O operations - far beyond what threading (limited by memory and GIL contention) can achieve.

Atomic Operations

For simple shared counters, atomic operations avoid the overhead of a full mutex. CPython’s threading module does not expose atomics directly, but the effect of atomic increment can be achieved with threading.Lock. At the C level, __sync_fetch_and_add or C11’s <stdatomic.h> provides true hardware atomics - a single CPU instruction that cannot be interrupted mid-operation.

Examples

Race condition and the fix:

import threading

# Broken
count = 0
def bad_increment():
    global count
    for _ in range(50_000):
        count += 1

# Fixed
count = 0
lock = threading.Lock()
def good_increment():
    global count
    for _ in range(50_000):
        with lock:
            count += 1

for fn in [bad_increment, good_increment]:
    count = 0
    threads = [threading.Thread(target=fn) for _ in range(4)]
    for t in threads: t.start()
    for t in threads: t.join()
    print(f"{fn.__name__}: {count}")
# bad_increment: < 200000 (varies)
# good_increment: 200000 (always)

asyncio web fetcher:

import asyncio, aiohttp, time

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [session.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        return [r.status for r in responses]

urls = ["https://httpbin.org/status/200"] * 20
t0 = time.perf_counter()
statuses = asyncio.run(fetch_all(urls))
print(f"20 requests in {time.perf_counter()-t0:.2f}s: {set(statuses)}")
# ~1–2s total, not 20s

The key insight that cuts through concurrency complexity: shared mutable state is the root cause of most bugs. Minimise it. Where it cannot be avoided, protect it with the narrowest lock possible and hold the lock for the shortest time possible. When in doubt for I/O-heavy Python code, asyncio with no shared state is often the right answer.


Read Next: