Concurrency & Race Conditions - When Two Threads Meet on a Shared Variable // Megha Bose

Helpful context:

Operating Systems - The Software That Runs All Other Software

Two users click “Transfer $500” on their banking app at exactly the same moment. Both requests hit the server simultaneously. Both read the same balance - $700. Both compute the new balance: $200. Both write $200 back. One transfer disappears. The account that should show $200 shows $200, but the balance that should show -$300 shows $200 too. $500 has been conjured from nothing.

This is a race condition. It is not a theoretical problem. Knight Capital Group lost $440 million in 45 minutes in 2012, partly due to a race condition in their order-routing system. Therac-25 radiation therapy machines killed patients in the late 1980s due to a race condition in their control software. Concurrency bugs do not show up in testing and they do not care about your code review process.

Why Concurrency Exists: Two Different Problems

Before diving into solutions, it is worth separating the two motivations for concurrent code - they have different solutions.

I/O-bound concurrency: Your program spends most of its time waiting. Waiting for a database query to return. Waiting for an HTTP request to complete. Waiting for a file read. A single-threaded program that makes 100 HTTP requests does so serially - request 1 finishes, then request 2 starts. But while request 1 is in flight, the CPU is idle. Concurrency lets you have 100 requests in flight simultaneously, even on a single CPU core.

CPU-bound parallelism: Your program does heavy computation. Image processing, video encoding, ML inference, scientific simulation. To actually go faster, you need multiple CPU cores doing work simultaneously. This is true parallelism, not just concurrent waiting.

Conflating these two leads to wrong solutions. Using 100 threads to parallelize CPU-bound Python code on a single-GIL interpreter does not help. Using async/await for a computationally intensive algorithm on a single-core event loop does not help. The right tool depends on which problem you have.

Race Conditions: When Interleaving Goes Wrong

counter += 1 looks atomic. It is not. It compiles to three operations: load the current value from memory into a register, add 1 to the register, store the result back to memory. The OS scheduler can preempt a thread between any two of these operations.

import threading

counter = 0

def increment():
    global counter
    for _ in range(100_000):
        counter += 1   # load, add, store - three steps, not one

t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
t1.start(); t2.start()
t1.join();  t2.join()

print(counter)
# Expected: 200,000
# Actual: anywhere between 100,001 and 200,000

What happens: Thread 1 loads counter = 42. Thread 2 loads counter = 42 before Thread 1 has stored. Thread 1 stores 43. Thread 2 stores 43. Two increments, one result. One increment is lost.

This is a lost update - the classic race condition pattern. It is non-deterministic; it may not appear under light load, only manifesting under specific timing and CPU scheduling conditions. Production load surfaces it.

The bank balance scenario is the same structure: read, compute, write, with an adversarial context switch between read and write.

Mutual Exclusion: The Lock Solution

A mutex (mutual exclusion lock) ensures only one thread can execute a protected section at a time. When a thread acquires the lock, other threads block until it releases:

import threading

counter = 0
lock = threading.Lock()

def safe_increment():
    global counter
    for _ in range(100_000):
        with lock:          # acquire on enter, guaranteed release on exit
            counter += 1

t1 = threading.Thread(target=safe_increment)
t2 = threading.Thread(target=safe_increment)
t1.start(); t2.start()
t1.join();  t2.join()

print(counter)  # Always 200,000

The with lock: pattern is critical. A bare lock.acquire() without lock.release() in a finally block will deadlock if the protected code raises an exception. Python’s context manager guarantees release.

Mutexes serialize access, which eliminates the race but also eliminates parallelism in the protected section. The critical section should be as small as possible - lock granularity matters. A lock that covers an entire function is safe but slow; a lock that covers only the shared state update is fast but requires careful reasoning about what “shared state” means.

Deadlock: Mutexes in Conflict

Deadlock occurs when two threads each hold a lock the other needs. Neither can proceed. The program freezes permanently, not crashing - just stopped.

import threading

lock_a = threading.Lock()
lock_b = threading.Lock()

def thread1():
    with lock_a:
        # ... do some work ...
        with lock_b:      # blocks if thread2 holds lock_b
            do_work()

def thread2():
    with lock_b:
        # ... do some work ...
        with lock_a:      # blocks if thread1 holds lock_a
            do_work()

Thread 1 acquires lock_a. Thread 2 acquires lock_b. Thread 1 tries to acquire lock_b - blocked. Thread 2 tries to acquire lock_a - blocked. Neither will ever release what the other needs. This is the dining philosophers problem in code: $n$ philosophers, $n$ forks, each needs two forks to eat. If every philosopher picks up their left fork simultaneously, everyone starves.

Deadlock requires four conditions, all simultaneously (Coffman, 1971):

Mutual exclusion: At least one resource is held exclusively
Hold and wait: A thread holds a resource while waiting for another
No preemption: Resources are not forcibly taken
Circular wait: A cycle of threads each waiting for the next

Breaking any one condition prevents deadlock. The practical solution: impose a global lock ordering. If every thread that needs both lock_a and lock_b always acquires them in the same order (always lock_a first), circular wait is impossible. This is tedious to enforce manually at scale, which is why higher-level abstractions (message queues, the actor model) that hide locks entirely are often preferable.

Python’s GIL: The Elephant in the Room

The Global Interpreter Lock is a single mutex in CPython that prevents more than one thread from executing Python bytecode at any time. It exists because CPython’s reference counting for garbage collection is not thread-safe: without the GIL, two threads could simultaneously decrement a reference count to zero and both free the same object.

This has practical consequences that confuse most Python programmers:

I/O-bound work with threads works fine. When a thread makes a blocking system call (reading from a socket, waiting for a subprocess, sleeping), it releases the GIL. Other threads can run during that wait. requests.get(url) in a thread releases the GIL while waiting for the response. Running 100 such threads concurrently is effective - they all wait simultaneously.

CPU-bound work with threads does not parallelize. Two threads running pure Python loops never actually run simultaneously, even on an 8-core machine. They take turns holding the GIL, each holding it for a few milliseconds before yielding. No speedup. For CPU-bound work, use multiprocessing - separate processes, separate GILs.

# I/O-bound: threading is effective
import threading, requests

def fetch(url, results, i):
    results[i] = requests.get(url).status_code  # releases GIL while waiting

urls = ["https://httpbin.org/status/200"] * 20
results = [None] * 20
threads = [threading.Thread(target=fetch, args=(u, results, i))
           for i, u in enumerate(urls)]
for t in threads: t.start()
for t in threads: t.join()
# 20 requests complete in ~1 second, not 20 seconds

Python 3.13 and the No-GIL Experiment

PEP 703, merged into Python 3.13 as an opt-in, removes the GIL. Python 3.13 can be built with --disable-gil. What changes:

CPU-bound Python threads can now run truly in parallel on multiple cores
threading becomes genuinely useful for CPU-bound work
The reference counting mechanism is replaced with thread-safe atomic operations

What does not change immediately:

Extension modules (NumPy, pandas, etc.) that assumed the GIL were the synchronization mechanism may have latent thread-safety bugs exposed
Performance of single-threaded Python may be slightly slower (atomic operations have overhead)
The ecosystem needs time to audit and update C extensions

The transition is deliberately cautious. Python 3.13 ships with both GIL and no-GIL builds. Python 3.14 will make no-GIL the default only after the ecosystem has had time to adapt. This is the right call - breaking the scientific Python ecosystem would be a catastrophe.

Async/Await: Concurrency Without Threads

Here is the thing about threads: they are expensive. Each thread needs a stack (typically 1 - 8 MB), a kernel thread structure, and scheduler metadata. 10,000 simultaneous HTTP connections → 10,000 threads → 10 - 80 GB of stack space. This is the C10K problem, documented in 1999, that broke the Apache model of “one thread per connection.”

asyncio solves this with a single-threaded event loop model. There are no threads - just coroutines that voluntarily yield when waiting for I/O:

import asyncio, aiohttp

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()   # yields control while waiting; no thread blocked

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls * 100]
        results = await asyncio.gather(*tasks)   # all 100 in flight simultaneously

asyncio.run(main())

The event loop runs in a single thread. When a coroutine hits an await, it suspends and the event loop runs another coroutine that is ready. When the awaited I/O completes (signaled by epoll/kqueue at the OS level), the event loop resumes the suspended coroutine.

This is not “async is just threads.” The threading model is preemptive - the scheduler interrupts threads at arbitrary points, requiring locks to protect shared state. The async model is cooperative - a coroutine yields only at await points, which you control. Between awaits, you have exclusive use of the CPU, so shared state is safe without locks (in a single-event-loop setup). This eliminates entire categories of race conditions.

The cost: CPU-bound work blocks the event loop. If a coroutine runs a tight computation for 100 ms without any await, every other coroutine waits 100 ms. Long CPU work must be offloaded to asyncio.run_in_executor (a thread or process pool) or broken into smaller chunks with explicit await asyncio.sleep(0) yields.

A practical rule of thumb: for I/O-bound Python servers handling thousands of concurrent connections, asyncio is the right model. For CPU-bound work, multiprocessing. For small numbers of concurrent tasks with shared state, threading is often simpler than async. Do not reach for async because it sounds modern.

Distributed Locks: When the Lock is Across Machines

The bank balance race condition from the opening does not disappear in a distributed system - it gets worse. If two servers can both handle “Transfer $500” requests for the same account simultaneously, you need a distributed lock.

Redis SETNX is the classic approach: before processing a transaction, try to set a key with SETNX (Set if Not eXists). If the set succeeds, you hold the lock. If not, another server holds it. Expire the key automatically so a crashed server does not hold the lock forever.

import redis
r = redis.Redis()

lock_key = f"lock:account:{account_id}"
lock_held = r.set(lock_key, "locked", nx=True, ex=5)  # nx=only if not exists, ex=expire in 5s

if lock_held:
    try:
        # safe to update account balance
        process_transfer(account_id, amount)
    finally:
        r.delete(lock_key)   # release lock

The hard problem with distributed locks: what if the lock holder crashes after acquiring the lock but before releasing it? The ex=5 expiry handles this - the lock expires after 5 seconds. But what if the operation takes longer than 5 seconds? What if network partitions cause both servers to think they hold the lock? The Redlock algorithm (Martin Kleppmann’s analysis of which is required reading) shows that even with Redis clustering, distributed locks have fundamental uncertainty under failure conditions.

The deeper lesson: distributed locking is hard enough that many systems avoid it by design. Database transactions with proper isolation levels handle this problem without application-level locks. The actor model (Erlang, Akka) gives each entity a mailbox - only one message processes at a time, no explicit locking. The cloud function model (AWS Lambda) scales by running many single-threaded function instances; the database handles coordination.

Lock-Free and Wait-Free Algorithms

For performance-critical code, locks have overhead: context switches when a thread blocks, cache coherence traffic, lock convoys (one slow thread delays all others). Lock-free algorithms use atomic CPU operations to achieve synchronization without locks.

Modern CPUs provide compare-and-swap (CAS): atomically check if a memory location has an expected value and, if so, replace it with a new value. This is a single hardware instruction that cannot be interrupted.

// Atomic compare-and-swap: if *ptr == old, store new; return true if swapped
bool __atomic_compare_exchange_n(int *ptr, int *old, int new, ...);

Lock-free counters, stacks, and queues can be built from CAS. They are correct under arbitrary thread interleaving without any blocking. The tradeoff: they are much harder to implement correctly, and the ABA problem (a location changes from A to B and back to A, making CAS think nothing changed) lurks everywhere.

Python exposes none of this directly. threading.Lock is the appropriate tool. At the C level, <stdatomic.h> (C11) provides portable atomic operations. Rust’s std::sync::atomic provides them with the ownership model guaranteeing correct usage.

Structured Concurrency: Making Async Manageable

Classic async code creates “fire and forget” tasks that are hard to reason about. If one task fails, when does the parent know? What happens to the other tasks?

Python 3.11’s asyncio.TaskGroup implements structured concurrency - the idea that concurrent tasks should have a well-defined lifetime tied to a scope:

async def main():
    async with asyncio.TaskGroup() as tg:
        task1 = tg.create_task(fetch(url1))
        task2 = tg.create_task(fetch(url2))
        task3 = tg.create_task(fetch(url3))
    # All tasks complete (or fail) before this line runs
    # If any task raises, the exception propagates here
    # and remaining tasks are cancelled

If fetch(url2) raises an exception, the TaskGroup cancels task1 and task3 and re-raises. You cannot leak tasks - the context manager enforces that all tasks either complete or are cancelled before the scope exits. This makes async code as readable and safe as sequential code, while maintaining the concurrency.

Kotlin coroutines and Swift’s structured concurrency have similar designs. This is where async programming is heading: structured lifetimes, automatic cancellation, no leaked tasks.

Rust: Data Race Prevention at Compile Time

Rust’s ownership and borrowing rules make data races impossible by construction. The borrow checker enforces:

Multiple threads can have immutable references to data simultaneously (reading is fine)
Only one thread can have a mutable reference at a time (exclusive write access)
References cannot outlive the data they point to

This is enforced at compile time, with zero runtime overhead. A program that has a data race does not compile. There is no runtime check, no segfault, no subtle bug discoverable only in production - just a compile error.

The practical consequence: safe concurrent Rust code is normal Rust code that happens to use Arc<Mutex<T>> or message-passing channels. The ownership model is the synchronization model.

Mechanism	Use Case	Cost	Risk
Mutex	Shared mutable state, any language	Thread blocking, contention	Deadlock if ordered wrong
Atomic operations	Simple counters, flags	Minimal (hardware instruction)	ABA problem, complexity
Async/event loop	I/O-bound, many concurrent connections	Minimal (no thread stack)	CPU work blocks loop
multiprocessing	CPU-bound Python, GIL bypass	Process creation overhead, IPC cost	Higher memory use
Distributed lock (Redis)	Cross-server coordination	Network RTT + Redis overhead	Complex failure modes
Database transactions	Financial operations, consistency	DB locking overhead	Deadlock, lock contention
Actor model	High-concurrency, message-driven	Mailbox overhead	Message ordering complexity

Read Next:

Networking - How Packets Find Their Way Across the Internet