Helpful context:


You open a file from six months ago. The function is called proc. It takes d, x, and flag. There are no comments. The logic involves a nested loop, three early returns, and a magic number 17 that you no longer remember the significance of. You wrote this. It made perfect sense at the time.

This experience is so universal that it has become the canonical argument for clean code: code is written once and read many times. Industry estimates put the ratio at roughly ten reads for every write, across code review, debugging, onboarding, and future modification. The primary audience for your code is not the compiler - it is the next programmer who has to understand it, which is often yourself.

But “clean code” has also become a loaded term, attached to a particular book (Robert C. Martin’s Clean Code, 2008) and a set of rules that have since generated significant debate. This post tries to give you the underlying principles - the reasons the rules exist - and the judgment to know when to break them.

The Historical Arc: Clean Code, TDD, and the Backlash

The Clean Code movement emerged in the late 1990s and early 2000s from the Extreme Programming (XP) school, which advocated short iterations, pair programming, and - centrally - test-driven development (TDD). The thesis was that tests written before code would force better design, because untestable code is usually bad code.

Martin’s Clean Code (2008) codified a set of practices: functions should be tiny (often under five lines), names should be verbose, comments should be rare, classes should have single responsibilities. These rules produced cleaner code in many contexts and became influential across the industry.

By the 2010s, the backlash had begun. Critics pointed out that Martin’s examples often fragmented simple logic into dozens of tiny functions, making code harder to follow, not easier. The “function should do one thing” rule, taken to an extreme, produces code where understanding a simple operation requires tracing through ten layers of abstraction. Dan McKinley’s “Choose Boring Technology” and similar pragmatist writing pushed back against over-engineering. The Hacker News thread when Martin posted his response to the “FizzBuzz enterprise edition” parody is worth reading if you want to understand the fault lines.

Where have we landed? The underlying principles are sound: clear names, small well-defined functions, tests that verify behavior. The rigid rules about maximum function length or banning comments entirely are now understood as heuristics that apply in many contexts but not all. The job is judgment.

Naming: The Cheapest Documentation You Have

A good name is the cheapest form of documentation. It costs nothing at runtime and communicates intent to every reader.

The goal is that a reader should understand what a function does from its name without reading its body. If reading the body is required to understand the name, the name is wrong.

# Unclear: what is d? what does it compute? what does flag do?
def calc(d, x, flag):
    return d * x * (1 - flag)

# Clear: the intent is unambiguous from the signature alone
def compute_discounted_price(
    unit_price: float,
    quantity: int,
    discount_rate: float
) -> float:
    return unit_price * quantity * (1 - discount_rate)

Practical naming rules:

  • Functions are verb phrases: calculate_tax, fetch_user, validate_email. Not taxCalc, not email.
  • Variables are nouns: user, order_total, retry_count.
  • Booleans are assertions: is_valid, has_permission, user_exists. Not valid, permission, flag.
  • Collections are plural: users, orders, error_messages.
  • Avoid abbreviations unless they are universally known (e.g., url, id). usr, cnt, tmp obscure rather than compress.
  • Be specific about units: timeout_ms or duration_seconds rather than timeout or duration. Units in names eliminate entire categories of bugs.

The naming question to ask yourself: if I showed only this function’s name to someone who hasn’t seen the code, would they know what it does and why it exists?

Functions: One Thing, Testably Well

A function should do one thing. The test: can you describe what the function does without using the word “and”? If the honest description is “it fetches the data and parses it and writes to the database,” you have three functions pretending to be one.

# Three responsibilities in one function
def process_report(url: str, filename: str) -> None:
    response = requests.get(url)        # responsibility 1: fetch
    data = json.loads(response.text)   # responsibility 2: parse
    with open(filename, "w") as f:     # responsibility 3: write
        json.dump(data, f)

# Three separate, testable functions
def fetch_report(url: str) -> dict:
    response = requests.get(url)
    return json.loads(response.text)

def save_report(data: dict, filename: str) -> None:
    with open(filename, "w") as f:
        json.dump(data, f)

def process_report(url: str, filename: str) -> None:
    data = fetch_report(url)
    save_report(data, filename)

The separated version is more testable: fetch_report can be tested with a mock HTTP response, save_report can be tested with a temporary file, and neither needs the other to be exercised. The monolithic version requires mocking both the HTTP call and the file system simultaneously to test anything.

Pure functions are the gold standard. A pure function has no side effects and returns the same output for the same input, always. It does not modify global state, write to disk, make network calls, or depend on the current time. Pure functions are trivially testable (no setup required, no cleanup needed), easy to reason about (no hidden state), and safe to parallelize.

# Pure: output depends only on input, no side effects
def calculate_tax(amount: float, rate: float) -> float:
    return amount * rate

# Impure: depends on external state (current time)
def is_business_hours() -> bool:
    from datetime import datetime
    return 9 <= datetime.now().hour < 17

Push impurity to the edges of your system. Business logic should be pure where possible; side effects (database writes, API calls, file I/O) should be isolated in thin outer layers. This is the architecture behind the Hexagonal Architecture and Ports-and-Adapters patterns - not coincidentally, also the architecture that is easiest to unit test.

Comments: Explain Why, Not What

Code shows what is happening. Comments should explain why - the business rule, the historical constraint, the non-obvious performance decision.

# Bad: restates what the code already shows
i += 1   # increment i by 1

# Bad: "explains" code that should just be clearer
# Check if list is not empty and has valid first element
if len(items) > 0 and items[0] is not None:
    process(items[0])

# Good: explains a non-obvious business rule
# Skip the first row - it is always the header, even in auto-generated exports
for row in csv_rows[1:]:
    process(row)

# Good: documents a known limitation or tradeoff
# Using a list here instead of a set because the downstream renderer
# depends on insertion order, and sets do not guarantee ordering in
# Python < 3.7 (which some deployment targets still use). See issue #412.
ordered_items = []

If a comment explains what the code does rather than why, ask: could I write the code so clearly that the comment is unnecessary? Often yes. Sometimes no - and in those cases, a good comment is worth more than a forced simplification.

The DRY Principle and Its Limits

DRY - Don’t Repeat Yourself - says each piece of knowledge should have a single, authoritative representation in the codebase. If you copy-paste a block of logic three times and the logic later changes, you must find and update all three copies. Miss one, and you have a bug.

The tension is premature abstraction: extracting a shared abstraction before you understand the actual pattern. Two pieces of code that look similar might be similar by coincidence rather than by shared meaning. If you abstract them prematurely, you couple two things that should be free to evolve independently.

The heuristic: wait until you have three concrete instances before abstracting. Two is coincidence; three is a pattern. This is sometimes called the “rule of three.”

Sandi Metz has a useful inversion: “duplication is far cheaper than the wrong abstraction.” A copy-pasted function can be independently modified. A wrong abstraction causes every caller to adapt to its mistakes.

Type Hints: Lightweight Specification

Python’s type hints are optional annotations that document the expected types of function arguments and return values. They do not affect runtime behavior, but they enable static analysis tools (mypy, pyright) to catch type errors before execution, and they serve as machine-checkable documentation.

def send_email(
    recipient: str,
    subject: str,
    body: str,
    cc: list[str] | None = None,
    priority: int = 1,
) -> bool:
    """Returns True if the email was successfully queued."""
    ...

The signature tells you everything you need to know to call this function correctly: what each parameter expects, which are optional, and what you get back. Without type hints, you might need to read the function body or the docs to know whether cc can be omitted.

Run mypy in CI to catch regressions:

pip install mypy
mypy my_module.py

Type annotations are not all-or-nothing. Annotating function signatures (the public interface) is high value; annotating every local variable is often noise. Start with function signatures, especially in code that crosses module boundaries.

Unit Testing with pytest

A unit test verifies one isolated behavior of the code. The Arrange-Act-Assert pattern gives each test a clear structure:

# calculator.py
def divide(a: float, b: float) -> float:
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b
# test_calculator.py
import pytest
from calculator import divide

def test_divide_normal_case():
    # Arrange: nothing to set up, pure function
    # Act
    result = divide(10, 2)
    # Assert
    assert result == 5.0

def test_divide_by_zero_raises_value_error():
    with pytest.raises(ValueError, match="divide by zero"):
        divide(10, 0)

def test_divide_returns_float_not_int():
    result = divide(7, 2)
    assert result == 3.5
    assert isinstance(result, float)

Each test function tests exactly one behavior. The test name says what behavior it tests. When a test fails, you know from the name alone what broke.

Run with pytest from the command line. pytest discovers tests automatically in files matching test_*.py or *_test.py.

Mocking: Isolating Your Code from the World

Tests should be fast and deterministic. Network calls, database queries, and filesystem operations are neither. Use mocking to replace them with controlled stand-ins.

from unittest.mock import patch, MagicMock
from my_module import fetch_user_data

def test_fetch_user_data_returns_parsed_json():
    mock_response = MagicMock()
    mock_response.json.return_value = {"id": 1, "name": "alice"}
    mock_response.status_code = 200

    with patch("my_module.requests.get", return_value=mock_response):
        result = fetch_user_data(user_id=1)

    assert result["name"] == "alice"

def test_fetch_user_data_handles_404():
    mock_response = MagicMock()
    mock_response.status_code = 404

    with patch("my_module.requests.get", return_value=mock_response):
        result = fetch_user_data(user_id=999)

    assert result is None

The real HTTP call never happens. The test is instant, requires no network, and controls both the success and failure paths. Mocking is essential precisely because fetch_user_data was designed with dependency inversion - it calls requests.get (which can be replaced) rather than creating its own HTTP session (which cannot).

Test-Driven Development: When It Is Worth It

TDD (test-driven development) is a discipline where you write the failing test first, then write the minimum code to make it pass, then refactor. The cycle is: Red → Green → Refactor.

The argued benefits are real: tests written first force you to think about the interface before the implementation, which often produces better-designed code. It is harder to write a test for a function that mixes IO with logic - the pain of testing pushes you toward separation.

The honest critique: TDD feels slower, especially at first, and especially for exploratory code where you do not yet know what the interface should be. For code you are writing to figure out what it should do, writing tests first can be counterproductive - you end up throwing away both the tests and the code.

The resolution that most experienced practitioners have landed on: TDD is most valuable for code that is well-understood, highly consequential, and will change frequently. It is least valuable for exploratory prototypes and stable, rarely-modified library code. For a payment processing system or a cryptographic implementation, write the tests first. For a script to parse a one-off data format, just write the script.

The testing instinct - that code should be tested, that untested critical paths are a liability - is always right. The specific discipline of tests-first is a useful tool, not a moral requirement.

Test Coverage: A Proxy, Not a Goal

Test coverage measures what percentage of your code was executed by your test suite. High coverage is generally good, but 100% coverage is not the goal and pursuing it mechanically causes harm.

Coverage can be 100% with tests that make no assertions. Coverage at 85% might mean every critical code path is tested and only some trivial utility functions are not. Coverage at 100% might mean every line is executed but no edge cases are tested.

pip install pytest-cov
pytest --cov=my_module --cov-report=term-missing

The term-missing flag shows which lines were not executed. Use this to find important paths that are not tested - error handlers, edge cases, boundary conditions. Do not use it to find lines to write trivial tests for in order to hit a percentage target.

The real goal is confidence: you want to be confident that the code does what you think it does, and that changes you make do not break things you care about. Coverage is one signal toward that goal, not the goal itself.

The Critique of Clean Code Dogma

Martin’s Clean Code has specific rules that deserve scrutiny, not wholesale acceptance:

“Functions should be very small” taken to extremes produces call graphs so deep that understanding any single operation requires tracing through dozens of functions. The human short-term memory holds about seven things; a ten-layer call stack exceeds that. Sometimes a 50-line function that does its job completely is easier to understand than five 10-line functions that must be mentally assembled.

“Never use comments” is an overstatement. The real rule is “do not use comments to explain what code does.” Comments that explain why - architectural decisions, business rules, known limitations - are valuable. Eliminating them in pursuit of “self-documenting code” produces code that is opaque to anyone who wasn’t there when the decision was made.

“Getters and setters for everything” (from the related advice to encapsulate fields) often produces Java-style boilerplate in Python where it adds no value. Python’s @property decorator exists exactly so you can start with a plain attribute and add behavior later without changing the interface. Start plain; add encapsulation when you need it.

The healthier framing: these are heuristics, not laws. They were derived from real problems in real codebases, and they solve those problems in many contexts. Learn why they exist. Apply them with judgment.

Future Directions: Property-Based Testing and Beyond

Property-based testing (Hypothesis for Python) is a step beyond example-based testing. Instead of testing specific inputs, you define properties that should hold for all inputs, and the framework generates hundreds of random inputs trying to find a counterexample.

from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_sort_is_idempotent(lst):
    """Sorting an already-sorted list should produce the same result."""
    sorted_once = sorted(lst)
    sorted_twice = sorted(sorted_once)
    assert sorted_once == sorted_twice

@given(st.lists(st.integers()))
def test_sort_length_preserved(lst):
    """Sorting should not change the number of elements."""
    assert len(sorted(lst)) == len(lst)

Hypothesis found a bug in the Python sort algorithm itself (a specific edge case with a custom comparison function) - a test suite of manually chosen examples would likely never have found it. Property-based testing is particularly effective for: pure functions with well-defined mathematical properties, parsers and serializers (encode then decode should round-trip), and data transformations that should preserve invariants.

Type annotations as specification are becoming more powerful. mypy and pyright can now catch a wide range of errors that would previously only surface at runtime: calling a function with the wrong number of arguments, passing None where a non-None value is required, accessing attributes that don’t exist on a type. The trend is toward stricter typing as a lightweight formal specification layer.

Tests in CI/CD are the gatekeepers between development and deployment. Every pull request should trigger the full test suite. Failing tests block merging. This is not just a tooling detail - it is the cultural signal that tests are non-negotiable, not aspirational. Code review culture that takes testing seriously is the environment where clean, testable code becomes the norm rather than the exception.

Summary

Practice What It Solves When to Bend the Rule
Descriptive naming “What does this do?” at a glance Domain jargon that teams know universally
Small, focused functions Testability, reasoning in isolation When splitting fragments logic unnaturally
Pure functions Triviality of testing, reasoning IO and side effects are unavoidable at boundaries
Comments explain why, not what Undocumented intent and constraints Eliminate by clarifying code when possible
Unit tests with Arrange-Act-Assert Confidence in isolated behaviors Exploratory prototyping phases
TDD Better interface design, earlier feedback Exploratory code, stable library code
Type hints Interface documentation, static analysis One-off scripts, internal variables
Coverage as proxy metric Finding untested critical paths Never chase 100% mechanically
Property-based testing Finding edge cases you wouldn’t think to test Mathematical/invariant-heavy transformations

Clean code is not a destination. It is a direction. The practical version of the principle: leave code easier to understand than you found it, test the things that matter, and design systems so that the next person (including future you) can work in them without dread.


Read Next: