Helpful context:


You’re standing somewhere in a city and you want to reach a specific destination. Your map gives you two kinds of moves:

  • Move A: Go 2 blocks east and 1 block north.
  • Move B: Go 1 block east and 3 blocks north.

You can do these moves as many times as you want, in any order, and you can go backwards (negative copies). The question: can you reach any point in the city using only these two moves? And if you want to reach a specific destination, how many of each do you need?

This problem is not a toy. It is, in disguised form, exactly the question at the heart of linear algebra. Almost everything in this post is an attempt to answer it precisely - and then to generalise what that answer tells us.


What Is a Move? (What Is a Vector?)

A “move” has two pieces of information: how far east, and how far north. Two numbers. So let’s write Move A as:

$$\mathbf{a} = \begin{pmatrix} 2 \\ 1 \end{pmatrix}$$

The top number is the east component, the bottom is the north component. Move B is:

$$\mathbf{b} = \begin{pmatrix} 1 \\ 3 \end{pmatrix}$$

These are vectors. A vector in $\mathbb{R}^2$ is just a pair of real numbers - but the pair has structure. It’s not just a list; it represents a displacement in space, with direction and magnitude both encoded.

Geometric view: Draw an arrow from the origin to the point $(2, 1)$. That arrow is $\mathbf{a}$. The length of the arrow is the magnitude; the way it points is the direction. When you perform Move A, you’re following that arrow.

Algebraic view: Two numbers in a specific order. That’s it. The geometry is optional, a way of thinking, not the definition. Later, when vectors show up with 1000 components - as in machine learning, where a datapoint might have 1000 features - you can’t draw them. But the algebra still works.

Discomfort check. If you’re uncomfortable with the two views not obviously being the same thing: that’s correct. They are the same thing, but seeing why takes time. For now, hold both loosely. The arrow helps you think geometrically; the column of numbers lets you calculate.

Combining moves: vector addition

If you do Move A and then Move B, you end up 3 blocks east and 4 blocks north:

$$\mathbf{a} + \mathbf{b} = \begin{pmatrix} 2 \\ 1 \end{pmatrix} + \begin{pmatrix} 1 \\ 3 \end{pmatrix} = \begin{pmatrix} 3 \\ 4 \end{pmatrix}$$

Vector addition is just component-wise addition. Geometrically, place the tail of $\mathbf{b}$ at the tip of $\mathbf{a}$; the result is the arrow from start to finish. Algebraically, add the numbers.

Scaling a move: scalar multiplication

What if you want to do Move A twice? You end up at $(4, 2)$. What if you reverse it? You go to $(-2, -1)$. What about half of Move A? You get $(1, 0.5)$.

$$2\mathbf{a} = \begin{pmatrix} 4 \\ 2 \end{pmatrix}, \quad -\mathbf{a} = \begin{pmatrix} -2 \\ -1 \end{pmatrix}, \quad \frac{1}{2}\mathbf{a} = \begin{pmatrix} 1 \\ 0.5 \end{pmatrix}$$

Scalar multiplication scales both components by the same number. Geometrically: same direction, different length (flip the direction if the scalar is negative).

Combinations: the key operation

Now suppose you do 3 copies of Move A and 2 copies of Move B. You’d end up at:

$$3\mathbf{a} + 2\mathbf{b} = 3\begin{pmatrix} 2 \\ 1 \end{pmatrix} + 2\begin{pmatrix} 1 \\ 3 \end{pmatrix} = \begin{pmatrix} 6 \\ 3 \end{pmatrix} + \begin{pmatrix} 2 \\ 6 \end{pmatrix} = \begin{pmatrix} 8 \\ 9 \end{pmatrix}$$

An expression like $3\mathbf{a} + 2\mathbf{b}$ - scaling vectors and adding them - is called a linear combination. The numbers 3 and 2 are the coefficients of the combination.

Every journey you can take using Moves A and B is a linear combination of $\mathbf{a}$ and $\mathbf{b}$.

Physics perspective. Imagine two forces acting on an object simultaneously: one pushing northeast with magnitude 2 in $x$ and 1 in $y$, another pushing with magnitude 1 in $x$ and 3 in $y$. The net force is their sum. Scaling a force is like doubling or halving its strength. The “moves” and “forces” are the same mathematical object; the physical story just helps you interpret what the numbers mean.


The Real Question: What Destinations Are Reachable?

Back to the city. You can reach any destination that can be written as some $x_1 \mathbf{a} + x_2 \mathbf{b}$ for real numbers $x_1, x_2$. The set of all such destinations is called the span of $\mathbf{a}$ and $\mathbf{b}$.

Can the span of two 2D vectors cover all of $\mathbb{R}^2$? Sometimes yes, sometimes no. It depends on whether the two moves are “genuinely different directions.”

Look at $\mathbf{a} = (2, 1)$ and $\mathbf{b} = (1, 3)$. These two arrows point in different directions - not just different lengths, fundamentally different. Intuitively, any destination should be reachable. We’ll verify this soon.

Now consider if instead of $\mathbf{b} = (1, 3)$, you had $\mathbf{b}' = (4, 2)$. Notice $\mathbf{b}' = 2\mathbf{a}$. Move B' is just “do Move A twice.” You never escape the line that goes through the origin in direction $(2, 1)$. All your linear combinations $x_1 \mathbf{a} + x_2 \mathbf{b}'$ lie on a single line through the origin. This is a far smaller set than all of $\mathbb{R}^2$.

Two vectors that are multiples of each other are called linearly dependent. They don’t give you two independent directions; you only really have one. If the two vectors point in genuinely different directions - neither is a multiple of the other - they are linearly independent, and together they can reach any point in $\mathbb{R}^2$.

The span of two linearly independent vectors in $\mathbb{R}^2$ is all of $\mathbb{R}^2$. The span of two linearly dependent vectors in $\mathbb{R}^2$ is just a line. That’s the difference between “can reach any destination” and “trapped on a single road.”

Why two independent directions always span the plane. Take $\mathbf{a} = (2, 1)$ and $\mathbf{b} = (1, 3)$. Both have components in the $x$ and $y$ directions. The key move: combine them to cancel one component at a time.

To kill the $y$-component: form $3\mathbf{a} - \mathbf{b} = (6, 3) - (1, 3) = (5, 0)$. This is a pure $x$-direction vector.

To kill the $x$-component: form $\mathbf{a} - 2\mathbf{b} = (2, 1) - (2, 6) = (0, -5)$. This is a pure $y$-direction vector.

Scale each: $\frac{1}{5}(5, 0) = (1, 0)$ and $-\frac{1}{5}(0, -5) = (0, 1)$. You now have the standard unit vectors - all built from $\mathbf{a}$ and $\mathbf{b}$ alone.

Any point $(x, y)$ in the plane is just $x(1, 0) + y(0, 1)$. Since both unit vectors are in the span of $\mathbf{a}$ and $\mathbf{b}$, every point in $\mathbb{R}^2$ is reachable.

This construction works whenever the two vectors are not multiples of each other. If they were - say $\mathbf{b} = 2\mathbf{a}$ - then $3\mathbf{a} - \mathbf{b}$ would still point in the same direction as $\mathbf{a}$, and you could never isolate a perpendicular direction. You would be stuck on a line forever.

Extending to three dimensions. In $\mathbb{R}^3$ you have three vectors $\mathbf{a}, \mathbf{b}, \mathbf{c}$ and need to build $(1,0,0)$, $(0,1,0)$, $(0,0,1)$. You can’t cancel two components in one step - but you can do it in two stages.

Stage 1 - flatten to a plane: combine $\mathbf{a}$ and $\mathbf{b}$ to cancel their $z$-components, giving a vector $\mathbf{p}$ that lives in the $xy$-plane (zero $z$-entry). Do the same with $\mathbf{a}$ and $\mathbf{c}$ to get another vector $\mathbf{q}$ also in the $xy$-plane. You now have two vectors in a 2D subspace.

Stage 2 - apply the 2D argument: if the original three were independent, then $\mathbf{p}$ and $\mathbf{q}$ are independent in the $xy$-plane - so the same cancellation from before gives you $(1,0,0)$ and $(0,1,0)$.

Recovering the third unit vector: take any original vector, say $\mathbf{a} = (a_1, a_2, a_3)$. Subtract off its $x$ and $y$ parts - which you can now build - to get $\mathbf{a} - a_1(1,0,0) - a_2(0,1,0) = (0,0,a_3)$. Scale to get $(0,0,1)$.

Proof by induction. The 3D argument is really just the 2D argument applied one level deeper. That recursive structure is exactly what induction formalises.

Base case ($n = 1$). A single nonzero vector $(a_1)$ in $\mathbb{R}^1$ spans $\mathbb{R}^1$: scale by $1/a_1$ to get $(1)$, and every real number is a multiple of that.

Inductive step. Assume any $k$ linearly independent vectors in $\mathbb{R}^k$ span $\mathbb{R}^k$. Now take $k+1$ linearly independent vectors $\mathbf{v}_1, \ldots, \mathbf{v}_{k+1}$ in $\mathbb{R}^{k+1}$.

Step 1 - find a vector with nonzero last coordinate. If every $\mathbf{v}_i$ had a zero last coordinate, all $k+1$ vectors would live in the hyperplane $x_{k+1} = 0$, which is a copy of $\mathbb{R}^k$. But you cannot have $k+1$ linearly independent vectors in $\mathbb{R}^k$, so at least one - call it $\mathbf{v}_{k+1}$ - has last coordinate $c \neq 0$.

Step 2 - flatten to $\mathbb{R}^k$. For each $i \leq k$, form $\mathbf{w}_i = c \mathbf{v}_i - (v_i)_{k+1} \mathbf{v}_{k+1}$. The last coordinate cancels to zero, so $\mathbf{w}_i$ lives in $\mathbb{R}^k$. These $k$ vectors are linearly independent: any dependence $\sum \lambda_i \mathbf{w}_i = 0$ would unpack into a dependence among the original $\mathbf{v}_i$, which don’t have one.

Step 3 - apply the inductive hypothesis. Since $\mathbf{w}_1, \ldots, \mathbf{w}_k$ are $k$ independent vectors in $\mathbb{R}^k$, they span $\mathbb{R}^k$ by assumption. This gives unit vectors $\mathbf{e}_1, \ldots, \mathbf{e}_k$ (with zero last coordinate), each expressible as a combination of the $\mathbf{w}_i$, hence of the original $\mathbf{v}_i$.

Step 4 - recover $\mathbf{e}_{k+1}$. Take $\mathbf{v}_{k+1}$ and subtract off its first $k$ components using the unit vectors just built. What remains is $(0, \ldots, 0, c)$; scale by $1/c$ to get $\mathbf{e}_{k+1}$.

All $k+1$ unit vectors are now in the span, so every point in $\mathbb{R}^{k+1}$ is reachable. By induction, the result holds for all $n$.


Packaging the Problem: Matrices

Now we want to find the actual route. Suppose you want to reach the destination $\mathbf{d} = (8, 7)$. You want to find $x_1$ and $x_2$ such that:

$$x_1 \begin{pmatrix} 2 \\ 1 \end{pmatrix} + x_2 \begin{pmatrix} 1 \\ 3 \end{pmatrix} = \begin{pmatrix} 8 \\ 7 \end{pmatrix}$$

Writing this out component by component:

$$2x_1 + 1x_2 = 8$$ $$1x_1 + 3x_2 = 7$$

Two equations, two unknowns. You’ve seen this before. But there’s a cleaner way to write the whole thing by packaging your two available moves into a single object:

$$A = \begin{pmatrix} 2 & 1 \\ 1 & 3 \end{pmatrix}$$

This is a matrix. Its first column is Move A; its second column is Move B. The system of equations above is exactly:

$$A\mathbf{x} = \mathbf{d}, \quad \text{where } \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}, \quad \mathbf{d} = \begin{pmatrix} 8 \\ 7 \end{pmatrix}$$

This is what a matrix-vector product means: $A\mathbf{x}$ is a linear combination of the columns of $A$, with coefficients given by the entries of $\mathbf{x}$.

$$A\mathbf{x} = x_1 \begin{pmatrix}2\\1\end{pmatrix} + x_2 \begin{pmatrix}1\\3\end{pmatrix}$$

This is the column picture of matrix-vector multiplication. It’s the most important way to think about it. The row picture - each row of $A$ dotted with $\mathbf{x}$ - gives you the same result but makes it harder to see what’s going on.

Why the columns are vectors - and why that’s not obvious. Look at how we built the matrix: we took two moves (vectors) and placed them side by side as columns. So the columns being vectors is not a coincidence or a theorem - it’s literally what we put in. The matrix is just a way of storing multiple vectors in a grid so we can refer to them together.

The confusion comes from the standard way of computing $A\mathbf{x}$: multiply each row of $A$ by $\mathbf{x}$ entry-by-entry and sum. That mechanical rule gives you the answer, but it looks nothing like “combining columns.” The first entry of the result is $2x_1 + x_2$, which mixes contributions from both columns. How is that “using column 1 and column 2 separately”?

The answer: regroup the arithmetic. The full result is $\begin{pmatrix} 2x_1 + x_2 \\ x_1 + 3x_2 \end{pmatrix}$. Split it by which part came from $x_1$ and which from $x_2$: $x_1 \begin{pmatrix} 2 \\ 1 \end{pmatrix} + x_2 \begin{pmatrix} 1 \\ 3 \end{pmatrix}$. Those two column vectors appear exactly. The row rule and the column picture are the same computation, just grouped differently.

The column picture is the geometrically meaningful one: $\mathbf{x}$ is a recipe (“take $x_1$ of Move A and $x_2$ of Move B”), and the columns are the moves. Multiplying $A\mathbf{x}$ executes the recipe. This is true for any matrix, not just this example: the columns of a matrix always represent the “elementary moves” that the matrix knows how to make, and a matrix-vector product is always a weighted combination of those moves.

Row picture (briefly). The first equation $2x_1 + x_2 = 8$ defines a line in the $(x_1, x_2)$ plane. The second equation defines another line. Solving the system is finding where the two lines intersect. This is geometrically valid but hard to generalise to higher dimensions. The column picture generalises cleanly.

What does the matrix do?

A matrix is a function. $A$ takes the vector $\mathbf{x}$ (your instructions: “do $x_1$ of Move A and $x_2$ of Move B”) and outputs a vector in the destination space. It transforms the “recipe” space into the “destination” space.

When you change the matrix - put different columns in - you change what destinations are reachable and how you reach them.


Solving the System

To find $x_1$ and $x_2$, solve:

$$2x_1 + x_2 = 8 \qquad (1)$$ $$x_1 + 3x_2 = 7 \qquad (2)$$

From (1): $x_2 = 8 - 2x_1$. Substitute into (2):

$$x_1 + 3(8 - 2x_1) = 7 \implies x_1 + 24 - 6x_1 = 7 \implies -5x_1 = -17 \implies x_1 = \frac{17}{5}$$

Then $x_2 = 8 - 2(17/5) = 8 - 34/5 = 6/5$.

Check: $\frac{17}{5}(2, 1) + \frac{6}{5}(1, 3) = (34/5 + 6/5, 17/5 + 18/5) = (40/5, 35/5) = (8, 7)$. ✓

The system had a unique solution. This means there’s exactly one way to combine Moves A and B to reach $(8, 7)$.


When Does a Solution Exist? When Is It Unique?

Now the general question. Given $A\mathbf{x} = \mathbf{b}$, when does a solution exist, and when is it unique?

Existence: A solution exists if and only if $\mathbf{b}$ is in the span of the columns of $A$ - that is, if $\mathbf{b}$ is reachable using the available moves. In the city: if your destination lies on roads you can actually travel. The set of all reachable destinations is called the column space of $A$, written $\text{col}(A)$.

Uniqueness: If a solution exists, it’s unique if and only if the only way to combine the columns and get $\mathbf{0}$ is to use all-zero coefficients. Formally: the only solution to $A\mathbf{x} = \mathbf{0}$ is $\mathbf{x} = \mathbf{0}$. The set of all solutions to $A\mathbf{x} = \mathbf{0}$ is called the null space of $A$.

If the null space contains a nonzero vector $\mathbf{z}$ - meaning $A\mathbf{z} = \mathbf{0}$ - then whenever you have one solution $\mathbf{x}_p$, you also have infinitely many: $\mathbf{x}_p + t\mathbf{z}$ for any $t$. Why? Because $A(\mathbf{x}_p + t\mathbf{z}) = A\mathbf{x}_p + tA\mathbf{z} = \mathbf{b} + \mathbf{0} = \mathbf{b}$.

In the city: The null space is the set of all “do-nothing trips” - combinations of moves that bring you back to where you started. If the two moves are linearly independent, the only do-nothing trip is the empty one. If they’re dependent (one is a multiple of the other), you can do Move A forward and then the equivalent of Move A backward via Move B, ending where you started with a nonzero combination.


Rank: How Many Independent Directions Do You Actually Have?

The rank of a matrix is the number of linearly independent columns (equivalently, rows). It measures how many genuinely different directions your moves span.

For our matrix $A = \begin{pmatrix} 2 & 1 \\ 1 & 3 \end{pmatrix}$: the two columns $(2, 1)$ and $(1, 3)$ are not multiples of each other, so they are linearly independent. Rank = 2.

If instead we had $A = \begin{pmatrix} 2 & 4 \\ 1 & 2 \end{pmatrix}$: the second column is twice the first. They point in the same direction. Rank = 1. The column space is a single line through the origin, and most destinations are unreachable.

The rank-nullity theorem says:

$$\text{rank}(A) + \text{nullity}(A) = n$$

where $n$ is the number of columns. In our $2 \times 2$ case: rank + nullity = 2. If rank = 2 (full column rank), nullity = 0, meaning the null space is just $\{\mathbf{0}\}$ - unique solutions whenever a solution exists. If rank = 1, nullity = 1 - a whole line of “do-nothing trips.”

Intuition: You start with $n$ available moves. Some of them are redundant (reachable by combining other moves). Rank counts the non-redundant ones. The nullity counts the redundancy - how many independent ways to combine all moves and go nowhere.

Proof of rank-nullity. Let $k = \text{nullity}(A)$ and let $\mathbf{z}_1, \ldots, \mathbf{z}_k$ be a basis for the null space of $A$. Any set of $k$ independent vectors in $\mathbb{R}^n$ can be extended to a full basis, so add $n - k$ further vectors $\mathbf{u}_1, \ldots, \mathbf{u}_{n-k}$ so that $\mathbf{z}_1, \ldots, \mathbf{z}_k, \mathbf{u}_1, \ldots, \mathbf{u}_{n-k}$ is a basis for $\mathbb{R}^n$.

Claim: $A\mathbf{u}_1, \ldots, A\mathbf{u}_{n-k}$ is a basis for $\text{col}(A)$. If true, then $\text{rank}(A) = n - k$, which is the theorem.

They span $\text{col}(A)$: Take any $\mathbf{b} \in \text{col}(A)$, so $\mathbf{b} = A\mathbf{x}$ for some $\mathbf{x}$. Expand $\mathbf{x}$ in the basis: $\mathbf{x} = \sum c_i \mathbf{z}_i + \sum d_j \mathbf{u}_j$. Apply $A$: since each $A\mathbf{z}_i = \mathbf{0}$ (the $\mathbf{z}_i$ are in the null space), the null-space terms vanish and $\mathbf{b} = A\mathbf{x} = \sum d_j A\mathbf{u}_j$. So every vector in $\text{col}(A)$ is a linear combination of $A\mathbf{u}_1, \ldots, A\mathbf{u}_{n-k}$.

They are linearly independent: Suppose $\sum d_j A\mathbf{u}_j = \mathbf{0}$. Then $A\bigl(\sum d_j \mathbf{u}_j\bigr) = \mathbf{0}$, so $\sum d_j \mathbf{u}_j$ is in the null space. Every null-space vector is a combination of the null-space basis, so $\sum d_j \mathbf{u}_j = \sum c_i \mathbf{z}_i$ for some $c_i$. Rearranging: $\sum c_i \mathbf{z}_i - \sum d_j \mathbf{u}_j = \mathbf{0}$. But $\mathbf{z}_1, \ldots, \mathbf{z}_k, \mathbf{u}_1, \ldots, \mathbf{u}_{n-k}$ is a basis - hence linearly independent - so all coefficients are zero. In particular each $d_j = 0$.

Since $A\mathbf{u}_1, \ldots, A\mathbf{u}_{n-k}$ spans $\text{col}(A)$ and is independent, it is a basis of size $n - k$. Therefore $\text{rank}(A) = n - k$, giving $\text{rank}(A) + \text{nullity}(A) = n$.


The Bigger Picture: Other Perspectives

We solved a city-navigation problem. Here’s why this same structure appears everywhere else:

Data and machine learning. A dataset is a matrix $X \in \mathbb{R}^{n \times d}$ - $n$ data points, each with $d$ features. A linear model predicts outcomes as $X\mathbf{w} = \hat{\mathbf{y}}$ for some weight vector $\mathbf{w}$. Finding $\mathbf{w}$ from observed outcomes $\mathbf{y}$ is solving a linear system. The column space of $X$ is the set of predictions your model can make. If two features are perfectly correlated (one is a linear combination of others), the matrix is rank-deficient and infinitely many weight vectors make the same prediction.

Graphics and transformations. A 2D rotation by angle $\theta$ is:

$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$

Applying this matrix to a vector rotates it. A matrix here is a transformation of space, not a lookup table. This view - matrices as linear maps - is the right one for graphics, robotics, and physics.

Networks and systems. The equations governing currents in an electrical circuit, or flows in a network, or concentrations in a chemical system at equilibrium, all take the form $A\mathbf{x} = \mathbf{b}$. The matrix encodes the structure of the system; the vector $\mathbf{b}$ encodes the inputs; the solution $\mathbf{x}$ is the state.


The Rigorous Underpinning (for when you want it)

Everything above was intuition-first. Here is where the intuition connects to the formal definitions.

A vector space over $\mathbb{R}$ is a set $V$ with two operations - addition and scalar multiplication - satisfying eight axioms (commutativity, associativity, identity, inverses, distributivity). $\mathbb{R}^n$ is the canonical example, but polynomial spaces and function spaces also qualify, which is why the same linear algebra applies in differential equations and quantum mechanics.

Linear independence is the formal condition: vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ are linearly independent if the only solution to $c_1 \mathbf{v}_1 + \cdots + c_k \mathbf{v}_k = \mathbf{0}$ is $c_1 = \cdots = c_k = 0$. A basis for a subspace is a linearly independent spanning set. Its size is the dimension.

Column space and null space are both subspaces (closed under addition and scalar multiplication). Rank-nullity is a dimension-counting theorem: the dimensions of these two subspaces add up to the number of columns. It is one of the cleanest facts in all of mathematics.

A matrix $A$ is invertible if and only if it has full rank - equivalently, its column space is the entire output space, and its null space contains only $\mathbf{0}$. In the city: invertible means you can reach any destination and do so uniquely.


Summary: The Same Question, Many Forms

Question Linear algebra language
Can I reach destination $\mathbf{b}$ with moves $A$? Is $\mathbf{b} \in \text{col}(A)$?
Is there a unique route? Is $\text{null}(A) = \{\mathbf{0}\}$?
How many independent moves do I really have? What is $\text{rank}(A)$?
How much redundancy is in my moves? What is $\text{nullity}(A)$?
Can I reach anywhere? Does $A$ have full rank?

The city problem is solved. The reason to learn linear algebra is that the same matrix $A\mathbf{x} = \mathbf{b}$ is the skeleton of problems in physics, data science, engineering, economics, and geometry - all at once. The names of the variables change. The structure doesn’t.


Read next: