Linear Transformations - Geometry Encoded as Arithmetic
Helpful context:
We have been writing $A\mathbf{x}$ and treating it as a mechanical computation: multiply rows by the vector, add up the entries, get a new vector. But there is a different and more powerful way to read this expression. A matrix is not just a table of numbers sitting in front of a vector - it is a machine for transforming space itself.
Every time you compute $A\mathbf{x}$, you are applying a function from one vector space to another. The output vector lives in a (possibly different) space from the input. The matrix encodes everything about what the function does. This reframing - from “computation” to “transformation” - is the shift that makes linear algebra applicable everywhere, including machine learning, graphics, physics, and differential equations.
This post is about understanding what that transformation actually does.
Watching a Transformation in Action
Start in $\mathbb{R}^2$ where we can see things concretely. Consider the matrix:
$$R = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}$$
What does it do to vectors? Try the standard basis vectors $\mathbf{e_1} = (1, 0)$ and $\mathbf{e_2} = (0, 1)$:
$$R\mathbf{e_1} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\begin{pmatrix} 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, \qquad R\mathbf{e_2} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} -1 \\ 0 \end{pmatrix}$$
The east direction $(1, 0)$ became the north direction $(0, 1)$. The north direction $(0, 1)$ became the west direction $(-1, 0)$. Every vector in the plane rotated by $90°$ counterclockwise. $R$ is a rotation matrix.
Notice what just happened: to understand what $R$ does to the entire plane, we only needed to check two vectors. Because once you know where the basis vectors go, you know where every vector goes.
A general vector $\mathbf{v} = (v_1, v_2) = v_1\mathbf{e_1} + v_2\mathbf{e_2}$, so:
$$R\mathbf{v} = R(v_1\mathbf{e_1} + v_2\mathbf{e_2}) = v_1(R\mathbf{e_1}) + v_2(R\mathbf{e_2}) = v_1\begin{pmatrix}0\\1\end{pmatrix} + v_2\begin{pmatrix}-1\\0\end{pmatrix} = \begin{pmatrix}-v_2\\v_1\end{pmatrix}$$
The two columns of $R$ are exactly $R\mathbf{e_1}$ and $R\mathbf{e_2}$. The columns of a matrix are the images of the standard basis vectors. This is not a coincidence - it is the definition.
A Gallery of Transformations
Different matrices do qualitatively different things to space. Here are the most important families.
Scaling. The matrix $S = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}$ sends $\mathbf{e_1} \to (2, 0)$ and $\mathbf{e_2} \to (0, 3)$. It stretches the horizontal axis by 2 and the vertical axis by 3. A circle becomes an ellipse. Distances change but the coordinate axes remain perpendicular.
Projection. The matrix $P = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$ sends $(x, y) \to (x, 0)$: every vector gets projected onto the $x$-axis. The $y$-component is wiped out. Information is lost. This transformation is not invertible - you cannot recover $(x, y)$ from $(x, 0)$ because the original $y$ is gone.
Reflection. The matrix $F = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ sends $\mathbf{e_1} \to \mathbf{e_2}$ and $\mathbf{e_2} \to \mathbf{e_1}$: it swaps the two coordinates. Geometrically, every vector is reflected across the line $y = x$. No information is lost - reflect twice and you return to the start.
Shear. The matrix $H = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}$ sends $\mathbf{e_1} \to (1, 0)$ (unchanged) and $\mathbf{e_2} \to (1, 1)$. The $x$-axis stays put; the $y$-axis tips to the right. A square grid of points becomes a parallelogram grid. Shear is invertible.
Rotation by angle $\theta$. For a general angle:
$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$
This rotates every vector counterclockwise by $\theta$. It preserves lengths (since $\cos^2\theta + \sin^2\theta = 1$) and angles between vectors.
All of these are examples of the same underlying thing: a linear transformation.
What Makes a Transformation “Linear”?
A function $T: V \to W$ between vector spaces is called a linear transformation if it satisfies two conditions for all vectors and scalars:
- $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$
- $T(c\mathbf{v}) = c T(\mathbf{v})$
Together, these say: $T$ preserves linear combinations. If you scale and add first then transform, you get the same result as transforming first then scaling and adding:
$$T(c_1 \mathbf{v_1} + c_2 \mathbf{v_2}) = c_1 T(\mathbf{v_1}) + c_2 T(\mathbf{v_2})$$
This is the single most important property. It is what makes linear transformations compatible with the vector space structure.
What linearity implies geometrically:
- $T(\mathbf{0}) = \mathbf{0}$: the zero vector always maps to zero (set $c = 0$ in condition 2). Any transformation that does not fix the origin is not linear.
- Lines through the origin map to lines through the origin (or collapse to the origin).
- Parallel lines map to parallel lines (or both collapse to the same line or point).
- Grids map to grids - possibly distorted, possibly degenerate, but the regular spacing is preserved.
Non-examples:
- Translation: $T(\mathbf{v}) = \mathbf{v} + \mathbf{b}$ with $\mathbf{b} \neq \mathbf{0}$. Check: $T(\mathbf{0}) = \mathbf{b} \neq \mathbf{0}$. Not linear.
- Squaring: $T(v) = v^2$ on $\mathbb{R}$. Check: $T(2v) = 4v^2 \neq 2v^2 = 2T(v)$. Not linear.
- Absolute value: $T(v) = |v|$. Check: $T(-1) = 1 \neq -1 = -T(1)$. Not linear.
Why this matters for ML. A neural network layer computes $T(\mathbf{x}) = A\mathbf{x} + \mathbf{b}$. The $A\mathbf{x}$ part is a linear transformation. The $+\mathbf{b}$ bias makes it an affine transformation - linear plus a translation. The bias shifts the “decision boundary” away from the origin; the weight matrix $A$ is where the interesting geometric transformation lives.
The Key Theorem: Determined by a Basis
This is the most important structural fact about linear transformations:
Theorem. A linear transformation $T: V \to W$ is completely determined by the images of a basis of $V$.
Proof. Let $\{\mathbf{b_1}, \ldots, \mathbf{b_n}\}$ be a basis for $V$. Every $\mathbf{v} \in V$ has a unique representation $\mathbf{v} = c_1\mathbf{b_1} + \cdots + c_n\mathbf{b_n}$. By linearity:
$$T(\mathbf{v}) = c_1T(\mathbf{b_1}) + \cdots + c_nT(\mathbf{b_n})$$
So $T(\mathbf{v})$ is completely determined by the $n$ values $T(\mathbf{b_1}), \ldots, T(\mathbf{b_n})$. $\square$
What this means in practice. To define a linear transformation from an $n$-dimensional space, specify $n$ output vectors - one per basis element. You are free to choose them however you like; once chosen, the transformation is determined everywhere.
What this means for matrices. The standard basis of $\mathbb{R}^n$ is $\{\mathbf{e_1}, \ldots, \mathbf{e_n}\}$. For any linear $T: \mathbb{R}^n \to \mathbb{R}^m$:
$$T(\mathbf{x}) = x_1T(\mathbf{e_1}) + \cdots + x_nT(\mathbf{e_n})$$
Stack $T(\mathbf{e_1}), \ldots, T(\mathbf{e_n})$ as columns into a matrix $A$. Then $T(\mathbf{x}) = A\mathbf{x}$.
Every linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ is multiplication by a unique $m \times n$ matrix, and the columns of that matrix are the images of the standard basis vectors.
Assembling the Matrix from the Transformation
The recipe: apply $T$ to each standard basis vector and write the result as a column.
Example. “Rotate $90°$ counterclockwise” in $\mathbb{R}^2$:
$$T\begin{pmatrix}1\\0\end{pmatrix} = \begin{pmatrix}0\\1\end{pmatrix}, \qquad T\begin{pmatrix}0\\1\end{pmatrix} = \begin{pmatrix}-1\\0\end{pmatrix}$$
Matrix: $\begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}$. This recovers $R$ from the opening.
Example. “Project onto the line $y = x$.” A vector $(x, y)$ projects to the nearest point on $y = x$, which is $\left(\frac{x+y}{2}, \frac{x+y}{2}\right)$.
$$T\begin{pmatrix}1\\0\end{pmatrix} = \begin{pmatrix}1/2\\1/2\end{pmatrix}, \qquad T\begin{pmatrix}0\\1\end{pmatrix} = \begin{pmatrix}1/2\\1/2\end{pmatrix}$$
Matrix: $\begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}$. Notice both columns are the same - the image is 1-dimensional (the line $y = x$). Applying the projection twice gives the same result as applying it once: $P^2 = P$. (Check: $\begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}^2 = \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}$. ✓) A transformation satisfying $P^2 = P$ is called a projection - and projections always have this matrix structure.
Kernel and Image
Every linear transformation $T: V \to W$ comes with two fundamental subspaces.
The kernel of $T$ (null space) is the set of inputs that map to zero:
$$\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}$$
The image of $T$ (column space) is the set of all possible outputs:
$$\text{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\}$$
Both are subspaces. For $T(\mathbf{x}) = A\mathbf{x}$: kernel = null space of $A$, image = column space of $A$. The linear transformation viewpoint explains why these were the right things to study - they capture what the transformation loses (kernel) and what it produces (image).
The rank-nullity theorem in this language:
$$\dim(\ker T) + \dim(\text{im} T) = \dim(V)$$
Every dimension of the domain either collapses to zero (contributing to the kernel) or survives in the output (contributing to the image). No dimension is created or destroyed - the transformation just redistributes them.
Geometric reading:
- $\ker(T) = \{\mathbf{0}\}$: no information is lost. $T$ is injective (one-to-one).
- $\text{im}(T) = W$: every output is reachable. $T$ is surjective (onto).
- Both: $T$ is invertible.
Example. Projection $P = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$: $\ker(P) = \{(0, y) : y \in \mathbb{R}\}$ (the $y$-axis, dimension 1); $\text{im}(P) = \{(x, 0) : x \in \mathbb{R}\}$ (the $x$-axis, dimension 1). Rank-nullity: $1 + 1 = 2$. ✓ Information about $y$ is lost; only $x$ survives.
Composition = Matrix Multiplication
Here is the deepest reason why matrix multiplication is defined the way it is.
Suppose you apply $T_1: \mathbb{R}^n \to \mathbb{R}^m$ first, then $T_2: \mathbb{R}^m \to \mathbb{R}^p$. The composition does $T_1$ then $T_2$:
$$(T_2 \circ T_1)(\mathbf{x}) = T_2(T_1(\mathbf{x})) = A_2(A_1\mathbf{x}) = (A_2 A_1)\mathbf{x}$$
The matrix of the composition is the product of the matrices. Matrix multiplication was defined precisely so this works. The row-times-column formula is not arbitrary - it is exactly what you get when you ask “what single matrix represents applying $A_1$ and then $A_2$?”
A concrete check. Two $90°$ rotations = one $180°$ rotation:
$$R^2 = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix}$$
This sends $(x, y) \to (-x, -y)$: a $180°$ rotation. ✓
Why $AB \neq BA$ in general. Applying $A$ then $B$ is a different sequence from applying $B$ then $A$. Function composition is not commutative, so matrix multiplication is not commutative. But composition is associative - the order of grouping does not matter, only the order of application - so $(AB)C = A(BC)$.
Invertibility
$T: V \to W$ is invertible if there exists a linear $T^{-1}: W \to V$ that undoes $T$ in both directions.
From the kernel-image picture: invertibility requires $\ker(T) = \{\mathbf{0}\}$ (no two inputs share an output, so no output is ambiguous to undo) AND $\text{im}(T) = W$ (every output has a preimage, so $T^{-1}$ is defined everywhere).
For square maps $T: \mathbb{R}^n \to \mathbb{R}^n$, these two conditions are equivalent: one implies the other by rank-nullity. The condition is simply that $T$ has full rank $n$.
In matrix language: $A$ is invertible iff $A^{-1}$ exists with $A^{-1}A = AA^{-1} = I$. Gaussian elimination gives the computational procedure for finding $A^{-1}$ or confirming that it does not exist.
Beyond $\mathbb{R}^n$: Differentiation
The full power of the linear transformation viewpoint comes when you apply it to spaces other than $\mathbb{R}^n$.
Consider the differentiation operator $D: \mathcal{P}_3 \to \mathcal{P}_2$ which maps a polynomial to its derivative. Is $D$ linear?
- $D(f + g) = f' + g' = D(f) + D(g)$. Yes.
- $D(cf) = cf' = cD(f)$. Yes.
Differentiation is a linear transformation. Its domain is the vector space $\mathcal{P}_3$ of polynomials of degree at most 3; its codomain is $\mathcal{P}_2$.
To find its matrix, apply $D$ to the basis $\{1, x, x^2, x^3\}$ of $\mathcal{P}_3$ and express each output in the basis $\{1, x, x^2\}$ of $\mathcal{P}_2$:
$$D(1) = 0, \quad D(x) = 1, \quad D(x^2) = 2x, \quad D(x^3) = 3x^2$$
Stacking coordinate vectors as columns:
$$[D] = \begin{pmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{pmatrix}$$
To differentiate $f = 2 + 3x + 5x^2 + x^3$, write its coordinate vector $(2, 3, 5, 1)^T$ and multiply:
$$[D]\begin{pmatrix}2\\3\\5\\1\end{pmatrix} = \begin{pmatrix} 0(2)+1(3)+0(5)+0(1) \\ 0(2)+0(3)+2(5)+0(1) \\ 0(2)+0(3)+0(5)+3(1) \end{pmatrix} = \begin{pmatrix} 3 \\ 10 \\ 3 \end{pmatrix}$$
This is the coordinate vector of $3 + 10x + 3x^2$, which is indeed $f'(x)$. ✓
The kernel of $D$ is the set of polynomials with zero derivative: the constants $\{a : a \in \mathbb{R}\}$, a 1-dimensional subspace. Rank-nullity: $4 - 1 = 3 = \dim(\mathcal{P}_2)$. ✓
This is the bridge between linear algebra and calculus. Differential equations are linear systems in function space; the operators are linear transformations; and the solution sets are subspaces.
Linear Transformations in Machine Learning
Layers are linear transformations. A fully connected layer computes $\mathbf{y} = A\mathbf{x} + \mathbf{b}$. The weight matrix $A$ is a linear transformation from input space $\mathbb{R}^n$ to output space $\mathbb{R}^m$. Learning the layer means finding an $A$ such that the transformation maps the input distribution to something useful for the task.
Why you cannot stack only linear layers. The composition of linear transformations is linear. Two layers $A_2(A_1\mathbf{x}) = (A_2 A_1)\mathbf{x}$ is just one matrix. A ten-layer linear network can always be replaced by a single-layer network with matrix $A_{10} \cdots A_1$. Depth without nonlinearity adds nothing. The ReLU, sigmoid, or other activation function at each layer is what breaks this collapse - it introduces a nonlinearity that cannot be “absorbed” into a single matrix.
Dimensionality reduction. PCA finds the linear transformation $W \in \mathbb{R}^{k \times n}$ (with $k \ll n$) that projects high-dimensional data onto the $k$ directions of highest variance. This is a deliberate information loss: the kernel of $W$ is the $(n-k)$-dimensional subspace of low-variance directions. The image is the $k$-dimensional subspace that best captures the data.
Attention in transformers. The input sequence $X \in \mathbb{R}^{n \times d}$ is projected by three learned linear transformations: $Q = XW_Q$, $K = XW_K$, $V = XW_V$. Each $W$ is a matrix that maps the $d$-dimensional token representations to query, key, and value spaces. The attention scores are computed from $Q$ and $K$; the output is a linear combination of the values. Every operation is built from linear transformations, with the softmax providing the nonlinearity.
What the weight matrix “learns”. A linear transformation can rotate and reflect (reorganize directions), scale (emphasize some features over others), project (discard some information), or expand to higher dimensions (create new feature combinations). A trained neural network is a composition of many such operations, interleaved with nonlinearities, that has learned to map input data into a space where the task becomes easy.
The Rigorous Underpinning
Formal definition. $T: V \to W$ is a linear transformation if for all $\mathbf{u}, \mathbf{v} \in V$ and $\alpha, \beta \in \mathbb{R}$:
$$T(\alpha\mathbf{u} + \beta\mathbf{v}) = \alpha T(\mathbf{u}) + \beta T(\mathbf{v})$$
The matrix representation theorem. Let $B_V = \{\mathbf{b_1}, \ldots, \mathbf{b_n}\}$ be a basis for $V$ and $B_W = \{\mathbf{c_1}, \ldots, \mathbf{c_m}\}$ a basis for $W$. Any linear $T: V \to W$ has a unique matrix $[T]$ such that: if $\mathbf{v}$ has coordinate vector $\mathbf{c}$ in basis $B_V$, then $T(\mathbf{v})$ has coordinate vector $[T]\mathbf{c}$ in basis $B_W$. The columns of $[T]$ are the $B_W$-coordinate vectors of $T(\mathbf{b_1}), \ldots, T(\mathbf{b_n})$.
Different choices of bases give different matrices for the same transformation. Changing basis transforms the matrix by a similarity or equivalence transformation - the transformation itself is unchanged, only its coordinate description changes.
Isomorphism. A bijective linear transformation is an isomorphism. Two vector spaces are isomorphic if there is an isomorphism between them - they have identical linear-algebraic structure. Every $n$-dimensional real vector space is isomorphic to $\mathbb{R}^n$: the isomorphism is the coordinate map sending each vector to its tuple of basis coefficients.
The space of linear maps. $\mathcal{L}(V, W)$, the set of all linear transformations from $V$ to $W$, is itself a vector space of dimension $mn$ (where $m = \dim W$, $n = \dim V$). The matrix representation identifies it with the space of $m \times n$ matrices.
Summary
| Concept | What it means |
|---|---|
| Linear transformation | Preserves linear combinations; maps vector spaces to vector spaces |
| Columns of matrix | Images of the standard basis vectors |
| Kernel | Inputs mapped to zero; captures what the transformation “loses” |
| Image | All reachable outputs; captures what the transformation “produces” |
| Rank-nullity | $\dim(\ker) + \dim(\text{im}) = \dim(\text{domain})$ |
| Matrix multiplication | Composition of linear transformations |
| Invertibility | Trivial kernel and full image; full rank |
| Matrix representation | Coordinate description relative to a choice of basis |
The shift from “multiply a matrix” to “apply a linear transformation” is the shift from computation to understanding. Once you see matrices as transformations, the question stops being “how do I multiply these?” and starts being “what does this transformation do to space?” - and that question applies in polynomials, in differential equations, in neural networks, and everywhere else the same structure appears.
Read next: