Linear Transformations - Geometry Encoded as Arithmetic // Megha Bose

Helpful context:

Vector Spaces & Subspaces - The Geometry of Abstract Addition and Scaling

We have been writing $A\mathbf{x}$ and treating it as a mechanical computation: multiply rows by the vector, add up the entries, get a new vector. But there is a different and more powerful way to read this expression. A matrix is not just a table of numbers sitting in front of a vector - it is a machine for transforming space itself.

Every time you compute $A\mathbf{x}$, you are applying a function from one vector space to another. The output vector lives in a (possibly different) space from the input. The matrix encodes everything about what the function does. This reframing - from “computation” to “transformation” - is the shift that makes linear algebra applicable everywhere, including machine learning, graphics, physics, and differential equations.

This post is about understanding what that transformation actually does.

Watching a Transformation in Action

Start in $\mathbb{R}^2$ where we can see things concretely. Consider the matrix:

$$R = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}$$

What does it do to vectors? Try the standard basis vectors $\mathbf{e_1} = (1, 0)$ and $\mathbf{e_2} = (0, 1)$:

$$R\mathbf{e_1} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\begin{pmatrix} 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, \qquad R\mathbf{e_2} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} -1 \\ 0 \end{pmatrix}$$

The east direction $(1, 0)$ became the north direction $(0, 1)$. The north direction $(0, 1)$ became the west direction $(-1, 0)$. Every vector in the plane rotated by $90°$ counterclockwise. $R$ is a rotation matrix.

Notice what just happened: to understand what $R$ does to the entire plane, we only needed to check two vectors. Because once you know where the basis vectors go, you know where every vector goes.

A general vector $\mathbf{v} = (v_1, v_2) = v_1\mathbf{e_1} + v_2\mathbf{e_2}$, so:

$$R\mathbf{v} = R(v_1\mathbf{e_1} + v_2\mathbf{e_2}) = v_1(R\mathbf{e_1}) + v_2(R\mathbf{e_2}) = v_1\begin{pmatrix}0\\1\end{pmatrix} + v_2\begin{pmatrix}-1\\0\end{pmatrix} = \begin{pmatrix}-v_2\\v_1\end{pmatrix}$$

The two columns of $R$ are exactly $R\mathbf{e_1}$ and $R\mathbf{e_2}$. The columns of a matrix are the images of the standard basis vectors. This is not a coincidence - it is the definition.

A Gallery of Transformations

Different matrices do qualitatively different things to space. Here are the most important families.

Scaling. The matrix $S = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}$ sends $\mathbf{e_1} \to (2, 0)$ and $\mathbf{e_2} \to (0, 3)$. It stretches the horizontal axis by 2 and the vertical axis by 3. A circle becomes an ellipse. Distances change but the coordinate axes remain perpendicular.

Projection. The matrix $P = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$ sends $(x, y) \to (x, 0)$: every vector gets projected onto the $x$-axis. The $y$-component is wiped out. Information is lost. This transformation is not invertible - you cannot recover $(x, y)$ from $(x, 0)$ because the original $y$ is gone.

Reflection. The matrix $F = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ sends $\mathbf{e_1} \to \mathbf{e_2}$ and $\mathbf{e_2} \to \mathbf{e_1}$: it swaps the two coordinates. Geometrically, every vector is reflected across the line $y = x$. No information is lost - reflect twice and you return to the start.

Shear. The matrix $H = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}$ sends $\mathbf{e_1} \to (1, 0)$ (unchanged) and $\mathbf{e_2} \to (1, 1)$. The $x$-axis stays put; the $y$-axis tips to the right. A square grid of points becomes a parallelogram grid. Shear is invertible.

Rotation by angle $\theta$. For a general angle:

$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$

This rotates every vector counterclockwise by $\theta$. It preserves lengths (since $\cos^2\theta + \sin^2\theta = 1$) and angles between vectors.

All of these are examples of the same underlying thing: a linear transformation.

What Makes a Transformation “Linear”?

A function $T: V \to W$ between vector spaces is called a linear transformation if it satisfies two conditions for all vectors and scalars:

$T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$
$T(c\mathbf{v}) = c T(\mathbf{v})$

Together, these say: $T$ preserves linear combinations. If you scale and add first then transform, you get the same result as transforming first then scaling and adding:

$$T(c_1 \mathbf{v_1} + c_2 \mathbf{v_2}) = c_1 T(\mathbf{v_1}) + c_2 T(\mathbf{v_2})$$

This is the single most important property. It is what makes linear transformations compatible with the vector space structure.

What linearity implies geometrically:

$T(\mathbf{0}) = \mathbf{0}$: the zero vector always maps to zero (set $c = 0$ in condition 2). Any transformation that does not fix the origin is not linear.
Lines through the origin map to lines through the origin (or collapse to the origin).
Parallel lines map to parallel lines (or both collapse to the same line or point).
Grids map to grids - possibly distorted, possibly degenerate, but the regular spacing is preserved.

Non-examples:

Translation: $T(\mathbf{v}) = \mathbf{v} + \mathbf{b}$ with $\mathbf{b} \neq \mathbf{0}$. Check: $T(\mathbf{0}) = \mathbf{b} \neq \mathbf{0}$. Not linear.
Squaring: $T(v) = v^2$ on $\mathbb{R}$. Check: $T(2v) = 4v^2 \neq 2v^2 = 2T(v)$. Not linear.
Absolute value: $T(v) = |v|$. Check: $T(-1) = 1 \neq -1 = -T(1)$. Not linear.

Why this matters for ML. A neural network layer computes $T(\mathbf{x}) = A\mathbf{x} + \mathbf{b}$. The $A\mathbf{x}$ part is a linear transformation. The $+\mathbf{b}$ bias makes it an affine transformation - linear plus a translation. The bias shifts the “decision boundary” away from the origin; the weight matrix $A$ is where the interesting geometric transformation lives.

The Key Theorem: Determined by a Basis

This is the most important structural fact about linear transformations:

Theorem. A linear transformation $T: V \to W$ is completely determined by the images of a basis of $V$.

Proof. Let $\{\mathbf{b_1}, \ldots, \mathbf{b_n}\}$ be a basis for $V$. Every $\mathbf{v} \in V$ has a unique representation $\mathbf{v} = c_1\mathbf{b_1} + \cdots + c_n\mathbf{b_n}$. By linearity:

$$T(\mathbf{v}) = c_1T(\mathbf{b_1}) + \cdots + c_nT(\mathbf{b_n})$$

So $T(\mathbf{v})$ is completely determined by the $n$ values $T(\mathbf{b_1}), \ldots, T(\mathbf{b_n})$. $\square$

What this means in practice. To define a linear transformation from an $n$-dimensional space, specify $n$ output vectors - one per basis element. You are free to choose them however you like; once chosen, the transformation is determined everywhere.

What this means for matrices. The standard basis of $\mathbb{R}^n$ is $\{\mathbf{e_1}, \ldots, \mathbf{e_n}\}$. For any linear $T: \mathbb{R}^n \to \mathbb{R}^m$:

$$T(\mathbf{x}) = x_1T(\mathbf{e_1}) + \cdots + x_nT(\mathbf{e_n})$$

Stack $T(\mathbf{e_1}), \ldots, T(\mathbf{e_n})$ as columns into a matrix $A$. Then $T(\mathbf{x}) = A\mathbf{x}$.

Every linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ is multiplication by a unique $m \times n$ matrix, and the columns of that matrix are the images of the standard basis vectors.

Assembling the Matrix from the Transformation

The recipe: apply $T$ to each standard basis vector and write the result as a column.

Example. “Rotate $90°$ counterclockwise” in $\mathbb{R}^2$:

$$T\begin{pmatrix}1\\0\end{pmatrix} = \begin{pmatrix}0\\1\end{pmatrix}, \qquad T\begin{pmatrix}0\\1\end{pmatrix} = \begin{pmatrix}-1\\0\end{pmatrix}$$

Matrix: $\begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}$. This recovers $R$ from the opening.

Example. “Project onto the line $y = x$.” A vector $(x, y)$ projects to the nearest point on $y = x$, which is $\left(\frac{x+y}{2}, \frac{x+y}{2}\right)$.

$$T\begin{pmatrix}1\\0\end{pmatrix} = \begin{pmatrix}1/2\\1/2\end{pmatrix}, \qquad T\begin{pmatrix}0\\1\end{pmatrix} = \begin{pmatrix}1/2\\1/2\end{pmatrix}$$

Matrix: $\begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}$. Notice both columns are the same - the image is 1-dimensional (the line $y = x$). Applying the projection twice gives the same result as applying it once: $P^2 = P$. (Check: $\begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}^2 = \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix}$. ✓) A transformation satisfying $P^2 = P$ is called a projection - and projections always have this matrix structure.

Kernel and Image

Every linear transformation $T: V \to W$ comes with two fundamental subspaces.

The kernel of $T$ (null space) is the set of inputs that map to zero:

$$\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}$$

The image of $T$ (column space) is the set of all possible outputs:

$$\text{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\}$$

Both are subspaces. For $T(\mathbf{x}) = A\mathbf{x}$: kernel = null space of $A$, image = column space of $A$. The linear transformation viewpoint explains why these were the right things to study - they capture what the transformation loses (kernel) and what it produces (image).

The rank-nullity theorem in this language:

$$\dim(\ker T) + \dim(\text{im} T) = \dim(V)$$

Every dimension of the domain either collapses to zero (contributing to the kernel) or survives in the output (contributing to the image). No dimension is created or destroyed - the transformation just redistributes them.

Geometric reading:

$\ker(T) = \{\mathbf{0}\}$: no information is lost. $T$ is injective (one-to-one).
$\text{im}(T) = W$: every output is reachable. $T$ is surjective (onto).
Both: $T$ is invertible.

Example. Projection $P = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$: $\ker(P) = \{(0, y) : y \in \mathbb{R}\}$ (the $y$-axis, dimension 1); $\text{im}(P) = \{(x, 0) : x \in \mathbb{R}\}$ (the $x$-axis, dimension 1). Rank-nullity: $1 + 1 = 2$. ✓ Information about $y$ is lost; only $x$ survives.

Composition = Matrix Multiplication

Here is the deepest reason why matrix multiplication is defined the way it is.

Suppose you apply $T_1: \mathbb{R}^n \to \mathbb{R}^m$ first, then $T_2: \mathbb{R}^m \to \mathbb{R}^p$. The composition does $T_1$ then $T_2$:

$$(T_2 \circ T_1)(\mathbf{x}) = T_2(T_1(\mathbf{x})) = A_2(A_1\mathbf{x}) = (A_2 A_1)\mathbf{x}$$

The matrix of the composition is the product of the matrices. Matrix multiplication was defined precisely so this works. The row-times-column formula is not arbitrary - it is exactly what you get when you ask “what single matrix represents applying $A_1$ and then $A_2$?”

A concrete check. Two $90°$ rotations = one $180°$ rotation:

$$R^2 = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix}$$

This sends $(x, y) \to (-x, -y)$: a $180°$ rotation. ✓

Why $AB \neq BA$ in general. Applying $A$ then $B$ is a different sequence from applying $B$ then $A$. Function composition is not commutative, so matrix multiplication is not commutative. But composition is associative - the order of grouping does not matter, only the order of application - so $(AB)C = A(BC)$.

Invertibility

$T: V \to W$ is invertible if there exists a linear $T^{-1}: W \to V$ that undoes $T$ in both directions.

From the kernel-image picture: invertibility requires $\ker(T) = \{\mathbf{0}\}$ (no two inputs share an output, so no output is ambiguous to undo) AND $\text{im}(T) = W$ (every output has a preimage, so $T^{-1}$ is defined everywhere).

For square maps $T: \mathbb{R}^n \to \mathbb{R}^n$, these two conditions are equivalent: one implies the other by rank-nullity. The condition is simply that $T$ has full rank $n$.

In matrix language: $A$ is invertible iff $A^{-1}$ exists with $A^{-1}A = AA^{-1} = I$. Gaussian elimination gives the computational procedure for finding $A^{-1}$ or confirming that it does not exist.

Beyond $\mathbb{R}^n$: Differentiation

The full power of the linear transformation viewpoint comes when you apply it to spaces other than $\mathbb{R}^n$.

Consider the differentiation operator $D: \mathcal{P}_3 \to \mathcal{P}_2$ which maps a polynomial to its derivative. Is $D$ linear?

$D(f + g) = f' + g' = D(f) + D(g)$. Yes.
$D(cf) = cf' = cD(f)$. Yes.

Differentiation is a linear transformation. Its domain is the vector space $\mathcal{P}_3$ of polynomials of degree at most 3; its codomain is $\mathcal{P}_2$.

To find its matrix, apply $D$ to the basis $\{1, x, x^2, x^3\}$ of $\mathcal{P}_3$ and express each output in the basis $\{1, x, x^2\}$ of $\mathcal{P}_2$:

$$D(1) = 0, \quad D(x) = 1, \quad D(x^2) = 2x, \quad D(x^3) = 3x^2$$

Stacking coordinate vectors as columns:

$$[D] = \begin{pmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{pmatrix}$$

To differentiate $f = 2 + 3x + 5x^2 + x^3$, write its coordinate vector $(2, 3, 5, 1)^T$ and multiply:

$$[D]\begin{pmatrix}2\\3\\5\\1\end{pmatrix} = \begin{pmatrix} 0(2)+1(3)+0(5)+0(1) \\ 0(2)+0(3)+2(5)+0(1) \\ 0(2)+0(3)+0(5)+3(1) \end{pmatrix} = \begin{pmatrix} 3 \\ 10 \\ 3 \end{pmatrix}$$

This is the coordinate vector of $3 + 10x + 3x^2$, which is indeed $f'(x)$. ✓

The kernel of $D$ is the set of polynomials with zero derivative: the constants $\{a : a \in \mathbb{R}\}$, a 1-dimensional subspace. Rank-nullity: $4 - 1 = 3 = \dim(\mathcal{P}_2)$. ✓

This is the bridge between linear algebra and calculus. Differential equations are linear systems in function space; the operators are linear transformations; and the solution sets are subspaces.

Linear Transformations in Machine Learning

Layers are linear transformations. A fully connected layer computes $\mathbf{y} = A\mathbf{x} + \mathbf{b}$. The weight matrix $A$ is a linear transformation from input space $\mathbb{R}^n$ to output space $\mathbb{R}^m$. Learning the layer means finding an $A$ such that the transformation maps the input distribution to something useful for the task.

Why you cannot stack only linear layers. The composition of linear transformations is linear. Two layers $A_2(A_1\mathbf{x}) = (A_2 A_1)\mathbf{x}$ is just one matrix. A ten-layer linear network can always be replaced by a single-layer network with matrix $A_{10} \cdots A_1$. Depth without nonlinearity adds nothing. The ReLU, sigmoid, or other activation function at each layer is what breaks this collapse - it introduces a nonlinearity that cannot be “absorbed” into a single matrix.

Dimensionality reduction. PCA finds the linear transformation $W \in \mathbb{R}^{k \times n}$ (with $k \ll n$) that projects high-dimensional data onto the $k$ directions of highest variance. This is a deliberate information loss: the kernel of $W$ is the $(n-k)$-dimensional subspace of low-variance directions. The image is the $k$-dimensional subspace that best captures the data.

Attention in transformers. The input sequence $X \in \mathbb{R}^{n \times d}$ is projected by three learned linear transformations: $Q = XW_Q$, $K = XW_K$, $V = XW_V$. Each $W$ is a matrix that maps the $d$-dimensional token representations to query, key, and value spaces. The attention scores are computed from $Q$ and $K$; the output is a linear combination of the values. Every operation is built from linear transformations, with the softmax providing the nonlinearity.

What the weight matrix “learns”. A linear transformation can rotate and reflect (reorganize directions), scale (emphasize some features over others), project (discard some information), or expand to higher dimensions (create new feature combinations). A trained neural network is a composition of many such operations, interleaved with nonlinearities, that has learned to map input data into a space where the task becomes easy.

The Rigorous Underpinning

Formal definition. $T: V \to W$ is a linear transformation if for all $\mathbf{u}, \mathbf{v} \in V$ and $\alpha, \beta \in \mathbb{R}$:

$$T(\alpha\mathbf{u} + \beta\mathbf{v}) = \alpha T(\mathbf{u}) + \beta T(\mathbf{v})$$

The matrix representation theorem. Let $B_V = \{\mathbf{b_1}, \ldots, \mathbf{b_n}\}$ be a basis for $V$ and $B_W = \{\mathbf{c_1}, \ldots, \mathbf{c_m}\}$ a basis for $W$. Any linear $T: V \to W$ has a unique matrix $[T]$ such that: if $\mathbf{v}$ has coordinate vector $\mathbf{c}$ in basis $B_V$, then $T(\mathbf{v})$ has coordinate vector $[T]\mathbf{c}$ in basis $B_W$. The columns of $[T]$ are the $B_W$-coordinate vectors of $T(\mathbf{b_1}), \ldots, T(\mathbf{b_n})$.

Different choices of bases give different matrices for the same transformation. Changing basis transforms the matrix by a similarity or equivalence transformation - the transformation itself is unchanged, only its coordinate description changes.

Isomorphism. A bijective linear transformation is an isomorphism. Two vector spaces are isomorphic if there is an isomorphism between them - they have identical linear-algebraic structure. Every $n$-dimensional real vector space is isomorphic to $\mathbb{R}^n$: the isomorphism is the coordinate map sending each vector to its tuple of basis coefficients.

The space of linear maps. $\mathcal{L}(V, W)$, the set of all linear transformations from $V$ to $W$, is itself a vector space of dimension $mn$ (where $m = \dim W$, $n = \dim V$). The matrix representation identifies it with the space of $m \times n$ matrices.

Summary

Concept	What it means
Linear transformation	Preserves linear combinations; maps vector spaces to vector spaces
Columns of matrix	Images of the standard basis vectors
Kernel	Inputs mapped to zero; captures what the transformation “loses”
Image	All reachable outputs; captures what the transformation “produces”
Rank-nullity	$\dim(\ker) + \dim(\text{im}) = \dim(\text{domain})$
Matrix multiplication	Composition of linear transformations
Invertibility	Trivial kernel and full image; full rank
Matrix representation	Coordinate description relative to a choice of basis

The shift from “multiply a matrix” to “apply a linear transformation” is the shift from computation to understanding. Once you see matrices as transformations, the question stops being “how do I multiply these?” and starts being “what does this transformation do to space?” - and that question applies in polynomials, in differential equations, in neural networks, and everywhere else the same structure appears.

Read next: