Steve Brunton recently released the second edition of his textbook, entitled “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control.” Since ML, dynamical systems, and control are three of my favorite things to learn about, I’ve decided to read all 761 pages of the book and do a writeup in many parts. This is the first part, and will cover interpretations of the Singular Value Decomposition (SVD), pages 1-55 of the book.

SVD Overview

SVD is a matrix factorization method which can represent any matrix $A$, which has $n$ rows and $m$ columns, as a product of three other matrices:

$$A = U \Sigma V^*$$

Where $X^*$ is the complex conjugate transpose of $X$. The shapes of these matrices are:

$U$: $n \times n$ - a unitary matrix with orthonormal columns
$\Sigma$: $n \times m$ - a matrix with real, non-negative values on the diagonal and zeros off the diagonal
$V$: $m \times m$ - a unitary matrix with orthonormal columns

(A unitary matrix is one for which $U U^{}= U^{}U = I$ - that is, its complex conjugate transpose is its inverse).

The matrix product form of SVD is equivalent to:

$$A = \sum\limits_{k=1}^{r} \sigma_{k} u_{k} v^{*}_{k}$$

Where $u_{k}$ and $v^{ * }_ {k}$ are respectively the $k$th columns of $U$ and $V$ (such that $u_{k} v^{ * }_ {k}$ is an $n \times m$ matrix).

In the event that $n \geq m$ (a tall-skinny matrix), $\Sigma$ is also tall-skinny, meaning that it’s sort of a diagonal matrix resting on a block of zeros:

$$\Sigma = \begin{bmatrix} \hat{\Sigma} \ 0 \end{bmatrix}$$

If we similarly break $U$ into $U = [\hat{U}\ \hat{U}^\bot]$, we see that $\hat{U}^\bot$ will be nulled out by the block of zeros in $\Sigma$, so we can exactly represent $A$ using the economy SVD:

$$A = \hat{U} \hat{\Sigma} V^*$$

The diagonal elements of $\Sigma$ are called the singular values of the matrix, and their associated columns in $U$ and $V$ are called the left and right singular vectors.

Interpretation 1: Matrix Approximation

Looking at the summation form of the SVD:

$$A = \sum\limits_{k=1}^{r} \sigma_{k} u_{k} v^{*}_{k}$$

$u_{k}$ and $v_{k}$ are both columns in an orthonormal basis, meaning that all their entries are bounded between -1 and 1. In the matrix $u_ {k} v^{ * }_ {k}$, each entry must then have the same bounds. The singular value $\sigma_{k}$ could then be thought of as assigning a weight to each $(u_{k}, v_{k})$ pair. If we took the $\tilde{r}$ largest few singular values and their associated singular vectors, we could get a rank-$\tilde{r}$ approximation for $A$ using the economy SVD.

As it turns out, this is an optimal low-rank matrix approximation for $A$ for all $\tilde{r}$. This means that we can, for example, use SVD as an image compression algorithm.

Interpretation 2: Directions of Maximum Variance

In the singular value decomposition of a matrix $A$, we can see that the left singular vectors are eigenvectors of the row-wise correlation matrix $AA^*$ and the right singular vectors are eigenvectors of the column-wise correlation matrix $A^{ * }A$. The obvious question is: why should we care about this? What can we do with the eigenvectors of correlation matrices?

The answer is: we can do PCA. As a refresher, PCA, or Principal Components Analysis, is a dimensionality reduction technique where we try to find the $k$ vectors which explain as much of the variance in $A$ as possible (in this case, $A$ is probably a collection of datapoints, so the row-wise variance, by convention, is variance between the datapoints). As it turns out, the eigenvectors of the correlation matrix of $A$ are the directions of maximum variance, so SVD is used to perform PCA. We might intuitively expect something like this to be true based on interpretation 1 (if the first $\tilde{r}$ columns explain most of the data in the matrix, surely they also explain most of the variance), but when I asked GPT-4 to prove that the eigenvectors of the correlation matrix are the directions of maximum variance it spent like 5 minutes writing an answer that relied on “Rayleigh quotients” (fun new LLM game: “real math thing, or hallucintation?”), so I’m just going to trust that someone has figured this out.

Wrapup

Steve Brunton’s book goes into way more detail than I have here, which is why this post is like two pages and the book chapter is 50. In particular, he spends a lot of time on when the SVD does and doesn’t provide good approximations of a matrix/dataset, gives some alternatives and generalizations (like randomized SVD and tensor decomposition), and provides cool examples.

The next chapter of the book covers one of my favorite techniques: Fourier and wavelet transforms.

Brunton Textbook Review - SVD Interpretations

2023/06/27

SVD Overview

Interpretation 1: Matrix Approximation

Interpretation 2: Directions of Maximum Variance

Wrapup