Brunton Textbook Review - Data Driven Dynamics

2023/09/21

Welcome! This post is my first review in the Dynamics and Control section of “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control.”.

This chapter focuses on analyzing dynamical systems by pure observation (that is, deducing their properties purely by observing the system, rather then from some first-principles analysis). First, it introduces some basic principles of dynamical systems, then gives a few methods for discovering their dynamics.

There are a few things we want to do with dynamical systems:

There are two main things in our way:

And to deal with these things, we will use two main approaches:

Basics of Dynamical Systems

A dynamical system is typically defined by its derivative function:

$$\dot{x} = f(x, t; \beta)$$

Where $\dot{x}$ is the time derivative of $x$, $t$ is the time, and $\beta$ is a parameter set. However, we can also define discrete-time dynamical functions:

$$x_ {t+1} = F(x_ {t})$$

Discrete-time functions are, perhaps unintuitively, more general than continuous-time functions, in that they can describe discontinuous behavior. You can turn a continuous-time dynamical system into a discrete-time system by forward-integrating it over the appropriate timestep.

Linear Systems, Spectral Decomposition

Basically the only kinds of dynamical systems we actually know how to work with are linear systems ($\dot{x} = Ax$). For these, we can forward-integrate solutions for arbitrary times: $x_ {t_ {0} + t} = e^{At}x_ {t_ {0}}$. But we can also find the spectral decomposition, which is a fancy way to say that we find the eigenvectors and eigenvalues of $A$:

$$AT = T\Lambda$$

Where $T$ is a matrix of eigenvectors and $\Lambda$ is a matrix of corresponding eigenvalues. A useful property of the spectral decomposition is that we can also see that $A = T \Lambda T^{-1}$, so $\dot{x} = Ax = T \Lambda T^{-1}x$. If we define a new coordinate space $z = T^{-1}x$, then there is an equivalent dynamics in this space where:

$$\begin{align} \dot{x} &= Ax \\ \dot{Tz} &= ATz \\ T\dot{z} &= T \Lambda T^{-1} T z \\ \dot{z} &= \Lambda z \end{align}$$

Which is super cool, because since $\Lambda$ is a diagonal matrix of eigenvalues we can do dynamics super easily in this space and then transform back to the original.

Dynamic Mode Decomposition (DMD)

Intuition, Basic Math

Let’s say you have some data that represents a system evolving over time. The example in the book is vortex shedding by a cylinder, which means you have a series of 2D matrices whose values represent vorticity at a certain point in space. The dimensionality of your data doesn’t really matter, since you immediately vectorize it into a series of $n$-dimensional vectors $x_ {1} \dots x_ {m}$. What you get out of DMD is (the equivalent of) a matrix $A$, which is a best-fit linear approximation for how this system tends to evolve over time. You can imagine that it minimizes $|x_ {k} - Ax_ {k-1}|$.

I say you get the equivalent of $A$ because, as you can see, $A$ is $n \times n$, and if $n$ is the dimension of a vectorized image, $A$ might be huge. So instead we solve for the rank-$r$ SVD approximation of $A$. To do this, we start by finding splitting our data into $X = [x_ {1}\dots x_ {m-1}], X’ = [x_ {2}\dots x_ {m}]$, so that $X’ = AX$ and, therefore, $A = X’X^{\dagger}$ (where $X^{\dagger}$ is the pseudoinverse of $X$). Since we wish to find a version of $A$ that works in a reduced state space, we now compute the SVD $X = \tilde{U} \tilde{\Sigma} \tilde{V}^{*}$ (cf: SVD Interpretations). The columns of $U$ are also known as the POD (Proper Orthogonal Decomposition) modes of $X$.

Since $A = X’X^{\dagger}$ and the pseudoinverse of $X$ can be easily be found from the SVD ($X^{\dagger} = \tilde{V} \tilde{\Sigma}^{-1} \tilde{U}^{*}$), we can now project $A$ into the POD basis of $U$:

$$\tilde{A} = \tilde{U}^{* } A \tilde{U} = \tilde{U}^{* } X’ \tilde{V} \tilde{\Sigma}^{-1}$$

Now, if we want to use $\tilde{A}$, we can simply find $\tilde{x}_ {k} = \tilde{U}^{* }x_ {k}$, then run it through the dynamical system ($\tilde{x}_ {k+1} = \tilde{A}\tilde{x}_ {k}$) and finally project it back into the expanded space with $x_ {k+1} = \tilde{U}\tilde{x}_ {k+1}$

Here’s the part of this that was a little tricky for me to understand: using the spectral decomposition of $\tilde{A}$ and a little extra math that I didn’t really follow, you can find the spectral decomposition of the high-dimensional $A$ matrix. The eigenvectors in this decomposition are called the “DMD modes.” Understanding that the modes are just eigenvectors and associated eigenvalues can be helpful in realizing that the DMD modes are really just components of a low-rank approximation of $A$. However, DMD is special because these modes are both spatially and temporally coherent. One of the big reasons we like to use it is that it offers us a way to decompose a system into linear combinations of oscillating coherent structures.

We can leverage the spectral decomposition of $A$ by using it to do another spectral decomposition of a state $x_ {k}$:

$$x_ {k} = \sum\limits_ {j=1}^{r} \phi_ {j} \lambda_ {j}^{k-1} b_ {j}$$

Where $\phi_ {j}$ and $\lambda_ {j}$ are the $j$th eigenvectors/eigenvalues of $A$, and $b_ {j}$ is a phasor giving a phase and amplitude to fully characterize the oscillatory behavior of the mode $\phi_ {j}$. Looking at this, we can see that the real component of each $\lambda_ {j}$ allows us to model exponential growth (or decay) in the magnitude of each phase, while the imaginary component allows us to fit a frequency to its oscillations. Magnitude and phase, as stated, come from $b_ {j}$.

Another way of thinking of the DMD is that it’s sort of a cross between a time-domain Fourier transform and a spatial PCA, extracting sets of coherent spatial structures which evolve similarly over time.

Weaknesses and Variations

One big problem with vanilla DMD is that noisy data can create significant biases in the distribution of the eigenvalues. Worse, the bias is not resolved by larger sample sizes. A simple way to help this is by computing the DMD matrices for both the forward and backward directions in time, then taking the average of the forward direction and the inverse of the backwards direction. There are also a number of techniques which attempt to address the problem of bias by solving for the matrices, and also for the phasors in $b$, in explicit optimization problems which are handed off to nonlinear solvers.

DMD also can’t handle nonlinear and nonperiodic features, such as multiple fixed points or chaos. Further, the failure mode of DMD in these cases is such that it completely fails to capture even the linear features of the system. The next algorithm discussed, SINDy, can fully extract nonlinear dynamics, but doesn’t scale well to high-dimensional systems unless they admit projection into a low-dimensional subspace. In the case of a high-dimensional, nonlinear system, a more recent algorithm called LANDO (linear and nonlinear disambiguation optimization) may be useful. LANDO extracts the full nonlinear dynamics, and allows you to use DMD to extract modes for a linearization of the system around some point.

A few types of content in the DMD input data are particularly hard to capture. Travelling waves are difficult to break into stationary spatial modes. Transient spikes are both nonlinear and nonperiodic. Nonlinear interactions between a few dominant modes can produce what appears to be broadband frequency content, which will result in a large number of modes appearing in the DMD analysis.

Extensions, Applications

DMD can be useful in compression and compressed sensing, since a relatively small number of modes can often describe most of the data. Extensions of DMD which optimize for sparsity in $b$ can improve quality here. Further, the modes themselves often admit sparse representation in a Fourier basis.

Extensions: Variations of DMD exist for a variety of systems, including systems which are subject to control rather than evolving naturally (DMDc), systems with nonlinear measurement functions (eDMD), systems with transient dynamics or dynamics on vastly different timescales (like climate data, the technique is mrDMD), and systems with incomplete/low-rank measurements (which can paradoxically produce worse results than overdefined systems because DMD needs a higher-rank measurement matrix to produce complex conjugate pairs of eigenvalues - the solution here is to stack delayed measurements into a larger matrix).

DMD is actually used for compression in streaming, so there are versions which admit incremental and parallelized reconstruction of data for maximum speed.

There are tensor formulations of DMD which preserve the spatial structure of the data rather than vectorizing it.

Applications: DMD was originally developed for analyzing fluid flows and is still used for this purpose, but also in epidemiology (modelling disease spread over time), neuroscience (analyzing neural recordings), video analysis (example: foreground/background separation can be done by noting that the background is likely to have low/zero eigenvalue), plasma dynamics, robotics, and probably many other fields.

Sparse Identification of Nonlinear Dynamics (SINDy)

SINDy is an extremely simple concept: it solves a linear programming problem to fit coefficients for a library of functions to a dynamics model. Specifically, if you have a data matrix $X$ with derivative $\dot{X}$ (which can be calculated numerically from $X$ using the total-variation regularized derivative), you can create a library $\Theta(X)$ of functions applied to $X$. The matrix of coefficients for $\Theta$ will be denoted $\Xi$, so we’d ideally like $\dot{X} = \Theta(X)\Xi$. But instead of solving directly (with a pseudoinverse, for example), we just solve a minimization problem:

$$\text{argmin}_ {\Xi} ||\dot{X} - \Theta(X)\Xi||_ {2} + \lambda ||\Xi||_ {1}$$

Where $\lambda$ is a tunable parameter to promote sparsity in $\Xi$.

You can also do SINDy with discrete-time dynamics - if you also set $\lambda$ to zero, then the SINDy problem reduces to DMD.

Koopman Theory

The cool thing that Koopman Theory lets us do is express nonlinear dynamics in a high-dimensional linear space. In theory, this space is actually an infinite-dimensional Hilbert space (that is, a space of functions with some property), but in practice it is sometimes possible to find finite-dimensional approximations.

Koopman Operators

Suppose you have a system being measured by measurement function $g$, a member of some infinite-dimensional Hilbert space. The Koopman operator for some timestep, $\mathcal{K}_ {\Delta t} g(x_ {t})$, returns the measurement value for the state of the system at that point in the future, $g(x_ {t+\Delta t})$. In a discrete-time system with timestep $\Delta t$, this becomes $g(x_ {t+1})$. There is also a fully continuous-time version of the Koopman operator, which is essentially equivalent to the time-derivative of $g$ (think of the limit definition of the derivative).

The infinite dimensionality of the Koopman operator is, of course, inconvenient. Capturing just any subspace of the Hilbert space probably won’t be sufficient, because there’s no guarantee that a random subspace actually behaves linearly under the Koopman operator. However, eigenfunctions of the Koopman operator are, by definition, guaranteed to have this property. In discrete time, an eigenfunction $\phi$ with eigenvalue $\lambda$ would satisfy:

$$\phi(x_ {t+1}) = K_ {\delta t} \phi(x_ {t}) = \lambda \phi(x_ {t})$$

And in continuous time:

$$\frac{d}{dt} \phi(x_ {t}) = K\phi(x_ {t}) = \lambda \phi(x_ {t})$$

If we wanted to solve for the eigenfunction, we would apply the chain rule and try to solve the resulting PDE:

$$\frac{d}{dt} \phi(x) = \lambda \phi(x) = \nabla \phi(x) \cdot f(x)$$

The dynamics of a nonlinear system are linear in the eigenfunction coordinates. For example, any conserved quantity in a physical system corresponds to an eigenfunction of its Koopman operator with a corresponding eigenvalue of zero.

Learning Koopman Operators

It’s possible to train a NN to give finite-dimensional approximations of a Koopman operator using a variation on autoencoders. A normal autoencoder trains two halves of a network, one an encoding function ($\phi$) and one a decoding function ($\phi^{-1}$). Typically, the loss function just enforces that $\phi^{-1}(\phi(x)) = x$. However, a Koopman network adds one fully-connected layer between the $\phi$ and $\phi^{-1}$ portions of the network. Since one fully-connected layer with no nonlinearities is equivalent to matrix multiplication, this layer is the $K$ matrix. In addition to a typical autoencoder loss, the Koopman networks require that:

A variation on this technique has an additional network to predict the eigenvalues of the $K$ matrix, which is then constructed based on these eigenvalues; this makes it possible to linearize even systems which normally don’t admit compact Koopman representations (such as those with an escalating series of harmonic eigenvalues or broadband frequency content).