Welcome back to my Kalman filtering notes! In this post we’ll first back up and talk about when and why we would want to use a Kalman filter, and we’ll go all the way to deriving a filter with the same properties as KF.
Kalman Filter Setup and Motivation
Suppose you’re designing a satellite that needs to estimate its location in space very precisely. The two most obvious ways to do this are to use GPS and to infer your location by measuring it relative to background stars. Suppose for a moment that your satellite is totally immobile and can take a number of independent measures of its altitude, which we’ll call $y_i, i \in 1..m$. Each of these measurements can also have some uncertainty associated with it, which we’ll describe using measurement variances $\sigma^2_i$. If we want the estimate with the minimum variance, it’s hopefully intuitive that we can find it by taking an average weighted by inverse variance:
$$\hat{x} = \dfrac{\sum_{i=1}^{m} \left( \sigma^2_i \right)^{-1} y_i}{\sum_{i=1}^{m} \left( \sigma^2_i \right)^{-1}}$$
This is a good start, but we’ve made a number of assumptions that won’t necessarily hold in general. We’ve assumed that the noise in our estimates is independent, and we also assumed a perfectly stationary satellite. In reality, we’ll have a series of measurements with different variances of different states taken at different times, and we’ll need to combine those measurements to estimate the current state.
When is the Kalman filter a good idea? If you’re measuring a dynamical system that you have unimodal dynamics and measurement models for, and the dynamics aren’t insanely nonlinear (which will usually result in multimodal dynamics models anyway), KF is probably a good fit. I mean unimodal in the statistical sense here: if you need a filter that can simultaneously maintain multiple hypotheses, you should probably consider a particle filter.
Kalman Filter Properties
The Kalman Filter is a LUMVE: Linear, Unbiased, Minimum-Variance Estimator. It’s linear in that it only uses matrix operations to perform estimations. Unbiased means that the estimates it produces aren’t consistently higher or lower than the actual value if it has correct system and measurement models (mathematically: $E[\hat{x} - x] = 0$). Minimum-variance means pretty much what you think it does.
Let’s derive a LUMVE estimator!
LUMVE
Problem Setup
We have a set of observations $y_i, i=1..m$. If we wish to fit a line to $y$, we would fit: $y(t) = a + bt = G(a,b; t)$. In general, we’ll use $G$ as our model function, so we’ll have $y = G(x, t)$ where $x$ is our model parameters.
Some models are nonlinear but can be linearized. For instance, if we have $y = a e^{bt}$, we can linearize by taking the natural log of both sides: $ln(y) = ln(a) + bt$, then substituting $y’ = ln(y)$ and $a’ = ln(a)$, so we get $y’ = a’ + bt$. Here, $y$ is nonlinear in $a$ and $b$, but $y’$ is linear in $a’$ and $b$.
Our problem is now to find a set of parameters such that they minimize the sum of square error between $G(x, t)$ and $y$.
From multivariate calculus, we know that the conditions for a minimum are:
- $$\dfrac{\partial J}{\partial x_i} = 0, i \in 1..N$$
- $H(J, x)$ is positive definite, where $H$ is the Hessian function ($H$ gives a matrix of second-order partial derivatives of $J$ with respect to pairs of variables in $x$ and can be found as $H = d d^T$ where d is the nx1 vector $\dfrac{d J}{d x}$).
Useful property of positive definite matrix is that for all nontrivial $x$, $x^T A x > 0$.
Modeling Error
In real life, we may have measurement error. Keeping our $G$ function, we’ll model this error as $y_i = G(x, t) + \epsilon_i$, where $\epsilon$ is a vector of random variables representing our error. However, $G$ may be a nonlinear function which we can’t represent as a matrix. We address this by finding a $H$, a linearization of $G$ about $x$. We are able to do this by having row $i$ of matrix $H$ be: $H^{(i)} = [\dfrac{\partial G_t^{(i)}}{\partial x_1} \dfrac{\partial G_t^{(i)}}{\partial x_2} \dots]$ (that is, $H$ is the paritial derivatives matrix, or Jacobian, of $G$ with respect to $x$ at time $t$). Here, $H$ is $m \times n$, where $m$ is the number of observations and $n$ is the dimensionality of the state space. We can now say that locally:
$$y = Hx + \epsilon$$
To do this, we need to make a crucial switch: we’ll change from estimating the state of the system directly to estimating the deviation between where the system is and where we would have thought it would be based on our last estimate. In other words, say we have a system dynamics function $F$, where $F(x_ {i-1}) = x_ i$ (in fact, we don’t have this yet, for the sake of simplicity, and we’ll go back to ignoring it after this paragraph by assuming for now that the system is stationary). We now want to estimate $\hat{x} = X _ {i} - F(X^* _ {i-1})$, where $X^{* }$ is the estimated system state (previously denoted $\hat{x}$). Similarly, we’ll switch from using the observation itself to the observation residual, $y = Y - G(F(X^* _ {i-1}))$. As a brief note on notation: I denoted the state and observation as uppercase $X$ and $Y$ in the last sentences not because they somehow turned into matrices but because I’m running out of variants to attach to letters. Fortunately, these should be the last variants, and $x$ and $y$ will denote residual values from here on out.
With this linear modeling, we have our cost function as $J(x) = [y - H\hat{x}]^T [y - H\hat{x}]$. Setting the derivative equal to zero, we arrive at the formula to minimize error without weighting:
$$\hat{x} = (H^T H)^{-1}H^Ty$$
This is essentially a rederivation of the Moore-Penrose method of taking the left pseudoinverse of $H$, since the left pseudoinverse $H^+$ of matrix $H$ defines the least-squares solution to $y = A\hat{x}$. In other words, we have found that the left pseudoinverse is defined as $H^+ = (H^T H)^{-1}H^T$.
Note that $H^T H$ must be invertible - this implies that $H$ must contain $n$ linearly independent observations/rows.
But what if our error varies across different observations? We can address this by weighting each observation separately by adding an $m \times m$ weight matrix $w$ and setting $J’(x) = [y - H\hat{x}]^T W [y - H\hat{x}]$. You can probably tell by inspection that setting up the objective function this way basically lets you set up a linear weighting function. If $W$ is a diagonal matrix, then it basically lets us scale the relative importance of optimizing each dimension. This time, I’ll show a more detailed derivation for finding the optimal solution.
We start with
$$\begin{align} J’(\hat{x}) &= [y - H\hat{x}]^T W [y - H\hat{x}] \\ &= y^TWy - yWH\hat{x} - \hat{x}^T H^T Wy + \hat{x}^TH^T W H\hat{x} \\ \end{align}$$
Taking the partial derivative and setting to zero will let us find the minimum:
$$\frac{\partial J}{\partial \hat{x}} = -y^T W^{-1} H + \hat{x}^T H^T W H = 0$$
$$\hat{x} = (H^T W H)^{-1}H^T W y$$
Note that now $H^T W H$ must be invertible.
LUMVE
By treating our weighting matrix from before in a statistically rigorous way, we can arrive at a Linear, Unbiased, Minimum-Variance Estimate (LUMVE). To do this, we simply replace $W$ with the inverse of the covariance matrix of our measurement noise. We’ll denote the covariance matrix of the measurement noise as $R = E[(y - Hx)(y - Hx)^T] = \epsilon \epsilon^T$, so we’re setting $W = R^{-1}$. Note that if the covariance matrix is diagonal, this is equivalent to weighting by inverse variances of each measurement, which makes sense.
We now find that the optimal gain matrix is:
$$M = (H^T R^{-1} H)^{-1}H^T R^{-1}$$
$M$ has some interesting properties. For instance, it’s actually still a left pseudoinverse of $H$, meaning that $MH = I$. However, it’s now a pseudoinverse that also satisfies the optimality condition of minimizing the covariance (which we accomplished by weighting our least-squares estimator with the inverse of the measurement covariance matrix).
By using this in $\hat{x} = My$, we have a LUMVE estimator. We can use our setup from before to find the covariance of the estimator:
$$\begin{align} \hat{P} &= Cov(\hat{x}) \\ &= Cov(My) \\ &= M\ Cov(y) M^T \\ &= M R M^T \\ &= (H^T R^{-1} H)^{-1} H^T R^{-1} R ((H^T R^{-1} H)^{-1} H^T R^{-1})^T \\ &= (H^T R^{-1} H)^{-1} H^T R^{-T} H (H^T R^{-1} H)^{-T} \\ &= (H^T R^{-1} H)^{-1} H^T R^{-T} H H^{-1} R^T H^{-T} \\ &= (H^T R^{-1} H)^{-1} \end{align}$$
Wow, that was a pain! Note that here, the notation $A^{-T}$ is used to denote the inverse transpose of $A$. But now we also have the interesting substitution:
$$M = \hat{P} H^T R^{-1}$$
Or, if we wanted $\hat{P}$:
$$\hat{P} = M R H^{-T}$$
Though the first substitution feels somehow more deep to me. This, as usual, assumes that $H$ is full-rank (that is, it has rank $n$).
Dealing with Biased Noise
If $\epsilon$ has some bias $b$, how can we deal with it? If $b$ is known, we can just solve the problem for $y - b$, but what if it’s not? It’s actually very simple - we can simply add $b$ as a parameter in $x$, so that it will be estimated as part of the filter. Dealing with biased noise isn’t any different from dealing with regular biases in the data.
Incorporating Priors
A priori information can be incorporated by making it an observation with appropriate weight. Since a priori information usually takes the form of an initial guess about $x$ with some associated covariance, we can write it as:
$$\bar{x} = x + \eta, \bar{P} = E[\eta \eta^T]$$
Now we can write a new measurement system:
$$\tilde{y} = \tilde{H}x + \tilde{\epsilon}$$
Where $\tilde{y} = [y,\ \bar{x}]^T, \tilde{H} = [H,\ I]^T, \tilde{\epsilon} = [\epsilon,\ \eta]$
We can now use the original LUMVE formulation:
$$\hat{x} = (\tilde{H}^T \tilde{R}^{-1} \tilde{H})^{-1} \tilde{H}^T \tilde{R}^{-1} \tilde{y}$$ $$\hat{P} = (\tilde{H}^T \tilde{R}^{-1} \tilde{H})^{-1}$$
Noting that:
$$\tilde{R}^{-1} = \begin{bmatrix}R^{-1} & 0\\0 & \bar{P}^{-1}\end{bmatrix}$$
If $R$ and $\eta$ are uncorrelated (that is, $E[\epsilon \eta^T] = E[\eta \epsilon^T] = 0$), performing the matrix multiplication shows:
$$\tilde{H}^T \tilde{R}^{-1} \tilde{H} = H^T R^{-1} H + \bar{P}^{-1}$$ $$\tilde{H} \tilde{R}^{-1} \tilde{y} = H^T R^{-1} y + \bar{P}^{-1} \bar{x}$$
So our final estimation is then:
$$\hat{x} = (H^T R^{-1} H + \bar{P}^{-1})^{-1} (H^T R^{-1} y + \bar{P}^{-1} \bar{x})$$
A few notes:
- $\bar{P}$ and $(H^T R^{-1} H + \bar{P}^{-1})$ must be invertible. Interestingly, though, $H^T R^{-1} H$ need not be invertible - meaning that we no longer need a full-rank $H$ matrix. In other words, $\bar{P}$ can be used to resolve singularities in your system.
- Remember that the formulation here relies on $\epsilon$ and $\eta$ being uncorrelated. However, the derivation would be similar (with a different result) if they were correlated.