Derive yourself a Kalman filter

If you have already tried to understand Kalman filters, you may very likely have found yourself dumbfounded, as I was when I did. Yet Kalman filters are actually quite simple. In this post, I will try to explain them as I would have liked them to be explained to me. What follows is a quite personal take on them and may hence fail to use the usual vocabulary of the field, or lack exhaustiveness, sorry about that.

Introduction

The classical example of the use of a Kalman filter is the following. Say you want to program a remote piloting interface for a small robot. This robot is moving around and we want to track its position. To track the position of this robot we have two possible sources of information:

We have access to some continuous measurement of the position of the robot (say GPS)
We also know the starting position of the robot and the movements that should have been done so far ("We have commanded the wheels to move x centimeters in such or such direction."). From this two things, we can compute the position where the robot should currently stand.

Now this two sources of information may disagree, and we are left with the question of how to merge them into one. One may wish to simply average all the estimators of the position we have access to, but a more rigorous analysis is possible.

Formal setting

Lets consider a dynamical system with state $x_{i}$ evolving in discrete steps, and that at each step, we can get some measure $y_{i}$ of the state. Then Kalman filters are an algorithm allowing you estimate the state $x_n$ of your system at step $n$, given measurements $y_1$ ... $y_n$, provided that your system follows the following constraints.

Affine system evolution + Gaussian noise: $$x_{n+1} = F_{n+1} \cdot x_{n} + u_n + v_{n}$$ where $F_{n+1}$ is a matrix, $u_n$ is a vector representing user input and $v_n$ is a centered random Gaussian variable (of appropriate dimension).
Linear measurements + Gaussian noise: $$y_n = H_{n} \cdot x_n + w_n$$ where $H_n$ is a matrix, and $w_n$ a centered random Gaussian variable.
Initial state $x_0$ is either normally distributed or fixed (if we abuse notation, we will say that a fixed value is distributed normally, but with a nul variance.)

The Great and Central Theorem of Kalman Filters

To compute the estimation of the system's state at each step, Kalman filters rely on the following simple Theorem:

Theorem

Provided that the assumptions from the previous section hold, then for any $n$, $x_{n}$ is normally distributed given $y_1, ..., y_{n}$. And there exist a formula to compute both the variance and the mean of this distribution at each step.

A remark here is that it is fairly easy to conclude that $x_n$ and $y_n$ are normally distributed (by induction on $n$, and the fact that all operations preserve Gaussianness, see below), what is less obvious is the conditional distribution of $x_n$ knowing $y_n$. This conditional expectation however stems directly from the Gaussianness of $(x_n,y_n)$.

From now on, upper case symbols will denote random variable and their lower case counterpart will denote a realization of this random variable.

This section will be dedicated to the proof of the previous theorem and to outlining the formulas of the Kalman filter. We will first list several "Gaussianness preserving" operations, and their effect on the variance and mean, and we will then show that the posterior distribution of $x_n$ knowing $y_1$ ... $y_n$ is Gaussian and give its moment by successively applying operations listed before.

It is worth reminding here that a Gaussian distribution only has a probability density if its covariance matrix is definite (ie. invertible).

Some syntaxic sugar

To be able to better reason about these Gaussianness preserving operations in what follows, let me first introduce some syntaxic sugar.

Let $A$ and $B$ be two random variables. We can define $A|B$ as being a function which, to a value $b$ of $B$ associate the random variable $A|(B=b)$, ie. the random variable which takes values in the same domain as $A$ but with probability: $$P(A|B(b) \in a) = P((A|(B=b)) ) \in a) = P(A \in a | B=b)$$

For any function $f$ we can define $f(A|B)$ by: $$f(A|B)(b) = f(A|B(b))$$. Indeed, as $A|B(b)$ is a random variable, most properties of random variables can be lifted to these "conditioned random variables". For example, we can straightforwardly define characteristic functions, covariance, expectancy. This quantities will also be functions of $b$.

As well, we say that $A|B$ is multinormal if for all $b$, $A|b(b)$ is.

Gaussianness preserving operations

Affine function: Let $A$, $B$ be random variable, $A$ of dimension $n$, so that $A|B$ is multi-normally distributed with mean $\mu$ and covariance matrix $\Sigma$. Let $F$ be a $m \times n$ matrix, and $H$ be an $m$-dimensional vector. Then: $F \cdot A|B + H$ is normally distributed, with mean $F \mu + H$ and variance $F \Sigma F^T$.
Sum of two independent multi normal variables Let $A$, $B$, $C$, be random variables. $A$ and $B$ of dimension $m$, so that $A|C$ and $B|C$ are normally distributed and independent (for every value of $B$), with means $\mu_A$ and $\mu_B$ and covariances $\Sigma_A$ and $\Sigma_B$. Then $(A + B)|C$ is normally distributed, with mean $\mu_A + \mu_B$ and covariance $\Sigma_A + \Sigma_B$.
Conditional distribution of two jointly Gaussian variable. Let $A$, $B$, $C$ be random variables, so that $A|C$ and $B|C$ are jointly Gaussian with mean $\mu$ and covariance $\Sigma$: $$ \mu = \begin{pmatrix}\mu_A \\ \mu_B\end{pmatrix} \ \ \Sigma = \begin{pmatrix}\Sigma_{A,A} & \Sigma_{A,B} \\ \Sigma_{B,A} & \Sigma_{B,B}\end{pmatrix} $$ Then $A|(C,B=b)$ is also Gaussian, with mean $\mu'$ and covariance $\Sigma'$: $$ \mu' = \mu_A + \Sigma_{A,B}\Sigma_{B,B}^{-1}(b - \mu_B) $$ $$ \Sigma' = \Sigma_{A,A} - \Sigma_{A,B}\Sigma_{B,B}^{-1}\Sigma_{B,A} $$ where "${}^{-1}$" denotes the generalized inverse.

Building the recursive estimate.

Let n be an integer. Let's assume that $X_n|Y_1,...,Y_n$ is Gaussian, with mean $\mu_n$ and covariance $\Sigma_n$.

We know that $X_{n+1} = F_{n+1}X_{n} + u_n + v_n$, hence: $$X_{n+1}|Y_1,...,Y_n = F_{n+1}(X_{n}|Y_1,...,Y_n) + u_n + V_n$$ As this quantity is only made by successively applying affine functions and summing independent Gaussian variables, we have that $X_{n+1}|Y_1,...,Y_n$ is normally distributed with moments: $$ \mu_X = F_{n+1}\mu_n + u_n $$ $$ \Sigma_X = F_{n+1}\Sigma_{n}F_{n+1}^T + Cov(V_n) $$

Now, let's find out the joint distribution of $X_{n+1},Y_{n+1}$ . We have:

$$\begin{pmatrix} X_{n+1} \\ Y_{n+1} \end{pmatrix} = \begin{pmatrix} Id\\ H_{n+1} \end{pmatrix} \cdot X_{n+1} + \begin{pmatrix} 0\\ W_{n+1} \end{pmatrix}$$,

This expression is as previously the result of successively applying an affine function and summing independent Gaussian variables. Hence $X_{n+1}, Y_{n+1} | Y_1...Y_n$ are jointly Gaussian the mean $\mu_J$ and covariance $\Sigma_J$ of this joint distribution are: $$ \mu_J = \begin{pmatrix} Id\\ H_{n+1} \end{pmatrix} \cdot \mu_X $$ $$ \Sigma_J = \begin{pmatrix} Id\\ H_{n+1} \end{pmatrix} \Sigma_X \begin{pmatrix} Id & H_{n+1}^T \end{pmatrix} + \begin{pmatrix}0 & 0\\ 0 & Cov(W_{n+1})\end{pmatrix} $$

They simplify to:

$$ \mu_J = \begin{pmatrix} \mu_X\\ H_{n+1}\mu_X \end{pmatrix} $$ $$ \Sigma_J = \begin{pmatrix}\Sigma_X & \Sigma_X H_{n+1}^T\\ H_{n+1}\Sigma_X & H_{n+1}\Sigma_X H_{n+1}^T + Cov(W_n)\end{pmatrix} $$

Given that $X_{n+1}, Y_{n+1} | Y_1...Y_n$ are jointly Gaussian, per our third property, the posterior $X_{n+1} | Y_1...Y_{n+1}$ is as well, and when $Y_{n+1} = y_{n+1}$, its moments are:

$$ \mu_{n+1} = \mu_X + \Sigma_X H_{n+1}^T \left(H_{n+1}\Sigma_X H_{n+1}^T + Cov(W_n)\right)^{-1}(y_{n+1}-H_{n+1}\mu_X) $$ $$ \Sigma_{n+1} = \Sigma_X - \Sigma_X H_{n+1}^T \left(H_{n+1}\Sigma_X H_{n+1}^T + Cov(W_n)\right)^{-1}H_{n+1}\Sigma_X $$

Recap

Here we have obtained the following recursive formulas for the moments of the posterior at step $n$ (I leave intermediate variables for readability):

$$ \mu_X = F_{n+1}\mu_n + u_n $$ $$ \Sigma_X = F_{n+1}\Sigma_{n}F_{n+1}^T + Cov(V_n) $$ $$ \mu_{n+1} = \mu_X + \Sigma_X H_{n+1}^T \left(H_{n+1}\Sigma_X H_{n+1}^T + Cov(W_n)\right)^{-1}(y_{n+1}-H_{n+1}\mu_X) $$ $$ \Sigma_{n+1} = \Sigma_X - \Sigma_X H_{n+1}^T \left(H_{n+1}\Sigma_X H_{n+1}^T + Cov(W_n)\right)^{-1}H_{n+1}\Sigma_X $$

These formulas are the usual one of Kalman filters, and, going back to our robot example, give a rigorous way of estimating the position of our robot. If the dynamic of your system is not linear, then you can use what is called an "extended Kalman filter", which is just a fancy name for linearizing the dynamic of your system (i.e. taking $F_i$ equal to the Jacobian) and applying the usual Kalman filter.