Expectation Maximization, Part 1: Motivation and Recipe

Jul 2020 10 min read Theory

Introduction

This is the first in a series of posts on Expectation Maximization (EM) type algorithms. Our goal will be to motivate some of the theory behind these algorithms. In later posts we will implement examples in C++, often with the help of the Eigen linear algebra library.

Maximum likelihood

A large subset of statistics is concerned with determining properties of a distribution by using data that is assumed to be generated by that distribution. A common example is Maximum Likelihood Estimation (MLE). Here one assumes that a vector of observed data \(\vec{x}\in\mathbb{R}^N\) is the realization of a random vector \(\vec{X}\) with a probability density \(p(\vec{X} \ | \ \theta)\) that depends on a vector of parameters \(\theta\). MLE amounts to estimating \(\theta\) with the value that makes this probability density has high as possible for the observed data:

\[ \hat{\theta} := \text{argmax}_{\theta} \ \ p(\vec{x} \ | \ \theta) \] As a function of \(\theta\), the density \(\mathcal{L}(\theta; \vec{x}) := p(\vec{x} \ | \ \theta)\) is called the likelihood. Because probability densities are positive for the realized values \(\vec{x}\) of \(\vec{X}\), the above problem is equivalent to maximizing the logarithm of the likelihood: \[ \hat{\theta} := \text{argmax}_{\theta} \ \ \log(p(\vec{x} \ | \ \theta)) \] (The main practical reason behind this log transformation is that it often makes the problem easier numerically. The theoretical advantage is that it ties MLE to the theory of the Fisher Information).

Dependence structures and problems with hidden variables

The situation in the last section can be summarized by the simple dependence structure (or Markov diagram) \(\theta \to \vec{X}\). That is, given the value of \(\theta\) we can determined the distribution of \(\vec{X}\), namely \(p(\vec{X} \ | \ \theta)\).

However, in many applications we may only have partial observations of the data we want, with some of the relevant information remaining unobserved/hidden. For example, suppose 100 identical and independent dice are thrown in an experiment. The dice are not necessarily uniformly weighted, with probabilities of landing 1,2,…,6 given by \(\theta = [p_1, p_2,...,p_6]\). Suppose the dice land with values represented by \(\vec{X} = [X_1, X_2, ... X_{100}]\) with \(X_i\) being the number the \(i^{th}\) die lands on. Suppose also that in the experiment we are only able to observe whether each die landed on an even or odd number. That is, we observe a vector \(\vec{Y}\) given by

\[ Y_i = X_i \mod 2 \]

for \(i \in \{1, 2, ... 100\}\). In this case the dependence structure is a little more complex: \(\theta \to \vec{X} \to \vec{Y}\). That is, once we know the value of \(\vec{X}\) we can fully specify the distribution of \(\vec{Y}\) without knowing the value of \(\theta\). We would have the Markov property for densities:

\[ p(\vec{y} \ | \ \vec{x}, \theta) = p(\vec{y} \ | \ \vec{x}) \]

In general, we have a dependence structure given by \(\theta \to \vec{X} \to \vec{Y}\), we observe only \(\vec{Y}=\vec{y}\) and we want to estimate the parameters \(\theta\). The MLE estimator would be: \[ \hat{\theta} := \text{argmax}_{\theta} \ \ \log(p(\vec{y} \ | \ \theta)) \] All of the theory of MLE applies in this case. In the example above this would be relatively easy. However, there are times this maximization problem is very difficult. Often the density \(p(\vec{y} \ | \ \theta)\) may be much more complicated than the density \(p(\vec{x} \ | \ \theta)\) of the hidden data that we wish we had.

In such as situation, if we knew \(\vec{x}\) then we can replace the above problem with \(\hat{\theta} := \text{argmax}_{\theta} \ \ \log(p(\vec{x} \ | \ \theta))\). In fact, we wouldn’t even need to know exactly what the value of \(\vec{x}\) is but only what the value of \(\log(p(\vec{x} \ | \ \theta))\) is for a given \(\theta\).

A general recipe for EM algorithms

The idea of EM is indeed to try and maximize \(\log(p(\vec{x} \ | \ \theta))\) instead of \(\log(p(\vec{y} \ | \ \theta))\), but because we do not know \(\log(p(\vec{x} \ | \ \theta))\) to instead use an approximation/estimate of it.

How to approximate such an expression? The quantity \(\log(p(\vec{x} \ | \ \theta))\) is a random variable (depending on the unknown value \(\vec{x}\) of \(\vec{X}\)). To estimate it in a meaningful way we need to use the most informative distribution related to \(\vec{x}\). Because we know \(\vec{y}\) the best such distribution is \(p(\vec{x}|\vec{y},\theta)\).

The problem is this distribution (or any other) will necessarily depend \(\theta\), which we do not know! At first this seems like a circular trap, because to estimate \(\log(p(\vec{x} \ | \ \theta))\) using \(p(\vec{x}|\vec{y},\theta)\) we need to know \(\theta\), but to estimate \(\theta\) we need to know \(\log(p(\vec{x} \ | \ \theta))\). However the trap hints at a solution: simply alternate between estimating the random variable \(\log(p(\vec{x} \ | \ \theta))\) using a current guess of \(\theta\) and then use this updated estimate of \(\log(p(\vec{x} \ | \ \theta))\) to update our guess of \(\theta\). More formally we can summarize EM in 5 steps:

Step 1: Let \(m = 0\). Make an initial estimate \(\theta_m\) for \(\theta\).
Step 2: Given the observed data \(\vec{y}\) and pretending for the moment that our current guess \(\theta_m\) is correct, construct the conditional probability distribution \(p(\vec{x}|\vec{y},\theta_m)\) of the hidden data \(\vec{x}\) given all known information.
Step 3: Using the distribution \(p(\vec{x}|\vec{y},\theta_m)\) construct an estimator/approximation of the desired log-likelihood \(\log(p(\vec{x} \ | \ \theta))\) for arbitrary \(\theta\). We denote this approximation by \(Q(\theta|\theta_m)\).
Step 4: Set \(\theta_{m+1}\) equal to a value of \(\theta\) that maximizes the current approximation \(Q(\theta|\theta_m)\) of \(\log(p(\vec{x} \ | \ \theta))\).
Step 5: Return to step 2 and repeat until some stopping criteria is met.¹

Practically speaking, this algorithm would be applied when each of these steps is significantly easier than the original MLE problem of \(\hat{\theta} := \text{argmax}_{\theta} \ \ \log(p(\vec{y} \ | \ \theta))\). As a general example, this is often the case when the model is linear with respect to \(\vec{X}\), but the information loss of going from \(\vec{X}\) to \(\vec{Y}\) is nonlinear and non-invertible (we’ll give examples in later posts).

Constructing an estimator for \(\log(p(\vec{X} \ | \ \theta))\)

How do we fill in the blank left by step 3 above? That is, how do we use the probability density \(p(\vec{x}|\vec{y},\theta_m)\) to estimate the value of the random variable \(\log(p(\vec{X} \ | \ \theta))\)? Two possibilities come to mind.

Point-estimate type EM

One possibility is to let \[ \vec{x}_m = \vec{x}_m(\vec{y}, \theta_m) := \text{argmax}_{\vec{x}} \ p(\vec{x}|\vec{y},\theta_m) \] and then define \[ Q(\theta | \theta_m) := \log(p(\vec{x}_m \ | \ \theta)) \] This is called point-estimate EM. Here we use “the most likely” value of \(\vec{X}\) as determined by the density \(p(\vec{x}|\vec{y},\theta_m)\) and then impute this value into our log-likelihood that we want to maximize. Another possibility would be to let

\[ \vec{x}_m = \vec{x}_m(\vec{y}, \theta_m) := \ \text{E}_{\vec{X} \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \vec{X}\big] \] and let \(Q\) be as before.

The idea of these type of EM algorithms is to first estimate the missing data \(\vec{x}\) and then impute the result into \(log(p(\vec{x}\ |\ \theta))\).

Expectation EM (i.e. standard EM)

There is a theoretically more elegant way. As mentioned earlier, we do not in fact need an estimate of the missing data \(\vec{x}\). One of the best ways to estimate the value of a random variable with respect to a conditional distribution is to simply compute the conditional expectation of that variable with respect to that conditional distribution:

\[ Q(\theta | \theta_m) \ := \ \text{E}_{\vec{X} \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \log(p(\vec{X} \ | \ \theta)) \big] = \int_{\mathcal{X}} \log(p(\vec{x} \ | \ \theta)) \ p(\vec{x}|\vec{y},\theta_m) \ d\vec{x} \] Here we’re computing the mean of \(\log(p(\vec{X} \ | \ \theta))\) with respect to the density \(p(\vec{x}|\vec{y},\theta_m)\). As is common when using expectations, this second method has some advantages we’ll see later. When we refer to EM we will always mean this case, unless otherwise specified.

\(Q(\theta|\theta_m)\) for I.I.D. samples

So far everything has been rather general, applying to any random vectors \(\vec{X}\) and \(\vec{Y}\). Many problems assume that data are generated independently and identically distributed (I.I.D.) so it’s helpful to have a formulation for this particular case.

Proposition: Suppose that the components of \(\vec{X}\) are IID (given \(\theta\)), that is: \[ p(\vec{x}|\theta) = \prod_{i = 1}^N p(x_i|\theta) \ \qquad \forall x, \theta \] Suppose also that the dependence structure \(\theta \to \vec{X} \to \vec{Y}\) splits into subgraphs as \(\theta \to X_i \to Y_i\) for all \(i = 1, 2, ..., N\). This just means that given \(X_i\), the distribution \(Y_i\) is independent of all other variables: \[ p(y_i|\vec{x}, \theta, y_1, y_2, ..., y_{i-1}, y_{i+1}, ..., y_N) = p(y_i|x_i) \] Then \(Q(\theta|\theta_m):= \ \text{E}_{\vec{X} \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \log(p(\vec{X} \ | \ \theta)) \big]\) can be written as: \[ Q(\theta|\theta_m) = \sum_{i=1}^N Q_i(\theta|\theta_m) \] where \[ Q_i(\theta|\theta_m) := \ \text{E}_{X_i \ | \ Y_i \ = \ y_i, \ \theta_m} \big[ \log(p(X_i \ | \ \theta)) \big]. \] Proof: The proof begins by showing that the joint elements \((X_i,Y_i)\) are independent across \(i\), that is: \[ p(\vec{x},\vec{y}|\theta) = \prod_{i = 1}^N p(x_i, y_i|\theta). \] To prove this we start by applying the multiplication theorem for probability densities: \[\begin{equation} \begin{aligned} p(\vec{x},\vec{y}|\theta) &= p(y_1|y_2,y_3, ..., y_N, \vec{x},\theta)...p(y_N|\vec{x},\theta)p(\vec{x}|\theta) \qquad \text{by the multiplication theorem}\\ &= p(y_1|x_1,\theta)...p(y_N|x_N,\theta)p(\vec{x}|\theta) \qquad \text{by conditional independence but keeping theta} \\ &= p(\vec{x}|\theta)\prod_{i=1}^Np(y_i|x_i,\theta) \\ &=\prod_{i=1}^Np(y_i|x_i,\theta)p(x_i|\theta) \qquad \text{by independence of the x's} \\ &= \prod_{i=1}^Np(y_i,x_i|\theta). \end{aligned} \end{equation}\]

Next we have that for each \(i\)

\[\begin{equation} \begin{aligned} p(x_i|\vec{y},\theta) &= \frac{p(x_i,\vec{y}|\theta)}{p(\vec{y}|\theta)} \qquad \text{by Bayes} \\ &= \frac{\int p(\vec{x},\vec{y}|\theta)dx_1...dx_{i-1}dx_{i+1}...dx_n}{\int p(\vec{x},\vec{y}|\theta)d\vec{x}} \\ &= \frac{\int \prod_{j=1}^Np(y_j,x_j|\theta)dx_1...dx_{i-1}dx_{i+1}...dx_n}{\int \prod_{j=1}^Np(y_j,x_j|\theta)d\vec{x}} \\ &= \frac{p(x_i,y_i|\theta) \prod_{j=1, j \ne i}^N \int p(y_j,x_j|\theta)dx_j}{\prod_{j=1}^N\int p(y_j,x_j|\theta) dx_j} \\ &= \frac{p(x_i,y_i|\theta) \prod_{j=1, j \ne i}^N p(y_j|\theta)}{\prod_{j=1}^N p(y_j|\theta)}\\ &= \frac{p(x_i,y_i|\theta)}{p(y_i|\theta)}\\ &= p(x_i|y_i,\theta) \end{aligned} \end{equation}\]

Hence \(p(x_i|\vec{y},\theta) = p(x_i|y_i,\theta)\). Therefore we have

\[\begin{equation} \begin{aligned} Q(\theta, \theta_m) &= \text{E}_{\vec{X} \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \log(p(\vec{X} \ | \ \theta)) \big] \\ &= \text{E}_{\vec{X} \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \log(\prod_{i = 1}^N p(X_i \ | \ \theta)) \big] \\ &= \text{E}_{\vec{X} \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \sum_{i = 1}^N \log(p(X_i \ | \ \theta)) \big]\\ &= \sum_{i = 1}^N \text{E}_{X_i \ | \ \vec{Y} \ = \ \vec{y}, \ \theta_m} \big[ \log(p(X_i \ | \ \theta)) \big] \\ &= \sum_{i = 1}^N \text{E}_{X_i \ | \ Y_i \ = \ y_i, \ \theta_m} \big[ \log(p(X_i \ | \ \theta)) \big] = \sum_{i=1}^N Q_i(\theta|\theta_m) \end{aligned} \end{equation}\] Where we used \(p(x_i|\vec{y},\theta) = p(x_i|y_i,\theta)\) in the 2nd to last equality. QED

Maximum A Posteriori EM and regularizing priors

The EM algorithm is easily extendable to regularized MLE. This is usually referred to as Maximum A Posteriori (MAP) in the Bayesian setting, which we adopt. Here the penalty term is interpreted as the log of the prior density on \(\theta\), and the function to be maximized is the posterior density of \(\theta\) given the data. By Bayes’ Theorem \[ p(\theta|\vec{y}) \propto p(\vec{y}|\theta)p(\theta) \] where the proportionality constant is independent of \(\theta\). Thus the MAP estimator is \[ \hat{\theta}_{MAP} := \text{argmax}_{\theta} \ \log(p(\theta|\vec{y})) \ = \text{argmax}_{\theta} \ \log(p(\vec{y}|\theta)) + \log(p(\theta)) \] The way to extend EM to this situation is clear (at least formally): Simply replace the maximization step (step 4 above) in EM with maximizing \(Q(\theta|\theta_m) + \log(p(\theta))\) instead of simply \(Q(\theta|\theta_m)\):

\[ \theta_{m+1} := \text{argmax}_{\theta} \ \ \ Q(\theta|\theta_m) + \log(p(\theta)) \] where as before \(Q\) is given by

Monotonicity of the EM algorithm

At this point the reader has all the theory needed to begin applying EM where they believe it’s a good fit. Before we end the post though let’s mention at least one result that shows that EM is indeed a generalization of MLE to the case of hidden data: under relatively weak assumptions of the algorithm causes the log-likelihood \(\log(p(\vec{y}|\theta_m))\) to be an increasing sequence in \(m\).

Step 2 allows for a whole family of such algorithms, one for each possible approximator to \(\log(p(\vec{x} \ | \ \theta))\). Step 4 can also be generalized. Since the value of \(\theta_{m+1}\) maximizes \(Q(\theta | \theta_m)\) then \(Q(\theta_{m+1} | \theta_m) \ge Q(\theta_m | \theta_m)\). Instead of seeking to maximize \(Q(\theta | \theta_m)\) we may simply seek a value of \(\theta_{m+1}\) that improves on \(\theta_m\) in the sense of this inequality. For step 5, the stopping criteria are up to the implementer.↩︎

expectation maximization em algorithm em