Programatically understanding Expectation Maximization

Maximize based on what you know, re-estimate!
ML
Author

Nipun Batra

Published

June 1, 2014

I was recently studying the Expectation Maximization algorithm from this well cited Nature article. To be honest, I found it hard to get all the maths right initially. I had to infact resort to looking up a few forums to get a clear understanding of this algorithm. I decided to take a programming approach to clear up some concepts in this problem. Another reason to do this is the amount of calculations needed in this algorithm (though not difficult, can be time consuming).

Maximum Likelihood

We are given two coins- A and B. Both these coins have a certain probability of getting heads. We choose one of the coin at random (with equal probability) and toss it 10 times noting down the heads-tails pattern. We also carefully account which coin was thrown. We repeat this procedure 5 times. The coin tosses observed in this case are show in the figure below in case A.

Our aim is to determine the probability of getting a head on coin A and likewise for coin B. Intuitively, if we add up the number of heads observed when A was thrown and divide it by the total number of times A was tossed, we woud get this number. This comes from the well known principle of Maximum Likelihood.

This procedure of tossing a coin which may land as either heads or tails is an example of a Bernoulli trial. As per this Wiki page, its definition is as follows:

In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted

When \(n\) such trials are performed, it is called a binomial experiment. In the case of the coin toss experiment, if we have:

  • \(n\) coin toss
  • \(p\) probability of head in each trial -> \(1-p\) probability of head in each throw

then we observe \(k\) heads as per the following:

\[{n\choose{k}} p^k(1-p)^{n-k}\]

Let us write some code to see how this function varies. We fix \(n\)=10 and for varying \(p\) and observe how the probability distirbution(pmf) varies.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import StaticInteract, RangeWidget
%matplotlib inline
a=range(11)
def plot_binomial(p=0.5):
    fig, ax = plt.subplots(figsize=(4,3))
    y = [0]*11
    for i in a:
        y[i-1] =  stats.binom.pmf(i, 10, p)
    ax.bar(a,y,label="$p = %.1f$" % p)
    ax.set_ylabel('PMF of $k$ heads')
    ax.set_xlabel('$k$')
    ax.set_ylim((0,0.5))
    ax.set_xlim((0,10))
    ax.legend()
    return fig
StaticInteract(plot_binomial, p=RangeWidget(0.0,1.0,0.1))
p:

As expected, as we increase \(p\), there is a higher probability of getting more heads. Now, let us look at the Maximum likelihood problem. Say, we are given that we made \(n\) coin tosses and obtained \(x\) heads. Using this information and the nature of distribution, let us find what value of \(p\) would have most likely resulted in this data.

If we choose the probability of heads to be \(p\), the likelihood (probability) of obtaining \(x\) heads in \(n\) throws in our data $D $ is: \[L(D|\theta) \propto p^x(1-p)^{n-x}\] If we differentiate this term wrt \(p\) and equate to 0, we get: \[ xp^{x-1}(1-p)^{n-x} - p^x (n-x)(1-p)^{n-x-1} = 0\] or \[ p^{x-1}(1-p)^{n-x-1}[x(1-p) - p(n-x)]= 0\] or \[ x - xp - pn + xp = 0\] or \[ p = \frac{x}{n}\]

This also generalizes over when multiple sets of throws are done. Thus, in the figure shown above, we find the probability of head for coins A and coins B, given the data, as : \[ P (A=head|D) = \frac{\mathrm{heads~observed~when~A~was~thrown}}{\mathrm{times~A~was~thrown}}\] \[ = \frac{24}{24+6} = 0.8\] Similarly, for coin B, we can find this ratio as 0.45.

Problem description

Now, we come to the main problem. What if we didn’t note down which coin was thrown in which iteration. Can we still find out the probabilities of heads for the different coins. It turns out that we can use the EM algorithm for this task. I would recommend reading the theory in the nature paper before resuming this section. Now, we present the programming route to EM and its different components under this setting.

Creating the dataset

A head is labeled as a 1 and a tail as a 0. The five rows correspond to the five set of throws.

observations = np.array([[1,0,0,0,1,1,0,1,0,1],
                         [1,1,1,1,0,1,1,1,1,1],
                         [1,0,1,1,1,1,1,0,1,1],
                         [1,0,1,0,0,0,1,1,0,0],
                         [0,1,1,1,0,1,1,1,0,1]])

The true coin choice- A =True, B= False

coins_id = np.array([False,True,True,False,True])

Completely observed case

As discussed before, if we know which coin was used when, this task reduces to Maximum likelihood.

The sets of observations corresponding to coin A can be found as:

observations[coins_id]
array([[1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
       [1, 0, 1, 1, 1, 1, 1, 0, 1, 1],
       [0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])

Number of heads recorded when A was thrown

np.sum(observations[coins_id])
24

Thus, the probability of head for A given the data would be:

1.0*np.sum(observations[coins_id])/observations[coins_id].size
0.8

The same quantity for coin B would be the following:

1.0*np.sum(observations[~coins_id])/observations[~coins_id].size
0.45

Unseen data settings

Now, we follow the Step b in the first figure on this page (which is Figure 1 from the Nature article)

Step 0: Assuming initial set of parameters

\[\theta_A^0 = 0.6\] \[\theta_B^0 = 0.5\]

Iteration 1, Step 1 : E-step

Let us take the first group of observation. We have 5 Heads and 5 tails. PMF according to binomial distribution is given by : \({n\choose{k}}p^k(1-p)^{n-k}\). For coin A we have \(p=\theta^0_A=0.6\) and \(n=10\). We calculate the \(pmf\) as follows:

coin_A_pmf_observation_1 = stats.binom.pmf(5,10,0.6)
coin_A_pmf_observation_1
0.20065812480000034

Similarly, for coin B, we get

coin_B_pmf_observation_1 = stats.binom.pmf(5,10,0.5)
coin_B_pmf_observation_1
0.24609375000000025

Since coin_B_pmf_observation_1 is greater than coin_A_pmf_observation_1, coin B is more likely to have produced the sequence of 5 Heads and 5 Tails. We now normalize these \(pmfs\) to 1 and \(weigh\) our observations.

normalized_coin_A_pmf_observation_1 = coin_A_pmf_observation_1/(coin_A_pmf_observation_1+coin_B_pmf_observation_1)
print "%0.2f" %normalized_coin_A_pmf_observation_1
normalized_coin_B_pmf_observation_1 = coin_B_pmf_observation_1/(coin_A_pmf_observation_1+coin_B_pmf_observation_1)
print "%0.2f" %normalized_coin_B_pmf_observation_1
0.45
0.55

We now weigh in the observations according to this ratio. Thus, for observation set 1, we have:

weighted_heads_A_obervation_1 = 5*normalized_coin_A_pmf_observation_1
print "Coin A Weighted count for heads in observation 1: %0.2f" %weighted_heads_A_obervation_1
weighted_tails_A_obervation_1 = 5*normalized_coin_A_pmf_observation_1
print "Coin A Weighted count for tails in observation 1: %0.2f" %weighted_tails_A_obervation_1
weighted_heads_B_obervation_1 = 5*normalized_coin_B_pmf_observation_1
print "Coin B Weighted count for heads in observation 1: %0.2f" %weighted_heads_B_obervation_1
weighted_tails_B_obervation_1 = 5*normalized_coin_B_pmf_observation_1
print "Coin B Weighted count for tails in observation 1: %0.2f" %weighted_tails_B_obervation_1
Coin A Weighted count for heads in observation 1: 2.25
Coin A Weighted count for tails in observation 1: 2.25
Coin B Weighted count for heads in observation 1: 2.75
Coin B Weighted count for tails in observation 1: 2.75

We can similarly find out the weigted count in each observation set for both coin A and B. For now, we will pick up the numbers from the paper.

Iteration 1, Step 2: M-step

We now take the sum of weighted count of heads for both coin A and B. For coin A, we have 21.3 heads and 8.6 tails. Thus, we get the probability of getting a head at the end of this iteration as:

21.3/(21.3+8.6)
0.7123745819397994

We can find the same quantity for coin B as well. Next, we repeat the same procedure using these latest calculated probabilities of obtaining heads for coins A and B.

EM single iteration

Let us now write a procedure to do a single iteration of the EM algorithm. The function takes in as argument the following:

  • priors \(\theta_A\) and \(\theta_B\)
  • observation matrix (5 X 10) in this case

and outputs the new set of priors based on EM iteration.

def em_single(priors, observations):
    """
    Performs a single EM step
    Arguments
    ---------
    priors : [theta_A, theta_B]
    observations : [m X n matrix]
    
    Returns
    --------
    new_priors: [new_theta_A, new_theta_B]
    """
    counts = {'A':{'H':0,'T':0}, 'B':{'H':0,'T':0}}
    theta_A = priors[0]
    theta_B = priors[1]
    # E step
    for observation in observations: 
        len_observation = len(observation)
        num_heads = observation.sum()
        num_tails = len_observation - num_heads
        contribution_A = stats.binom.pmf(num_heads,len_observation,theta_A)
        contribution_B = stats.binom.pmf(num_heads,len_observation,theta_B)
        weight_A = contribution_A/(contribution_A+contribution_B)
        weight_B = contribution_B/(contribution_A+contribution_B)
        # Incrementing counts
        counts['A']['H']+= weight_A*num_heads
        counts['A']['T']+= weight_A*num_tails
        counts['B']['H']+= weight_B*num_heads
        counts['B']['T']+= weight_B*num_tails
    # M step
    new_theta_A = counts['A']['H']/(counts['A']['H']+counts['A']['T'])
    new_theta_B = counts['B']['H']/(counts['B']['H']+counts['B']['T'])
    return [new_theta_A, new_theta_B]

EM procedure

This procedure calls the single EM iteration untill convergence or some stopping condition. We specificy two stopping conditions and the procedure stops when either condition is met. * 10000 iterations * the change in prior is less than 1e-6

def em(observations, prior, tol=1e-6, iterations=10000):
    import math
    iteration = 0
    while iteration<iterations:
        new_prior = em_single(prior, observations)
        delta_change = np.abs(prior[0]-new_prior[0])
        if delta_change<tol:
            break
        else:
            prior = new_prior
            iteration+=1
    return [new_prior, iteration]

Results

Let us run the algorithm for the priors used in the paper

em(observations, [0.6,0.5])
[[0.79678875938310978, 0.51958393567528027], 14]

Great! Our results match exactly! It took 14 iterations of the EM algorithm to reach this value.

What if we reverse the priors for A and B

em(observations, [0.5,0.6])
[[0.51958345063012845, 0.79678895444393927], 15]

Ok. EM does not have a notion of A and B!! For it, there exists two coins and it agains finds the same results.

What if prior for both A and B were equal?

em(observations, [0.3,0.3])
[[0.66000000000000003, 0.66000000000000003], 1]

So, this clearly is not a very good initialization strategy!!

What if one of the priors is very close to 1

em(observations, [0.9999,0.00000001])
[[0.79678850504581944, 0.51958235686544463], 13]

So EM is still smart enough!

References

  1. Question on math stack exchange
  2. Another question on math stack exhange

If you have any suggestions feel free let me know.