import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import StaticInteract, RangeWidget
%matplotlib inline
I was recently studying the Expectation Maximization algorithm from this well cited Nature article. To be honest, I found it hard to get all the maths right initially. I had to infact resort to looking up a few forums to get a clear understanding of this algorithm. I decided to take a programming approach to clear up some concepts in this problem. Another reason to do this is the amount of calculations needed in this algorithm (though not difficult, can be time consuming).
Maximum Likelihood
We are given two coins- A and B. Both these coins have a certain probability of getting heads. We choose one of the coin at random (with equal probability) and toss it 10 times noting down the heads-tails pattern. We also carefully account which coin was thrown. We repeat this procedure 5 times. The coin tosses observed in this case are show in the figure below in case A.
Our aim is to determine the probability of getting a head on coin A and likewise for coin B. Intuitively, if we add up the number of heads observed when A was thrown and divide it by the total number of times A was tossed, we woud get this number. This comes from the well known principle of Maximum Likelihood.
This procedure of tossing a coin which may land as either heads or tails is an example of a Bernoulli trial. As per this Wiki page, its definition is as follows:
In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted
When \(n\) such trials are performed, it is called a binomial experiment. In the case of the coin toss experiment, if we have:
- \(n\) coin toss
- \(p\) probability of head in each trial -> \(1-p\) probability of head in each throw
then we observe \(k\) heads as per the following:
\[{n\choose{k}} p^k(1-p)^{n-k}\]
Let us write some code to see how this function varies. We fix \(n\)=10 and for varying \(p\) and observe how the probability distirbution(pmf) varies.
=range(11)
adef plot_binomial(p=0.5):
= plt.subplots(figsize=(4,3))
fig, ax = [0]*11
y for i in a:
-1] = stats.binom.pmf(i, 10, p)
y[i="$p = %.1f$" % p)
ax.bar(a,y,label'PMF of $k$ heads')
ax.set_ylabel('$k$')
ax.set_xlabel(0,0.5))
ax.set_ylim((0,10))
ax.set_xlim((
ax.legend()return fig
=RangeWidget(0.0,1.0,0.1)) StaticInteract(plot_binomial, p
As expected, as we increase \(p\), there is a higher probability of getting more heads. Now, let us look at the Maximum likelihood problem. Say, we are given that we made \(n\) coin tosses and obtained \(x\) heads. Using this information and the nature of distribution, let us find what value of \(p\) would have most likely resulted in this data.
If we choose the probability of heads to be \(p\), the likelihood (probability) of obtaining \(x\) heads in \(n\) throws in our data $D $ is: \[L(D|\theta) \propto p^x(1-p)^{n-x}\] If we differentiate this term wrt \(p\) and equate to 0, we get: \[ xp^{x-1}(1-p)^{n-x} - p^x (n-x)(1-p)^{n-x-1} = 0\] or \[ p^{x-1}(1-p)^{n-x-1}[x(1-p) - p(n-x)]= 0\] or \[ x - xp - pn + xp = 0\] or \[ p = \frac{x}{n}\]
This also generalizes over when multiple sets of throws are done. Thus, in the figure shown above, we find the probability of head for coins A and coins B, given the data, as : \[ P (A=head|D) = \frac{\mathrm{heads~observed~when~A~was~thrown}}{\mathrm{times~A~was~thrown}}\] \[ = \frac{24}{24+6} = 0.8\] Similarly, for coin B, we can find this ratio as 0.45.
Problem description
Now, we come to the main problem. What if we didn’t note down which coin was thrown in which iteration. Can we still find out the probabilities of heads for the different coins. It turns out that we can use the EM algorithm for this task. I would recommend reading the theory in the nature paper before resuming this section. Now, we present the programming route to EM and its different components under this setting.
Creating the dataset
A head is labeled as a 1 and a tail as a 0. The five rows correspond to the five set of throws.
= np.array([[1,0,0,0,1,1,0,1,0,1],
observations 1,1,1,1,0,1,1,1,1,1],
[1,0,1,1,1,1,1,0,1,1],
[1,0,1,0,0,0,1,1,0,0],
[0,1,1,1,0,1,1,1,0,1]]) [
The true coin choice- A =True, B= False
= np.array([False,True,True,False,True]) coins_id
Completely observed case
As discussed before, if we know which coin was used when, this task reduces to Maximum likelihood.
The sets of observations corresponding to coin A can be found as:
observations[coins_id]
array([[1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1, 1],
[0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])
Number of heads recorded when A was thrown
sum(observations[coins_id]) np.
24
Thus, the probability of head for A given the data would be:
1.0*np.sum(observations[coins_id])/observations[coins_id].size
0.8
The same quantity for coin B would be the following:
1.0*np.sum(observations[~coins_id])/observations[~coins_id].size
0.45
Unseen data settings
Now, we follow the Step b in the first figure on this page (which is Figure 1 from the Nature article)
Step 0: Assuming initial set of parameters
\[\theta_A^0 = 0.6\] \[\theta_B^0 = 0.5\]
Iteration 1, Step 1 : E-step
Let us take the first group of observation. We have 5 Heads and 5 tails. PMF according to binomial distribution is given by : \({n\choose{k}}p^k(1-p)^{n-k}\). For coin A we have \(p=\theta^0_A=0.6\) and \(n=10\). We calculate the \(pmf\) as follows:
= stats.binom.pmf(5,10,0.6) coin_A_pmf_observation_1
coin_A_pmf_observation_1
0.20065812480000034
Similarly, for coin B, we get
= stats.binom.pmf(5,10,0.5) coin_B_pmf_observation_1
coin_B_pmf_observation_1
0.24609375000000025
Since coin_B_pmf_observation_1 is greater than coin_A_pmf_observation_1, coin B is more likely to have produced the sequence of 5 Heads and 5 Tails. We now normalize these \(pmfs\) to 1 and \(weigh\) our observations.
= coin_A_pmf_observation_1/(coin_A_pmf_observation_1+coin_B_pmf_observation_1)
normalized_coin_A_pmf_observation_1 print "%0.2f" %normalized_coin_A_pmf_observation_1
= coin_B_pmf_observation_1/(coin_A_pmf_observation_1+coin_B_pmf_observation_1)
normalized_coin_B_pmf_observation_1 print "%0.2f" %normalized_coin_B_pmf_observation_1
0.45
0.55
We now weigh in the observations according to this ratio. Thus, for observation set 1, we have:
= 5*normalized_coin_A_pmf_observation_1
weighted_heads_A_obervation_1 print "Coin A Weighted count for heads in observation 1: %0.2f" %weighted_heads_A_obervation_1
= 5*normalized_coin_A_pmf_observation_1
weighted_tails_A_obervation_1 print "Coin A Weighted count for tails in observation 1: %0.2f" %weighted_tails_A_obervation_1
= 5*normalized_coin_B_pmf_observation_1
weighted_heads_B_obervation_1 print "Coin B Weighted count for heads in observation 1: %0.2f" %weighted_heads_B_obervation_1
= 5*normalized_coin_B_pmf_observation_1
weighted_tails_B_obervation_1 print "Coin B Weighted count for tails in observation 1: %0.2f" %weighted_tails_B_obervation_1
Coin A Weighted count for heads in observation 1: 2.25
Coin A Weighted count for tails in observation 1: 2.25
Coin B Weighted count for heads in observation 1: 2.75
Coin B Weighted count for tails in observation 1: 2.75
We can similarly find out the weigted count in each observation set for both coin A and B. For now, we will pick up the numbers from the paper.
Iteration 1, Step 2: M-step
We now take the sum of weighted count of heads for both coin A and B. For coin A, we have 21.3 heads and 8.6 tails. Thus, we get the probability of getting a head at the end of this iteration as:
21.3/(21.3+8.6)
0.7123745819397994
We can find the same quantity for coin B as well. Next, we repeat the same procedure using these latest calculated probabilities of obtaining heads for coins A and B.
EM single iteration
Let us now write a procedure to do a single iteration of the EM algorithm. The function takes in as argument the following:
- priors \(\theta_A\) and \(\theta_B\)
- observation matrix (5 X 10) in this case
and outputs the new set of priors based on EM iteration.
def em_single(priors, observations):
"""
Performs a single EM step
Arguments
---------
priors : [theta_A, theta_B]
observations : [m X n matrix]
Returns
--------
new_priors: [new_theta_A, new_theta_B]
"""
= {'A':{'H':0,'T':0}, 'B':{'H':0,'T':0}}
counts = priors[0]
theta_A = priors[1]
theta_B # E step
for observation in observations:
= len(observation)
len_observation = observation.sum()
num_heads = len_observation - num_heads
num_tails = stats.binom.pmf(num_heads,len_observation,theta_A)
contribution_A = stats.binom.pmf(num_heads,len_observation,theta_B)
contribution_B = contribution_A/(contribution_A+contribution_B)
weight_A = contribution_B/(contribution_A+contribution_B)
weight_B # Incrementing counts
'A']['H']+= weight_A*num_heads
counts['A']['T']+= weight_A*num_tails
counts['B']['H']+= weight_B*num_heads
counts['B']['T']+= weight_B*num_tails
counts[# M step
= counts['A']['H']/(counts['A']['H']+counts['A']['T'])
new_theta_A = counts['B']['H']/(counts['B']['H']+counts['B']['T'])
new_theta_B return [new_theta_A, new_theta_B]
EM procedure
This procedure calls the single EM iteration untill convergence or some stopping condition. We specificy two stopping conditions and the procedure stops when either condition is met. * 10000 iterations * the change in prior is less than 1e-6
def em(observations, prior, tol=1e-6, iterations=10000):
import math
= 0
iteration while iteration<iterations:
= em_single(prior, observations)
new_prior = np.abs(prior[0]-new_prior[0])
delta_change if delta_change<tol:
break
else:
= new_prior
prior +=1
iterationreturn [new_prior, iteration]
Results
Let us run the algorithm for the priors used in the paper
0.6,0.5]) em(observations, [
[[0.79678875938310978, 0.51958393567528027], 14]
Great! Our results match exactly! It took 14 iterations of the EM algorithm to reach this value.
What if we reverse the priors for A and B
0.5,0.6]) em(observations, [
[[0.51958345063012845, 0.79678895444393927], 15]
Ok. EM does not have a notion of A and B!! For it, there exists two coins and it agains finds the same results.
What if prior for both A and B were equal?
0.3,0.3]) em(observations, [
[[0.66000000000000003, 0.66000000000000003], 1]
So, this clearly is not a very good initialization strategy!!
What if one of the priors is very close to 1
0.9999,0.00000001]) em(observations, [
[[0.79678850504581944, 0.51958235686544463], 13]
So EM is still smart enough!
References
If you have any suggestions feel free let me know.