K-means and EM for Gaussian mixtures are two clustering algorithms commonly covered in machine learning courses. In this post, I’ll go through my implementations on some sample data.

I won’t be going through much theory, as that can be easily found elsewhere. Instead I’ve focused on highlighting the following:

Pretty visualizations in

`ggplot`

, with the helper packages`deldir`

,`ellipse`

, and`knitr`

for animations.Structural similarities in the algorithms, by splitting up K-means into an E and M step.

Animations showing that EM reduces to the K-means algorithm in a particular limit.

This last point is covered in Section 9.3.2 of Bishop’s *Pattern Recognition and Machine Learning*, which I recommend taking a look at for additional theoretical intuition.

So let’s get started! Our data comes from Barbara Englehardt’s Spring 2013 Duke course, STA613/CBB540: Statistical methods in computational biology, homework 4.

First, we load our library functions and data points:

```
library(deldir)
library(ellipse)
library(pryr)
library(ggplot2)
center_title = theme(plot.title = element_text(hjust = 0.5))
no_legend = theme(legend.position="none")
```

```
points = read.table('../data/points_hw4.txt', col.names=c("x", "y"))
ggplot(points, aes(x = x, y = y)) + geom_point() +
labs(title = "Scatter plot of data") +
center_title
```

One of the aims of this post is to show how the common EM clustering algorithm reduces to K-means in a particular limit. To do this, we should first put both algorithms into a common form.

If you’ve worked through the algorithms before, you’ll see that both K-means and EM consist of a step where points are assigned to clusters, followed by a step where parameter updates are computed from those assignments. They look like the following:

- EM E step: compute soft assignments of assigning a probability distribution for each point over the \(K\) clusters
- K-means “E step”: compute hard assignments of assigning every data point to its nearest cluster center
- EM M step: using the soft assignments, update \(\mathbf{\mu}_i\), the Gaussian means, \(\mathbf{\Sigma}_i\), the Gaussian covariance matrices, and \(\mathbf{\pi}\), the cluster weights
- K-means “M step”: using the hard assignments, update \(\mathbf{\mu}_i\), the cluster centers

Our approach will be to implement these four helper functions, and then string them together using a common interface. This also cuts down on duplication. The high-level program takes an E function, an M function, and starting inputs to an E step, and alternates the two steps while keeping track of all intermediate results:

```
iterate_em = function(nsteps, K, points, e_step, m_step, m_params.init) {
m_list = list(m_params.init)
e_list = list()
i = 1
while (i <= nsteps) {
e_list[[i]] = e_step(K, points, m_list[[i]])
m_list[[i+1]] = m_step(K, points, e_list[[i]])
i = i + 1
}
return(list(m_list=m_list, e_list=e_list))
}
```

The rest of this section is pretty dry, and just consists of my R code for the two algorithms. In the E step for EM, I make use of the log-sum-exp trick, which turns out to be helpful for numerical precision; you can read more about that here.

```
# K-means as functions
# points: N x D matrix
# e_params: list with
# clusters: vector of assignments
# m_params: list with
# centers: matrix of cluster centers
kmeans.e = function(K, points, m_params) {
N = dim(points)[1]
D = dim(points)[2]
distances2 = matrix(0, N, K)
for (k in 1:K) {
for (j in 1:D) {
distances2[,k] = distances2[,k] + (points[,j] - m_params$centers[k,j])^2
}
}
clusters = apply(distances2, 1, which.min)
e_params = list(clusters=clusters)
return(e_params)
}
kmeans.m = function(K, points, e_params) {
N = dim(points)[1]
D = dim(points)[2]
centers = matrix(0, K, D)
for (k in 1:K) {
centers[k,] = colMeans(points[e_params$clusters == k,])
}
m_params = list(centers=centers)
return(m_params)
}
# EM as functions
# points: N x D matrix
# m_params: list with
# mu: K x D, MoG centers
# sigma: list of length K of D x D matrices, MoG covariances
# weights: K, MoG weights
# e_params: list with
# resp: responsibilities, N x K
# ll: log-likelihood, for debugging
em.e = function(K, points, m_params) {
N = dim(points)[1]
D = dim(points)[2]
mu = m_params$mu
sigma = m_params$sigma
weights = m_params$weights
# update responsibilities
resp = matrix(rep(0, N*K), N, K)
for (k in 1:K) {
constant_k = log(weights[k]) - 0.5*log(det(sigma[[k]])) -
log(2*pi)*(D/2)
displacement = points - as.numeric(matrix(mu[k,], N, D, byrow = TRUE))
log_probs = -1/2 * colSums(t(displacement) * (
solve(sigma[[k]]) %*% t(displacement)))
resp[,k] = log_probs + constant_k
}
# log-sum-exp trick
max_log_probs = apply(resp, 1, max)
resp = resp - matrix(max_log_probs, N, K)
resp = exp(resp)
ll = mean(log(rowSums(resp))) + mean(max_log_probs) # log likelihood
resp = resp / matrix(rowSums(resp), N, K)
e_params = list(resp=resp, ll=ll)
return(e_params)
}
em.m = function(K, points, e_params, fix_sigma=NULL, fix_weights=NULL) {
N = dim(points)[1]
D = dim(points)[2]
resp = e_params$resp
# update means
mu = matrix(0, K, D)
for (k in 1:K) {
mu[k,] = colSums(resp[,k]*points) / sum(resp[,k])
}
# update covarainces
if (is.null(fix_sigma)) {
sigma = NULL
for (k in 1:K) {
sigma[[k]] = matrix(0, D, D)
displacement = points - as.numeric(matrix(mu[k,], N, D, byrow = TRUE))
for (j in 1:D) {
sigma[[k]][j,] = colSums(displacement[,j]*displacement*resp[,k]) / sum(resp[,k])
}
}
} else {
sigma = fix_sigma
}
# update component weights
if (is.null(fix_weights)) {
weights = colSums(resp) / sum(resp)
} else {
weights = fix_weights
}
m_params = list(mu=mu, sigma=sigma, weights=weights)
return(m_params)
}
```

Now, we can choose `K = 3`

and `nsteps = 20`

for an initial run. We randomly choose three points as our starting centers for both K-means and EM. For EM, we additionaly initialize using identity covariance and equal weights over the mixture components.

```
# Run K means and EM
K = 3
nsteps = 20
N = dim(points)[1]
D = dim(points)[2]
set.seed(3)
centers = points[sample(1:N, K),]
row.names(centers) = NULL
m_params.init = list(centers=centers)
kmeans_results = iterate_em(nsteps, K, points, kmeans.e, kmeans.m, m_params.init)
mu = centers
sigma = NULL
for(k in 1:K) {
sigma[[k]] = diag(D) # covariances initialized to identity matrix
}
weights = rep(1, K) / K # weights initialized to uniform
m_params.init = list(mu=mu, sigma=sigma, weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, em.m, m_params.init)
```

The results of each E-step produces data for each of the N points, which is verbose. Instead, let’s print the results of the M-step, which is more compact.

`kmeans_results$m_list[1:3]`

```
## [[1]]
## [[1]]$centers
## x y
## 1 4.236116 4.322595
## 2 3.115771 3.307241
## 3 4.474370 2.387970
##
##
## [[2]]
## [[2]]$centers
## [,1] [,2]
## [1,] 3.903263 4.585723
## [2,] 2.434461 3.479530
## [3,] 4.907858 2.282699
##
##
## [[3]]
## [[3]]$centers
## [,1] [,2]
## [1,] 3.837077 4.520047
## [2,] 2.366460 3.438372
## [3,] 4.907858 2.282699
```

`em_results$m_list[1:3]`

```
## [[1]]
## [[1]]$mu
## x y
## 1 4.236116 4.322595
## 2 3.115771 3.307241
## 3 4.474370 2.387970
##
## [[1]]$sigma
## [[1]]$sigma[[1]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## [[1]]$sigma[[2]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## [[1]]$sigma[[3]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
##
## [[1]]$weights
## [1] 0.3333333 0.3333333 0.3333333
##
##
## [[2]]
## [[2]]$mu
## [,1] [,2]
## [1,] 3.596083 4.280319
## [2,] 2.652853 3.548945
## [3,] 4.318030 2.668293
##
## [[2]]$sigma
## [[2]]$sigma[[1]]
## [,1] [,2]
## [1,] 0.66681006 0.05709793
## [2,] 0.05709793 0.56318406
##
## [[2]]$sigma[[2]]
## [,1] [,2]
## [1,] 0.6406161 0.1080031
## [2,] 0.1080031 0.5753433
##
## [[2]]$sigma[[3]]
## [,1] [,2]
## [1,] 1.3820866 -0.4615455
## [2,] -0.4615455 0.7275576
##
##
## [[2]]$weights
## [1] 0.3032386 0.5042118 0.1925496
##
##
## [[3]]
## [[3]]$mu
## [,1] [,2]
## [1,] 3.602189 4.367311
## [2,] 2.578687 3.553098
## [3,] 4.475524 2.552581
##
## [[3]]$sigma
## [[3]]$sigma[[1]]
## [,1] [,2]
## [1,] 0.5694985 0.1191194
## [2,] 0.1191194 0.4292038
##
## [[3]]$sigma[[2]]
## [,1] [,2]
## [1,] 0.5033149 0.1763752
## [2,] 0.1763752 0.5461334
##
## [[3]]$sigma[[3]]
## [,1] [,2]
## [1,] 1.2389029 -0.4628892
## [2,] -0.4628892 0.5710045
##
##
## [[3]]$weights
## [1] 0.3006975 0.5026309 0.1966716
```

Looks sensible. It’s also a good idea to check that the EM log-likelihood always increases.

```
lls = rep(0, nsteps)
for (i in 1:nsteps) {
lls[i] = em_results$e_list[[i]]$ll
}
ggplot(data=data.frame(x=1:nsteps, y=lls)) +
geom_line(aes(x=x, y=y)) + geom_point(aes(x=x, y=y)) +
labs(title="Log likelihood for EM", x="step", y="log likelihood") +
center_title
```

We’d like to visualize our results, and since my aim is to compare K-means with EM, I’ve chosen to visualize them side-by-side using `ggplot`

’s `facet_grid`

option. Points are colored to show the assigned cluster (K-means) or the most likely cluster (EM); an alternate visualization would use blended colors for EM. I used the `deldir`

package to compute K-means decision boundaries, which come from Voronoi diagrams, and the `ellipse`

package to plot shapes of each Gaussian mixture.

```
make_visualization = function(points, kmeans_data, em_data, nsteps, K) {
for (i in 1:nsteps) {
# colored points
df_points = rbind(
data.frame(x = points[,1], y = points[,2], type = "K-means",
cluster = kmeans_data$e_list[[i]]$clusters),
data.frame(x = points[,1], y = points[,2], type = "EM",
cluster = apply(em_data$e_list[[i]]$resp, 1, which.max)))
# K-means decision boundaries
centers = kmeans_data$m_list[[i]]$centers
df_voronoi = deldir(centers[,1], centers[,2])$dirsgs
df_voronoi$type = factor("K-means", levels=c("K-means", "EM"))
# ellipses
mu = em_data$m_list[[i]]$mu
sigma = em_data$m_list[[i]]$sigma
all_ellipses = NULL
for (k in 1:K) {
ellipse_data = ellipse(sigma[[k]], level=pchisq(1, df=D))
all_ellipses[[k]] = data.frame(
x=ellipse_data[,1] + mu[k,1], y=ellipse_data[,2] + mu[k,2],
cluster=k, type="EM")
}
df_ellipses = do.call(rbind, all_ellipses)
print(
ggplot() +
geom_point(data=df_points, aes(x=x, y=y, color=factor(cluster))) +
geom_point(data=data.frame(x=centers[,1], y=centers[,2], type="K-means"),
aes(x=x, y=y), shape=17, size=3) +
geom_segment(data=df_voronoi, linetype = 1, color= "#FFB958",
aes(x = x1, y = y1, xend = x2, yend = y2)) +
geom_path(data=df_ellipses, aes(x=x, y=y, color=factor(cluster))) +
facet_grid(. ~ type) +
ggtitle(paste0("Most likely cluster, K = ", K, ", step = ", i)) +
center_title + no_legend)
}
}
```

Since `knitr`

/ R Markdown supports animations, we can simply plot each frame in a for loop. In my case, I’m using `ffmpeg`

with an `.mp4`

format, and hacked `knitr`

to add some flags for Apple support, which was necessary for me to get things (hopefully) viewable in Safari.

With this visualization code, we can finally take a look at our results!

`make_visualization(points, kmeans_results, em_results, nsteps, K)`

To make the EM algorithm more like K-means, we start by limiting the M step to only change the \(\mathbf{\mu}\) parameters. The correspondence is pretty clear – the Gaussian means correspond to the K-means cluster centers.

If you look closely above, I added some extra arguments to the EM M step that allows for this change. Since the `iterate_em`

function accepts a function for the M step, we can use the `partial`

function from the `pryr`

package to set these arguments appropriately.

```
fixed_sigma = partial(em.m, fix_sigma=sigma, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
```

In the above animation, you’ll see that the shapes of the Gaussians do not change; only their centers do. One can show that this leads to linear decision boundaries for the most likely cluster, just like K-means. However, the algorithm evolution is still not the same as K-means.

To allow the two algorithms to finally match up, we need to take a limit where the fixed covariance for each mixture component is \(\epsilon I\), and we take \(\epsilon\) to 0. In this case, we take \(\epsilon\) to be 0.01, corresponding to a standard deviation of 0.1. The log-sum-exp trick I mentioned earlier was necessary for my results to not under / overflow in this case.

```
sigma001 = NULL
for(k in 1:K) {
sigma001[[k]] = diag(D)*0.01
}
m_params.init = list(mu=mu, sigma=sigma001, weights=weights)
fixed_sigma001 = partial(em.m, fix_sigma=sigma001, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma001, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
```

The two sides match very well!

We can repeat this entire process for a different value of \(K\). With regular EM:

```
# Run K means and EM
K = 8
set.seed(3)
centers = points[sample(1:N, K),]
row.names(centers) = NULL
m_params.init = list(centers=centers)
kmeans_results = iterate_em(nsteps, K, points, kmeans.e, kmeans.m, m_params.init)
mu = centers
sigma = NULL
for(k in 1:K) {
sigma[[k]] = diag(D) # covariances initialized to identity matrix
}
weights = rep(1, K) / K # weights initialized to uniform
m_params.init = list(mu=mu, sigma=sigma, weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, em.m, m_params.init)
# Visualize
make_visualization(points, kmeans_results, em_results, nsteps, K)
```

With only \(\mathbf{\mu}\) updates, and \(I\) covariance:

```
fixed_sigma = partial(em.m, fix_sigma=sigma, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
```

With only \(\mathbf{\mu}\) updates, and \(0.01I\) covariance:

```
sigma001 = NULL
for(k in 1:K) {
sigma001[[k]] = diag(D)*0.01
}
m_params.init = list(mu=mu, sigma=sigma001, weights=weights)
fixed_sigma001 = partial(em.m, fix_sigma=sigma001, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma001, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
```

Besides the links included above, I found this link useful for plotting Voronoi diagrams using `ggplot`

.

*This blog post was generated from an R Markdown file using the knitr and blogdown packages. The original source can be downloaded from GitHub.*