Introduction to PR and ML (Part 4): Intro to Clustering

2019-10-22

This is the fourth part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first, second and third posts, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Individuals curious enough to want to explore pattern recognition and machine learning, but have no previous exposure to it.
Individuals who had a course on it and wish to refresh their memory.
People interested in the ideas and equations, but not interested in only learning recipes and/or only learning how to use a library or toolkit.

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined. Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

All snippets are done using python 3 using only numpy and matplotlib add-on packages. Also, ffmpeg package needs to be installed to create the videos.

Clustering - Gaussian Mixture Models

Reference: Section 9.2, 9.3

Initially discussion will be limited to Gaussian Mixture Models with Expectation Maximization algorithm previously introduced in Part 2.

Clustering

Given datasets ${x}$ , assign each data point to a different class $C_{k}$ or component of a multimodal distribution.

Consider Gaussian mixture models composed of $K$ components constructed through a linear superposition of Gaussian distributions

p (x) = k \sum K π_{k} N (x ∣ μ_{k}, Σ),

where it is required that $π_{k} \in [0, 1]$ , $\sum_{k} π_{k} = 1$ for the above model to be a valid distribution. The $π_{k}$ components above denote the contribution of the $k$ -th Gaussian and are referred to as the mixing coefficients.

Similar to the previous section, consider the introduction of a $K$ dimensional vector $z$ for a data point such that $z_{k} \in {0, 1}$ (and consequently $\sum_{k} z_{k} = 1$ ) for 1-of- $K$ coding scheme to indicate which class $C_{k}$ a data point belongs to. The marginal distribution over $z$ is then specified in terms of the mixing coefficients $π_{k}$ such that

p (z) = k \prod K π_{k}^{z_{k}} .

Note that $\sum_{z} p (z) = 1$ as required. The conditional distribution of a data point $x$ for a given $z$ is written as

p (x ∣ z) = k \prod N (x ∣ μ_{k}, Σ_{k})^{z_{k}},

and marginalized distribution

p (x) = z \sum p (z) p (x ∣ z) = z \sum k \prod K [π_{k} N (x ∣ μ_{k}, Σ_{k})]^{z_{k}} = k \sum K π_{k} N (x ∣ μ_{k}, Σ_{k}) .

Note that each distribution has independent unknown parameters $Σ_{k}, μ_{k}, π_{k}$ . As seen before, these quantities may be determined by maximum likelihood. Let $X \in R^{N \times D}$ denote the training set where each row contains a single data point. Similarly, let $Z \in R^{N \times K}$ denote the corresponding latent variables with rows $z_{n}$ .

Consider now the maximization of the complete data set ${X, Z}$ , given by

p (X, Z ∣ θ μ, Σ, π) = n \prod N k \prod K π^{z_{n k}} N (x_{n} ∣ μ_{k}, Σ_{k})^{z_{n k}}

subject to the constraint $\sum_{k} π_{k} = 1$ .

Next we proceed using the Expectation Maximization algorithm discussed in Part 2. As a reminder, consider the maximization of the expected value of the conditional distribution $p (Z ∣ X . θ)$ under the complete log distribution:

π, μ, Σ a r g m a x E_{p (Z ∣ X, θ^{o l d})} [lo g p (X, Z ∣ θ^{o l d})] = π, μ, Σ a r g m a x Z \sum lo g p (X, Z ∣ θ^{o l d}) p (Z ∣ X, θ^{o l d})

where,

p (Z ∣ X, θ) = \frac{p ( X ∣ Z ) p ( Z )}{p ( X )} = \frac{\prod _{n}^{N} \prod _{k}^{K} [ π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} ) ] ^{z_{n k}}}{\sum _{Z} \prod _{n}^{N} \prod _{k}^{K} [ π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} ) ] ^{z_{n k}}} . = \frac{\prod _{n}^{N} \prod _{k}^{K} [ π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} ) ] ^{z_{n k}}}{\sum _{z_{n}} \prod _{k}^{K} [ π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} ) ] ^{z_{n k}}}, = \frac{\prod _{n}^{N} \prod _{k}^{K} [ π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} ) ] ^{z_{n k}}}{\sum _{k} π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )} .

As a reminder, $z_{n k} \in {0, 1}$ corresponds to the indicator term for the $k$ -th class of the $n$ -th point, ${z_{n}}$ are independent, and each point can be assigned to one class. When computing the expected value of $lo g p (X, Z ∣ θ^{o l d})$ , the expected value of $z_{n k}$ under this distribution:

E [z_{n k}] = \frac{\sum _{z_{n}} z _{n k} \prod _{k^{'}}^{K} [ π _{k}^{'} N ( x _{n} ∣ μ _{k}^{'} , Σ _{k}^{'} ) ] ^{z_{n k}}}{\sum _{k} π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )} = \frac{\sum _{z_{n}} z _{n k} \prod _{k^{'}}^{K} [ π _{k}^{'} N ( x _{n} ∣ μ _{k}^{'} , Σ _{k}^{'} ) ] ^{z_{n k}}}{\sum _{k} π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )} = \frac{π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}{\sum _{k} π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )} = : γ (z_{n k})

The expected value of the complete-data log likelihood under this distribution with the constraint built in then becomes

E_{p (Z ∣ X, θ^{o l d})} [lo g p (X, Z ∣ θ^{o l d})] = n \sum k \sum γ (z_{n k}) (lo g π_{k} + lo g N (x_{n} ∣ μ_{k}, Σ_{k})) - λ (k \sum π_{k} - 1)

Maximization of the above objective function with respect to $π, μ, Σ$ leads to

π_{k} μ_{k} Σ_{k} = \frac{1}{N} n \sum γ (z_{n k}) = \frac{1}{\sum _{n} γ ( z _{n k} )} n \sum γ (z_{n k}) x_{n} = \frac{1}{\sum _{n} γ ( z _{n k} )} n \sum γ (z_{n k}) (x_{n} - μ_{k}) (x_{n} - μ_{k})^{T} .

The Expectation Maximization algorithm thus reduces to:

EM for Gaussian Mixtures

Initialization:
Initialize $θ = (π_{k}, μ_{k}, Σ_{k}), k \in {1, \dots, K}$
E(xpectation) step:
Using $θ$ evaluate the expected value of the indicator variable (responsiblities), $γ (z_{n k})$
$γ (z_{n k}) : = \frac{π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}{\sum _{k} π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}$
M(aximization) step:
Using the evaluated responsibilities from the E-step, evaluate new $θ = (π_{k}, μ_{k}, Σ_{k})$ using above equations
Go to Step 2, and repeat until convergence.

Toggle Example

Example: Gaussian Mixtures for clustering

Same dataset considered in the neural network example is reconsidered. The clusters are considered as incomplete dataset, i.e. it is assumed that the clusters are not known.

A fixed clustersize of $K = 5$ is considered. The means are randomly initialized, and the mixing coefficients are taken to be $π_{k} = 1 / 5$ , and variances are taken to be identity. The EM iterations described above are performed to convergence. The video below shows the gaussian mixture model along with means of the clusters for each iteration of the EM algorithm. All clusters are plotted using the same color to indicate that the dataset is incomplete.

Source

For reference, the means of the Gaussian distributions from which the clusters were drawn were $[- 3, + 3], [+ 3, + 3], [+ 3, - 3], [- 3, - 3], [0, 0]$ .

References

Suggested general reference books:

Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.

Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.

A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.

Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.

Very easy read and short refresher on basics of statistics.