Introduction to PR and ML (Part 4): Intro to Clustering

2019-10-22

This is the fourth part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first, second and third posts, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined. Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

All snippets are done using python 3 using only numpy and matplotlib add-on packages. Also, ffmpeg package needs to be installed to create the videos.

Clustering - Gaussian Mixture Models

Reference: Section 9.2, 9.3

Initially discussion will be limited to Gaussian Mixture Models with Expectation Maximization algorithm previously introduced in Part 2.

Clustering

Given datasets , assign each data point to a different class or component of a multimodal distribution.

Consider Gaussian mixture models composed of components constructed through a linear superposition of Gaussian distributions

where it is required that , for the above model to be a valid distribution. The components above denote the contribution of the -th Gaussian and are referred to as the mixing coefficients.

Similar to the previous section, consider the introduction of a dimensional vector for a data point such that (and consequently ) for 1-of- coding scheme to indicate which class a data point belongs to. The marginal distribution over is then specified in terms of the mixing coefficients such that

Note that as required. The conditional distribution of a data point for a given is written as

and marginalized distribution

Note that each distribution has independent unknown parameters . As seen before, these quantities may be determined by maximum likelihood. Let denote the training set where each row contains a single data point. Similarly, let denote the corresponding latent variables with rows .

Consider now the maximization of the complete data set , given by

subject to the constraint .

Next we proceed using the Expectation Maximization algorithm discussed in Part 2. As a reminder, consider the maximization of the expected value of the conditional distribution under the complete log distribution:

where,

As a reminder, corresponds to the indicator term for the -th class of the -th point, are independent, and each point can be assigned to one class. When computing the expected value of , the expected value of under this distribution:

The expected value of the complete-data log likelihood under this distribution with the constraint built in then becomes

Maximization of the above objective function with respect to leads to

The Expectation Maximization algorithm thus reduces to:

EM for Gaussian Mixtures
  1. Initialization:

    Initialize

  2. E(xpectation) step:

    Using evaluate the expected value of the indicator variable (responsiblities),

  3. M(aximization) step:

    Using the evaluated responsibilities from the E-step, evaluate new using above equations

  4. Go to Step 2, and repeat until convergence.
Example: Gaussian Mixtures for clustering

Same dataset considered in the neural network example is reconsidered. The clusters are considered as incomplete dataset, i.e. it is assumed that the clusters are not known.

A fixed clustersize of is considered. The means are randomly initialized, and the mixing coefficients are taken to be , and variances are taken to be identity. The EM iterations described above are performed to convergence. The video below shows the gaussian mixture model along with means of the clusters for each iteration of the EM algorithm. All clusters are plotted using the same color to indicate that the dataset is incomplete.

Source

For reference, the means of the Gaussian distributions from which the clusters were drawn were .

References

Suggested general reference books: