This is the fourth part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first, second and third posts, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including
Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined. Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.
Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:
All snippets are done using python 3
using only
numpy
and matplotlib
add-on
packages. Also, ffmpeg
package needs to be installed
to create the videos.
Reference: Section 9.2, 9.3
Initially discussion will be limited to Gaussian Mixture Models with Expectation Maximization algorithm previously introduced in Part 2.
Given datasets , assign each data point to a different class or component of a multimodal distribution.
Consider Gaussian mixture models composed of components constructed through a linear superposition of Gaussian distributions
where it is required that , for the above model to be a valid distribution. The components above denote the contribution of the -th Gaussian and are referred to as the mixing coefficients.
Similar to the previous section, consider the introduction of a dimensional vector for a data point such that (and consequently ) for 1-of- coding scheme to indicate which class a data point belongs to. The marginal distribution over is then specified in terms of the mixing coefficients such that
Note that as required. The conditional distribution of a data point for a given is written as
and marginalized distribution
Note that each distribution has independent unknown parameters . As seen before, these quantities may be determined by maximum likelihood. Let denote the training set where each row contains a single data point. Similarly, let denote the corresponding latent variables with rows .
Consider now the maximization of the complete data set , given by
subject to the constraint .
Next we proceed using the Expectation Maximization algorithm discussed in Part 2. As a reminder, consider the maximization of the expected value of the conditional distribution under the complete log distribution:
where,
As a reminder, corresponds to the indicator term for the -th class of the -th point, are independent, and each point can be assigned to one class. When computing the expected value of , the expected value of under this distribution:
The expected value of the complete-data log likelihood under this distribution with the constraint built in then becomes
Maximization of the above objective function with respect to leads to
The Expectation Maximization algorithm thus reduces to:
Initialize
Using evaluate the expected value of the indicator variable (responsiblities),
Using the evaluated responsibilities from the E-step, evaluate new using above equations
Same dataset considered in the neural network example is reconsidered. The clusters are considered as incomplete dataset, i.e. it is assumed that the clusters are not known.
A fixed clustersize of is considered. The means are randomly initialized, and the mixing coefficients are taken to be , and variances are taken to be identity. The EM iterations described above are performed to convergence. The video below shows the gaussian mixture model along with means of the clusters for each iteration of the EM algorithm. All clusters are plotted using the same color to indicate that the dataset is incomplete.
For reference, the means of the Gaussian distributions from which the clusters were drawn were .
Suggested general reference books:
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.
A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.
Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.
Very easy read and short refresher on basics of statistics.