This is the second part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first post, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including
Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.
Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:
All snippets are done using python 3
using only
numpy
and matplotlib
add-on packages.
Reference: Section 3.5 of PR&ML book
In the treatment of the regression problem in the first post model complexity and hyperparameters were assumed to be be known ahead of time. Extension of the approach described can be used to learn these paramters during the learning phase.
In a full Bayesian treatment of the problem, prior probabilities would be introduced over the hyperparameters. This approach will be revisited later, but for now consider the maximization of the marginalized likelihood function. Recall that the likelihood function of the data set was defined to be
where
is the Gaussian distribution normalization constant. Also, the prior distribution introduced over model weights was given by
where
is the Gaussian normalization constant for the prior distribution. Then, the maginalized likelihood function, given the training set alone, is
The integral can be evaluated the same way as in Part 1 of the discussion, namely expanding and completing the square, or using the general results of evaluating the joint distribution (see notes in Part 1). Consider expanding it out once again as an exercise. Expanding the exponential term excluding the preceding factor gives
The last line follows by completing the square, and is defined as
As a side note, with a little advance insight, the last two terms can be rewritten as (as are in the PR&ML book)
Isolating the terms for integration, it follows that the original integral can be written as
In above, the property of the unit integral property of the multivariate Gaussian distribution is used. As before, it may be convenient to consider the log of the marginal likelihood function:
Note that in the above expression and that varying can have an effect on the value.
Consider now the maximization of the log-likelihood with respect to the hyperparameters and . Recall that if are the eigenvalues of the symmetric matrix , then are the eigenvalues of . The log-likelihood can thus be expressed as
where are the eigenvalues of . Optimizing with respect to requires
or,
leading to an implicit expression for
Consider next maximization of the log-likelihood with respect to . First, note that if the eigenvalues of are , then the eigenvalues of are . Then,
and the minimizer of the log-likelihood requires
Rearranging,
which is again an implicit expression for .
In this example, the noisy dataset generated from a sine function is reconsidered, but this time with the goal of gaining insights on the optimal model complexity.
Two separate sets of prior and model precision hyperparameters are considered in the figures below. In the first choice of hyperparameters, it is not evident which hyperparameter maximizes the log-likelihood. From the second set of graphs, it may be seen that the likelihood is highest at model complexity of . Note that this value is determined strictly by looking at the training set. For comparison, while not needed for evaluating the log likelihood, graphs of models with different complexities trained on the data set are shown using the same hyperparameters.
Model complexity of is considered as it was indicated to be the optimum model complexity and the hyperparameters are optimzied. A simple and naive fix-count iteration is performed to optimize the hyperparameters. In the figures below, the values after the iteration count are used to re-evaluate the log-likelihood for various different complexities, and also show the resulting model for the optimal complexity.
The new log-likelihood evaluated for various complexities now shows an exagguration of a preference on model complexity of .
Reference: Section 9.3 of PR&ML book
In Bayesian treatment of the problem in the previous post, the hyperparameters in the prior distribution were assumed to be known a priori. The previous section showed an example of these hyperparameters can be learned from the training data, e.g. by maximizing the (log) likelihood function. This section looks at an alterantive approach of learning the hyperparameters including by maximizing the expected joint distribution through what is known as Expectation Maximization (EM).
Expectation maximization, is a more general concept as will be seen later, and as the name suggests, is an algorithm for finding the maximum likelihood solutions for models having latent variables given model parameters. Within the context of Bayesian treatment of the regression problem, the latent variables are the weights, and the model parameters are the hyperparameters and .
Specifically, consider the full observation dataset denoted by :
where denotes the n-th observation point. Similarly, latent (unobserved) values are denoted by (e.g. weights) and the model parameters denoted by (e.g. model mean and covariance).
The log likelihood function of the observed data is expressed as marginization of the joint distribution of the observed and latent variable distribution.
The data set is referred to as the complete data set, and data set as the incomplete data set.
Once again, within the context of regression, the latent variables are the weighing functions and are not known in advance. Under EM, since the complete data set is (generally) not known, the expected value of the log likelihood under the posterior distribution of the latent variable is considered. Namely, the problem statement may be expressed as
The Expectation Maximization (EM) algorithm involves staggering individual components of the log likelihood function using previous values of which is guaranteed to increase the incomplete data log likelihood (unless it is already at a local maximum).
Initialize model parameters,
E Step: Evaluate the posterior distribution
M Step: Update model parameters by maximizing the expected value of the log likelihood
Check for convergence. If not converged
and go to Step 2.
Reference: Section 9.3.4 of PR&ML book
For the linear regression problem, following the EM algorithm above, implicit expressions can be obtained for the hyperparamters assuming previous, posterior and prior distributions.
For the linear regression problem, the latent (unobserved) variables, are the model weights , and model parameters are the hyperparameters . The complete (joint) data log-likelihood is given by
where, as before,
Recall that the posterior distribution within the Bayesian treatment was taken to be
where
The model parameters are evaluated by maximizing the log-likelihood of the joint distribution:
The expectation can be evaluated as
Optimizing with respect to leads to
Similarly, optimizing with respect to leads to
or,
The expectation can be evaluated by using the result for the expectation with respect to the distribution .
leading to
Note that the above expressions for and are implicit as before and their solution requires some iterations.
Reference: Section 10.1 of PR&ML book
Consider now a full Bayesian models where a prior distribution is introduced over all variables including the hyperparameters. Observation dataset will be denoted by and latent variables (including hyperparameters which will be assumed to be stochastic) will be denoted by . The probablistic model will specify the joint distribution of the observed and the latent variables, i.e. , and the objective is to find (an approximation to) the posterior distribution as well as the model evidence .
It will be instructive to note that log marginal distribution can be expressed as
where is the Kullback-Leibler divergence, and is a lower bound on .
A key result in variational inference results from considering the distribution to be separable as
where are disjoint. The lower bound under this distribution becomes
where . From the above, it can be found that maximization of the lower bound with respect to a single factor requires that
The above expression is an implicit expression which shows that the log of the minimizer of the lower bound, of the marginalized log distribution is obtained by considering the expectation of the log of the joint distribution over all variables with respect to all factors for .
Reference: Section 10.3 of PR&ML book
Reconsider the linear regression problem where the likelihood and weight priors are repeated for convenience:
This time, priors are introduced over and . Since the conjugate prior for a Gaussian distribution with known mean is a Gamma distribution, and are taken to be
The marginalized distribution over the latent variables decomposed into Kullback-Leibler divergence and lower bound on
with the maximum optimum value determined using
More specifically, in this context, the joint distribution is
Following the discussion above, the distribution , with abuse of notation, is assumed to be factorable as
The optimal individual components of the above expression are determined using the above equation:
In above, denote any terms that are not a function of . Comparing the above expression with the log of the Gamma distribution indicates that the resulting expression (as expected due to the choice of the prior) is a Gamma distribution given by
Repeating the same process for :
where the are terms which are not a function of , and the last line follows by completing the square with
The corresponding distribution therefore is
Finally,
Evaluating each integral is straightforward:
Since and are both Gamma distributions, evaluation of is identical to
Combining the terms, and ignoring the integrals resulting in terms independent of gives
Similar to , this indicates that the optimum distribution is once again a Gamma distribution:
With , , and distributions determined (Gaussian and Gamma distributions), the expected values shown in above expressions can be evaluated using standard results of Gamma distribution first moment, and previous expressions for Gaussian distributions:
For completeness, the predictive distribution is considered as in Part 1. Recall that the goal is, given an observation point , to make future predictions on the value of . The model weights are marginalized as before
where the last equality follows using the same approach(es) discussed in Part 1. Note that the integration of the distributions with the priors and can quickly become intractable, and thus have been left out, and are assumed to have been learned from the training set.
The same the noisy dataset generated from a sine function is reconsidered once again. In this example, optimum estimates on the hyperparameters using variational inference is determined.
Note that the converged values for the hyperparamters is different than those obtained using Evidence Approximation (with no priors introduced over the hyperparamets). The log-likelihood, evluated using the same equations as in previous example, still indicates that is the optimum complexity for this dataset. Also note that for the optimal model complexity of , the evaluated log-likelihood is still close to the previously obtained value.
Second graph shows the mean and standard deviations of the mean using the predictive distribution, where the hyperparameters are those learned from the dataset using variational inference. At each point, the preceding results is used to evalute the mean and the standard deviation using
Approach | Notes | ||
---|---|---|---|
Evidence Approximation | and are the eigenvalues of the Gram matrix. | ||
Expectation Maximization | is model complexity, is training dataset count. See above for other expressions. | ||
Variational Linear Regression | are prescribed and known Gamma prior distribution parameters. |
Show that with respect to the distribution .
SolutionShow that with respect to the distribution .
SolutionShow that if are the eigenvalues of the symmetric matrix , then are the eigenvalues of .
SolutionThe covariance matrix of the Gaussian distribution may be taken to be symmetric.
SolutionShow that
SolutionShow that can be expressed as
SolutionShow that the conjugate prior of the Gaussian likelihood function with known variance is Gaussian.
SolutionSuggested general reference books:
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.
A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.
Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.
Very easy read and short refresher on basics of statistics.