Introduction to PR and ML (Part 2): Regression

2019-09-08

This is the second part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first post, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

All snippets are done using python 3 using only numpy and matplotlib add-on packages.

Evidence Approximation

Reference: Section 3.5 of PR&ML book

In the treatment of the regression problem in the first post model complexity and hyperparameters were assumed to be be known ahead of time. Extension of the approach described can be used to learn these paramters during the learning phase.

In a full Bayesian treatment of the problem, prior probabilities would be introduced over the hyperparameters. This approach will be revisited later, but for now consider the maximization of the marginalized likelihood function. Recall that the likelihood function of the data set was defined to be

where

is the Gaussian distribution normalization constant. Also, the prior distribution introduced over model weights was given by

where

is the Gaussian normalization constant for the prior distribution. Then, the maginalized likelihood function, given the training set alone, is

The integral can be evaluated the same way as in Part 1 of the discussion, namely expanding and completing the square, or using the general results of evaluating the joint distribution (see notes in Part 1). Consider expanding it out once again as an exercise. Expanding the exponential term excluding the preceding factor gives

The last line follows by completing the square, and is defined as

As a side note, with a little advance insight, the last two terms can be rewritten as (as are in the PR&ML book)

Isolating the terms for integration, it follows that the original integral can be written as

In above, the property of the unit integral property of the multivariate Gaussian distribution is used. As before, it may be convenient to consider the log of the marginal likelihood function:

Note that in the above expression and that varying can have an effect on the value.

Consider now the maximization of the log-likelihood with respect to the hyperparameters and . Recall that if are the eigenvalues of the symmetric matrix , then are the eigenvalues of . The log-likelihood can thus be expressed as

where are the eigenvalues of . Optimizing with respect to requires

or,

leading to an implicit expression for

Consider next maximization of the log-likelihood with respect to . First, note that if the eigenvalues of are , then the eigenvalues of are . Then,

and the minimizer of the log-likelihood requires

Rearranging,

which is again an implicit expression for .

Example: Model Complexity

In this example, the noisy dataset generated from a sine function is reconsidered, but this time with the goal of gaining insights on the optimal model complexity.

Two separate sets of prior and model precision hyperparameters are considered in the figures below. In the first choice of hyperparameters, it is not evident which hyperparameter maximizes the log-likelihood. From the second set of graphs, it may be seen that the likelihood is highest at model complexity of . Note that this value is determined strictly by looking at the training set. For comparison, while not needed for evaluating the log likelihood, graphs of models with different complexities trained on the data set are shown using the same hyperparameters.

Source

Model complexity of is considered as it was indicated to be the optimum model complexity and the hyperparameters are optimzied. A simple and naive fix-count iteration is performed to optimize the hyperparameters. In the figures below, the values after the iteration count are used to re-evaluate the log-likelihood for various different complexities, and also show the resulting model for the optimal complexity.

Source

The new log-likelihood evaluated for various complexities now shows an exagguration of a preference on model complexity of .

Expectation Maximization

Reference: Section 9.3 of PR&ML book

In Bayesian treatment of the problem in the previous post, the hyperparameters in the prior distribution were assumed to be known a priori. The previous section showed an example of these hyperparameters can be learned from the training data, e.g. by maximizing the (log) likelihood function. This section looks at an alterantive approach of learning the hyperparameters including by maximizing the expected joint distribution through what is known as Expectation Maximization (EM).

Expectation maximization, is a more general concept as will be seen later, and as the name suggests, is an algorithm for finding the maximum likelihood solutions for models having latent variables given model parameters. Within the context of Bayesian treatment of the regression problem, the latent variables are the weights, and the model parameters are the hyperparameters and .

Specifically, consider the full observation dataset denoted by :

where denotes the n-th observation point. Similarly, latent (unobserved) values are denoted by (e.g. weights) and the model parameters denoted by (e.g. model mean and covariance).

The log likelihood function of the observed data is expressed as marginization of the joint distribution of the observed and latent variable distribution.

The data set is referred to as the complete data set, and data set as the incomplete data set.

Once again, within the context of regression, the latent variables are the weighing functions and are not known in advance. Under EM, since the complete data set is (generally) not known, the expected value of the log likelihood under the posterior distribution of the latent variable is considered. Namely, the problem statement may be expressed as

Expectation Maximization (EM)
Given the (incomplete) data set and a joint distribution , find maximizer of the maximum value of the log likelihood under the posterior distribution:

The Expectation Maximization (EM) algorithm involves staggering individual components of the log likelihood function using previous values of which is guaranteed to increase the incomplete data log likelihood (unless it is already at a local maximum).

Expectation Maximization Algorithm
  1. Initialize model parameters,

  2. E Step: Evaluate the posterior distribution

  3. M Step: Update model parameters by maximizing the expected value of the log likelihood

  4. Check for convergence. If not converged

    and go to Step 2.

EM Bayesian Regression

Reference: Section 9.3.4 of PR&ML book

For the linear regression problem, following the EM algorithm above, implicit expressions can be obtained for the hyperparamters assuming previous, posterior and prior distributions.

For the linear regression problem, the latent (unobserved) variables, are the model weights , and model parameters are the hyperparameters . The complete (joint) data log-likelihood is given by

where, as before,

Recall that the posterior distribution within the Bayesian treatment was taken to be

where

The model parameters are evaluated by maximizing the log-likelihood of the joint distribution:

The expectation can be evaluated as

Optimizing with respect to leads to

Similarly, optimizing with respect to leads to

or,

The expectation can be evaluated by using the result for the expectation with respect to the distribution .

leading to

Note that the above expressions for and are implicit as before and their solution requires some iterations.

Variational Inference

Reference: Section 10.1 of PR&ML book

Consider now a full Bayesian models where a prior distribution is introduced over all variables including the hyperparameters. Observation dataset will be denoted by and latent variables (including hyperparameters which will be assumed to be stochastic) will be denoted by . The probablistic model will specify the joint distribution of the observed and the latent variables, i.e. , and the objective is to find (an approximation to) the posterior distribution as well as the model evidence .

It will be instructive to note that log marginal distribution can be expressed as

where is the Kullback-Leibler divergence, and is a lower bound on .

A key result in variational inference results from considering the distribution to be separable as

where are disjoint. The lower bound under this distribution becomes

where . From the above, it can be found that maximization of the lower bound with respect to a single factor requires that

The above expression is an implicit expression which shows that the log of the minimizer of the lower bound, of the marginalized log distribution is obtained by considering the expectation of the log of the joint distribution over all variables with respect to all factors for .

Variational Linear Regression

Reference: Section 10.3 of PR&ML book

Reconsider the linear regression problem where the likelihood and weight priors are repeated for convenience:

This time, priors are introduced over and . Since the conjugate prior for a Gaussian distribution with known mean is a Gamma distribution, and are taken to be

The marginalized distribution over the latent variables decomposed into Kullback-Leibler divergence and lower bound on

with the maximum optimum value determined using

More specifically, in this context, the joint distribution is

Following the discussion above, the distribution , with abuse of notation, is assumed to be factorable as

The optimal individual components of the above expression are determined using the above equation:

In above, denote any terms that are not a function of . Comparing the above expression with the log of the Gamma distribution indicates that the resulting expression (as expected due to the choice of the prior) is a Gamma distribution given by

Repeating the same process for :

where the are terms which are not a function of , and the last line follows by completing the square with

The corresponding distribution therefore is

Finally,

Evaluating each integral is straightforward:

Since and are both Gamma distributions, evaluation of is identical to

Combining the terms, and ignoring the integrals resulting in terms independent of gives

Similar to , this indicates that the optimum distribution is once again a Gamma distribution:

With , , and distributions determined (Gaussian and Gamma distributions), the expected values shown in above expressions can be evaluated using standard results of Gamma distribution first moment, and previous expressions for Gaussian distributions:

Predictive Distribution

For completeness, the predictive distribution is considered as in Part 1. Recall that the goal is, given an observation point , to make future predictions on the value of . The model weights are marginalized as before

where the last equality follows using the same approach(es) discussed in Part 1. Note that the integration of the distributions with the priors and can quickly become intractable, and thus have been left out, and are assumed to have been learned from the training set.

Example: Optimum hyperparameters

The same the noisy dataset generated from a sine function is reconsidered once again. In this example, optimum estimates on the hyperparameters using variational inference is determined.

Source

Note that the converged values for the hyperparamters is different than those obtained using Evidence Approximation (with no priors introduced over the hyperparamets). The log-likelihood, evluated using the same equations as in previous example, still indicates that is the optimum complexity for this dataset. Also note that for the optimal model complexity of , the evaluated log-likelihood is still close to the previously obtained value.

Second graph shows the mean and standard deviations of the mean using the predictive distribution, where the hyperparameters are those learned from the dataset using variational inference. At each point, the preceding results is used to evalute the mean and the standard deviation using

Summary

Approach Notes
Evidence Approximation and are the eigenvalues of the Gram matrix.
Expectation Maximization is model complexity, is training dataset count. See above for other expressions.
Variational Linear Regression are prescribed and known Gamma prior distribution parameters.

Exercises

References

Suggested general reference books: