Introduction to PR and ML (Part 2): Regression

2019-09-08

This is the second part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first post, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Individuals curious enough to want to explore pattern recognition and machine learning, but have no previous exposure to it.
Individuals who had a course on it and wish to refresh their memory.
People interested in the ideas and equations, but not interested in only learning recipes and/or only learning how to use a library or toolkit.

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

All snippets are done using python 3 using only numpy and matplotlib add-on packages.

Evidence Approximation

Reference: Section 3.5 of PR&ML book

In the treatment of the regression problem in the first post model complexity and hyperparameters were assumed to be be known ahead of time. Extension of the approach described can be used to learn these paramters during the learning phase.

In a full Bayesian treatment of the problem, prior probabilities would be introduced over the hyperparameters. This approach will be revisited later, but for now consider the maximization of the marginalized likelihood function. Recall that the likelihood function of the data set was defined to be

p (t ∣ w, β) = n \prod N N (t_{n} ∣ w^{T} ϕ (x_{n}), β^{- 1}) = C_{1} exp {- \frac{β}{2} n \sum (t_{n} - w^{T} ϕ (x_{n}))^{2}} = C_{1} exp {- \frac{β}{2} ∣ ∣ t - Φ w ∣ ∣^{2}},

where

C_{1} = (\frac{β}{2 π})^{N / 2}

is the Gaussian distribution normalization constant. Also, the prior distribution introduced over model weights was given by

p (w, α) : = N (w ∣ 0, α^{- 1} I) = C_{2} exp {- \frac{α}{2} w^{T} w},

where

C_{2} = (\frac{α}{2 π})^{M / 2}

is the Gaussian normalization constant for the prior distribution. Then, the maginalized likelihood function, given the training set alone, is

p (t ∣ α, β) = \int p (t ∣ w, β) p (w, α) d w = C_{1} C_{2} \int exp {- \frac{1}{2} (β ∣ ∣ t - Φ w ∣ ∣^{2} + α w^{T} w)} d w .

The integral can be evaluated the same way as in Part 1 of the discussion, namely expanding and completing the square, or using the general results of evaluating the joint distribution (see notes in Part 1). Consider expanding it out once again as an exercise. Expanding the exponential term excluding the preceding factor gives

β ∣ ∣ t - Φ w ∣ ∣^{2} + α w^{T} w = β (t - Φ w)^{T} (t - Φ w) + α w^{T} w = β t^{T} t - β t^{T} Φ w - β w^{T} Φ^{T} t + w^{T} (= : A \in R^{M \times M} β Φ^{T} Φ + α I) w = w^{T} A w - 2 β w^{T} Φ^{T} t + β t^{T} t = (w - m)^{T} A (w - m) - m^{T} A m + β t^{T} t .

The last line follows by completing the square, and $m$ is defined as

m : = β A^{- T} Φ^{T} t .

As a side note, with a little advance insight, the last two terms can be rewritten as (as are in the PR&ML book)

- m^{T} A m + β t^{T} t = β ∣ ∣ t - Φ m ∣ ∣^{2} + α m^{T} m = : 2 E (m) .

Isolating the $w$ terms for integration, it follows that the original integral can be written as

p (t ∣ α, β) = C_{1} C_{2} \int exp {- \frac{1}{2} (β ∣ ∣ t - Φ w ∣ ∣^{2} + α w^{T} w)} d w = C_{1} C_{2} exp {- E (m)} \int exp {- \frac{1}{2} (w - m)^{T} A^{- 1} (w - m)} d w = C_{1} C_{2} exp {- E (m)} (2 π)^{M / 2} ∣ A ∣^{- 1 / 2} = (\frac{β}{2 π})^{N / 2} (\frac{α}{2 π})^{M / 2} (2 π)^{M / 2} ∣ A ∣^{- 1 / 2} exp {- E (m)} = (\frac{β}{2 π})^{N / 2} α^{M / 2} ∣ A ∣^{- 1 / 2} exp {- E (m)}

In above, the property of the unit integral property of the multivariate Gaussian distribution is used. As before, it may be convenient to consider the log of the marginal likelihood function:

lo g p (t ∣ α, β) = \frac{N}{2} (lo g (β) - lo g (2 π)) + \frac{M}{2} lo g (α) - \frac{1}{2} lo g (∣ A ∣) - E (m) .

Note that in the above expression $A = f (x, β, α)$ and that varying $M$ can have an effect on the $∣ A ∣$ value.

Consider now the maximization of the log-likelihood with respect to the hyperparameters $α$ and $β$ . Recall that if $λ_{i}$ are the eigenvalues of the symmetric matrix $A^{s}$ , then $λ_{i} + α$ are the eigenvalues of $A^{s} + α I$ . The log-likelihood can thus be expressed as

lo g p (t ∣ α, β) = \frac{N}{2} (lo g (β) - lo g (2 π)) + \frac{M}{2} lo g (α) - \frac{1}{2} lo g (∣ β Φ^{T} Φ + α I ∣) - E (m) = \frac{N}{2} (lo g (β) - lo g (2 π)) + \frac{M}{2} lo g (α) - \frac{1}{2} lo g {i \prod {λ_{i} + α}} - E (m) = \frac{N}{2} (lo g (β) - lo g (2 π)) + \frac{M}{2} lo g (α) - \frac{1}{2} i \sum lo g {λ_{i} + α} - E (m),

where $λ_{i}$ are the eigenvalues of $β Φ^{T} Φ$ . Optimizing with respect to $α$ requires

\frac{M}{2 α} - \frac{1}{2} i \sum \frac{1}{λ _{i} + α} - \frac{1}{2} m^{T} m = 0

or,

α m^{T} m = M - α i \sum M \frac{1}{λ _{i} + α} = i \sum \frac{λ _{i}}{λ _{i} + α} = : γ,

leading to an implicit expression for $α$

α = \frac{γ ( α )}{m ^{T} m} .

α = \frac{γ ( α )}{m ^{T} m} .

Consider next maximization of the log-likelihood with respect to $β$ . First, note that if the eigenvalues of $β A^{s}$ are $λ_{i}$ , then the eigenvalues of $A^{s}$ are $\hat{λ}_{i} = β^{- 1} λ_{i}$ . Then,

\frac{\partial λ _{i}}{\partial β} = \hat{λ}_{i} = β^{- 1} λ_{i} .

and the minimizer of the log-likelihood requires

\frac{\partial}{\partial β} lo g p (t ∣ α, β) = \frac{N}{2 β} - \frac{1}{2 β} \sum \frac{λ _{i}}{λ _{i} + α} - \frac{1}{2} ∣ ∣ t - Φ m ∣ ∣^{2} = 0 .

Rearranging,

β^{- 1} = \frac{1}{N - γ} ∣ ∣ t - Φ m ∣ ∣^{2},

which is again an implicit expression for $β$ .

Toggle Example

Example: Model Complexity

In this example, the noisy dataset generated from a sine function is reconsidered, but this time with the goal of gaining insights on the optimal model complexity.

Two separate sets of prior and model precision hyperparameters are considered in the figures below. In the first choice of hyperparameters, it is not evident which hyperparameter maximizes the log-likelihood. From the second set of graphs, it may be seen that the likelihood is highest at model complexity of $M = 3$ . Note that this value is determined strictly by looking at the training set. For comparison, while not needed for evaluating the log likelihood, graphs of models with different complexities trained on the data set are shown using the same hyperparameters.

Source

Model complexity of $M = 3$ is considered as it was indicated to be the optimum model complexity and the hyperparameters are optimzied. A simple and naive fix-count iteration is performed to optimize the hyperparameters. In the figures below, the values after the iteration count are used to re-evaluate the log-likelihood for various different complexities, and also show the resulting model for the optimal complexity.

Source

The new log-likelihood evaluated for various complexities now shows an exagguration of a preference on model complexity of $M = 3$ .

Expectation Maximization

Reference: Section 9.3 of PR&ML book

In Bayesian treatment of the problem in the previous post, the hyperparameters in the prior distribution were assumed to be known a priori. The previous section showed an example of these hyperparameters can be learned from the training data, e.g. by maximizing the (log) likelihood function. This section looks at an alterantive approach of learning the hyperparameters including by maximizing the expected joint distribution through what is known as Expectation Maximization (EM).

Expectation maximization, is a more general concept as will be seen later, and as the name suggests, is an algorithm for finding the maximum likelihood solutions for models having latent variables given model parameters. Within the context of Bayesian treatment of the regression problem, the latent variables are the weights, and the model parameters are the hyperparameters $α$ and $β$ .

Specifically, consider the full observation dataset denoted by $X$ :

X = ⎝ ⎜ ⎜ ⎜ ⎛ x_{1}^{T} x_{2}^{T} ⋮ x_{N}^{T} ⎠ ⎟ ⎟ ⎟ ⎞

where $x$ denotes the n-th observation point. Similarly, latent (unobserved) values are denoted by $Z$ (e.g. weights) and the model parameters denoted by $θ$ (e.g. model mean and covariance).

The log likelihood function of the observed data is expressed as marginization of the joint distribution of the observed and latent variable distribution.

lo g p (X ∣ θ) = lo g Z \sum p (X, Z ∣ θ)

The data set ${X, Z}$ is referred to as the complete data set, and data set ${X}$ as the incomplete data set.

Once again, within the context of regression, the latent variables $Z$ are the weighing functions and are not known in advance. Under EM, since the complete data set is (generally) not known, the expected value of the log likelihood under the posterior distribution of the latent variable is considered. Namely, the problem statement may be expressed as

Expectation Maximization (EM)

Given the (incomplete) data set

X

and a joint distribution

p (X, Z ∣ θ)

, find maximizer of the maximum value of the log likelihood under the posterior distribution:

argmax_{θ} Z \sum lo g p (X, Z ∣ θ) p (Z ∣ X, θ)

The Expectation Maximization (EM) algorithm involves staggering individual components of the log likelihood function using previous values of $θ$ which is guaranteed to increase the incomplete data log likelihood (unless it is already at a local maximum).

Expectation Maximization Algorithm

Initialize model parameters, $θ_{i}$
E Step: Evaluate the posterior distribution
$p (Z ∣ X, θ_{i})$
M Step: Update model parameters $θ$ by maximizing the expected value of the log likelihood
$θ_{i + 1} = argmax_{θ_{i + 1}} Z \sum lo g p (X, Z ∣ θ_{i + 1}) p (Z ∣ X, θ_{i})$
Check for convergence. If not converged
$θ_{i} \leftarrow θ_{i + 1}$
and go to Step 2.

EM Bayesian Regression

Reference: Section 9.3.4 of PR&ML book

For the linear regression problem, following the EM algorithm above, implicit expressions can be obtained for the hyperparamters assuming previous, posterior and prior distributions.

For the linear regression problem, the latent (unobserved) variables, $Z$ are the model weights $w$ , and model parameters $θ$ are the hyperparameters $α, β$ . The complete (joint) data log-likelihood is given by

lo g p (t, w ∣ α, β) = lo g p (t ∣ w, β) + lo g p (w, α)

where, as before,

p (t ∣ w, β) lo g p (t ∣ w, β) p (w ∣ α) = n \prod N (t_{n} ∣ w^{T} ϕ (x_{n}), β^{- 1}) = n \sum lo g N (t_{n} ∣ w^{T} ϕ (x_{n}), β^{- 1}) = \frac{N}{2} lo g (β) - \frac{N}{2} lo g (2 π) - β \frac{1}{2} ∣ ∣ t - Φ w ∣ ∣_{2}^{2} = N (w ∣ 0, α^{- 1} I) .

Recall that the posterior distribution within the Bayesian treatment was taken to be

p (w ∣ t) = N (w ∣ m_{N}, S_{N}),

where

S_{N}^{- 1} m_{N} = α I + β Φ^{T} Φ = β S_{N} Φ^{T} t .

The model parameters are evaluated by maximizing the log-likelihood of the joint distribution:

E [lo g p (t, w ∣ α, β)] = Z \sum lo g p (t, w ∣ α, β) p (w ∣ t, α, β) = Z \sum lo g p (t, w ∣ α, β) N (w ∣ m_{N}, S_{N}) = \frac{N}{2} lo g (\frac{β}{2 π}) - \frac{β}{2} n \sum E [(t_{n} - w^{T} ϕ_{n})^{2}] - \frac{α}{2} E [w^{T} w] + \frac{M}{2} lo g (\frac{α}{2 π}) .

The expectation $E [w^{T} w]$ can be evaluated as

E [w^{T} w] = m_{N}^{T} m_{N} + tr (S_{N}) .

Optimizing with respect to $α$ leads to

α = \frac{M}{E [ w ^{T} w ]} = \frac{M}{m _{N}^{T} m _{N} + tr ( S _{N} )} .

Similarly, optimizing with respect to $β$ leads to

\frac{N}{β} - n \sum E [(t_{n} - w^{T} ϕ_{n})^{2}] = 0

or,

β = \frac{N}{\sum _{n} E [ ( t _{n} - w ^{T} ϕ _{n} ) ^{2} ]} .

The expectation can be evaluated by using the result for the expectation $E [w w^{T}]$ with respect to the distribution $N (w ∣ m_{N}, S_{N})$ .

E [(t_{n} - w^{T} ϕ_{n})^{2}] = t_{n}^{2} + (m_{N} m_{N}^{T} + S_{N}) : ϕ_{n} ϕ_{n}^{T} - 2 t_{n} m_{N}^{T} ϕ_{n}

leading to

β^{- 1} = \frac{1}{N} n \sum (t_{n}^{2} + (m_{N} m_{N}^{T} + S_{N}) : ϕ_{n} ϕ_{n}^{T} - 2 t_{n} m_{N}^{T} ϕ_{n}) .

Evaluate

E [w^{T} w]

with respect to the distribution $N (w ∣ m_{n}, Σ)$ .

This expectation evaluation, compared to expectation of a real valued variable $w$ (which will be used as well) requires a bit of extra work and finesse.

Following results will be used:

The covariance matrix $Σ$ is taken to be symmetric. This is not a limiting assumption as the non-symmetric components vanish.
A real symmetric matrix $Σ \in R^{D \times D}$ admits the decomposition
$Σ = U Λ U^{T},$
where $Λ \in R^{D \times D}$ is a diagonal matrix with eigenvalues of $Σ$ as its diagonal entries, and $U \in R^{D \times D}$ is orthonormal and contains the eigenvectors of $Σ$ for its columns, i.e.
$U = (u_{1} u_{2} \dots u_{D}),$
where $u_{i}$ is the i-th eigenvector of $Σ$ . This is a specialization of SVD decomposition. See any book on numerical linear algebra for details and proof.
Given eigenvalues $λ_{i}$ ,
$∣ Σ ∣^{1 / 2} = \prod λ_{i}^{1 / 2}$ $tr (Σ) = \sum λ_{i} .$

Now, to start off, let $z = w - m$ , and express $z$ in terms of the eigenvectors $Σ$ :

z = U U^{T} z = U y = \sum u_{i} y_{i} .

Then

E [w^{T} w] = C \int w^{T} w exp {- \frac{1}{2} (w - m)^{T} Σ^{- 1} (w - m)} d w = C \int (z + m)^{T} (z + m) exp {- \frac{1}{2} z^{T} Σ^{- 1} z} d z,

where,

C : = \frac{1}{2 π ) ^{D / 2}} \frac{1}{∣ Σ ∣ ^{1 / 2}}

is the normalization constant. The leading coefficients of the exponential in the integrand will be considered separately. Consider first rewriting the exponential term

z^{T} Σ^{- 1} z = (z^{T} U U^{T}) (U Λ^{- 1} U^{T}) (U U^{T} z) = z^{T} U Λ^{- 1} U^{T} z = y^{T} Λ^{- 1} y = k \sum D \frac{y _{k}^{2}}{λ _{k}} .

Consider now the $z^{T} z$ term (excluding the leading normalization constant):

\int z^{T} z exp {- \frac{1}{2} z^{T} Σ^{- 1} z} d z = \int i \sum j \sum u_{i}^{T} u_{j} y_{i} y_{j} exp {- \frac{1}{2} k \sum D \frac{y _{k}^{2}}{λ _{k}}} d y = i \sum j \sum u_{i}^{T} u_{j} \int y_{i} y_{j} exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y .

Since $u_{i}$ and $u_{j}$ are orthogonal for $i \neq = j$ , the above equation simplifies to

i \sum \int y_{i} y_{i} exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y = i \sum \int y_{i} y_{i} exp {- \frac{1}{2} \frac{y _{1}^{2}}{λ _{1}}} exp {- \frac{1}{2} \frac{y _{2}^{2}}{λ _{2}}} \dots exp {- \frac{1}{2} \frac{y _{D}^{2}}{λ _{D}}} d y = i \sum λ_{i} (2 π)^{D / 2} j \prod λ_{j}^{1 / 2} = (2 π)^{D / 2} tr (Σ) ∣ Σ ∣^{1 / 2} = \frac{1}{C} \cdot tr (Σ),

where $C$ is the same normalization constant referenced above. Next consider the other two leading exponential coefficients

\int C m^{T} m exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y = m^{T} m,

and,

\int C z^{T} m exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y = 0 .

The last result follows anti-symmetry of the integrand. Final result follows combining the three integrals(and including the missing normalization constant where appropriate)

E [w^{T} w] = m^{T} m + tr (Σ) .

Evaluate

E [w w^{T}]

with respect to the distribution $N (w ∣ m_{n}, Σ)$ .

Same results used for evaluating the expectation $E [w^{T} w]$ will be used:

The covariance matrix $Σ$ is taken to be symmetric. This is not a limiting assumption as the non-symmetric components vanish.
A real symmetric matrix $Σ \in R^{D \times D}$ admits the decomposition
$Σ = U Λ U^{T},$
where $Λ \in R^{D \times D}$ is a diagonal matrix with eigenvalues of $Σ$ as its diagonal entries, and $U \in R^{D \times D}$ is orthonormal and contains the eigenvectors of $Σ$ for its columns, i.e.
$U = (u_{1} u_{2} \dots u_{D}),$
where $u_{i}$ is the i-th eigenvector of $Σ$ . This is a specialization of SVD decomposition. See any book on numerical linear algebra for details and proof.
Given eigenvalues $λ_{i}$ ,
$∣ Σ ∣^{1 / 2} = \prod λ_{i}^{1 / 2}$ $tr (Σ) = \sum λ_{i} .$

As before, let $z = w - m$ , and express $z$ in terms of the eigenvectors $Σ$ :

z = U U^{T} z = U y = \sum u_{i} y_{i} .

Then

E [w w^{T}] = C \int w w^{T} exp {- \frac{1}{2} (w - m) Σ^{- 1} (w - m)^{T}} d w = C \int (z + m) (z + m)^{T} exp {- \frac{1}{2} z^{T} Σ^{- 1} z} d z,

where,

C : = \frac{1}{2 π ) ^{D / 2}} \frac{1}{∣ Σ ∣ ^{1 / 2}}

is the normalization constant. The leading coefficients of the exponential in the integrand will be considered separately. Consider first rewriting the exponential term

z^{T} Σ^{- 1} z = (z^{T} U U^{T}) (U Λ^{- 1} U^{T}) (U U^{T} z) = z^{T} U Λ^{- 1} U^{T} z = y^{T} Λ^{- 1} y = k \sum D \frac{y _{k}^{2}}{λ _{k}} .

Consider now the $z^{T} z$ term (excluding the leading normalization constant):

\int z z^{T} exp {- \frac{1}{2} z^{T} Σ^{- 1} z} d z = \int i \sum j \sum u_{i} u_{j}^{T} y_{i} y_{j} exp {- \frac{1}{2} k \sum D \frac{y _{k}^{2}}{λ _{k}}} d y = i \sum j \sum u_{i} u_{j}^{T} \int y_{i} y_{j} exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y .

Since $u_{i}$ and $u_{j}$ are orthogonal for $i \neq = j$ , the above equation simplifies to

i \sum u_{i} u_{i}^{T} \int y_{i} y_{i} exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y = i \sum u_{i} u_{i}^{T} \int y_{i} y_{i} exp {- \frac{1}{2} \frac{y _{1}^{2}}{λ _{1}}} exp {- \frac{1}{2} \frac{y _{2}^{2}}{λ _{2}}} \dots exp {- \frac{1}{2} \frac{y _{D}^{2}}{λ _{D}}} d y = i \sum u_{i} u_{i}^{T} λ_{i} (2 π)^{D / 2} j \prod λ_{j}^{1 / 2} = i \sum u_{i} u_{i}^{T} (2 π)^{D / 2} λ_{i} ∣ Σ ∣^{1 / 2} = \frac{1}{C} Σ,

where $C$ is the same normalization constant referenced above. Next consider the other two leading exponential coefficients

\int C m m^{T} exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y = m m^{T},

and,

\int C z^{T} m exp {- \frac{1}{2} k \sum \frac{y _{k}^{2}}{λ _{k}}} d y = 0 .

The last result follows anti-symmetry of the integrand. Final result follows combining the three integrals(and including the missing normalization constant where appropriate)

E [w w^{T}] = m m^{T} + Σ .

Note that the above expressions for $α$ and $β$ are implicit as before and their solution requires some iterations.

Variational Inference

Reference: Section 10.1 of PR&ML book

Consider now a full Bayesian models where a prior distribution is introduced over all variables including the hyperparameters. Observation dataset will be denoted by $X$ and latent variables (including hyperparameters which will be assumed to be stochastic) will be denoted by $Z$ . The probablistic model will specify the joint distribution of the observed and the latent variables, i.e. $p (X, Z)$ , and the objective is to find (an approximation to) the posterior distribution $p (Z, X)$ as well as the model evidence $p (X)$ .

It will be instructive to note that log marginal distribution $lo g p (X)$ can be expressed as

lo g p (X) L (q) KL (q ∣ ∣ p) = L (q) + KL (q ∣ ∣ p) : = \int q (Z) lo g (\frac{p ( X , Z )}{q ( Z )}) d Z : = - \int q (Z) lo g (\frac{p ( Z ∣ X )}{q ( Z )}) d Z,

where $KL (q ∣ ∣ p) \geq 0$ is the Kullback-Leibler divergence, and $L (q)$ is a lower bound on $lo g p (X)$ .

A key result in variational inference results from considering the $q$ distribution to be separable as

q (Z) = i \prod M q_{i} (Z_{i}),

where $Z_{i}$ are disjoint. The lower bound $L$ under this distribution becomes

L (q) = \int (i \prod q_{i}) [lo g (p (X, Z)) - i \sum lo g q_{i}] d Z = \int q_{j} = : E_{i \neq = j} [lo g p (X, Z)] ⎝ ⎛ \int lo g p (X, Z) i \neq = j \prod q_{i} d Z_{i} ⎠ ⎞ d Z_{j} - \int q_{j} lo g q_{j} d Z_{j} - \int i \neq = j \sum q_{i} lo g q_{i} d Z_{i} = \int q_{j} E_{i \neq = j} [lo g p (X, Z)] d Z_{j} - \int q_{j} lo g q_{j} d Z_{j} - \int i \neq = j \sum q_{i} lo g q_{i} d Z_{i},

where $q_{i} : = q (Z_{i})$ . From the above, it can be found that maximization of the lower bound with respect to a single factor $q_{j}$ requires that

lo g q_{j}^{*} (Z_{j}) = E_{q_{i}, i \neq = j} [lo g p (X, Z)] + C .

The above expression is an implicit expression which shows that the log of the minimizer of the lower bound, $L$ of the marginalized log distribution $lo g p (X)$ is obtained by considering the expectation of the log of the joint distribution over all variables with respect to all factors ${q_{i}}$ for $i \neq = j$ .

Variational Linear Regression

Reference: Section 10.3 of PR&ML book

Reconsider the linear regression problem where the likelihood and weight priors are repeated for convenience:

p (t ∣ w) p (w ∣ α) = n \prod N (t_{n} ∣ w^{T} ϕ_{n}, β^{- 1}) = N (w ∣ 0, α^{- 1} I) .

This time, priors are introduced over $α$ and $β$ . Since the conjugate prior for a Gaussian distribution with known mean is a Gamma distribution, $p (α)$ and $p (β)$ are taken to be

p (α) p (β) = Gam (α ∣ a_{0}, b_{0}) = Gam (β ∣ c_{0}, d_{0}) .

The marginalized distribution over the latent variables $Z = (w, α, β)$ decomposed into Kullback-Leibler divergence and lower bound on

p (X) = L (q) + KL (q ∣ ∣ p)

with the maximum optimum value determined using

lo g q_{j}^{*} = E_{i \neq = j} [lo g p (X, Z)] + const = \int lo g p (X, Z) i \neq = j \prod q_{i} d Z_{i} + const .

More specifically, in this context, the joint distribution is

p (X, Z) = p (X t, Z w, α, β) = p (t ∣ w) p (w ∣ α, β) p (α) p (β) .

Following the discussion above, the distribution $q (Z)$ , with abuse of notation, is assumed to be factorable as

q (Z) = q (w, α, β) = q (w) q (α) q (β) .

The optimal individual components of the above expression are determined using the above equation:

lo g q^{*} (α) = E_{q_{i}, i \neq = j} [lo g p (X, Z)] + C = \int lo g p (t, w, α) i \neq = j \prod q_{i} d Z_{i} = \int lo g [p (t ∣ w, β) p (w ∣ α, β) p (α) p (β)] q (w) q (β) d w d β = \int lo g p (α) q (w) q (β) d w d β + \int lo g p (w ∣ α) q (w) q (β) d w d β + \int lo g p (t ∣ w) q (w) q (β) d w d β = lo g p (α) + E_{w} [lo g p (w ∣ α, β)] + C = - lo g Γ (a_{0}) + a_{0} lo g b_{0} + (a_{0} - 1) lo g α - b_{0} α - \frac{M}{2} lo g 2 π + \frac{M}{2} lo g α - \frac{α}{2} E_{w} [w^{T} w] + C = (a_{0} - 1) lo g α - b_{0} α + \frac{M}{2} lo g α - \frac{α}{2} E [w^{T} w] + C_{2} = (a_{0} + \frac{M}{2} - 1) lo g α - (b_{0} + \frac{1}{2} E [w^{T} w]) α + C_{2} .

In above, $C, C_{2}$ denote any terms that are not a function of $α$ . Comparing the above expression with the log of the Gamma distribution indicates that the resulting expression (as expected due to the choice of the prior) is a Gamma distribution given by

q^{*} (α) a_{N} b_{N} = Gam (α ∣ a_{N}, b_{N}) = a_{0} + \frac{M}{2} = b_{0} + \frac{1}{2} E_{w} [w^{T} w] = b_{0} + \frac{1}{2} \int w^{T} w q (w) d w .

Repeating the same process for $q (w)$ :

lo g q^{*} (w) = E_{q_{i}, i \neq = j} [lo g p (X, Z)] + C = \int lo g p (t, w, α, β) i \neq = j \prod q_{i} d Z_{i} = \int lo g [p (t ∣ w) p (w ∣ α, β) p (α) p (β)] q (α) q (β) d α d β = \int p (t ∣ w) q (α) q (β) d α d β + \int lo g p (w ∣ α) q (α) q (β) d α d β + \int lo g p (α) q (α) q (β) d α d β = - \frac{1}{2} ∣ ∣ t - Φ w ∣ ∣_{2}^{2} E [β] - \frac{1}{2} w^{T} w \int α q (α) q (β) d α d β + \int lo g p (α) q (α) q (β) d α d β = - \frac{1}{2} E [β] ∣ ∣ t - Φ w ∣ ∣_{2}^{2} - \frac{1}{2} w^{T} w E_{α} [α] + C = - \frac{1}{2} (β Φ^{T} Φ + E_{α} [α] I) + β w^{T} Φ^{T} t - \frac{β}{2} t^{T} t + C = - \frac{1}{2} w^{T} (β Φ^{T} Φ + E_{α} [α] I) w + β w^{T} Φ^{T} t + C_{2} = - \frac{1}{2} (w - m_{N})^{T} S_{N}^{- 1} (w - m_{N}) + C_{3}

where the $C, C_{2}, C_{3}$ are terms which are not a function of $w$ , and the last line follows by completing the square with

S_{N}^{- 1} m_{N} = (E_{α} [α] I + E [β] Φ^{T} Φ) = E [β] S_{N} Φ^{T} t .

The corresponding distribution therefore is

q^{*} (w) = N (w ∣ m_{N}, S_{N}) .

Finally,

lo g q^{*} (β) = \int lo g p (t ∣ w, α, β) q (α) q (w) d α d w + \int lo g p (w ∣ α) q (α) q (w) d α d w + \int lo g p (α) q (α) q (w) d α d w + \int lo g p (β) q (α) q (w) d α d w .

Evaluating each integral is straightforward:

\int lo g p (t ∣ w, α, β) q (α) q (w) d α d w = \int n \sum N lo g {\frac{β}{( 2 π ) ^{1 / 2}} exp [- \frac{β}{2} (t_{n} - w^{T} ϕ_{n})^{2}]} q (α) q (w) d α d w = N lo g β - \frac{β}{2} E_{w} ∣ ∣ t - Φ w ∣ ∣_{2}^{2} + C = N lo g β - \frac{β}{2} [t^{T} t - 2 t^{T} Φ E [w] + Φ^{T} Φ : E [w w^{T}]] + C .

Since $p (α)$ and $p (β)$ are both Gamma distributions, evaluation of $E_{w, α} [p (β)]$ is identical to $E_{w, β} [p (α)]$

E_{w, α} [p (β)] = (c_{0} - 1) lo g β - d_{0} β + C_{2} .

Combining the terms, and ignoring the integrals resulting in terms independent of $β$ gives

lo g q^{*} (β) = (c_{0} + N - 1) lo g β - (d_{0} + t^{T} t - 2 E_{w} [w^{T}] Φ^{T} t + Φ^{T} Φ : E_{w} [w w^{T}]) β + C_{3} .

Similar to $q^{*} (α)$ , this indicates that the optimum distribution is once again a Gamma distribution:

q^{*} (β) c_{N} d_{N} = Gam (α ∣ c_{N}, d_{N}) = c_{0} + N = d_{0} + t^{T} t - 2 E_{w} [w] Φ^{T} t + Φ^{T} Φ : E_{w} [w w^{T}] .

With $q (w)$ , $q (α)$ , and $q (β)$ distributions determined (Gaussian and Gamma distributions), the expected values shown in above expressions can be evaluated using standard results of Gamma distribution first moment, and previous expressions for Gaussian distributions:

E_{α} [α] E_{β} [β] E_{w} [w^{T} w] E_{w} [w w^{T}] E_{w} [w] = \frac{a _{N}}{b _{N}} = \frac{c _{N}}{d _{N}} = m_{N}^{T} m_{N} + tr (S_{N}) . = m_{N} m_{N}^{T} + S_{N} . = m_{N} .

Predictive Distribution

For completeness, the predictive distribution is considered as in Part 1. Recall that the goal is, given an observation point $x$ , to make future predictions on the value of $t$ . The model weights are marginalized as before

p (t ∣ x, t, α, β) = \int p (t ∣ x, w, β) p (w ∣ t, α, β) d w \approx \int p (t ∣ x, w, β) q (w ∣ t, α, β) d w = \int N (t ∣ w^{T} ϕ (x), β^{- 1}) N (w ∣ m_{N}, S_{N}) d w = N (t ∣ m_{N}^{T} ϕ (x), β^{- 1} + ϕ (x)^{T} S_{N} ϕ (x))

where the last equality follows using the same approach(es) discussed in Part 1. Note that the integration of the distributions with the priors $p (α)$ and $p (β)$ can quickly become intractable, and thus have been left out, and are assumed to have been learned from the training set.

Toggle Example

Example: Optimum hyperparameters

The same the noisy dataset generated from a sine function is reconsidered once again. In this example, optimum estimates on the hyperparameters $α, β$ using variational inference is determined.

Source

Note that the converged values for the hyperparamters is different than those obtained using Evidence Approximation (with no priors introduced over the hyperparamets). The log-likelihood, evluated using the same equations as in previous example, still indicates that $M = 3$ is the optimum complexity for this dataset. Also note that for the optimal model complexity of $M = 3$ , the evaluated log-likelihood is still close to the previously obtained value.

Second graph shows the mean and $\pm 1$ standard deviations of the mean using the predictive distribution, where the hyperparameters $α, β$ are those learned from the dataset using variational inference. At each point, the preceding results is used to evalute the mean and the standard deviation using

p (t ∣ x, t, α, β) μ σ^{2} = N (t ∣ μ, σ^{2}) = m_{N}^{T} ϕ (x) = β^{- 1} + ϕ (x)^{T} S_{N} ϕ (x)

Summary

Approach	$α$	$β^{- 1}$	Notes
Evidence Approximation	$\frac{γ}{m ^{T} m}$	$\frac{1}{N - γ} ∣ ∣ t - Φ m ∣ ∣_{2}^{2}$	$γ S_{N}^{- 1} m = i \sum \frac{λ _{i}}{λ _{i} + α} = α I + β Φ^{T} Φ = β S_{N} Φ^{T} t$ and $λ_{i}$ are the eigenvalues of the Gram matrix.
Expectation Maximization	$\frac{M}{m ^{T} m + tr ( S _{N} )}$	$\frac{1}{N} n \sum (t_{n}^{2} + E_{w} [w w^{T}] : ϕ_{n} ϕ_{n}^{T} - 2 t_{n} m_{N}^{T} ϕ_{n})$	$E_{w} [w^{T} w] = m_{N}^{T} m_{N} + tr (S_{N})$ $M$ is model complexity, $N$ is training dataset count. See above for other expressions.
Variational Linear Regression	$E [α] = \frac{a _{N}}{b _{N}}$	$E [β] = \frac{c _{N}}{d _{N}}$	$a_{N} b_{N} E_{w} [w w^{T}] S_{N}^{- 1} c_{N} d_{N} = a_{0} + \frac{M}{2} = b_{0} + \frac{1}{2} E [w^{T} w] = m_{N} m_{N}^{T} + S_{N} = α I + β Φ^{T} Φ = c_{0} + N = d_{0} + t^{T} t - 2 m_{N} Φ^{T} t + Φ^{T} Φ : E_{w} [w w^{T}]$ $a_{0}, b_{0}, c_{0}, d_{0}$ are prescribed and known Gamma prior distribution parameters.

Exercises

Show that $E [w^{T} w] = m_{N}^{T} m_{N} + tr (S_{N})$ with respect to the distribution $N (w ∣ m_{N}, S_{N})$ .
Solution
Show that $E [w w^{T}] = m_{N} m_{N}^{T} + S_{N}$ with respect to the distribution $N (w ∣ m_{N}, S_{N})$ .
Solution
Show that if $λ_{i}$ are the eigenvalues of the symmetric matrix $A^{s}$ , then $λ_{i} + α$ are the eigenvalues of $A^{s} + α I$ .
Solution
The covariance matrix $Σ$ of the Gaussian distribution may be taken to be symmetric.
Solution
Show that
$w^{T} A w - 2 β w^{T} Φ^{T} t + β t^{T} t = (w - m)^{T} A (w - m) - m^{T} A m + β t^{T} t .$ Solution
Show that $p (X)$ can be expressed as
$lo g p (X) L (q) KL (q ∣ ∣ p) = L (q) + KL (q ∣ ∣ p) : = \int q (Z) lo g (\frac{p ( X , Z )}{q ( Z )}) d Z : = - \int q (Z) lo g (\frac{p ( Z ∣ X )}{q ( Z )}) d Z,$ Solution
Show that the conjugate prior of the Gaussian likelihood function with known variance $σ^{2}$ is Gaussian.
Solution

References

Suggested general reference books:

Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.

Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.

A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.

Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.

Very easy read and short refresher on basics of statistics.