Introduction to PR and ML (Part 1): Regression

2019-06-30

This is the first part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including.

The goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Individuals curious enough to want to explore pattern recognition and machine learning, but have no previous exposure to it.
Individuals who had a course on it and wish to refresh their memory.
People interested in the ideas and equations, but not interested in only learning recipes and/or only learning how to use a library or toolkit.

This first post covers exclusively regression and its connection to machine learning (ML) concepts. This is terse and more like an engineering treatment of the topic, i.e., overly practical with possibly even more hand waving than the original book and plenty of abuse of notation.

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

[...] probability theory can be expressed in terms of two simple equations corresponding to the sum rule and the product rule. All of the probabilistic inference and learning manipulations discussed in this book, now matter how complex, amount to repeated application of these two equations.
PR & ML book, page 359.

All snippets are done using python 3 with numpy and matplotlib add-on packages.

A Point of Departure

Reference: Chapter 3 of PR&ML book

A good starting point is to try to exhaust a simple regression analysis by approaching it from different viewpoints and considering variations of the same general problem. The basic regression problem reads:

Find an (M-1)-th degree polynomial that 'best' fits input (measured) data

The word 'best' inherently implies specification of a metric which can be used to quantitatively compare the adequacy of different polynomials. The (M-1)-th degree polynomial, referred to as the model with $M$ being the model complexity, is denoted by $y$ and may be expressed as

y (x) = i = 0 \sum M - 1 w_{i} x^{i},

where $w_{i}$ are the polynomial coefficients (or weights) that determine the behavior of the model, and $x^{i}$ are the basis functions. The choice of selecting the model complexity is referred to as model selection. In future posts, determintation of the model complexity will be considered, but for the time being it is assumed that this information is known a priori. As a side note, one convenient way to think of this model is to note that it corresponds to the Taylor series expansion of $y$ .

Aside: it is convenient to introduce a small generalization of the basis functions and rewrite the above equation as

y (x) = i = 0 \sum M - 1 w_{i} ϕ_{i} (x),

where $ϕ (x_{i}) = (1, x, x^{2}, \dots, x^{M - 1}) \in R^{M}$ . In next sections, the definition of $ϕ$ will be changed to consider various different model types.

Noting that $ϕ_{0} : = 1$ , the above equation can be written as

y (x) = w_{0} + i = 1 \sum M - 1 w_{i} ϕ_{i} (x),

where $w_{0}$ is referred to as the bias parameter. Also note that a model with complexity $M$ is an $M - 1$ degree polynomial, e.g. a model with complexity of $M = 2$ is corresponds to a first-order polynomial.

The adequacy of the model can be chosen as the "closeness" of the model to the input dataset $x, t \in R^{N}$ at given observation points (training data). A particularly convenient and common measure is the (scaled) square of the $L_{2}$ norm

E (w; p, t) = \frac{1}{2} ∣ ∣ y (x; w) - t ∣ ∣_{2}^{2} .

Here, $y \in R^{N}$ contains all the model prediction which where convenient may be writtten as

y = Φ w,

where

Φ (x) = ⎝ ⎜ ⎜ ⎜ ⎛ ϕ^{T} (x_{1}) ϕ^{T} (x_{2}) ⋮ ϕ^{T} (x_{N}) ⎠ ⎟ ⎟ ⎟ ⎞ \in R^{N \times M} .

The above equation contains all of the basis functions evaluated at every input point. As alluded to, the choice of using $L_{2}$ norm is rather arbitrary, but a common and convenient one with interesting interpretations and properties. Other norms can be used; see e.g. the Convex Optimization book for notes on norm choice and their implications.

The "best" fit, given the error metric, is the choice of $w$ which minimizes $E (w; x, t)$ . The original probelm can thus be re-estated as:

Given

{(p, t), E}

find

w_{m i n}

such that the objective function

E

is minimized.

Within the context of "machine learning" (ML), the (M-1)-th degree polynomial is referred to as the model and the input data referred to as the training set, and within the context of optimization, $E$ is known as the objective function. The above minimziation problem is a well-known unconstrained convex optimization problem seen in optimization.

Within the context of ML, the goal of the model is to make future predictions on inputs (as opposed to, for example, look for individual approximations of the coefficients themselves). For the regression problem, this then reduces to the following general steps.

Choose the error function (objective function) to be minimized to be $E = c ∣ ∣ \cdot ∣ ∣_{2}^{2}$
Find model coefficients $w$
Given an input $x$ , evaluate a point-wise estimate for model $y (x; w)$ .

Sticking with the polynomial regression problem, slightly generalizing the above steps, the general framework (in surpervised learning) in ML may be expressed as

Choose model and objective function, $E$
Training: Given training set ${x, t}$ , find ${w}$ such that $E$ is minimized (or maximized depending on the objective function).
Prediction: Evaluate the value of the model on future input data.

For the $L_{2}$ minimization, a general solution can be obtained. It is simpler for these kinds of problems to briefly reverting to the index notation (see any books on tensor algebra)

\frac{\partial E}{\partial w _{m}} = \frac{\partial E}{\partial w _{m}} ∣ ∣ y - t ∣ ∣_{2}^{2} = \frac{\partial}{\partial w _{m}} \frac{1}{2} (Φ_{i j} w_{j} - t_{i})^{2} = (Φ_{i j} w_{j} - t_{i}) Φ_{i m} .

Setting above equation to $0$ and solving for $w_{j}$ (or $w$ ) and reverting back:

w_{min} = (Φ^{T} Φ)^{- 1} Φ^{T} t .

Toggle Example

Example: Regression and Least Squares

Consider the general solution to the above problem, namely the unconstrained optimization (minimzation) of the $L_{2}$ norm of the error function as measured above. Consider slight generalization of the basis functions, and writing the model as

y (x, w) : = ϕ^{T} (x) w = (1 x x^{2} x^{3} \dots) \cdot (w_{0} w_{1} w_{2} w_{3} \dots) .

Then, minimzation of the $1 / 2 ∣ ∣ E ∣ ∣_{2}^{2}$ can be computed (note that $∣ ∣ \cdot ∣ ∣$ is convex) as

\frac{1}{2} \frac{\partial}{\partial w} ∣ ∣ E ∣ ∣_{2}^{2} : = \frac{1}{2} \frac{\partial}{\partial w} n \sum (ϕ^{T} (x_{n}) w - t_{n})^{2} = (n \sum ϕ (x_{n}) ϕ^{T} (x_{n})) w - n \sum t_{n} ϕ (x_{n}) = 0 .

Or,

w = (n \sum ϕ (x_{n}) ϕ^{T} (x_{n}))^{- 1} n \sum t_{n} ϕ (x_{n}) .

These coefficients define the model and can be considered the training step. Future predictions are then evaluated using $y (x, w) = ϕ^{T} (x) w$ .

For linear regression where the model is linear polynomial, the simple familiar closed-form is obtained. First, noted that

y (x; w) = w_{0} + w_{1} x,

i.e.

ϕ (x) = (1 x) .

Then, a closed form solution for $w$ is found to be

(w_{0} w_{1}) = (N \sum x_{i} \sum x_{i} \sum x_{i}^{2})^{- 1} (\sum t_{i} \sum t_{i} x_{i}),

where $N$ denotes number of observed (or training) data points, and the inversion of the 2x2 matrix is replaced the closed form expression.

In the image below, a fictitious data set is generated as by perturbing a linear line with unit slope passing through the origin with a Gaussian noise with zero mean and a fixed standard deviation. A linear model is then passed through the fictitious data set. The original target and the reconstructed model is shown in the figure.

Source

While this gives the "best" linear fit (measured using the $L_{2}$ norm of the deviation of the model to the training data), it also considerably penalizes outliers, leading to results which do not visually look desirable. More on this later.

Toggle Example

Example: Total least squares regression

In the preceding example, the error function (objective function) was measured as the $L_{2}$ of the error between the model and observation points. In the section to be followed, this has an interesting interpretation as as the observations being perturbed by a Gaussian noise idependently of each observation point $x_{n}$ .

If the error is instead measured orthogonal to linear models, then an alternative model can be obtained. Consider the error function

E : = \frac{1}{2} ∣ ∣ (t_{n} - y_{n}) cos θ_{n} ∣ ∣_{2}^{2}

where $θ_{n}$ is measured orthogonal to the model, i.e.

cos θ_{n} = \frac{1}{1 + w _{1}^{2}} > 0

Note that this can be thought of as a weighted least squares where the weights are given by $cos θ_{n}$ .

After some tedius but otherwise straightforward algebra, a closed form solution for model coefficients can be found by minimization of the objective function:

w_{0} w_{1} c \overset{ˉ}{(\cdot)} = \overset{ˉ}{t} - w_{1} \overset{ˉ}{x} = - c \pm c + 1 = \frac{1}{2} \frac{∣ ∣ ( t ∣ ∣ _{2}^{2} - N t ˉ ^{2} ) - ( ∣ ∣ x ∣ ∣ _{2}^{2} - N x ˉ ^{2} )}{N x ˉ t ˉ - x \cdot t} = \frac{1}{N} n \sum (\cdot)_{n}

The same data set as the previous example are considered in the figure below, with the difference that the model coefficients are determined by minimizing the objective function described above.

Source

A Statistical Viewpoint

Reference: Section 1.2.5, and Section 3.3 of PR&ML book

Assume that the observed values $t$ are normally distributed whose mean is the model value $y (x, w)$ , and which has a known fixed variance $β^{- 1} > 0$ , namely

p (t ∣ x, w, β) = N (t ∣ y (x, w), β^{- 1}),

where $N (t ∣ y (x, w), β^{- 1})$ is the single variable Gaussian distribution. Refer to the figure below. Note the underlying assumption: the observations $y$ are normally distributed at each known observation point. For problems where there is uncertainty in both observation points $x$ and measurements $t$ (e.g. both are measured using noisy sensors), this assumption may not be ideal.

To introduce the concept of the likelihood function, consider the application of Bayes' theorem to determine model weights after observing the input dataset $D$ :

p (w ∣ D) = \frac{p ( D ∣ w ) \cdot p ( w )}{p ( D )} \propto p (D ∣ w) \cdot p (w) .

In the above, $p (D ∣ w)$ is referred to as the likelihood function and expresses the probability of observing dataset $D$ given model parameters $w$ . As a side note, the denominator can be viewed as a normalization constant such that the distribution is normalized:

p (D) = \int p (D, w) d w = \int p (D ∣ w) \cdot p (w) d w .

Returning to the regression problem, if the data is assumed to be drawn independently from the previously mentioned distribution, then the likelihood function for the full training dataset is

p (t ∣ x, w, β) = n = 0 \prod N - 1 N (t_{n} ∣ y (x_{n}, w), β^{- 1}) .

Recall that $lo g$ is concave on $R_{+ +}$ (this can be verified using second-order conditions), and that that the likelihood function is a positive valued function. Equivalently therefore the minimization of the negative log or maximation of the log function can be considered, which will prove to be convenient and will be a common theme: Maximize

lo g p (t ∣ x, w, β) = - \frac{β}{2} n = 0 \sum N - 1 [y (x_{n}, w) - t_{n}]^{2} + \frac{N}{2} lo g β - \frac{N}{2} lo g 2 π .

Maximization of the above maximum likelihood function (objective function) with respect to $w$ leads to:

w_{M L} = (\sum ϕ (x_{n}) ϕ (x_{n})^{T})^{- 1} \sum (t_{n} ϕ (x_{n}))

where a slight generalization over the basis function notation has been introduced namely, $ϕ (x_{n})^{T} : = (1 x x^{2} \dots x^{n})$ . Maximizing with respect to $β$ results in:

β_{M L}^{- 1} = \frac{1}{N} n = 0 \sum N - 1 [y (x_{n}, w_{M L}) - t_{n}]^{2} .

Here the subscripts $M L$ aim to denote the Maximum Liklihood and are not to be confused with Machine Learning.

With the maximum likelihood parameters determined, a predictive distribution of new values of $x$ becomes $y (x, w_{M L})$ . In other words, the mean of the distribution is precisely as those obtained by $L_{2}$ minimization of the error function. While the mean is the same as before, note that now a probabilistic estimate on the model is known with both mean and variance. Alternatively, $L_{2}$ can be thought of as maximum likelihood of the dataset where the data is assumped to be normally distributed at each observation point.

Consider now moving towards a more Bayesian approach. The general framework will be

Introduce a prior over model parameters (which can include hyperparameters). The prior is generally taken to be conjugate prior of the likelihood function.
Evaluate a maximum likelihood function given the training set
Evaluate (or approximate) the (maximum) posterior distribution or the marginalized distribution.

Here, the prior over $w$ is introduced assumed to be normally distributed with zero mean:

p (w ∣ α) = N (w ∣ 0, α^{- 1} I) = (\frac{α}{2 π})^{(M + 1) / 2} exp [- \frac{α}{2} w^{T} w], α > 0 .

Then, the posterior distribution of $w$ is evaluated using Bayes' theorem

p (w ∣ x, t, α, β) = p (t ∣ x, w, β) \cdot p (w ∣ α) \cdot \frac{1}{p ( t )} .

Expression for the posterior distribution can be found using general posterior results (or alternatively as an exercise, a direct derivation is available here, which is one step over Exercise 3.8 in the PR&ML book which requires only directly utilizing the previously derived equations) to be

p (w ∣ t, x, β) = N (w ∣ β Σ Φ^{T} t, Σ),

where

Σ^{- 1} = β Φ^{T} Φ + α I .

In the above, as will be seen in the Kernel Methods section, $Φ \in R^{N \times M}$ is referred to as the design matrix and $Φ Φ^{T} \in R^{M \times M}$ as the Gram matrix.

Toggle Example

Example: Posterior estimate

Consider posterior estimation of the weights in a linear model given training set $(x, t)$ . Since the model space is 2D, the posterior distribution can be conveniently visualized in a contour plot.

We proceed as follows (see linked source file). The data is randomly generated as follows:

$x$ is picked from a uniform distribution over $[x_{min}, x_{max}]$ . Target model is $t_{target} = x$ (i.e. the optimal points are $w = (0, 1)$ which is perturbed by Gaussian noise with zero mean and a constant variance.)
Posterior mean and covariance are evaluated using previously derived equations $p (w ∣ t, x, β) Σ = N (w ∣ β Σ^{- 1} Φ^{T} t, Σ), = β Φ^{T} Φ + α I .$
Posterior distribution is evaluated over a fixed parameter space $w_{0} \times w_{1}$ (left plot).
Target value using the mean of the posterior distribution is plotted (right plot).

This process is repeated with more data points added in each frame.

Source

The source file implements a naive update approach for demonstration purposes where mean and variance are re-evaluated from scratch in each frame. A more sequential approach can be taken by noting that some of the matrices can be sequentially updated with additional observed data points.

Few notes worth pointing out:

The posterior distribution, although plotted above in the weight space, is not actually directly used. Instead, only the mean is used directly.
There are no iterations performed in each update following new observation and new training.
The covariance of the posterior distribution of the weights decreases with increased observations. No explicit expression for the variance of the model has been written above.

Alternatively, the maximum posterior (MAP) can be found by directly expanding the negative log of the above posterior distribution:

- lo g p (w ∣ x, t, α, β) = \frac{β}{2} n \sum (t_{n} - y_{n})^{2} + \frac{α}{2} w^{T} w + C

where $C$ represents combination of the normalization factors. The maximization of above equation with respect to $w$ leads to

(\frac{α}{β} I + n \sum ϕ (x_{n}) ϕ (x_{n})^{T}) w = n \sum t_{n} ϕ (x_{n})

where $\frac{α}{β} I$ is positive definite which naturally appears and plays the role of a regularization parameter to avoid overfitting the data. Note that this is precisely the same result obtained for the mean of the weights earlier. This type of regularization could and is sometimes included in an adhoc manner in the previously encountered regression problem.

The coefficients maximizing this objective function, denoted $w_{M L}$ as before, can be used to evaluate the mean of the probability distribution.

Toggle Example

Example: Regularized regression

This example considers sin function sampled along $[- π, π]$ and perturbed with a zero-mean gaussian noise. A model with complexity of M=10 is selected with and without regulization parameter.

The first image shows the result of model with and without regularization terms trained on dataset of size N=10. The model exhibits overfitting and excessive oscillations without a regularization paramter.

The second image shows the the results with increased size of data set where the overfitting behavior without a regularization parameter is minimized.

Source

Bayesian Regression

Reference: Section 1.2.6, 3.3.2 of PR&ML book

As pointed out in previous example, the posterior distribution is not directly used. The dependence on the model weights $w$ can be removed if it is integrated out by evaluating the marginal distribution.

p (t_{n + 1} ∣ t, α, β) = \int N (t_{n + 1} ∣ y (x, w), β^{- 1}) N (w ∣ t, α, β) d w = \int N (t_{n + 1} ∣ ϕ (x)^{T} w, β^{- 1}) N (w ∣ μ_{w}, S_{N}) d w = N (t_{n + 1} ∣ μ_{w}^{T} ϕ, σ_{N}^{2}),

where,

σ_{N}^{2} = β^{- 1} + ϕ^{T} S ϕ .

The last equation for the marginal distribution follows by either expanding the integrand and separating and integrating out the $w$ components which turn out to be Gaussian, or directly using the general results which is accomplished using the general approach.

Toggle Example

Example: Predictive estimate

In contrast to the previous examples which have focused on simple linear basis functions, this example considers fitting a higher order model to a sine function defined over $[- 2 π, 2 π]$ perturbed by a Gaussian noise with zero mean and constant standard deviation.

Specifically, a polynomial basis function of 9th order (M = 10) is considered.

The mean is evaluated as

y (x) = m_{N}^{T} ϕ (x),

where $m_{n}$ is evaluated using the training set, and $ϕ (x)$ is evaluated at the query point $x$ . Note that now for each prediction an associated standard deviation is available.

The train-predict paradigm thus remains:

Train: Given training set $(x, t)$ , evaluate the posterior distribution mean and standard deviations $m_{N} S_{N} = β S_{n} Φ^{T} t = (α I + β Φ^{T} Φ)^{- 1}$
Predict: For any given point $x$ of interest, evaluate the expected value (mean) of the predictive distribution $y (x) = m_{N} \cdot ϕ (x)$
Optional: evaluate the standard deviation at the prediction point.

The video below shows the continuously updated predicted model mean and its standard deviation with increasing random observation points considered to be the training set.

Source

The implementation linked above, as in previous examples, considers a naive updating of the training phase where mean and covariance are re-evaluated from scratch given an update in the training set.

One of the caveats considered thus far has been introduction of fixed hyperparameters $α, β$ . A full Bayesian treatment would consider introduction of prior distributions over these parameters as well as the model weights. Doing so however will no longer result in anlytical solutions for the posterior and predictive distributions, and instead approximate solutions to the corresonding distributions must be saught. More on this later.

Kernel Methods

Reference: 3.3.3, Chapter 6 of PR&ML book

In preceding section the primary focus was on restricting the basis functions to polynomial basis functions.

Consider the MAP solution considered before, i.e. the minimization of the objective function

f (w) : = w^{T} w + \frac{β}{α} (Φ w - t)^{T} (Φ w - t),

whose minimizer was found to be

w_{M L} = - \frac{β}{α} Φ^{T} (Φ w - t) .

Now, define $a$ as

a : = - \frac{β}{α} (Φ w - t)

and note that, although not explicitely specified, $a$ itself is a function of $w$ . The minimzier can be thus be implicitely expressed as

w = Φ^{T} a

and the original objective function can be re-written as

f (w) = (Φ w - t)^{T} (Φ w - t) + \frac{α}{β} w^{T} w = (Φ Φ^{T} a - t)^{T} (Φ Φ^{T} a - t) + \frac{α}{β} a^{T} Φ Φ^{T} a = a^{T} K K a - a^{T} K t - t^{T} K a + t^{T} t + \frac{α}{β} α K a^{T},

where $Φ$ is known as the design matrix, and $K : = Φ Φ^{T} \in R^{N \times N}$ is symmetric and is known as the Gram matrix. Note that the individual components of the Gram matrix can be written as

K_{n m} = ϕ (x_{n})^{T} ϕ (x_{m}) = : k (x_{n}, x_{m}),

where $x_{n}$ is the $n$ th observation point, and $k (x, x^{'})$ is referred to as the kernel function.

Minimization of the objective function in terms of $a$ leads to

a a r g m i n f (a) = : a_{m i n} = (K + \frac{α}{β} I)^{- 1} t,

and the target values can therefore be written as

y (x) = w^{T} ϕ = a_{m i n}^{T} Φ ϕ = t^{T} (K + \frac{α}{β} I)^{- 1} Φ ϕ = t^{T} (K + \frac{α}{β} I)^{- 1} k (x) = k^{T} (x) (K + \frac{α}{β} I)^{- 1} t = n \sum k (x, x_{n}) \hat{t}_{n},

where $k \in R^{N}$ with components

k (x) : = Φ ϕ = (k (x_{1}, x), k (x_{2}, x), \dots, k (x_{N}, x)) .

Side note: this form of the solution is not new and was already encountered in a previous example where the model weight prior was assumed to be Gaussian and zero mean and isotropic covariance. Repeating the results from the example:

y (x) = m_{N}^{T} ϕ (x) = ϕ^{T} (x) (β S_{N} Φ^{T} t) = β ϕ^{T} S_{N} Φ^{T} t = n \sum β ϕ^{T} S_{N} ϕ (x_{n}) t_{n} = n \sum \hat{k} (x, x_{n}) t_{n} .

The second to last line follows since from linear algebra it is know that a matrix-vector multiplication can is linear combination of the columns of the matrix scaled by each row of the multiplying vector.

Toggle Example

Example: Nadaray-Watson Model

In this example we consider how using Kernel Desnity Estimation (KDE) leads to a class of kernels.

Consider the Kernel Density Estimation (KDE) of from $N$ discrete samples. Let the component of the kernel density estimator be denoted by $f (x, t)$ . Then the probability is estimated through the sampling points as

p (x, t) = \frac{1}{N} n \sum f (x - x_{n}, t - t_{n})

As before, the model is saught to be the mean of the conditional distribution

y (x) : = E [t ∣ x] = \int t p (t ∣ x) d t = \int t \frac{p ( t , x )}{p ( x )} d t (Bayes’ Theorem) = \frac{\int t p ( x , t ) d t}{\int p ( x , t ) d t} (Sum rule) = \frac{\int t \sum f ( x - x _{n} , t - t _{n} ) d t}{\int \sum f ( x - x _{n} , t - t _{n} ) d t} .

Consider restriction of the kerenel density estimator function to those with zero mean, i.e. let

\int f (x - x_{n}, t - t_{n}) d t = 0,

then by introducing a change of variables (use $u = t - t_{n}$ ), and introducing $g (x) : = \int f (x, t) d t$ , the mean can be expressed as

y (x) = \frac{\sum _{n} g ( x - x _{n} ) t _{n}}{\sum _{n} g ( x - x _{n} )} = \sum \tilde{k} (x, x_{n}) t_{n},

Consider zero-mean Gaussian kernel density estimator functions with isotropic covariance, i.e. let

f (x, t) = N (x, t ∣ 0, σ^{2} I) .

Then,

g (x - x_{m}) = \int f (x - x_{m}, t) d t = \int C exp {- \frac{1}{2 σ ^{2}} (x - x_{m} t)^{T} (x - x_{m} t)} d t = C exp {- \frac{1}{2 σ ^{2}} (x - x_{m})^{T} (x - x_{m})} \int exp {- \frac{1}{2 σ ^{2}} t^{2}} d t = exp {- \frac{1}{2 σ ^{2}} (x - x_{m})^{T} (x - x_{m})} .

So the mean and the conditional distributions, and the standard deviations become

y (x) p (t ∣ x) E [(t - μ)^{2} ∣ x] = \frac{\sum _{n} exp { - \frac{1}{2 σ ^{2}} ( x - x _{n} ) ^{2} } t _{n}}{\sum _{m} exp { - \frac{1}{2 σ ^{2}} ( x - x _{m} ) ^{2} }} = \frac{\sum _{n} N ( ( x - x _{n} , t ) ∣ 0 , σ ^{2} I )}{\sum _{m} exp { - \frac{1}{2 σ ^{2}} ( x - x _{m} ) ^{2} }} = N (t ∣ 0, σ^{2} I) = E [t^{2}] - (E [t])^{2} = \int_{- \infty}^{\infty} t^{2} N (t ∣ 0, σ^{2}) d t - y^{2} = σ^{2} - y^{2}

Last line of equality follows a simple integration by parts if the unit integral property of the Gaussian is used. In this example, the above equation is used to compute the standard deviation, and for variables where the variance is negative, it is overwritten with zero.

Note that under this form, the exponentials must be re-evaluated at all points and the Train-Predict steps previously mentioned are no longer are strictly distinguishable. This can lead to computationally expensive predictions with large training data sets. One possible approach to reduce computational costs is to sample the training set.

Example below shows sampling of a sin function perturbed by a isotropic Gaussian noise with zero mean. All data points are used for determining the predicted value.

Source

Next figure shows the resulting model if a subset of the original data is picked for training from a uniform distribution.

Source

Gaussian Processes

A random process $y (x)$ is a Gaussian process if it is defined as a probability distribution over $y (x)$ such that the its values evaluated at arbitrary points jointly have a Gaussian distribution. In such processes, the joint distribution is over $N$ variables is completely specified using the mean and the covariance.

It may be convenient to assume that the random processes have zero mean, and hence being able to completely specify them through the covariance, $k (x, x^{'})$ . Now let $x, x^{'} \in [a, b]$ . Then a zero-mean Gaussian process $y$ can be expressed as

p (y) = N (x ∣ 0, k (x, x^{'})) .

We can consider kernel functions directly, e.g.

k_{gauss} (x, x^{'}) k_{exp} (x, x^{'}) = exp {- \frac{1}{2} σ^{- 2} ∣ ∣ x - x^{'} ∣ ∣^{2}} = exp {- θ ∣ x - x^{'} ∣}

The two figures below show sampling 5 Gaussian zero-mean processes with Gaussian and Exponential kernels (covariancces) over a prespecified range.

Source

In order to relate this back to the regression model previously encountered, consider revisiting the linear regression model where the target values are expressed as linear combination of weights and basis functions and re-consider the zero mean isotropic Gaussian prior introduced over the weights:

y (x) p (w ∣ α) = Φ w = N (w ∣ 0, α^{- 1} I) .

Considering the mean and covariance of the target values, we have

E [y] cov [y] = Φ E [w] = 0 = E [y y^{T}] = Φ E [w w^{T}] Φ^{T} = Φ \int w w^{T} N (w ∣ 0, α^{- 1} I) d w Φ^{T} = α^{- 1} Φ Φ^{T} = : \hat{K}

Last line of equality follows a simple integration by parts if the unit integral property of the Gaussian is used. Also this line defines the kernel function for a linear regression model basis with zero mean isotropic Gaussian prior defined over the weights.

If the target value distribution conditioned on the model values is assumed to be isotropic Gaussian, i.e.

p (t ∣ y) = N (t ∣ y, β^{- 1} I)

While restricting the model being a zero-mean Gaussian process namely

p (y) = N (y ∣ 0, K),

then the marginalized distribution of the model trained on the data set can be written as

p (t) = \int p (t ∣ y) p (y) d y = N (t ∣ 0, C),

where

C (x_{n}, x_{m}) = k (x_{n}, x_{m}) + β^{- 1} δ_{n m},

and $δ_{n m}$ is the Kronecker delta, $k$ is a kernel function, and the last line follows as before by either using the generalized results for predictive distributions or following process similar to the marginal distribution evaluation previously mentioned.

Using the trained results to evalute predictive values for new observations is straightforward. Consider the joint distribution

p (t_{N + 1}, t_{N}) = N ((t_{N + 1}, t_{N}) ∣ 0, (C_{N} k^{T} k c)),

where $k = (k (x_{1}, x_{N + 1}), k (x_{2}, x_{N + 1}), \dots)$ and $c = k (x_{N + 1}, x_{N + 1}) + β^{- 1}$ . Then the conditional distribution distribution $p (t_{n + 1} ∣ t_{N})$ can be found using similar approaches as before to be

p (t_{n + 1} ∣ t_{N}) = N (t_{n + 1} ∣ k^{T} C_{N}^{- 1} t, c - k^{T} C_{N}^{- 1} k) .

Note that as in the previous example, generally there is no distinguishable training phase which can be separated from the prediction phase. Specifically, the $k$ vector must be constructed using the training data at each evaluation point. If the basis functions are limited to a finite set, then the original equations encountered for Bayesian treatment of the regression problem is recovered. Additionally, note that $C_{N}^{- 1} \in R^{N \times N}$ involves inversion of an $N \times N$ matrix which computationally grows in a cubic fasion with training set size. The advantage of the Gaussian process viewpoint however is that covariance functions which can only be expressed in terms of infinite dimensional basis functions.

Toggle Example

Example: Gaussian Process Regression

This example uses the Gaussian kernel for finding regression to noisy sin data. Sin function is sampled at a fixed sampling frequency in the range $[- 2 π, 2 π]$ and perturbed by a zero-mean Gaussian noise.

Using the results above, the mean and standard deviations of the model is computed for each point $x_{i}$ as

a m (x_{i}) σ^{2} (x_{i}) : = C_{N}^{- 1} t = k (x, x_{i})^{T} a = c - k (x, x_{i})^{T} C_{N}^{- 1} k (x, x_{i}),

where $k$ is the kernel function.

Figure considers Gaussian and a simple exponential kernel for creating the target model. For both cases, the last 3 points removed from consideration.

Source

Note that the standard deviation increases when outside the training region.

Neural Networks

Reference: Chapter 5 of PR&ML book

The basic neural network can be described as a series of functional transformations. The word network refers to the existence of a (nonlinear) coupling of the weights between basis functions.

The approximation at an input point $x$ is expressed in terms of the model's network coefficients at different "layers". For a two-layer network, following the regression theme, the model is given by

a_{j}^{(1)} z_{j} a^{(2)} y (x) : = w_{j 1}^{(1)} x + w_{j 0}^{(1)} j = 1, 2, \dots, M : = h (a_{j}^{(1)}) : = j = 1 \sum M w_{j}^{(2)} z_{j} + w_{0}^{(2)} = σ (a^{(2)}) = σ (j = 0 \sum M - 1 w_{j}^{(2)} h (w_{j 1}^{(1)} x + w_{j 0}^{(1)}) + w_{j 0}^{(2)})

where $w_{1}^{(1)}, w_{0}^{(1)} \in R^{M}$ , and $a \in R^{M}$ . The superscripts $^{(1)}$ and $^{(2)}$ denotes the first and second layers number of the neural network, respectively, and $M \geq 1$ is the model complexity. Following the earlier notation, $w_{j}^{(1)}$ are referred to as the weights and $w_{j 0}^{(1)}$ referred to as the model biases (of the first layer). Additionally $a_{j}$ are referred to as the activations and $h$ is called activation function.

The above expressions can be made less intimating by rewriting them as

a^{(1)} W^{(1)} ϕ^{T} (x) \hat{z}^{T} a^{(2)} y (x) = W^{(1)} ϕ (x) = ⎝ ⎜ ⎜ ⎜ ⎛ w_{10} w_{20} ⋮ w_{M 0} w_{11} w_{21} ⋮ w_{M 1} ⎠ ⎟ ⎟ ⎟ ⎞ \in R^{M \times 2} = (1 x) = (1 h (a_{1}) h (a_{2}) \dots h (a_{M})) \in R^{M + 1} = \hat{z}^{T} w^{(2)} = σ (a^{(2)}) .

For $D$ -dimensional input variables $x \in R^{D}$ (not to be confused with the input data set $x$ ), and $K$ -dimensional output variables $y \in R^{K}$ , the above equation at point is generalized as

a_{j}^{(1)} z_{j} a_{k}^{(2)} y_{k} : = i = 1 \sum D w_{j i}^{(1)} x_{i} + w_{j 0}^{(1)} j = 1, 2, \dots, M : = h (a_{j}^{(1)}) : = j = 1 \sum M w_{k j}^{(2)} z_{j} + w_{k 0}^{(2)} = σ (a_{k}^{(2)})

where $W^{(1)} \in R^{M \times D}$ , and $a, w_{j 0} \in R^{M}$

Neural network models therefore describe nonlinear models and will generally require iterative solutions.

Recall that model training in the case of least squares was an unconstrained minimization of the $L_{2}$ norm of the error function. This process is repeated for the model training of neural networks, i.e. minimizer of

E (x) : = \frac{1}{2} ∣ ∣ y (x) - t ∣ ∣_{2}^{2}

is saught, where $y$ denotes the model values at each observation point given by above expressions, and $t$ are the observation values. The key difference here is that the model is nonlinear and the minimization of the objective function would need to iterated.

Toggle Example

Example: Regression using 2 layer network

Similar regression problem to those considered in past examples is considered. A sine function perturbed by zero-mean Gaussian noise is sampled, and a two-layer network of various complexities is trained on the sampled data. The $σ$ function is taken to be the identity function.

Due to the nonlinearity of the model, at minimum the first derivative of the objective function with respect to the weights at each layer is required. The derivatives are given as

\frac{\partial y}{\partial w _{j 0}^{(1)}} \frac{\partial y}{\partial w _{j 1}^{(1)}} \frac{\partial E _{n}}{\partial w _{j 0}^{(1)}} \frac{\partial E _{n}}{\partial w _{j 1}^{(1)}} \frac{\partial E _{n}}{\partial w _{0}^{(2)}} \frac{\partial E _{n}}{\partial w _{j}^{(2)}} = w_{j}^{(2)} h^{'} (a_{j}^{(1)}) (no sum) = w_{j}^{(2)} h^{'} (a_{j}^{(1)}) x (no sum) = (y_{n} - t_{n}) w_{j}^{(2)} h^{'} (a_{j}^{(1)}) (no sum) = (y_{n} - t_{n}) w_{j}^{(2)} h^{'} (a_{j}^{(1)}) x (no sum) = (y_{n} - t_{n}) = (y_{n} - t_{n}) h (a_{j}), j > 0

First derivatives is sufficient for performing gradient descent with approximate line search for minimization of the $L_{2}$ of the error. The initial guess is picked randomly from a given range.

Source

Note that all the results have no converged to a sufficiently low gradient. The displayed results may vary considerably depending on the initial guess.

Summary

Key findings for each approach is tersely summarized in the table below meant to be used only as a refresher of the overall approach.

Approach	Model \| Likelihood	Objective \| Prior \| Conditional	Post. Mean	Post. Var
Regression	$y = Φ w$	$1 / 2 ∣ ∣ y - t ∣ ∣_{2}^{2}$	$(Φ^{T} Φ)^{- 1} Φ^{T} t$	-
MAP	$n = 0 \prod N - 1 N (t_{n} ∣ y (x_{n}, w), β^{- 1}) .$	$N (w ∣ 0, α^{- 1} I)$	$(\frac{α}{β} I + Φ^{T} Φ)^{- 1} Φ^{T} t$	-
Bayesian Regression	$N (t ∣ y = ϕ^{T} w, β^{- 1} I)$	$N (w ∣ 0, α^{- 1} I)$	$E [p (t ∣ \dots)] = β Σ Φ^{T} t$	$Σ^{- 1} = β Φ^{T} Φ + α I$
Nadaray-Watson (Using Gaussian kernel)	$y (x) = \frac{\sum _{n} g ( x - x _{n} ) t _{n}}{\sum _{n} g ( x - x _{n} )}$	-	$\frac{\sum _{n} exp { - \frac{1}{2 σ ^{2}} ( x - x _{n} ) ^{2} } t _{n}}{\sum _{m} exp { - \frac{1}{2 σ ^{2}} ( x - x _{m} ) ^{2} }}$	$σ^{2} - y^{2}$
Gaussian Process	$p (y) = N (y ∣ 0, K)$	$p (t ∣ y) = N (t ∣ y, β^{- 1} I)$	$m (x_{i}) = k (x, x_{i})^{T} a$	$σ^{2} (x_{i}) = c - k (x, x_{i})^{T} C_{N}^{- 1} k (x, x_{i})$ $c = k (x_{N + 1}, x_{N + 1}) + β^{- 1}$
Neural Network	$σ (\sum_{j = 0}^{M - 1} w_{j}^{(2)} h (w_{j 1}^{(1)} x + w_{j 0}^{(1)}) + w_{j 0}^{(2)})$	$1 / 2 ∣ ∣ y - t ∣ ∣_{2}^{2}$	Requires iterations	-

There are various other topics like sparse kernel machines, approximate inference, etc. that will be addressed in future posts.

References

Suggested general reference books:

Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.

Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.

A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.

Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.

Very easy read and short refresher on basics of statistics.

Introduction to PR and ML (Part 1): Regression

A Point of Departure

A Statistical Viewpoint

Bayesian Regression

Kernel Methods

Gaussian Processes

Neural Networks

Gradient Descent

Backtrack

Summary

References