This is the first part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including.
The goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including
This first post covers exclusively regression and its connection to machine learning (ML) concepts. This is terse and more like an engineering treatment of the topic, i.e., overly practical with possibly even more hand waving than the original book and plenty of abuse of notation.
Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.
Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:
[...] probability theory can be expressed in terms of two simple equations corresponding to the sum rule and the product rule. All of the probabilistic inference and learning manipulations discussed in this book, now matter how complex, amount to repeated application of these two equations.
All snippets are done using python
3 with
numpy
and matplotlib
add-on packages.
Reference: Chapter 3 of PR&ML book
A good starting point is to try to exhaust a simple regression analysis by approaching it from different viewpoints and considering variations of the same general problem. The basic regression problem reads:
The word 'best' inherently implies specification of a metric which can be used to quantitatively compare the adequacy of different polynomials. The (M-1)-th degree polynomial, referred to as the model with being the model complexity, is denoted by and may be expressed as
where are the polynomial coefficients (or weights) that determine the behavior of the model, and are the basis functions. The choice of selecting the model complexity is referred to as model selection. In future posts, determintation of the model complexity will be considered, but for the time being it is assumed that this information is known a priori. As a side note, one convenient way to think of this model is to note that it corresponds to the Taylor series expansion of .
Aside: it is convenient to introduce a small generalization of the basis functions and rewrite the above equation as
where . In next sections, the definition of will be changed to consider various different model types.
Noting that , the above equation can be written as
where is referred to as the bias parameter. Also note that a model with complexity is an degree polynomial, e.g. a model with complexity of is corresponds to a first-order polynomial.
The adequacy of the model can be chosen as the "closeness" of the model to the input dataset at given observation points (training data). A particularly convenient and common measure is the (scaled) square of the norm
Here, contains all the model prediction which where convenient may be writtten as
where
The above equation contains all of the basis functions evaluated at every input point. As alluded to, the choice of using norm is rather arbitrary, but a common and convenient one with interesting interpretations and properties. Other norms can be used; see e.g. the Convex Optimization book for notes on norm choice and their implications.
The "best" fit, given the error metric, is the choice of which minimizes . The original probelm can thus be re-estated as:
Within the context of "machine learning" (ML), the (M-1)-th degree polynomial is referred to as the model and the input data referred to as the training set, and within the context of optimization, is known as the objective function. The above minimziation problem is a well-known unconstrained convex optimization problem seen in optimization.
Within the context of ML, the goal of the model is to make future predictions on inputs (as opposed to, for example, look for individual approximations of the coefficients themselves). For the regression problem, this then reduces to the following general steps.
Sticking with the polynomial regression problem, slightly generalizing the above steps, the general framework (in surpervised learning) in ML may be expressed as
For the minimization, a general solution can be obtained. It is simpler for these kinds of problems to briefly reverting to the index notation (see any books on tensor algebra)
Setting above equation to and solving for (or ) and reverting back:
Consider the general solution to the above problem, namely the unconstrained optimization (minimzation) of the norm of the error function as measured above. Consider slight generalization of the basis functions, and writing the model as
Then, minimzation of the can be computed (note that is convex) as
Or,
These coefficients define the model and can be considered the training step. Future predictions are then evaluated using .
For linear regression where the model is linear polynomial, the simple familiar closed-form is obtained. First, noted that
i.e.
Then, a closed form solution for is found to be
where denotes number of observed (or training) data points, and the inversion of the 2x2 matrix is replaced the closed form expression.
In the image below, a fictitious data set is generated as by perturbing a linear line with unit slope passing through the origin with a Gaussian noise with zero mean and a fixed standard deviation. A linear model is then passed through the fictitious data set. The original target and the reconstructed model is shown in the figure.
While this gives the "best" linear fit (measured using the norm of the deviation of the model to the training data), it also considerably penalizes outliers, leading to results which do not visually look desirable. More on this later.
In the preceding example, the error function (objective function) was measured as the of the error between the model and observation points. In the section to be followed, this has an interesting interpretation as as the observations being perturbed by a Gaussian noise idependently of each observation point .
If the error is instead measured orthogonal to linear models, then an alternative model can be obtained. Consider the error function
where is measured orthogonal to the model, i.e.
Note that this can be thought of as a weighted least squares where the weights are given by .
After some tedius but otherwise straightforward algebra, a closed form solution for model coefficients can be found by minimization of the objective function:
The same data set as the previous example are considered in the figure below, with the difference that the model coefficients are determined by minimizing the objective function described above.
Reference: Section 1.2.5, and Section 3.3 of PR&ML book
Assume that the observed values are normally distributed whose mean is the model value , and which has a known fixed variance , namely
where is the single variable Gaussian distribution. Refer to the figure below. Note the underlying assumption: the observations are normally distributed at each known observation point. For problems where there is uncertainty in both observation points and measurements (e.g. both are measured using noisy sensors), this assumption may not be ideal.
To introduce the concept of the likelihood function, consider the application of Bayes' theorem to determine model weights after observing the input dataset :
In the above, is referred to as the likelihood function and expresses the probability of observing dataset given model parameters . As a side note, the denominator can be viewed as a normalization constant such that the distribution is normalized:
Returning to the regression problem, if the data is assumed to be drawn independently from the previously mentioned distribution, then the likelihood function for the full training dataset is
Recall that is concave on (this can be verified using second-order conditions), and that that the likelihood function is a positive valued function. Equivalently therefore the minimization of the negative log or maximation of the log function can be considered, which will prove to be convenient and will be a common theme: Maximize
Maximization of the above maximum likelihood function (objective function) with respect to leads to:
where a slight generalization over the basis function notation has been introduced namely, . Maximizing with respect to results in:
Here the subscripts aim to denote the Maximum Liklihood and are not to be confused with Machine Learning.
With the maximum likelihood parameters determined, a predictive distribution of new values of becomes . In other words, the mean of the distribution is precisely as those obtained by minimization of the error function. While the mean is the same as before, note that now a probabilistic estimate on the model is known with both mean and variance. Alternatively, can be thought of as maximum likelihood of the dataset where the data is assumped to be normally distributed at each observation point.
Consider now moving towards a more Bayesian approach. The general framework will be
Here, the prior over is introduced assumed to be normally distributed with zero mean:
Then, the posterior distribution of is evaluated using Bayes' theorem
Expression for the posterior distribution can be found using general posterior results (or alternatively as an exercise, a direct derivation is available here, which is one step over Exercise 3.8 in the PR&ML book which requires only directly utilizing the previously derived equations) to be
where
In the above, as will be seen in the Kernel Methods section, is referred to as the design matrix and as the Gram matrix.
Consider posterior estimation of the weights in a linear model given training set . Since the model space is 2D, the posterior distribution can be conveniently visualized in a contour plot.
We proceed as follows (see linked source file). The data is randomly generated as follows:
This process is repeated with more data points added in each frame.
The source file implements a naive update approach for demonstration purposes where mean and variance are re-evaluated from scratch in each frame. A more sequential approach can be taken by noting that some of the matrices can be sequentially updated with additional observed data points.
Few notes worth pointing out:
Alternatively, the maximum posterior (MAP) can be found by directly expanding the negative log of the above posterior distribution:
where represents combination of the normalization factors. The maximization of above equation with respect to leads to
where is positive definite which naturally appears and plays the role of a regularization parameter to avoid overfitting the data. Note that this is precisely the same result obtained for the mean of the weights earlier. This type of regularization could and is sometimes included in an adhoc manner in the previously encountered regression problem.
The coefficients maximizing this objective function, denoted as before, can be used to evaluate the mean of the probability distribution.
This example considers sin function sampled along and perturbed with a zero-mean gaussian noise. A model with complexity of M=10 is selected with and without regulization parameter.
The first image shows the result of model with and without regularization terms trained on dataset of size N=10. The model exhibits overfitting and excessive oscillations without a regularization paramter.
The second image shows the the results with increased size of data set where the overfitting behavior without a regularization parameter is minimized.
Reference: Section 1.2.6, 3.3.2 of PR&ML book
As pointed out in previous example, the posterior distribution is not directly used. The dependence on the model weights can be removed if it is integrated out by evaluating the marginal distribution.
where,
The last equation for the marginal distribution follows by either expanding the integrand and separating and integrating out the components which turn out to be Gaussian, or directly using the general results which is accomplished using the general approach.
In contrast to the previous examples which have focused on simple linear basis functions, this example considers fitting a higher order model to a sine function defined over perturbed by a Gaussian noise with zero mean and constant standard deviation.
Specifically, a polynomial basis function of 9th order (M = 10) is considered.
The mean is evaluated as
where is evaluated using the training set, and is evaluated at the query point . Note that now for each prediction an associated standard deviation is available.
The train-predict paradigm thus remains:
The video below shows the continuously updated predicted model mean and its standard deviation with increasing random observation points considered to be the training set.
The implementation linked above, as in previous examples, considers a naive updating of the training phase where mean and covariance are re-evaluated from scratch given an update in the training set.
One of the caveats considered thus far has been introduction of fixed hyperparameters . A full Bayesian treatment would consider introduction of prior distributions over these parameters as well as the model weights. Doing so however will no longer result in anlytical solutions for the posterior and predictive distributions, and instead approximate solutions to the corresonding distributions must be saught. More on this later.
Reference: 3.3.3, Chapter 6 of PR&ML book
In preceding section the primary focus was on restricting the basis functions to polynomial basis functions.
Consider the MAP solution considered before, i.e. the minimization of the objective function
whose minimizer was found to be
Now, define as
and note that, although not explicitely specified, itself is a function of . The minimzier can be thus be implicitely expressed as
and the original objective function can be re-written as
where is known as the design matrix, and is symmetric and is known as the Gram matrix. Note that the individual components of the Gram matrix can be written as
where is the th observation point, and is referred to as the kernel function.
Minimization of the objective function in terms of leads to
and the target values can therefore be written as
where with components
Side note: this form of the solution is not new and was already encountered in a previous example where the model weight prior was assumed to be Gaussian and zero mean and isotropic covariance. Repeating the results from the example:
The second to last line follows since from linear algebra it is know that a matrix-vector multiplication can is linear combination of the columns of the matrix scaled by each row of the multiplying vector.
In this example we consider how using Kernel Desnity Estimation (KDE) leads to a class of kernels.
Consider the Kernel Density Estimation (KDE) of from discrete samples. Let the component of the kernel density estimator be denoted by . Then the probability is estimated through the sampling points as
As before, the model is saught to be the mean of the conditional distribution
Consider restriction of the kerenel density estimator function to those with zero mean, i.e. let
then by introducing a change of variables (use ), and introducing , the mean can be expressed as
Consider zero-mean Gaussian kernel density estimator functions with isotropic covariance, i.e. let
Then,
So the mean and the conditional distributions, and the standard deviations become
Last line of equality follows a simple integration by parts if the unit integral property of the Gaussian is used. In this example, the above equation is used to compute the standard deviation, and for variables where the variance is negative, it is overwritten with zero.
Note that under this form, the exponentials must be re-evaluated at all points and the Train-Predict steps previously mentioned are no longer are strictly distinguishable. This can lead to computationally expensive predictions with large training data sets. One possible approach to reduce computational costs is to sample the training set.
Example below shows sampling of a sin function perturbed by a isotropic Gaussian noise with zero mean. All data points are used for determining the predicted value.
Next figure shows the resulting model if a subset of the original data is picked for training from a uniform distribution.
A random process is a Gaussian process if it is defined as a probability distribution over such that the its values evaluated at arbitrary points jointly have a Gaussian distribution. In such processes, the joint distribution is over variables is completely specified using the mean and the covariance.
It may be convenient to assume that the random processes have zero mean, and hence being able to completely specify them through the covariance, . Now let . Then a zero-mean Gaussian process can be expressed as
We can consider kernel functions directly, e.g.
The two figures below show sampling 5 Gaussian zero-mean processes with Gaussian and Exponential kernels (covariancces) over a prespecified range.
In order to relate this back to the regression model previously encountered, consider revisiting the linear regression model where the target values are expressed as linear combination of weights and basis functions and re-consider the zero mean isotropic Gaussian prior introduced over the weights:
Considering the mean and covariance of the target values, we have
Last line of equality follows a simple integration by parts if the unit integral property of the Gaussian is used. Also this line defines the kernel function for a linear regression model basis with zero mean isotropic Gaussian prior defined over the weights.
If the target value distribution conditioned on the model values is assumed to be isotropic Gaussian, i.e.
While restricting the model being a zero-mean Gaussian process namely
then the marginalized distribution of the model trained on the data set can be written as
where
and is the Kronecker delta, is a kernel function, and the last line follows as before by either using the generalized results for predictive distributions or following process similar to the marginal distribution evaluation previously mentioned.
Using the trained results to evalute predictive values for new observations is straightforward. Consider the joint distribution
where and . Then the conditional distribution distribution can be found using similar approaches as before to be
Note that as in the previous example, generally there is no distinguishable training phase which can be separated from the prediction phase. Specifically, the vector must be constructed using the training data at each evaluation point. If the basis functions are limited to a finite set, then the original equations encountered for Bayesian treatment of the regression problem is recovered. Additionally, note that involves inversion of an matrix which computationally grows in a cubic fasion with training set size. The advantage of the Gaussian process viewpoint however is that covariance functions which can only be expressed in terms of infinite dimensional basis functions.
This example uses the Gaussian kernel for finding regression to noisy sin data. Sin function is sampled at a fixed sampling frequency in the range and perturbed by a zero-mean Gaussian noise.
Using the results above, the mean and standard deviations of the model is computed for each point as
where is the kernel function.
Figure considers Gaussian and a simple exponential kernel for creating the target model. For both cases, the last 3 points removed from consideration.
Note that the standard deviation increases when outside the training region.
Reference: Chapter 5 of PR&ML book
The basic neural network can be described as a series of functional transformations. The word network refers to the existence of a (nonlinear) coupling of the weights between basis functions.
The approximation at an input point is expressed in terms of the model's network coefficients at different "layers". For a two-layer network, following the regression theme, the model is given by
where , and . The superscripts and denotes the first and second layers number of the neural network, respectively, and is the model complexity. Following the earlier notation, are referred to as the weights and referred to as the model biases (of the first layer). Additionally are referred to as the activations and is called activation function.
The above expressions can be made less intimating by rewriting them as
For -dimensional input variables (not to be confused with the input data set ), and -dimensional output variables , the above equation at point is generalized as
where , and
Neural network models therefore describe nonlinear models and will generally require iterative solutions.
Recall that model training in the case of least squares was an unconstrained minimization of the norm of the error function. This process is repeated for the model training of neural networks, i.e. minimizer of
is saught, where denotes the model values at each observation point given by above expressions, and are the observation values. The key difference here is that the model is nonlinear and the minimization of the objective function would need to iterated.
Similar regression problem to those considered in past examples is considered. A sine function perturbed by zero-mean Gaussian noise is sampled, and a two-layer network of various complexities is trained on the sampled data. The function is taken to be the identity function.
Due to the nonlinearity of the model, at minimum the first derivative of the objective function with respect to the weights at each layer is required. The derivatives are given as
First derivatives is sufficient for performing gradient descent with approximate line search for minimization of the of the error. The initial guess is picked randomly from a given range.
Note that all the results have no converged to a sufficiently low gradient. The displayed results may vary considerably depending on the initial guess.
Key findings for each approach is tersely summarized in the table below meant to be used only as a refresher of the overall approach.
Approach | Model | Likelihood | Objective | Prior | Conditional | Post. Mean | Post. Var |
---|---|---|---|---|
Regression | - | |||
MAP | - | |||
Bayesian Regression | ||||
Nadaray-Watson (Using Gaussian kernel) | - | |||
Gaussian Process |
|
|||
Neural Network | Requires iterations | - |
There are various other topics like sparse kernel machines, approximate inference, etc. that will be addressed in future posts.
Suggested general reference books:
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.
A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.
Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.
Very easy read and short refresher on basics of statistics.