This is the third part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first and second posts, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including
Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.
Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:
All snippets are done using python 3
using only
numpy
and matplotlib
add-on
packages. Also, ffmpeg
package needs to be installed
to create the videos.
Reference: Section 4.1.5
Part 1 and Part 2 of ML&PR notes considered supervised learning applied to regression problems. This post takes a step in a different direction and considers the application of supervised learning to classification problems. Specifically, consider an input training dataset where each data point is assigned to known target classes. Then the classification problem can be stated as follows.
As in Part 1 this problem can be approached by optimizing an objective function. Consider first the restriction to a simpler problem, namely the linear classification model for a two-class problem (. The hyperplane (decision surface) defined by
may be used to implicitely subdivide the the two halfspaces, where each halfspace identifies a separate class. Specifically, referring to the figure below, note that provides a signed measure of the distance from the decision surface. The two classes can thus be implicitely defined through the relationship
while the decision surface corresponds to .
The orientation of the supporting halfplane may be determined by considering the minimization of an appropriate objective function as was considered in Part 1 and Part 2 of the posts covering treatment of regression problems. Specifically, consider the objective function defined by
In above denotes the target class of point . Note that under this choice, the sum of the target value will be zero when all points are classified correctly:
Consider now the minimization of the objective function: the minimization of the objective function due to requires that
or
where is understood to be the mean of the total dataset. If the individual class means are defined to be
then
Similarly, minimization with respect to requires that
or equivalently
After some work, the above equation can be re-expressed as
where,
It is worth noting that is in the direction of and thus we re-express previous equations as
The orientation of the supporting hyperplane can therefore be explicitely learned from the data. Specifically, note that the previously described Training/Learning phase followed by Prediction phase is still applicable:
Training: Given input training set and corresponding target classes , determine the direction of the hyperplane (model weights) through
Note that there are no iterations involved in the training phase. Also, note that the in-class means can be evaluated sequentially, and that .
Prediction: For a new observation , compute
and assign a class based its value.
Though may not be immediately obvious, it will be shown below that this result corresponds to the the maximization of the separation of the projected class means onto while the in-class variance is minimized.
Two datasets are randomly created with pre-specified means and covariance. The data is such that there is likelihood of overlapping data. The linear discriminant is evaluated (trained) on the input data set.
The same paradigm is followed as before for classification:
Figure below shows the evaluted linear discriminant for the two classes.
Consider now an alternative viewpoint of the same problem, namely the Fisher's linear discriminant:
Given datasets , find projection of dataset to one-dimension using such that
is maximized, where
and denotes the number of points belonging to class .
The numerator of the Fisher's criteria provides a measure of the (projected) class mean separation, while the denominator provides a measure of the (summed) in-class variance (see figure below). Maximization of Fisher's criteria thus maximizes the separation of the projected class means onto while the in-class variance is minimized.
Note that the objective function is to be optimized in terms of the model weights . Using the definitions above, the objective function can be explicitely written in terms of the model weights as
where
The maximization of the above objective function is straightforward. Taking the derivative with respect to and rearranging results in
or
Using the definiton of , note that scales in the direction of . Also, the leading coefficient term in the above equation dependent of can be neglected since the objective is to find the orientation of . Hence, we can find an expression for the model weights as
where can be treated as a scaling constant. Note that, as alluded to earlier, this result is the same as the one obtained for least squares solution considered in the preceeding section.
Reference: Section 4.2
Consider now taking a probabilistic approach to classification such that it is generalizable to more than two classes. Assume a joint distribution of classification and datapoint given by . Assuming prior class distribution , the posterior distribution can be expressed as
By using
the above expression can be expressed as
Now consider assuming that the class conditional distributions are Gaussian with shared covariance given by
and consider finding the paramters by maximum likelihood solution. Let denote the target class for the -th point where is zero except for the correct class. Also, let the prior class probablities . The likelihood function is given by
As before, consider the maximization of the log likelihood. Before doing that, recall that prior probabilities must sum to one which can be included into the likelihood function as a equality constraint through introduction of a Lagrange multiplier
The priors can then be found to be by maximizating the constrained log likelihood with respect to .
Consider now the maximzation of the log likelihood with respect to the model means:
By rearranging the above expression, an explicit expression for the model mean is found to be
Finally, consider the maximization of the objective function with respect to the covariance matrix. Before proceeding, it's worth mentioning two identities:
or
See Convex Optimization section A4.1 (or a book on tensor algebra) for proof.
The full log likelihood including all terms including the covariance matrix is
Then the gradient with respect to the covariance matrix, utilizing the identities listed above, becomes
Rearranging, multiplying through by and rearranging again gives
Summarizing, given Gaussian class-conditional probability densities with shared covariance and class priors , the posterior distributions are given by
The previous training followed by prediction paradigm then holds:
Two separate training sets are considered. First dataset is compromised of four classes, while the second dataset contains five. The two-step training-prediction paradigm is followed to find the model priors, means and the shared covariance, and subsequently, the input space is classified based on these trained values.
Note that the shared boundaries between each classification region are linear.
Reference: Section 4.3
In the preceding section, the class posterior distributions were found by assuming class-conditional probablities and priors over the class conditions, and using maximum likelihood to determine paramters defining these probabilities. Alternatively, one may consider working directly with the functional form of the posterior distribution (and defining decision boundaries).
As indicated to above, the class posterior distribution is a normalized exponential
Note that .
Consider now the special case where the activation coefficients are given by
which resembles the linear decision surface seen in the preceeding sections and alluded to in the preceeding example.
Without introducing priors over the distributions, consider now the maximization of the likelihood function
where contains target variables with elements . As before, consider maximization of the negative log likelihood, namely
Note that this is a nonlinear function of the model weights, and the maximization of the log likelihood requires at least its first derivative. Note that
and thus
Training dataset compromising of four different clusters is considered. Backtracking is used to solve the nonlinear optimization problem.
The discriminants and the classification regions are shown for each step of the optimization iteration. Color of the lines corresponds to the cluster coloring, though the contour plot coloring is arbitrary.
Reference: Chapter 7
This is a good place to take a look at SVMs. Unlike the logistic regression approach considered above which relied on a probabilistic approach for arriving at the objective function, we'll need to start out with a geometric viewpoint.
The support vector machine is inherently for two-class classfication problems (in practice it is used for multi-class problems as well). Consider once again the classification of dataset linearly separable dataset. As before, we consider finding a hyperplane that "best" separates the two datasets. In SVM, we choose "best" to be a hyperplane that maximizing the orthogonal distance between the hyperplanes and the closest dataset to the hyperplane from each dataset.
Recall that provides a signed distance measure of the a data-point to the hyperplane. Without going into detail (see Section 7.1 of PRML book), we can choose to construct the weights such that for points that are closest to the hyperplane, which then requires that under for all points
where implies a point incorrectly classified.
With some handwaving, we can then set the objective function as the classfication error given by
Despite SVM inherently being a two-class classifier, we can heuristicly extend it for multi-class classification. One approach, commonly known as "one-vs-all" does this by constructing the error function such that a non-zero error is introduced when a wrong class is associated to a given data point.
where is the known correct class index for given datapoint under consideration.
Consider the pseudo-example below:
Consider a pseudo-example consisting of 3 class classification problem. Let
be the weighing coefficients.Now suppose that, at a given iteration, for , and we find
The predicted error then becomes
Now suppose that, for the same iteration, for , and we find
The predicted error corresponding to then becomes
Reference: Chapter 5
In the preceeding section, direct use of the functional form of the posterior distribution (and defining decision boundaries) was considered. Specifically, discriminants and the decision boundaries were chosen to be a linear function of the "activations". Consider now revisiting Neural Networks which were briefly introduced in Part 1 with application to regression.
Consider a more generalize form of the two-layer network:
In above,
Also, within the context of classification, represents the posterior probability of class , while denotes the -th data point belonging to class .
Recall that the log likelihood function from the previous was
Let
where component-wise operation is understood where necessary. Solution to the resulting nonlinear optimization problem requires at minimum a evaluation of the first derivative of the model outputs with respect to the model weights. Using expressions for the derivative of the model output with respect to the model weights, the first derivatives of the objective function with respect to model weights can be computed as
where represents the model weights including biases.
This example considers application of a neural network in classification of an input training dataset.
The training set consists of five clusters created randomly such that there exists some overlap of the data. A two-layer neural network of 8th order complexity () is considered and all model weights of the neural network are initialized by sampling them from a uniform distribution in the interval . The associated network diagram is shown in the image below.
A naive backtracking approach is considered and a fixed number of iterations. Video below shows the evolution of the learned classification regions in each step of the iteration.
Note that the decision boundaries are no longer linear.
Approach | Objective Function | Target | Model weights/properties | Notes |
---|---|---|---|---|
Least Squares (2-classes) | Same results using Fisher's Discriminant | |||
Probabilistic Generative Models | — | |||
Logistic Regression | Nonlinear iterations to find | — | ||
Neural Networks | Nonlinear iterations to find weights | — | Two layer network example |
Suggested general reference books:
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.
A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.
Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.
Very easy read and short refresher on basics of statistics.