Introduction to PR and ML (Part 3): Classification

2019-10-20

This is the third part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first and second posts, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

All snippets are done using python 3 using only numpy and matplotlib add-on packages. Also, ffmpeg package needs to be installed to create the videos.

Linear Discriminants

Reference: Section 4.1.5

Part 1 and Part 2 of ML&PR notes considered supervised learning applied to regression problems. This post takes a step in a different direction and considers the application of supervised learning to classification problems. Specifically, consider an input training dataset where each data point is assigned to known target classes. Then the classification problem can be stated as follows.

Given the training dataset where , make future predictions on the class a new observation belongs to.

As in Part 1 this problem can be approached by optimizing an objective function. Consider first the restriction to a simpler problem, namely the linear classification model for a two-class problem (. The hyperplane (decision surface) defined by

may be used to implicitely subdivide the the two halfspaces, where each halfspace identifies a separate class. Specifically, referring to the figure below, note that provides a signed measure of the distance from the decision surface. The two classes can thus be implicitely defined through the relationship

while the decision surface corresponds to .

The orientation of the supporting halfplane may be determined by considering the minimization of an appropriate objective function as was considered in Part 1 and Part 2 of the posts covering treatment of regression problems. Specifically, consider the objective function defined by

In above denotes the target class of point . Note that under this choice, the sum of the target value will be zero when all points are classified correctly:

Consider now the minimization of the objective function: the minimization of the objective function due to requires that

or

where is understood to be the mean of the total dataset. If the individual class means are defined to be

then

Similarly, minimization with respect to requires that

or equivalently

After some work, the above equation can be re-expressed as

where,

It is worth noting that is in the direction of and thus we re-express previous equations as

The orientation of the supporting hyperplane can therefore be explicitely learned from the data. Specifically, note that the previously described Training/Learning phase followed by Prediction phase is still applicable:

  1. Training: Given input training set and corresponding target classes , determine the direction of the hyperplane (model weights) through

    Note that there are no iterations involved in the training phase. Also, note that the in-class means can be evaluated sequentially, and that .

  2. Prediction: For a new observation , compute

    and assign a class based its value.

Though may not be immediately obvious, it will be shown below that this result corresponds to the the maximization of the separation of the projected class means onto while the in-class variance is minimized.

Example: Linear discriminant

Two datasets are randomly created with pre-specified means and covariance. The data is such that there is likelihood of overlapping data. The linear discriminant is evaluated (trained) on the input data set.

The same paradigm is followed as before for classification:

  1. Training: Using the input dataset, learn discriminant weights
  2. Prediction: Given new observation , compute . The class datapoint is then computed by by using the sign of .

Figure below shows the evaluted linear discriminant for the two classes.

Source

Fisher's Discriminant

Consider now an alternative viewpoint of the same problem, namely the Fisher's linear discriminant:

Fisher's Discriminant

Given datasets , find projection of dataset to one-dimension using such that

is maximized, where

and denotes the number of points belonging to class .

The numerator of the Fisher's criteria provides a measure of the (projected) class mean separation, while the denominator provides a measure of the (summed) in-class variance (see figure below). Maximization of Fisher's criteria thus maximizes the separation of the projected class means onto while the in-class variance is minimized.

Note that the objective function is to be optimized in terms of the model weights . Using the definitions above, the objective function can be explicitely written in terms of the model weights as

where

The maximization of the above objective function is straightforward. Taking the derivative with respect to and rearranging results in

or

Using the definiton of , note that scales in the direction of . Also, the leading coefficient term in the above equation dependent of can be neglected since the objective is to find the orientation of . Hence, we can find an expression for the model weights as

where can be treated as a scaling constant. Note that, as alluded to earlier, this result is the same as the one obtained for least squares solution considered in the preceeding section.

Probabilistic Generative Models

Reference: Section 4.2

Consider now taking a probabilistic approach to classification such that it is generalizable to more than two classes. Assume a joint distribution of classification and datapoint given by . Assuming prior class distribution , the posterior distribution can be expressed as

By using

the above expression can be expressed as

Now consider assuming that the class conditional distributions are Gaussian with shared covariance given by

and consider finding the paramters by maximum likelihood solution. Let denote the target class for the -th point where is zero except for the correct class. Also, let the prior class probablities . The likelihood function is given by

As before, consider the maximization of the log likelihood. Before doing that, recall that prior probabilities must sum to one which can be included into the likelihood function as a equality constraint through introduction of a Lagrange multiplier

The priors can then be found to be by maximizating the constrained log likelihood with respect to .

Consider now the maximzation of the log likelihood with respect to the model means:

By rearranging the above expression, an explicit expression for the model mean is found to be

Finally, consider the maximization of the objective function with respect to the covariance matrix. Before proceeding, it's worth mentioning two identities:

  1. or

  2. See Convex Optimization section A4.1 (or a book on tensor algebra) for proof.

The full log likelihood including all terms including the covariance matrix is

Then the gradient with respect to the covariance matrix, utilizing the identities listed above, becomes

Rearranging, multiplying through by and rearranging again gives

Summarizing, given Gaussian class-conditional probability densities with shared covariance and class priors , the posterior distributions are given by

The previous training followed by prediction paradigm then holds:

  1. Training: Given input data set, and class count compute the .
  2. Prediction: Given an observation , compute the class posterior distribution using above expression for . The point is assigned to the value which has the highest posterior class probability.
Example: Probabilistic Generative Models

Two separate training sets are considered. First dataset is compromised of four classes, while the second dataset contains five. The two-step training-prediction paradigm is followed to find the model priors, means and the shared covariance, and subsequently, the input space is classified based on these trained values.

Source

Note that the shared boundaries between each classification region are linear.

Logistic Regression

Reference: Section 4.3

In the preceding section, the class posterior distributions were found by assuming class-conditional probablities and priors over the class conditions, and using maximum likelihood to determine paramters defining these probabilities. Alternatively, one may consider working directly with the functional form of the posterior distribution (and defining decision boundaries).

As indicated to above, the class posterior distribution is a normalized exponential

Note that .

Consider now the special case where the activation coefficients are given by

which resembles the linear decision surface seen in the preceeding sections and alluded to in the preceeding example.

Without introducing priors over the distributions, consider now the maximization of the likelihood function

where contains target variables with elements . As before, consider maximization of the negative log likelihood, namely

Let's briefly review objective function above.

Note that this is a nonlinear function of the model weights, and the maximization of the log likelihood requires at least its first derivative. Note that

and thus

Example: Logistic Regression

Training dataset compromising of four different clusters is considered. Backtracking is used to solve the nonlinear optimization problem.

The discriminants and the classification regions are shown for each step of the optimization iteration. Color of the lines corresponds to the cluster coloring, though the contour plot coloring is arbitrary.

Source

Support Vector Machines (SVM)

Reference: Chapter 7

This is a good place to take a look at SVMs. Unlike the logistic regression approach considered above which relied on a probabilistic approach for arriving at the objective function, we'll need to start out with a geometric viewpoint.

The support vector machine is inherently for two-class classfication problems (in practice it is used for multi-class problems as well). Consider once again the classification of dataset linearly separable dataset. As before, we consider finding a hyperplane that "best" separates the two datasets. In SVM, we choose "best" to be a hyperplane that maximizing the orthogonal distance between the hyperplanes and the closest dataset to the hyperplane from each dataset.

Recall that provides a signed distance measure of the a data-point to the hyperplane. Without going into detail (see Section 7.1 of PRML book), we can choose to construct the weights such that for points that are closest to the hyperplane, which then requires that under for all points

where implies a point incorrectly classified.

With some handwaving, we can then set the objective function as the classfication error given by

Despite SVM inherently being a two-class classifier, we can heuristicly extend it for multi-class classification. One approach, commonly known as "one-vs-all" does this by constructing the error function such that a non-zero error is introduced when a wrong class is associated to a given data point.

where is the known correct class index for given datapoint under consideration.

Consider the pseudo-example below:

Consider a pseudo-example consisting of 3 class classification problem. Let

be the weighing coefficients.

Now suppose that, at a given iteration, for , and we find

The predicted error then becomes

Now suppose that, for the same iteration, for , and we find

The predicted error corresponding to then becomes

Neural Networks

Reference: Chapter 5

In the preceeding section, direct use of the functional form of the posterior distribution (and defining decision boundaries) was considered. Specifically, discriminants and the decision boundaries were chosen to be a linear function of the "activations". Consider now revisiting Neural Networks which were briefly introduced in Part 1 with application to regression.

Consider a more generalize form of the two-layer network:

In above,

: Model complexity
: Dimension of dataset (e.g. 2 dataset)
: Number of training class datasets
: First layer model weights
First layer model biases
Second layer model weights
Second layer model biases
(First layer) activation function commonly taken to be sigmoid or functions
Second layer activation function, e.g. a logistic sigmoid function

Also, within the context of classification, represents the posterior probability of class , while denotes the -th data point belonging to class .

Recall that the log likelihood function from the previous was

Let

where component-wise operation is understood where necessary. Solution to the resulting nonlinear optimization problem requires at minimum a evaluation of the first derivative of the model outputs with respect to the model weights. Using expressions for the derivative of the model output with respect to the model weights, the first derivatives of the objective function with respect to model weights can be computed as

where represents the model weights including biases.

Example: Supervised Learning - Neural Network

This example considers application of a neural network in classification of an input training dataset.

The training set consists of five clusters created randomly such that there exists some overlap of the data. A two-layer neural network of 8th order complexity () is considered and all model weights of the neural network are initialized by sampling them from a uniform distribution in the interval . The associated network diagram is shown in the image below.

A naive backtracking approach is considered and a fixed number of iterations. Video below shows the evolution of the learned classification regions in each step of the iteration.

Source

Note that the decision boundaries are no longer linear.

Summary

Approach Objective Function Target Model weights/properties Notes
Least Squares (2-classes) Same results using Fisher's Discriminant
Probabilistic Generative Models
Logistic Regression Nonlinear iterations to find
Neural Networks Nonlinear iterations to find weights Two layer network example

References

Suggested general reference books: