Introduction to PR and ML (Part 3): Classification

2019-10-20

This is the third part of series of notes taken to summarize notes taken while reading the Pattern Recognition and Machine Learning (Springer, 2006) book and includes simple examples with implementations for demonstration including. As in the first and second posts, the goal of these posts is not to be rigorous, comprehensive, or provide any new or advanced information. The target audience was first and foremost myself, but these notes are made available for anyone else who happens to find it useful, including

Individuals curious enough to want to explore pattern recognition and machine learning, but have no previous exposure to it.
Individuals who had a course on it and wish to refresh their memory.
People interested in the ideas and equations, but not interested in only learning recipes and/or only learning how to use a library or toolkit.

Some contents are hidden away from default view inside modal dialogs where no page redirection takes place. These links are underlined (see e.g. here). Either clicking outside the window, clicking the close button, or clicking the back button in the browser window will close window.

Basic background in mathematics including: basic calculus, basic linear algebra, exposure to optimization, and basic statistics is assumed; see suggested readings at the end of this page. If your statistics is rusty, fear not, as the author of the PR & ML book writes:

All snippets are done using python 3 using only numpy and matplotlib add-on packages. Also, ffmpeg package needs to be installed to create the videos.

Linear Discriminants

Reference: Section 4.1.5

Part 1 and Part 2 of ML&PR notes considered supervised learning applied to regression problems. This post takes a step in a different direction and considers the application of supervised learning to classification problems. Specifically, consider an input training dataset $x$ where each data point is assigned to known target classes. Then the classification problem can be stated as follows.

Given the training dataset

{x_{n}}

where

x_{n} \in C_{k}, k \in {1, \dots, K}

, make future predictions on the class

C_{k}

a new observation

x

belongs to.

As in Part 1 this problem can be approached by optimizing an objective function. Consider first the restriction to a simpler problem, namely the linear classification model for a two-class problem ( $K = 2$ . The hyperplane (decision surface) defined by

y (x) = w^{T} x + w_{0}

may be used to implicitely subdivide the the two halfspaces, where each halfspace identifies a separate class. Specifically, referring to the figure below, note that $∣ ∣ w ∣ ∣^{- 1} y (x)$ provides a signed measure of the distance from the decision surface. The two classes can thus be implicitely defined through the relationship

C_{1} C_{2} = {x ∣ y (x) \geq 0} = {x ∣ y (x) < 0},

while the decision surface corresponds to $y = 0$ .

The orientation of the supporting halfplane may be determined by considering the minimization of an appropriate objective function as was considered in Part 1 and Part 2 of the posts covering treatment of regression problems. Specifically, consider the objective function defined by

E y_{n} (x_{n}) t_{n} : = \frac{1}{2} ∣ ∣ y (x) - t ∣ ∣_{2}^{2} : = w^{T} x_{n} + w_{0} : = {+ \frac{N}{N _{1}} - \frac{N}{N _{2}} x_{n} \in C_{1} x_{n} \in C_{2}

In above $t_{n}$ denotes the target class of point $n$ . Note that under this choice, the sum of the target value will be zero when all points are classified correctly:

n \sum t_{n} = n \in C_{1} \sum t_{n} + n \in C_{2} \sum t_{n} = \frac{N}{N _{1}} N_{1} - \frac{N}{N _{2}} N_{2} = 0 .

Consider now the minimization of the objective function: the minimization of the objective function due to $w_{0}$ requires that

n \sum (w^{T} x + w_{0}) = n \sum t_{n} = 0

w_{0} = w^{T} : = m \frac{1}{N} n \sum x_{n} = w^{T} m,

where $m$ is understood to be the mean of the total dataset. If the individual class means are defined to be

m_{k} = \frac{1}{N _{k}} n \in C_{k} \sum x_{n}, k \in {1, 2}

then

m = \frac{1}{N} n \sum x_{n} = k \sum n \in C_{k} \sum x_{n} = \frac{1}{N} (N_{1} m_{1} + N_{2} m_{2}) .

Similarly, minimization with respect to $w$ requires that

n \sum (w^{T} x_{n} - w_{0} w^{T} m - t_{n}) x_{n} = 0 .

or equivalently

n \sum (x_{n} x_{n}^{T} - x_{n} m) w = n \sum t_{n} x_{n} .

After some work, the above equation can be re-expressed as

(S_{W} + \frac{N _{1} N _{2}}{N} S_{B}) w = N \cdot (m_{1} - m_{2})

where,

S_{W} S_{B} : = n \in C_{k}, k \in {1, 2} \sum (x_{n} - m_{k}) (x_{n} - m_{k})^{T} : = (m_{2} - m_{1}) (m_{2} - m_{1})^{T} .

It is worth noting that $S_{b} w$ is in the direction of $m_{2} - m_{1}$ and thus we re-express previous equations as

w \propto S_{W}^{- 1} (m_{1} - m_{2}) .

The orientation of the supporting hyperplane can therefore be explicitely learned from the data. Specifically, note that the previously described Training/Learning phase followed by Prediction phase is still applicable:

Training: Given input training set $X \in R^{N t i m e s D}$ and corresponding target classes $C_{k}$ , determine the direction of the hyperplane (model weights) through
$w w_{0} \propto S_{W}^{- 1} (m_{1} - m_{2}) . = w^{T} m$
Note that there are no iterations involved in the training phase. Also, note that the in-class means can be evaluated sequentially, and that $S_{W} \in R^{D \times D}$ .
Prediction: For a new observation $x$ , compute $y$
$y (x) = w^{T} x + w_{0}$
and assign a class based its value.

Though may not be immediately obvious, it will be shown below that this result corresponds to the the maximization of the separation of the projected class means $m_{k}$ onto $w$ while the in-class variance $s_{k}^{2}$ is minimized.

Toggle Example

Example: Linear discriminant

Two datasets are randomly created with pre-specified means and covariance. The data is such that there is likelihood of overlapping data. The linear discriminant is evaluated (trained) on the input data set.

The same paradigm is followed as before for classification:

Training: Using the input dataset, learn discriminant weights
Prediction: Given new observation $x_{n}$ , compute $y (x$ . The class datapoint is then computed by by using the sign of $y (x)$ .

Figure below shows the evaluted linear discriminant for the two classes.

Source

Fisher's Discriminant

Consider now an alternative viewpoint of the same problem, namely the Fisher's linear discriminant:

Fisher's Discriminant

Given datasets $X_{k} \in C_{k}, k \in {1, 2}$ , find projection of dataset to one-dimension using $y (x) = w^{T} x$ such that

J (w) : = \frac{( m _{2} - m _{1} ) ^{2}}{( s _{1}^{2} + s _{2}^{2} )}

is maximized, where

m_{k} m_{k} s_{k}^{2} y_{n} : = w^{T} m_{k} : = \frac{1}{N _{k}} n \in C_{k} \sum x_{n}, k \in {1, 2} : = n \in C_{k} \sum (y_{n} - m_{k})^{2} : = w^{T} x_{n}

and $N_{k}$ denotes the number of points belonging to class $C_{k}$ .

The numerator of the Fisher's criteria provides a measure of the (projected) class mean separation, while the denominator provides a measure of the (summed) in-class variance (see figure below). Maximization of Fisher's criteria thus maximizes the separation of the projected class means $m_{k}$ onto $w$ while the in-class variance $s_{k}^{2}$ is minimized.

Note that the objective function is to be optimized in terms of the model weights $w$ . Using the definitions above, the objective function can be explicitely written in terms of the model weights as

J (w) = \frac{w S _{B} w}{w S _{w} w}

where

S_{B} S_{W} : = (m_{2} - m_{1}) (m_{2} - m_{1})^{T} : = n \in C_{1} \sum (x_{n} - m_{1}) (x_{n} - m_{1})^{T} + n \in C_{2} \sum (x_{n} - m_{2}) (x_{n} - m_{2})^{T} .

The maximization of the above objective function is straightforward. Taking the derivative with respect to $w$ and rearranging results in

D_{w} J (w) = \frac{S _{B} w}{w ^{T} S _{W} w} - \frac{w ^{T} S _{B} w}{( w ^{T} S _{W} w ) ^{2}} S_{W} w = 0

S_{W} w = \frac{( w ^{T} S _{W} w )}{( w ^{T} S _{B} w )} S_{B} w .

Using the definiton of $S_{B}$ , note that $S_{B} w$ scales in the direction of $α \cdot (m_{w} - m_{1})$ . Also, the leading coefficient term in the above equation dependent of $w$ can be neglected since the objective is to find the orientation of $w$ . Hence, we can find an expression for the model weights as

w = α S_{W}^{- 1} S_{B} w = \overset{α}{^} S_{W}^{- 1} (m_{2} - m_{1})

where $α$ can be treated as a scaling constant. Note that, as alluded to earlier, this result is the same as the one obtained for least squares solution considered in the preceeding section.

Probabilistic Generative Models

Reference: Section 4.2

Consider now taking a probabilistic approach to classification such that it is generalizable to more than two classes. Assume a joint distribution of classification and datapoint given by $p (x, C_{k})$ . Assuming prior class distribution $p (C_{k})$ , the posterior distribution can be expressed as

p (C_{1} ∣ x) = \frac{p ( x ∣ C _{k} ) p ( C _{k} )}{\sum _{k} p ( x ∣ C _{k} ) p ( C _{k} )} .

By using

a_{k} = lo g (p (x ∣ C_{k}) p (C_{k})),

the above expression can be expressed as

p (C_{k} ∣ x) = \frac{exp { a _{k} }}{\sum _{m} exp { a _{m} }} .

Now consider assuming that the class conditional distributions are Gaussian with shared covariance given by

p (x ∣ C_{k}) = N (x ∣ μ_{k}, Σ),

and consider finding the paramters by maximum likelihood solution. Let $t_{n k}$ denote the target class for the $n$ -th point where $t_{n} = (t_{1}, t_{2}, \dots, t_{K})$ is zero except for the correct class. Also, let the prior class probablities $p (C_{k}) = π_{k}$ . The likelihood function is given by

p (t, x ∣ π_{k}, μ_{k}, Σ) = n \prod N k \prod K π_{k} N (x_{n} ∣ μ_{k}, Σ)^{t_{n k}}

As before, consider the maximization of the log likelihood. Before doing that, recall that prior probabilities must sum to one which can be included into the likelihood function as a equality constraint through introduction of a Lagrange multiplier $λ$

lo g p (t, x ∣ π_{k}, μ_{k}, Σ) = n \sum N k \sum K t_{n k} (lo g (π_{k}) + lo g N (x_{n} ∣ μ_{k}, Σ)) + λ (k \sum (π_{k}) - 1) .

The priors can then be found to be $π_{k} = \frac{N _{k}}{N}$ by maximizating the constrained log likelihood with respect to $π_{k}$ .

Consider now the maximzation of the log likelihood with respect to the model means:

\frac{\partial E}{\partial μ _{k}} = \frac{\partial}{\partial μ _{k}} n \sum k \sum - \frac{1}{2} t_{n k} (- x_{n}^{T} Σ^{- 1} μ_{k} - μ_{k}^{T} Σ^{- 1} x_{n} + μ_{k}^{T} Σ^{- 1} x_{n} + μ_{k}^{T} Σ^{- 1} μ_{k}) = n \sum k \sum - \frac{1}{2} t_{n k} (- x_{n}^{T} Σ^{- 1} + μ_{k}^{T} Σ^{- 1}) = 0 .

By rearranging the above expression, an explicit expression for the model mean is found to be

μ_{k}^{T} = \frac{1}{N _{k}} k \sum t_{n k} x_{n}^{T} .

Finally, consider the maximization of the objective function with respect to the covariance matrix. Before proceeding, it's worth mentioning two identities:

$\frac{\partial}{\partial A} (A A^{- 1}) = 0 = A \frac{\partial A ^{- 1}}{\partial A} + A^{- 1}$
or
$\frac{\partial A ^{- 1}}{\partial A} = - A^{- 2}$
$\frac{\partial}{\partial A} lo g ∣ A ∣ = \frac{1}{∣ A ∣} ∣ A ∣ A^{- 1}$
See Convex Optimization section A4.1 (or a book on tensor algebra) for proof.

The full log likelihood including all terms including the covariance matrix is

p : = n \sum k \sum - \frac{1}{2} lo g ∣ Σ ∣ - \frac{1}{2} (x - μ_{k})^{T} Σ^{- 1} (x - μ_{k}) + C

Then the gradient with respect to the covariance matrix, utilizing the identities listed above, becomes

n \sum k \sum [Σ^{- 1} - (x - μ_{k})^{T} Σ^{- 2} (x - μ_{k})] t_{n k} = 0

Rearranging, multiplying through by $Σ^{2}$ and rearranging again gives

Σ = \frac{1}{N} n \sum k \sum t_{n k} (x_{n} - μ_{k}) (x_{n} - μ_{k})^{T} .

Summarizing, given Gaussian class-conditional probability densities $p (x ∣ C_{k}) = N (x ∣ μ_{k}, Σ)$ with shared covariance and class priors $p (C_{k}) = π$ , the posterior distributions are given by

p (C_{k} ∣ x) a_{k} π_{k} μ_{k} Σ = \frac{exp a _{k}}{\sum _{j} a _{j}} = lo g p (x ∣ C_{k}) p (C_{k}) = \frac{N _{k}}{N} = \frac{1}{N _{k}} n \sum t_{n k} x_{n} = \frac{1}{N} n \sum k \sum t_{n k} (x_{n} - μ_{k}) (x_{n} - μ_{k})^{T}

The previous training followed by prediction paradigm then holds:

Training: Given input data set, and class count $K$ compute the $π, μ_{k}, Σ$ .
Prediction: Given an observation $x$ , compute the class posterior distribution $p (C_{k} ∣ x)$ using above expression for $k \in [1, K]$ . The point is assigned to the value which has the highest posterior class probability.

Toggle Example

Example: Probabilistic Generative Models

Two separate training sets are considered. First dataset is compromised of four classes, while the second dataset contains five. The two-step training-prediction paradigm is followed to find the model priors, means and the shared covariance, and subsequently, the input space is classified based on these trained values.

Source

Note that the shared boundaries between each classification region are linear.

Logistic Regression

Reference: Section 4.3

In the preceding section, the class posterior distributions were found by assuming class-conditional probablities and priors over the class conditions, and using maximum likelihood to determine paramters defining these probabilities. Alternatively, one may consider working directly with the functional form of the posterior distribution (and defining decision boundaries).

As indicated to above, the $C_{k}$ class posterior distribution is a normalized exponential

y_{k} (x) : = p (C_{k} ∣ x) = σ (w_{k}^{T} x) = \frac{exp a _{k}}{\sum _{j} exp a _{j}} .

Note that $y_{k} (x) \in (0, 1)$ .

Consider now the special case where the activation coefficients $a_{k}$ are given by

a_{k} = w^{T} x

which resembles the linear decision surface seen in the preceeding sections and alluded to in the preceeding example.

Without introducing priors over the distributions, consider now the maximization of the likelihood function

p (T ∣ w_{1}, w_{2}, \dots, w_{K}) = n \prod N k \prod K p (C ∣ x_{n})^{t_{n k}}

where $T \in Z_{0}^{N \times K}$ contains target variables with elements $t_{n k}$ . As before, consider maximization of the negative log likelihood, namely

E (w_{1}, \dots, w_{K}) : = - lo g p (T ∣ w_{1}, w_{2}, \dots, w_{K}) = - n \sum k \sum t_{n k} lo g y_{n k} = - n \sum k \sum t_{n k} (a_{k} - lo g j \sum a_{j})

Let's briefly review objective function above.

As noted earlier, $y (x) \in (0, 1)$ .
For points correctly classified, $E_{n} \to 0$ , while for incorrectly classified points $E_{n} \to \infty$ .
The choice $a_{k} : = w_{k}^{T} x$ limits the hyperplane (decision planes) to be linear and is appropriate only for dataset that is linearly separable.

Note that this is a nonlinear function of the model weights, and the maximization of the log likelihood requires at least its first derivative. Note that

\frac{\partial y _{k}}{\partial a _{m}} = \frac{\partial}{\partial a _{m}} \frac{exp { a _{k} }}{\sum _{j} exp { a _{j} }} = \frac{δ _{k m} exp { a _{k} }}{\sum _{j} exp { a _{j} }} - \frac{exp { a _{k} }}{( \sum _{j} exp { a _{j} } ) ^{2}} exp {a_{j}} = y_{k} \cdot (δ_{k m} - y_{j}),

and thus

\nabla_{w_{j}} E (w_{1}, \dots, w_{K}) = n \sum (y_{n j} - t_{n j}) x_{n}

Toggle Example

Example: Logistic Regression

Training dataset compromising of four different clusters is considered. Backtracking is used to solve the nonlinear optimization problem.

The discriminants and the classification regions are shown for each step of the optimization iteration. Color of the lines corresponds to the cluster coloring, though the contour plot coloring is arbitrary.

Source

Support Vector Machines (SVM)

Reference: Chapter 7

This is a good place to take a look at SVMs. Unlike the logistic regression approach considered above which relied on a probabilistic approach for arriving at the objective function, we'll need to start out with a geometric viewpoint.

The support vector machine is inherently for two-class classfication problems (in practice it is used for multi-class problems as well). Consider once again the classification of dataset linearly separable dataset. As before, we consider finding a hyperplane that "best" separates the two datasets. In SVM, we choose "best" to be a hyperplane that maximizing the orthogonal distance between the hyperplanes and the closest dataset to the hyperplane from each dataset.

Recall that $y$ provides a signed distance measure of the a data-point to the hyperplane. Without going into detail (see Section 7.1 of PRML book), we can choose to construct the weights such that $t_{n} y_{n} = 1$ for points that are closest to the hyperplane, which then requires that under for all points

t_{n} y_{n} \geq 1,

where $1 - t_{n} y_{n} \geq 0$ implies a point incorrectly classified.

With some handwaving, we can then set the objective function as the classfication error given by

E (W) = n \sum max (0, 1 - t_{n} y_{n}) .

Despite SVM inherently being a two-class classifier, we can heuristicly extend it for multi-class classification. One approach, commonly known as "one-vs-all" does this by constructing the error function such that a non-zero error is introduced when a wrong class is associated to a given data point.

E (W) = n \sum j \neq = y_{n} \sum max (s_{n j} - s_{y_{n}} + 1, 0),

where $y_{n}$ is the known correct class index for given datapoint under consideration.

Consider the pseudo-example below:

Consider a pseudo-example consisting of 3 class classification problem. Let

W = ⎝ ⎛ w_{1}^{T} w_{2}^{T} w_{3}^{T} ⎠ ⎞

be the weighing coefficients.

Now suppose that, at a given iteration, for $x_{1} \in C_{1}$ , and we find

W x_{1} = ⎝ ⎛ + 1.23 + 0.11 - 0.67 ⎠ ⎞ .

The predicted error $x_{1}$ then becomes $max (0.45 - 1.23 + 1, 0) + max (- 0.67 - 1.23 + 1, 0) = 0 .$

Now suppose that, for the same iteration, for $x_{2} \in C_{2}$ , and we find

W x_{1} = ⎝ ⎛ + 0.23 + 1.10 + 1.67 ⎠ ⎞ .

The predicted error corresponding to $x_{2}$ then becomes $max (0.23 - 1.10 + 1, 0) + max (1.67 - 1.10 + 1, 0) = 0.13 + 1.57 = 1.70 .$

Neural Networks

Reference: Chapter 5

In the preceeding section, direct use of the functional form of the posterior distribution (and defining decision boundaries) was considered. Specifically, discriminants and the decision boundaries were chosen to be a linear function of the "activations". Consider now revisiting Neural Networks which were briefly introduced in Part 1 with application to regression.

Consider a more generalize form of the two-layer network:

y_{k} = σ (j \sum M w_{k j}^{(2)} h (j \sum D w_{j i}^{(1)} x_{i} + w_{j 0}^{(1)}) + w_{k 0}^{(2)}), k \in {1, \dots, K}

In above,

M

: Model complexity

D

: Dimension of dataset (e.g. 2 dataset)

K > 1

: Number of training class datasets

w_{j i}^{(1)} \in R^{M \times D}

: First layer model weights

w_{j 0}^{(1)} \in R^{M}

First layer model biases

w_{k j}^{(2)} \in R^{K \times M}

Second layer model weights

w_{k 0}^{(2)} \in R^{K}

Second layer model biases

h (\cdot)

(First layer) activation function commonly taken to be sigmoid or

tanh

functions

σ (\cdot)

Second layer activation function, e.g. a logistic sigmoid function

Also, within the context of classification, $y_{k}$ represents the posterior probability of class $C_{k}$ , while $x_{n k}$ denotes the $n$ -th data point belonging to class $C_{k}$ .

Recall that the log likelihood function from the previous was

E (w) = - n \sum N k \sum K t_{n k} lo g_{k} y_{k} (x_{n}, w)

Let

h (a) σ (a) : = tanh (a) : = \frac{1}{1 + exp { - x }}

where component-wise operation is understood where necessary. Solution to the resulting nonlinear optimization problem requires at minimum a evaluation of the first derivative of the model outputs with respect to the model weights. Using expressions for the derivative of the model output with respect to the model weights, the first derivatives of the objective function with respect to model weights can be computed as

\frac{\partial E}{\partial w} = - n \sum N k \sum K t_{n k} \frac{1}{y _{n k}} \frac{\partial y _{n k}}{\partial w ^ _{m p}}

where $\overset{w}{^}$ represents the model weights including biases.

Toggle Example

Example: Supervised Learning - Neural Network

This example considers application of a neural network in classification of an input training dataset.

The training set consists of five clusters created randomly such that there exists some overlap of the data. A two-layer neural network of 8th order complexity ( $M = 8$ ) is considered and all model weights of the neural network are initialized by sampling them from a uniform distribution in the interval $[- 1, 1)$ . The associated network diagram is shown in the image below.

A naive backtracking approach is considered and a fixed number of iterations. Video below shows the evolution of the learned classification regions in each step of the iteration.

Source

Note that the decision boundaries are no longer linear.

Summary

Approach	Objective Function	Target	Model weights/properties	Notes
Least Squares (2-classes)	$\frac{1}{2} ∣ ∣ y (x) - t ∣ ∣_{2}^{2}$	$y (x_{n}) = w^{T} x_{n} + w_{0}$	$w w_{0} S_{W} \propto S_{W}^{- 1} (m_{1} - m_{2}) = w^{T} m = n \in C_{k}, k \in {1, 2} \sum (x_{n} - m_{k}) (x_{n} - m_{k})^{T}$	Same results using Fisher's Discriminant
Probabilistic Generative Models	$n \sum N k \sum K t_{n k} (lo g (π_{k}) + lo g N (x_{n} ∣ μ_{k}, Σ)) + λ (k \sum (π_{k}) - 1)$	$p (C_{k} ∣ x) a_{k} p (C_{k}) p (x ∣ C_{k}) = \frac{exp a _{k}}{\sum _{j} a _{j}} = lo g p (x ∣ C_{k}) p (C_{k}) = π_{k} = N (x ∣ μ_{k}, Σ)$	$π_{k} μ_{k} Σ λ = \frac{N _{k}}{N} = \frac{1}{N _{k}} n \sum t_{n k} x_{n} = \frac{1}{N} n \sum k \sum t_{n k} (x_{n} - μ_{k}) (x_{n} - μ_{k})^{T} = - N$	—
Logistic Regression	$y (x) = w^{T} x$	Nonlinear iterations to find $w$	$\nabla_{w_{j}} E (w_{1}, \dots, w_{K}) = n \sum (y_{n j} - t_{n j}) x_{n}$	—
Neural Networks	$y_{k} = σ (j \sum M w_{k j}^{(2)} h (j \sum D w_{j i}^{(1)} x_{i} + w_{j 0}^{(1)}) + w_{k 0}^{(2)}), k \in {1, \dots, K}$	Nonlinear iterations to find weights	—	Two layer network example

References

Suggested general reference books:

Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.

Base reference for content of this page, though certainly not the best book that I have read.
Boyd, Stephen, and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.

A classic and a must-read introductory book to (convex) optimization for everyone regardless of their background. PDF copy is also available on Standford webpage as of this writing.
Greenberg, Michael D. Foundations of Applied Mathematics. Courier Corporation, 2013.

Good reference book or refresher on applied math.
Bulmer, Michael George. Principles of Statistics. Courier Corporation, 1979.

Very easy read and short refresher on basics of statistics.