A Nontraditional Introduction to pytorch

2020-07-24

Before looking at implementing or even using various different ML models, it is a good idea to form a mental model of the toolset that pytorch enables. A particularly good way to accomplish this is to look at how it can be used to tackle trivial problems that you may already be very familiar with. This accomplishes two things:

See how to use the framework to accomplish a task that you are already familiar with.
Perform sanity checks on the outputs since you already know what the outputs should be.

Since machine learning inherently involves optimization problems, a good example of a "trivial" problem would be focusing on solving linear systems of equations (LSE). This is in fact the objective of this post: incrementally going through using different components of pytorch for solving LSEs iteratively, while noting that we are doing this merely to become comfortable with pytorch and learn about some aspects that may be counter-intuitive.

Linear system of equations

Here's the problem that is of interest to us:

Find $x \in R^{N}$ such that $A x = b$ , where $A \in R^{N \times N}$ and $b \in R^{N}$ .

In particular, we are interested in an iterative solution to the problem via an iterative method.

This is a particularly nice problem because it allows us to use, incrementally, pytorch's

layers
parameters
modules
optimizer

For simplicity, we will use gradient descent for iteratively for solving LSEs.

Gradient descent

Plain gradient descent is simple. We start with an initial guess for what $x$ is, and define an error metric ("loss function"). Then we iterate until the error metric is smaller than our tolerance. In each iteration, the guess for the solution is updated using the gradient of the error metric. Here's a pseudo-code

x = initial_guess()
while (error(x) > tolerance):
   error = error_metric(x)
   grad_err = gradient_of_error_wrt_x(x)
   x = - gamma * grad_err

gamma is a step size (more accurately, scaler for the step size) that commonly is referred to as "learning rate".

numpy Implementation

Let's define the error metric to be the $L_{2}^{2}$ of the error, namely

f : = ∣ ∣ A x - b ∣ ∣_{2}^{2}

Also, note that the gradient of the objective function with respect to $x$ is

\frac{\partial f}{\partial x} = A^{T} (A x - b)

Before looking at pytorch, let's form a baseline that we can refer to. We will use numpy. The implementation is very simple:

import numpy as np

A_MAT = [
    [1, 2, 0],
    [0, 1, 0],
    [0, 0, 1],
]

B_VEC = [1, 2, 3]

def numpy_impl():
    A = np.array(A_MAT, dtype=np.float64)
    b = np.array(B_VEC, dtype=np.float64)
    x = np.array([0, 0, 0], dtype=np.float64)

    for i in range(50):
        Ax = A.dot(x)

    # Objective function
    err_norm = 0.5 * np.linalg.norm(Ax - b)**2

    if i%10 == 0:
        print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')

    # Backward step: update estimate
    err_grad_x = A.T.dot(Ax-b)
    x -= ALPHA*err_grad_x

This is our baseline. Our pytorch implementation should give identical result to this trivial problem so long as we use the same data type and same initial guess.

Before moving on, let's make a note that err_norm is not a float datatype, but a numpy.float64 datatype. But for practical purposes, it is the same as a float type on your machine.

First pytorch implementation

We already have hard-coded what A_MAT, B_MAT and our initial guess are. Let's use that:

def torch_A_x_b():
    A = torch.tensor([A_MAT], dtype=torch.float64, requires_grad=False)
    b = torch.tensor([[[B_VEC[0]], [B_VEC[1]], [B_VEC[2]]]], dtype=torch.float64, requires_grad=False)
    x = torch.tensor([[[0], [0], [0]]], dtype=torch.float64, requires_grad=True)
    return A, x, b

Now we can implement the first iteration of our pytorch implementation. In this implementation, we will make use of the autograd functionality built into pytorch to compute the gradients for us:

def torch_impl_1():
    A, x, b = torch_A_x_b()
    for i in range(ITER_COUNT):
        y = torch.bmm(A, x)
        err_norm = 0.5 * torch.norm(b - y) ** 2

        # See below
        err_norm.backward()

        if i%10 == 0:
            print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')

        # See below
        with torch.no_grad():
            x -= ALPHA*x.grad

        # See below
        x.grad = None

Note: the output of err_norm is not a float type, but a pytorch tensor type. Specifically, a tensor type of shape 0. Do not confuse pytorch tensors with mathematical object tensors. Pytorch tensors are just multidimensional (0-dimensioanl in this case) blocks of numbers.

When we performed our operations precending err_norm, pytorch internally constructed and stored a call graph. Calling err_norm.backward() results in a backward pass in the evaluation of the gradient of the error_norm with respect to all parameters.

This brings us to the next item. Intuitively, we may expect that x.grad is the gradient of x (with respect to what?). But it's not. And that would not even make sense: the gradient of x clearly does not have the same dimension as x so it cannot be added to it.

Of course, by comparing this with the numpy implementation above, we can recognize that x.grad is actually the gradient of the err_norm (objective function) with respect to x. This may be very counter-intuitive and confusing when you first start using pytorch.

Next, since we are updating x in place, we called torch.no_grad() to tell pytorch to not try to update the call graph since the variable is being overwritten.

Lastly, pytorch does not know when our forward call has ended. Therefore, we have to clear the accumulated gradient by calling x.grad = None to prepare for the next iteration. If we fail to do that, it will accumulate all subsequent computations.

Second pytorch implementation

Pytorch has various loss functions (objective functions) built-in. In our second iteration, we will make use of them.

def torch_impl_2():
    A, x, b = torch_A_x_b()

    # MSE : Mean-Square-Error
    metric = torch.nn.MSELoss(reduction='sum')

    for i in range(ITER_COUNT):
        yt = torch.matmul(A, x)
        err_norm = 0.5 * metric(yt, b)
        err_norm.backward()

        if i%10 == 0:
            print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')

        with torch.no_grad():
            x -= ALPHA*x.grad
        x.grad = None

The MSE in torch.nn.MSELoss stands for Mean Square Error. But torch.nn.MSELoss computes either the 'mean' of the error, or the 'sum' of the error. To be consistent with our numpy implementation, we pick 'sum' for the reduction argument, and have to multiply by 0.5 since the MSEloss does not do that.

Since we are using the same initial guess and 64-bit data types, the result of this implementation, including the iterations, is identical to the previous.

Third pytorch implementation

In pytorch, we often work with model 'modules' to be able to stack them. In this implementation, we move the Ax=b into a new model 'module'. This one is going to be very odd using our linear system of equations example, alas we will use it anyway to become familiar with how it would look.

def torch_impl_3():
    # Replace A with a pytoprch module
    A, x, b = torch_A_x_b()

    class LSEModel(nn.Module):
        def __init__(self, initial_guess):
            super().__init__()
            self.A = A
            self.x = nn.Parameter(initial_guess)

        def forward(self):
            return torch.matmul(self.A, self.x)

    model = LSEModel(x)
    metric = torch.nn.MSELoss(reduction='sum')

    for i in range(ITER_COUNT):
        # Model output --
        # Our input is already baked into the model as a parameter
        yt = model()

        # forward pass
        err_norm = 0.5 * metric(yt, b)

        err_norm.backward()

        if i%10 == 0:
            print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')

        with torch.no_grad():
            x -= ALPHA*model.x.grad
        model.x.grad = None

Fourth pytorch implementation

Next, we will use pytorch's built-in 'optimizer's to do the parameter updates.

def main_torch_4():
    # Create optimizer

    # Linear System of Equations
    class LSEModule(nn.Module):
        def __init__(self, A, x):
            super().__init__()
            self.A = A
            self.x = nn.Parameter(x)

        def forward(self):
            return torch.matmul(self.A, self.x)

    A, x, b = torch_A_x_b()
    model = LSEModule(A, x)
    metric = torch.nn.MSELoss(reduction='sum')
    optim = torch.optim.SGD(model.parameters(), lr=ALPHA)

    for i in range(ITER_COUNT):
        # forward
        y = model()

        # objective function
        err_norm = 0.5 * metric(y, b)

        # backward
        optim.zero_grad()
        # Compute gradient
        err_norm.backward()
        # update parameters (x)
        optim.step()

        if i%10 == 0:
            print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')

How does the optimizer know what the model parameters are? You can get model parameters via model.parameters()

for param in model.parameters():
    print(type(param.data), param.size())

Here's a pitfall: If the model paramaters have been put in an array, they will not get listed:

class MyModel(nn.Module):
   def __init__(self, A, x):
      super().__init__()
      self.my_params = []
      self.my_params.append(nn.Linear(10, 20))
      self.my_params.append(nn.Linear(20, 40))
      self.my_params.append(nn.Linear(40, 50))

model = MyModel()
# No output!
for param in model.parameters():
    print(type(param.data), param.size())