Before looking at implementing or even using various different ML models, it is a good idea to form a mental model of the toolset that pytorch enables. A particularly good way to accomplish this is to look at how it can be used to tackle trivial problems that you may already be very familiar with. This accomplishes two things:
Since machine learning inherently involves optimization problems, a good example of a "trivial" problem would be focusing on solving linear systems of equations (LSE). This is in fact the objective of this post: incrementally going through using different components of pytorch for solving LSEs iteratively, while noting that we are doing this merely to become comfortable with pytorch and learn about some aspects that may be counter-intuitive.
Here's the problem that is of interest to us:
Find such that , where and .
In particular, we are interested in an iterative solution to the problem via an iterative method.
This is a particularly nice problem because it allows us to use, incrementally, pytorch's
For simplicity, we will use gradient descent for iteratively for solving LSEs.
Plain gradient descent is simple. We start with an initial guess for what is, and define an error metric ("loss function"). Then we iterate until the error metric is smaller than our tolerance. In each iteration, the guess for the solution is updated using the gradient of the error metric. Here's a pseudo-code
x = initial_guess()
while (error(x) > tolerance):
error = error_metric(x)
grad_err = gradient_of_error_wrt_x(x)
x = - gamma * grad_err
gamma
is a step size (more accurately, scaler for the
step size) that commonly is referred to as "learning rate".
Let's define the error metric to be the of the error, namely
Also, note that the gradient of the objective function with respect to is
Before looking at pytorch, let's form a baseline that we can refer to. We will use numpy. The implementation is very simple:
import numpy as np
A_MAT = [
[1, 2, 0],
[0, 1, 0],
[0, 0, 1],
]
B_VEC = [1, 2, 3]
def numpy_impl():
A = np.array(A_MAT, dtype=np.float64)
b = np.array(B_VEC, dtype=np.float64)
x = np.array([0, 0, 0], dtype=np.float64)
for i in range(50):
Ax = A.dot(x)
# Objective function
err_norm = 0.5 * np.linalg.norm(Ax - b)**2
if i%10 == 0:
print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')
# Backward step: update estimate
err_grad_x = A.T.dot(Ax-b)
x -= ALPHA*err_grad_x
This is our baseline. Our pytorch implementation should give identical result to this trivial problem so long as we use the same data type and same initial guess.
Before moving on, let's make a note that err_norm
is
not a float
datatype, but a numpy.float64
datatype. But for practical purposes, it is the same as a
float
type on your machine.
We already have hard-coded what A_MAT
, B_MAT
and our initial guess are. Let's use that:
def torch_A_x_b():
A = torch.tensor([A_MAT], dtype=torch.float64, requires_grad=False)
b = torch.tensor([[[B_VEC[0]], [B_VEC[1]], [B_VEC[2]]]], dtype=torch.float64, requires_grad=False)
x = torch.tensor([[[0], [0], [0]]], dtype=torch.float64, requires_grad=True)
return A, x, b
Now we can implement the first iteration of our pytorch implementation. In this implementation, we will make use of the autograd functionality built into pytorch to compute the gradients for us:
def torch_impl_1():
A, x, b = torch_A_x_b()
for i in range(ITER_COUNT):
y = torch.bmm(A, x)
err_norm = 0.5 * torch.norm(b - y) ** 2
# See below
err_norm.backward()
if i%10 == 0:
print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')
# See below
with torch.no_grad():
x -= ALPHA*x.grad
# See below
x.grad = None
Note: the output of err_norm
is not
a float
type, but a pytorch tensor
type. Specifically,
a tensor
type of shape
0. Do not confuse
pytorch tensor
s with mathematical object tensors. Pytorch
tensors are just multidimensional (0-dimensioanl in this case) blocks of
numbers.
When we performed our operations precending err_norm
,
pytorch internally constructed and stored a call graph. Calling
err_norm.backward()
results in a backward pass in the
evaluation of the gradient of the error_norm with respect to all
parameters.
This brings us to the next item. Intuitively, we may expect that
x.grad
is the gradient of x
(with
respect to what?). But it's not. And that would not even make
sense: the gradient of x
clearly does not have the same dimension as
x
so it cannot be added to it.
Of course, by comparing this with the numpy implementation above, we
can recognize that x.grad
is actually the gradient of the
err_norm
(objective function) with respect to
x
. This may be very counter-intuitive and confusing when you first
start using pytorch.
Next, since we are updating x
in place, we called
torch.no_grad()
to tell pytorch to not try to update
the call graph since the variable is being overwritten.
Lastly, pytorch does not know when our forward call has
ended. Therefore, we have to clear the accumulated gradient by calling
x.grad = None
to prepare for the next iteration. If we fail
to do that, it will accumulate all subsequent computations.
Pytorch has various loss functions (objective functions) built-in. In our second iteration, we will make use of them.
def torch_impl_2():
A, x, b = torch_A_x_b()
# MSE : Mean-Square-Error
metric = torch.nn.MSELoss(reduction='sum')
for i in range(ITER_COUNT):
yt = torch.matmul(A, x)
err_norm = 0.5 * metric(yt, b)
err_norm.backward()
if i%10 == 0:
print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')
with torch.no_grad():
x -= ALPHA*x.grad
x.grad = None
The MSE
in torch.nn.MSELoss
stands for
Mean Square Error. But torch.nn.MSELoss
computes either the 'mean'
of the error, or the 'sum' of the error. To be consistent with our numpy
implementation, we pick 'sum' for the reduction
argument, and have to multiply by 0.5 since the
MSEloss
does not do that.
Since we are using the same initial guess and 64-bit data types, the result of this implementation, including the iterations, is identical to the previous.
In pytorch, we often work with model 'modules' to be able to stack
them. In this implementation, we move the Ax=b
into a
new model 'module'. This one is going to be very odd using our
linear system of equations example, alas we will use it anyway to become
familiar with how it would look.
def torch_impl_3():
# Replace A with a pytoprch module
A, x, b = torch_A_x_b()
class LSEModel(nn.Module):
def __init__(self, initial_guess):
super().__init__()
self.A = A
self.x = nn.Parameter(initial_guess)
def forward(self):
return torch.matmul(self.A, self.x)
model = LSEModel(x)
metric = torch.nn.MSELoss(reduction='sum')
for i in range(ITER_COUNT):
# Model output --
# Our input is already baked into the model as a parameter
yt = model()
# forward pass
err_norm = 0.5 * metric(yt, b)
err_norm.backward()
if i%10 == 0:
print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')
with torch.no_grad():
x -= ALPHA*model.x.grad
model.x.grad = None
Next, we will use pytorch's built-in 'optimizer's to do the parameter updates.
def main_torch_4():
# Create optimizer
# Linear System of Equations
class LSEModule(nn.Module):
def __init__(self, A, x):
super().__init__()
self.A = A
self.x = nn.Parameter(x)
def forward(self):
return torch.matmul(self.A, self.x)
A, x, b = torch_A_x_b()
model = LSEModule(A, x)
metric = torch.nn.MSELoss(reduction='sum')
optim = torch.optim.SGD(model.parameters(), lr=ALPHA)
for i in range(ITER_COUNT):
# forward
y = model()
# objective function
err_norm = 0.5 * metric(y, b)
# backward
optim.zero_grad()
# Compute gradient
err_norm.backward()
# update parameters (x)
optim.step()
if i%10 == 0:
print(f'iterr {i :2d} -- error: {err_norm.item() :0.3e}')
How does the optimizer know what the model parameters are? You can
get model parameters via model.parameters()
for param in model.parameters():
print(type(param.data), param.size())
Here's a pitfall: If the model paramaters have been put in an array, they will not get listed:
class MyModel(nn.Module):
def __init__(self, A, x):
super().__init__()
self.my_params = []
self.my_params.append(nn.Linear(10, 20))
self.my_params.append(nn.Linear(20, 40))
self.my_params.append(nn.Linear(40, 50))
model = MyModel()
# No output!
for param in model.parameters():
print(type(param.data), param.size())