A Dive into Automatic Differentiation
How torch.autograd
is implemented.
In this blog, I will share a small discussion on the underlying technology the popular PyTorch module torch.autograd
uses. This technology, known as Automatic/Algorithmic Differentiation (AD) has been of great discussion. Specially on the field of deep learning and machine learning. This blog post is based on the CSC321 Course from Toronto University, as well as this paper which explores the connection between AD and machine learning.
Finally note that this blog does not discuss in detail how torch.autograd
is really implemented. The blog rather demonstrates the fundamental concepts the module is built upon. If you would like to see how the module is implemented in detail, go no further than the autograd/__init__.py
file found on PyTorch’s GitHub repository. Without further ado! Let us dive into this discussion.
What is Automatic Differentiation?
We can start this discussion by first answering what torch.autograd
attempts to solve in the first place. autograd
is a module that is primarily used when updating the weights of some neural network. More specifically, autograd
is used during the forward iteration of a model’s training cycle. Here, autograd
’s main purpose is to calculate the derivatives of our neurons with respect to each of it’s components; which are later used during backpropagation. What is backpropagation? Well this blog is not intended to be an in-depth discussion on how neural networks are trained. So I will not go in detail on how backpropagation works. If you are curious or confused on what this paragraph is talking about, GeeksForGeeks has a great article about this step on a model’s training.
So, autograd
mainly calculates and keeps track of the derivatives of the network and its neurons with respect to its components. But how does it do that? This is the bulk of our conversation today, and is based upon a vital technology within the ML/AI discussion known as automatic/algorithmic differentiation (AD).
In essence, AD breaks down a network’s calculations into indivisible components (such as a sum, subtraction, multiplication, or exponential) known as a primitive. These primitives are then merged together. While merged, the derivative of these primitives with respect to each other are calculated and stored. The order on how these primitives should be merged is known as a computational graph. Once all primitives are combined and the graph completed, the entire network’s forward pass calculation will be done; with the huge side benefit that the derivative of this computation with respect to each primitives it is built upon is also calculated.
Cool, but why do we even use AD in the first place? The quickest answer is simply to say that it makes the training cycle a whole lot faster. But it goes much further than that. A neural network which uses AD is usually:
- Stable Numerically.
- An AD machine usually calculates the derivative of two primitives with very high precission.
- Highly scalable.
- AD’s impressive linear complexity and the fact it breaks down all calculations into elementary primitives allows this.
- Ad is a technology that can be used for your MNIST convolutional network to a LLM such as ChatGPT or Claude.
Not only that, AD can be used for many applications outside ML (e.g. Computational Fluid Dynamics). This is why I believe it is a technology worth understanding.
Types of Automatic Differentiation.
Note that AD first constructs a computational graph. One can easily think about this graph as a flowchart on the order primitives must be calculated. Later in this blog we will see how to implement a computational graph. Once the computational graph is built, there are two main approaches AD describes:
- Forward-mode automatic differentiation
- Reverse-mode automatic differentiation
Forward-mode AD is often used in other fields apart from ML/AI. More precisely, it is used when the number of inputs \(n\) are far fewer to the number of outputs \(m\) (i.e. \(m >> n\)). The paper on computational fluid dynamics is an example on using forward-mode AD. But this is not the concentration for today. For ML, it is far more useful to discuss about the other flavour of AD.
Reverse-mode AD is often used in the realm of machine learning. This is because reverse-mode AD is very effective when the number of inputs \(n\) far outweights the number outputs \(m\) (i.e. \(m << n\)). A scenario that is far too common we encounter in machine learning. As an example, the popular ImageNet model has ~150,000 inputs and only 1000 possible classifications as outputs.
How does Reverse Mode AD Work?
Let’s now dive into how reverse mode AD functions. The main goal when of this algorithm is to calculate the partial derivative any component with respect to the output of the network. That is, given some neuron in layer \(j\) noted as \(a_i^{(j)}\), and the output of our network \(y\); we would like to calculate \(\frac{\partial y}{\partial a_i^{(j)}}\). Even further, we would like to calculate the derivative of \(y\) with respect to the weight assigned to this neuron \(w_i^{(j)}\). Conceptually, the approach to do so is composed on two ‘sweeps’ over the computational graph: A forward pass and a backward pass. See the resemblance on backpropagation?
In our forward pass, not only would we evaluate the primitive’s output given the incoming variables, but in each primitive, we will calculate the partial derivative of the primitive function with respect to it’s inputs. Remember, a primitive is defined as some simple (multivariate) function \(a\).
For example, consider a computational graph with two distinct layers. The first layer has variables \(x\) and \(y\). The second layer contains primitives \(a_1\ldots a_{N}\) which takes \(x\) and \(y\) as input. During the forward pass, we would like to calculate the partial derivative of primitive functions \(a_1\ldots a_N\) with respect to each input. Once calculated, we would like to remember them, as they will be used during the bakcward pass which will in turn calculate the partial derivative of our inputs with respect to \(a_i\). Note that with this simple of a model, we already calculated such partial derivative, so there is no need for a backward pass. Nevertheless, the second stage of reverse mode AD will be necessary once we increase the number of layers. During implementation, one would usually store these derivatives in the same place the input being differentiated is stored. You will soon see how we can do something like this in Python.
The image below shows the procedure just described. Recall that the partial derviatives will be stored in the parent’s node. In this example, \(\frac{\delta a_1}{\delta x}\) is stored in the \(x\) node, and \(\frac{\delta a_N}{\delta y}\) is stored in the \(y\) node.
Once we completed a forward pass, we should have the partial derivative of every single primitive function we use with respect to it’s inputs. Only then we can move to the second stage: the backward pass!
Again, our main goal is to calculate the partial derivative of some weight with respect to the network’s output. But imagine we have multiple layers, so a direct path (as shown above) is non-existent. For that, we can simply use the trusty chain-rule. Which states that.
\[\frac{\partial y}{\partial w}=\sum_{i=1}(\frac{\partial y}{\partial a_i}*\frac{\partial a_i}{\partial w})\]Where each \(a_i\) represents all variables dependent to \(y\). In the example of a network, this would be all of the neurons that are connected to the output value.
But note! That we have calculated the partial derivatives for all related values during the forward pass. So all of this information is known. Thus we can simply add up the product of each related step until we reach our desired weight.
Notice the high resemblance to backpropagation? That is because backpropagation is a special flavour of AD. So this implementation can be seen as how backpropagation is implemented in PyTorch.
Implementing Reverse Mode AD in Python
The theory is cool and all, but how do we implement reverse mode AD in Python? Let us create a small example to show how one canc create an AD model from scratch. Consider the following python functions:
import numpy as np
def calculate_weighted_sum(X, weights, bias) -> np.float64:
# x - Vector containing all n inputs (from previous neurons)
# weights - Corresponding n weights for each input x[i].
# bias - Bias of the neuron.
a = X * weights + bias
return a
def calculate_activation_function(a) -> np.float64:
# Note that here sigma is some function we choose as our activation function.
# Common functions are linear, ReLU, Logistic, or Hyperbolic Tangent
# In this case we simply used tanh.
def sigma(x):
return np.tanh(x)
y_pred = sigma(a)
return y_pred
def calculate_cost_function(y_pred, y_actual) -> np.float64:
# Similar to activation function, we choose are able to chose cost function
# from a variety of functions (e.g. MSE, Log loss, Hinge loss, etc.)
# In this case, we simply used MSE:
def J(y_hat, y):
return 1/2 * (y_hat - y ** 2)
cost = J(y_pred, y_actual)
return cost
As discussed on the inner-workings of reverse mode AD, the idea is to now break down each of this functions into some set of primitive functions. With our defined Python functions, we can break down the entire chain of calculation as.
\[t_1 = Wx\\ a = t_1 + b\\ y_{pred} = \tanh(a)\\ t_2 = y_{pred} - y_{actual}\\ t_3 = (t_2)^2\\ J = \frac{1}{2}t_3\\\]Note that here \(t_i\) is simply some intermediary step that ensures the calculation done is some primitive we can predefine. With these primitives. Autograd then creates a computational graph that relates each of these primitive functions with each other. That is, once we parse and generate these primitives, our autograd library should be able to create some sort of graph that shows how we can merge these primitives together to construct the entire line of computation. Below, is an example on how our computational graph may look for the primitives above.
Now the biggest question is how do we compute this computational graph? Similar to autograd, we will do something known as tracing. That is, we define a new type of floating-point-like data type which overloads it’s basic operations; these overloaded operations will serve as our primitive functions. So, if two instances of this new class perform a basic operation (e.g. addition, subtraction, multiplication, etc.) we will overload it. In this new operator’s definition, we will (of course) perform the operation as intended. But we will also calculate the partial derivative with respect to each primitive, which is predifined within the function. Once we calculate these new partial derivatives, we will store them in some list that each parent node contains, in addition to a new instance which will represent the result of our two function’s operation.
Below is an example on how
# We can overload some primitive operators to trace each function.
import numpy as np
class Variable:
def __init__(self, val: np.float64):
self.val = val
self.val_hat = None
# Here, we store the differentiation of OUR CHILDREN with respect to ourself.
self._children_diff = []
def __str__(self) -> str:
return f'Variable(val=({self.val}))'
###################################################
# Define all the primitives needed. #
# This demo only defines necessary ones #
###################################################
def __add__(self, other):
if not isinstance(other, Variable):
raise TypeError(f"Cannot operate '{type(self)}' with '{type(other)}'")
f = Variable(self.val + other.val)
# Append partial diff of self and other with this new variable.
self._children_diff.append((1.0, f)) # d/d(self)(self + other) = 1
other._children_diff.append((1.0, f)) # d/d(other)(self + other) = 1
return f
def __sub__(self, other):
if not isinstance(other, Variable):
raise TypeError(f"Cannot operate '{type(self)}' with '{type(other)}'")
f = Variable(self.val - other.val)
# Append partial diff of self and other with this new variable.
# Note that order does matter here!!!!
self._children_diff.append((1.0, f)) # d/d(self)(self - other) = 1
other._children_diff.append((-1.0, f)) # d/d(other)(self - other) = -1
return f
def __mul__(self, other):
if not isinstance(other, Variable):
raise TypeError(f"Cannot operate '{type(self)}' with '{type(other)}'")
f = Variable(self.val * other.val)
# Append partial diff of self and other with this new variable.
self._children_diff.append((other.val, f)) # d/d(self)(self * other) = other
other._children_diff.append((self.val, f)) # d/d(other)(self * other) = self
return f
def __pow__(self, other):
if not isinstance(other, Variable):
raise TypeError(f"Cannot operate '{type(self)}' with '{type(other)}'")
f = Variable(self.val ** other.val)
# Append partial diff of self and other with this new variable.
# Note that order does matter here!!!!
# d/d(self)(self ** other) = other * (self ** (other - 1))
self._children_diff.append((other.val * (self.val ** (other.val - 1)), f))
# There may be the case where self.val <= 0. In this case, we will do a 'cheesy'
# solution and only evaluate it's absolute value.
# d/d(other)(self ** other) = self ** other * np.ln(self)
other._children_diff.append(((self.val ** other.val) * np.log(abs(self.val)), f))
return f
def __truediv__(self, other):
if not isinstance(other, Variable):
raise TypeError(f"Cannot operate '{type(self)}' with '{type(other)}'")
if other.val == 0: raise ZeroDivisionError()
f = Variable(self.val / other.val)
# Append partial diff of self and other with this new variable.
# Note that order does matter here!!!!
# d/d(self)(self / other) = 1 / other
self._children_diff.append([1 / other.val, f])
# d/d(other)(self / other) = - a / (other ** 2)
other._children_diff.append((-self.val / (other.val ** 2), f))
return f
def tanh(self):
f = Variable(np.tanh(self.val))
# Append partial diff of self and other with this new variable.
self._children_diff.append((np.tanh(self.val) ** 2, f))
return f
def get_children(self):
return self._children_diff
a = Variable(2)
b = Variable(10)
c = Variable(8)
print(a + b + c)
print(*a.get_children(), *b.get_children(), *c.get_children())
Notice how each value contains its own set of children and how we can access them. This essentially forms the computational graph we discussed before. In addition note that, while not shown here, new variables \(t_1=a+b\) and \(t_2=t_1+c\) are created. These new variables do not contain any children as these are the outputs of our calculations. In fact, notice that printing this get_children
gives us the result of our partial derivative as well as the memory reference to \(t_1\) and \(t_2\) respectively.
Cool, so now we have our computational graph working. What is the next step? Well, we can now go to the bakward sweep of our algorithm. The logistics of this sweep are implemented in the gradient(var: Variable)
function below. The function in itself is a basic BFS search algorithm that iterates through every child of some variable, and recursively calculates the chain rule found in v_hat
. Once we reach our limit (i.e. there are no more children to calculate) we return the result of v_hat
which in turn yields \(\partial y/\partial w\) for some arbitrary primitive \(w\).
# We can overload some primitive operators to trace each function.
import numpy as np
class Variable(Variable):
###################################################
# Define all the primitives needed. #
# This demo only defines necessary ones #
###################################################
@staticmethod
def gradient(var: Variable):
# We now calculate the sum of all PD of parent with respect to children val_hat
if len(var.get_children()) == 0:
return 1.0
v_hat = 0.0
for diff, child in var.get_children():
v_hat += diff * Variable.gradient(child)
return v_hat
Neat! Now we have a fully functional reverse-mode AD model. We can close this discussion by creating a small demo and progressing through the entire chain of computation we defined at the start. This resembles a classic. weighted sum done on some deep network. And updating the weights of our variables should be trivial now.
def calculate_weighted_sum(X: Variable, weights: Variable, bias: Variable) -> Variable:
# x - Vector containing all n inputs (from previous neurons)
# weights - Corresponding n weights for each input x[i].
# bias - Bias of the neuron.
a = X * weights + bias
return a
def calculate_activation_function(a: Variable) -> Variable:
# Note that here sigma is some function we choose as our activation function.
# Common functions are linear, ReLU, Logistic, or Hyperbolic Tangent
# In this case we simply used tanh.
def sigma(x):
return x.tanh()
y_pred = sigma(a)
return y_pred
def calculate_cost_function(y_pred: Variable, y_actual: Variable) -> Variable:
# Similar to activation function, we choose are able to chose cost function
# from a variety of functions (e.g. MSE, Log loss, Hinge loss, etc.)
# In this case, we simply used MSE:
def J(y_hat, y):
return (Variable(1) / Variable(2)) * ((y_hat - y) ** Variable(2))
cost = J(y_pred, y_actual)
return cost
# In this case X is imply a singleton vector
X = Variable(2)
w = Variable(1)
b = Variable(0)
a = calculate_weighted_sum(X, w, b)
y_pred = calculate_activation_function(a)
y_actual = Variable(1)
cost = calculate_cost_function(y_pred, y_actual)
# Once we have the cost function (our output) we can backtrace all the way to w
# and get its partial derivative with respect to the cost. Which in turn will
# allow us to perform a weight update.
grad = Variable.gradient(w)
print(f"The partial derivative of our weight with respect to our cost is :: {grad}")
The partial derivative of our weight with respect to our cost is :: -0.06686187756915031
This is a very coarse demonstration on how to implement something similar torch.autograd
has cooking up. But nonetheless it is fundamental to the understanding of the module.