Ditching Backpropagation: Automatic Differentiation and Dual Numbers

I spent several months intermittently reading about Automatic Differentiation, and I was always left scratching my head. Until one day it clicked. Hopefully, I will spare you from all that and Dual Numbers will click on you by the end of this post.

The Quest to your Derivatives

If you want to differentiate a function, we usually take one of these options:

1- Numerical Differentiation (ND)
2- Symbolic Differentiation (SD)

Symbolic is the way you go when you took your Calculus course, most of the times you take a derivative on paper to solve a math problem, or if you use software such as Mathematica or Sympy.

Numerical is the way you go when you code in, say, C, or make physics simulations.

(* Of course, you could also make physics simulations using symbolic differentiation, or take a numerical derivative on paper, but I guess the other way round is more common)

Each one has pros and cons which we will analyze later.

But there happens to be a completely different new kid on the block:

3- Automatic Differentiation (AD)

Here is a comparison of the features of the 3 methods:

	Symbolic	Numerical	Automatic
Exact?	yes	no	yes
Returns	a expression	a number	a number
Fast?	Yes/no¹	yes	yes
Needs	symbolic expression	function	function
handles arbitrary functions?	no²	yes	yes

¹ computing and then simplifying a derivative is usually slow. But you do that only once. After that, you can use the resulting expression to calculate the value of the derivative in as many points you want, and it’s usually fast.

² functions with for loops, if branches and other programming constructs are beyond the scope of application of SD (what’s the derivative of a for loop?). On the other hand, ND and AD handle these with ease.

If you are only familiar with Symbolic and Numerical, that table might look quite weird to you (it’s exact but returns a number? How come?)

In order to introduce you to Automatic Differentiation, I must first introduce you to Dual Numbers

Dual Numbers

Remember complex numbers? This has nothing to do with them, but…

You know how we defined
$i^2 = -1$
and then suddenly all functions defined on the reals were also defined on the complex numbers by virtue of their Taylor expansion? (If you don’t remember, that’s ok )

Well, this is similar, only this time we define

$\epsilon ^ 2 = 0$ (but $\epsilon \neq 0$ )

You can think of $\epsilon$ as a number so small, that if you square it, the result is 0. Actually 0. Yet, it’s not so small, that we could actually say that $\epsilon$ is just 0. 😕 does that make any sense? So far the intuition goes…

Unable to resist the temptation to follow the mathematical abstraction caravan, we go head and define Dual Numbers as any number $w$ of the form $a + \epsilon b$ , and we call $a = Re(w)$ the real part of $w$ and $b = In(w)$ the infinitesimal part of $w$ , analogous to the real and imaginary parts of complex numbers.

What shall come out of this unholy abomination? Let’s see…

Evaluating functions on Dual Numbers

Let us take a dual number $w = a + \epsilon b$ and see what’s the result of applying a function $f$ to it:

$f(w) = f(a+\epsilon b) = f(a) + f'(a)(\epsilon b) + f''(a) \frac{(b \epsilon)^2}{2}+O(\epsilon^3)$

by virtue of the Taylor expansion of $f$ in $a$ . Using the definition of $\epsilon$ above, we finally get that

$f(w) = f(a+\epsilon b) = f(a) + \epsilon (f'(a) b)$

For the special case when $b=1$ , we get:

$f(a+\epsilon) = f(a) +\epsilon f'(a)$

That is, we get the value of $f$ in the real part of $f(w)$ and the derivative of $f$ in the infinitesimal partof $f(w)$

“Allright, but what’s the point?
1- We still need to calculate both $f(a)$ and $f'(a)$ … 2- Also, setting the infinitesimal part of $w$ , $b$ , to anything other that 1 is quite pointless right?”

Well… hold on. I’ll answer question 2 first.

Chain Rule

I assume you took your basic Calculus course and remember the chain rule. Ok, so let’s see what’s the result of applying a composition of functions, $f$ and $g$ to the dual number $a + \epsilon$ . We know the result should be :

$h(a+\epsilon) = f(g(a+\epsilon)) = h(a) + \epsilon h'(a) = f(g(a)) + \epsilon f'(g(a)) g'(a)$

because we know the chain rule. But a computer doesn’t, and we don’t want to implement it. At most, we could implement a DualNumber class and overload some operators, but that’s it.

If we do so, and ask the computer to evaluate $h(a+\epsilon) = f(g(a+\epsilon ))$ , it would first compute

$k = g(a+\epsilon) = g(a) + \epsilon g'(a)$

and then

$f(k) = f(Re(k)) + \epsilon f'(Re(k)) In(k) = f(g(a)) + \epsilon f'(g(a)) g'(a)$

thus, effectively computing the derivative of $h$ without knowing anything about the chain rule, only knowing how to apply functions on dual numbers.

To sum up, having a non-1 infinitesimal part on dual number $k$ above, enabled us to compute the derivative of a composition of functions without knowing the chain rule. That answers question 2. Let’s tackle question 1.

Automatic Differentiation

If you understood everything above, you are pretty much done. You know that, if you had an implementation for a function $f$ , and evaluated it in $a + \epsilon$ , you would get

$f(a+\epsilon) = f(a) + \epsilon f'(a)$

thus, you get the value of $f$ and the derivative for free… Hence the name Automatic Differentiation.

“Wow, hold on, hold on, not so fast…
1- My function $f$ doesn’t take DualNumber, it takes, say float.
2- Also, if I overloaded $f$ to also take DualNumberI would still need to implement the code to compute $f'$
3- Evaluating $f(a+\epsilon)$ will be slow, because it has to compute both $f(a)$ and $f'(a)$
”

1- yes, if you work in a statically typed language (say C++). If you work in a dynamically typed language (say Python), $f$ will take anything you throw at it. In a language like C++, you would ideally have your functions implemented with templates, so they would take any type of number.

2- No. $f$ will likely be composed of a chain of elementary functions, and all those would already work on DualNumber, since the class definition would overload all basic operators. And, by the section above, all compositions of functions which work on DualNumber also work on DualNumber.

3- That’s the worst case scenario. Even then it wouldn’t be that bad, since you are usually interested in both values. But many times, you can even compute both in the same time it would take to compute only one.
For example, to compute $sin(x+\epsilon)$ you would have to compute both $sin(x)$ and $cos(x)$ , but there is a sincos function in the standard C <math.h> library which returns both sin and cos in the same time it would take to compute either separately. It does so by using the Assembler FSINCOS instruction.
As another example, $e^{x+\epsilon}$ has a trivial optimization (see why?)

The takeaway is that the intermediate results for computing $f$ and $f'$ can be shared, enabling optimizations; as well as there being some other more sophisticated methods for computing both simultaneously.
Either way, you wouldn’t have to worry for any of that, since your $f$ s would already work with DualNumber out of the box, because they are a composition of elementary functions.

Code Sample

Here I show you how this would look in practice. This will be an hypothetical API (at the end of the post I list real ones), which contains a Dual class to represent dual numbers. Dual receives two numbers as arguments, the first represents the real part and the second represents the infinitesimal part.

from AD import Dual

def square_it_baby(x):
    return x**2

def factorial(x):
    """
    function to showcase that AD handles
    programming constructs such as
    recursion and 'if' with no hassle
    """
    if x<1:
        return 1
    return x*factorial(x-1)

print( square_it_baby(Dual(0, 1)))   # prints Dual(0, 0)
print( square_it_baby(Dual(-1, 1)))  # prints Dual(1, -2)
print( factorial(Dual(3, 1)) )       # prints Dual(6, 11)
print( factorial(Dual(2.5, 1)))     # prints Dual(3.75, 4)

N-dimensional case

Your functions will usually take many arguments, not just one. So how do we compute the derivative in such case?

It depends on what you mean by derivative. You could at least mean 3 things:

1- Partial Derivatives

If you evaluate $f(x+\epsilon, y, z)$ you will get:

$f(x+\epsilon, y, z) = f(x, y, z) +\epsilon f_x(x, y, z)$

a partial derivative. What if you put dual numbers for the other arguments as well? We generalize this to…

2- Directional Derivatives

We can say your function takes a n-dimensional vector as an argument. Then, if you evaluate it in $\textbf{a} +\epsilon \textbf{ b}$ where $\textbf{ a}$ and $\textbf{ b}$ are vectors, you get:

$f(\textbf{ a} +\epsilon \textbf{ b} ) = f(\textbf{a} ) +\epsilon \nabla f(\textbf{a} ) . \textbf{ b}$

by the Taylor expansion of $f$ , as usual. Note that what we get for the infinitesimal part is a directional derivative in the direction of $\textbf{ b}$ .

Directional derivatives are many times enough for our porpouses (such as optimization), but it may happen that what we really want is $\nabla f$ . Setting $\textbf{ b} = \textbf{ 1}$ won’t help us, since we would get the directional derivative in the direction of vector $\textbf{ 1}$ . So, what we do is…

3- Vector Derivative (or Gradient)

We define a vector $\boldsymbol{ \epsilon}$ such that:

$\boldsymbol{ \epsilon} \otimes \boldsymbol{ \epsilon} = \textbf{0}$ (but $\boldsymbol{ \epsilon} \neq \textbf{ 0}$ )

Or, in terms of components, we define $n$ numbers $\epsilon_i$ for $i \in [1,n]$ such that:

$\forall_{i, j} \:\epsilon_i \epsilon_j = 0$ (but $\forall_i \:\epsilon_i \neq 0$ )

Our multi-dual numbers will now have multiple infinitesimal parts (or a multi-infinitesimal part). They will be numbers of the form $w= a +\textbf{b} .\boldsymbol{\epsilon}$ , and we will call $a=Re(w)$ the real part of $w$ and vector $\textbf{ b} =In(w)$ the multi-infinitesimal part of $w$ .

Then we evaluate our function in a vector of multi-dual numbers, $\textbf{ a} + \textbf{ B} . \boldsymbol{\epsilon}$ where $\textbf{ a}$ is a vector and $\textbf{ B}$ is a square matrix. We get: