Tutorial 1: Writing your own tangent-linear and adjoint code#

In this tutorial, we essentially follow the Appendix A of Marotzke et al 1999. The difference is that we will be doing this in Python. The paper does not give the tangent linear code. However, we have provided one here, since it is helpful to write the adjoint code. The tape (storage) for the adjoint is mimicked using a list object in Python as a stack.

We start by importing the only library we need, which is numpy.

import numpy as np

Forward model#

The forward model (not to be confused with tangent linear model) performs non-linear quadratic operations on the input \(\mathbf{x}_0\), which has been copied to \(x\). This is followed by “convective adjustment”. \(\mathbf{x}_0\) is our control variable, and for the given choice of \(\mathbf{x}_0 = (1,3,3)\), the cost function \(fc\) evaluates to 40.25.

We call the function forward_without_tape because we are not storing the interim vectors to tape, which would then aid in adjoint calculations. Instead we have a separate function forward for that, which is illustrated further below.

def forward_without_tape(x0 = np.array([1, 3, 3])):
    
    x = np.zeros(3)

    for i in range(3):
        x[i] = x0[i]

    y = x[0]**2

    for i in range(3):

        x[i] = y + x[i]**2 

    # Convective adjustment

    for i in range(2):

        if (x[i] < x[i+1]):
            x[i] = 0.5*(x[i] + x[i+1])
            x[i+1] = x[i]

    # Cost function
    
    fc = (x[0]-5.5)**2 + 2.*x[1] + 3.*x[2]

    return fc

print(f"Cost function for given x0: {forward_without_tape()}")
Cost function for given x0: 40.25

Getting the finite difference gradient from the Forward model#

We can use finite differences with the central differencing scheme to get the gradient evaluated at the given value of \(\mathbf{x}_0\). For a \(N\) dimensional control vector, the forward model needs to be called \(2N\) times to get an approximate gradient. This is because we can only evaluate directional derivatives, which we need to do \(N\) times to get the gradient. Each directional derivative evaluation, in turn, requires 2 forward model evaluations.

Mathematically, this can be formulated as

\[ \left (\nabla_{\mathbf{x}_0} fc(\mathbf{x}_0), \hat{\mathbf{e}} \right) \approx \frac{fc\left(\mathbf{x}_0+\epsilon \hat{\mathbf{e}}\right) - fc\left(\mathbf{x}_0-\epsilon \hat{\mathbf{e}}\right)}{2\epsilon} + \mathcal{O}(\epsilon^2) \]

Here \((.,.)\) indicates the normal inner product of two discrete vectors in \(\mathbb{R}^N\).

The left side is the directional derivative of \(fc\) at \(\mathbf{x}_0\) in the direction \(\hat{\mathbf{e}}\), while the right side is the central differences approximation of the directional derivative.

The choice of \(\epsilon\) can be critical, however we do not discuss that here and simply choose a value \(\epsilon = 0.001\). In general, one would perform a convergence analysis with a range of values of \(\epsilon = {0.1, 0.01, 0.001, ...}\). Ideally, this convergence is often isn’t actually observed.

To get the first component of the gradient at \(\mathbf{x}_0\), we choose \(\epsilon = 0.001, \hat{\mathbf{e}} = \hat{\mathbf{e}}_1 = (1,0,0)\). To get the second component of the gradient at \(\mathbf{x}_0\), we choose \(\epsilon = 0.001, \hat{\mathbf{e}} = \hat{\mathbf{e}}_2 = (0,1,0)\). To get the third component of the gradient at \(\mathbf{x}_0\), we choose \(\epsilon = 0.001, \hat{\mathbf{e}} = \hat{\mathbf{e}}_3 = (0,0,1)\).

eps = 0.001
x0 = np.array([1,3,3])
e1 = np.array([1,0,0])
e2 = np.array([0,1,0])
e3 = np.array([0,0,1])
g1 = (forward_without_tape(x0 = x0 + eps*e1) - forward_without_tape(x0 = x0 - eps*e1))/(2*eps)

print(f"First component of gradient for given x0: {g1:.2f}")
First component of gradient for given x0: 15.50
g2 = (forward_without_tape(x0 = x0 + eps*e2) - forward_without_tape(x0 = x0 - eps*e2))/(2*eps)

print(f"Second component of gradient for given x0: {g2:.2f}")
Second component of gradient for given x0: 10.50
g3 = (forward_without_tape(x0 = x0 + eps*e3) - forward_without_tape(x0 = x0 - eps*e3))/(2*eps)

print(f"Third component of gradient for given x0: {g3:.2f}")
Third component of gradient for given x0: 15.00
print("Therefore the approximated finite differences gradient at x_0 = (1, 3, 3) is given by g = (%.2f, %.2f, %.2f)"%(g1, g2, g3))
Therefore the approximated finite differences gradient at x_0 = (1, 3, 3) is given by g = (15.50, 10.50, 15.00)

Tangent linear mode#

We now illustrate how to write the tangent linear code for the forward model shown above. Mathematically, a tangent linear model is a linearization of a non-linear forward model.

If the forward model is \(\mathbf{x}_{\text{new}} = A(\mathbf{x})\), where \(A\) can be non-linear, the linearized statement would read: \(\mathbf{dx}_{\text{new}} = \mathbf{B}\mathbf{dx}\), where \(\mathbf{B}(i,j) = \frac{\partial [A(\mathbf{x})]_i}{\partial \mathbf{x}_j}\) is the jacobian of the non-linear operator \(A\).

An important subtlety here is that \(\mathbf{B}\) depends on \(\mathbf{x}\), and not \(\mathbf{x}_{\text{new}}\). From a coding perspective, this means that the linearized statements are always called before the original non-linear ones. This is because we generally need the old values of the control vector for the linearized statement. If the non-linear statements are called first, they will generally update \(\mathbf{x}\) to \(\mathbf{x}_{\text{new}}\).

One would expect a linearized model to be cheaper computationally than its non-linear counterpart. However, the computations for the linearized model require variables from the original non-linear model and hence one has to evaluate the non-linear forward model too in order to evaluate the linearized forward model. This means the cost is approximately 2 times that of the original non-linear model.

This can be clearly seen below. For each statement in the non-linear forward model, the tangent linear model forward_d has 2 statements. It is helpful to look at each line of code as a general non-linear operation \(\mathbf{x}_{\text{new}} = A(\mathbf{x})\), which we linearize to \(\mathbf{dx}_{\text{new}} = \mathbf{B}\mathbf{dx}\). The jacobian matrices \(\mathbf{B}\), which are functions of \(\mathbf{x}\) are mentioned in the comments above the repsective code statements.

def forward_d(x0 = np.array([1, 3, 3]), x0d = [1, 0, 0]):

    x = np.zeros(3)
    xd = np.zeros(3)

    for i in range(3):
        
        # [xd[i]]  = [1. 0.] [xd]
        # [x0d[i]] = [0. 1.] [x0d[i]] 
    
        xd[i] = x0d[i]
        x[i] = x0[i]

    # [yd]    = [0. 2*x[0]] [yd]
    # [xd[0]] = [0.     1.] [xd[0]] 
    
    yd = 2*x[0]*xd[0]
    y = x[0]**2
    
    for i in range(3):

        # [xd[i]]  = [2*x[i]  1.] [xd[i]]
        # [yd]     = [0.      1.] [yd]
        
        xd[i] = yd + 2*x[i]*xd[i]
        x[i] = y + x[i]**2 
        
    # Convective adjustment

    for i in range(2):

        if (x[i] < x[i+1]):
            
            # [xd[i]]   = [0.5 0.5] [xd[i]]
            # [xd[i+1]] = [0.   1.] [xd[i+1]]
        
            xd[i] = 0.5*(xd[i] + xd[i+1])
            x[i] = 0.5*(x[i] + x[i+1])

            # [xd[i]]   = [1. 0.] [xd[i]]
            # [xd[i+1]] = [1. 0.] [xd[i+1]]
            
            xd[i+1] = xd[i]
            x[i+1] = x[i]

    # [xd[0]] = [1.           0. 0. 0.] [xd[0]]
    # [xd[1]] = [0.           1. 0. 0.] [xd[1]]
    # [xd[2]] = [0.           0. 1. 0.] [xd[2]]
    # [fcd]   = [2*(x[0]-5.5) 2. 3. 0.] [fcd]
    
    fcd = 2*(x[0]-5.5)*xd[0] + 2.*xd[1] + 3.*xd[2]
    fc = (x[0]-5.5)**2 + 2.*x[1] + 3.*x[2]


    return fcd

The tangent linear model for a given \(\mathbf{dx}_0\), represented by \(x0d\) in the code above, returns a scalar value \(fcd\), which is directional derivative in the direction \(\mathbf{dx}_0\). To get the gradient, we have to evaluate the directional derivative in all of the basis directions \(\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \hat{\mathbf{e}}_3\), just like we did for the finite differences gradient.

(g1, g2, g3) = forward_d(x0d = [1, 0, 0]), forward_d(x0d = [0, 1, 0]), forward_d(x0d = [0, 0, 1])
print("Therefore the tangent linear evaluated gradient at x_0 = (1, 3, 3) is given by g = (%.2f, %.2f, %.2f)"%(g1, g2, g3))
Therefore the tangent linear evaluated gradient at x_0 = (1, 3, 3) is given by g = (15.50, 10.50, 15.00)

Adjoint mode#

We now illustrate how to write the adjoint code for the forward model shown above. To better understand this, we consider a non-linear forward model \(\mathbf{y} = A(\mathbf{x})\), where the non-linear operator \(A\) can be further seen as a composition of functions - \(A = A_3 \circ A_2 \circ A_1\). This is helpful to see how the adjoint is reverse in the nature of its flow. $\( \therefore \mathbf{y} = A_2(A_1(\mathbf{x}))) \)$

One can even interpret \(A_1, A_2\) as being individual lines of code. We also define the interim variables as \(\mathbf{x}_1 = A_1(\mathbf{x}), \mathbf{y} = A_2(A_1(\mathbf{x})) = A_2(\mathbf{x}_1)\).

Let us assume that the cost function \(\mathcal{J}\) is a function of \(\mathbf{y}\), therefore \(\mathcal{J} = \mathcal{J}(\mathbf{y})\). We wish to evaluate the sensitivity of \(\mathcal{J}\) to \(\mathbf{x}\), i.e \(\nabla_{\mathbf{x}} J\), which can be evaluated using a simple chain rule. Both the LHS and RHS are row vectors here. The indices used are as per Einstein notation.

\[ \mathbf{g}^T = \nabla_{\mathbf{x}} \mathcal{J}^T = \frac{\partial \mathcal{J}}{\partial \mathbf{x}_i} = \frac{\partial \mathcal{J}}{\partial \mathbf{y}^j} \frac{\partial \mathbf{y}^j}{\partial \mathbf{x}^i} = \frac{\partial \mathcal{J}}{\partial \mathbf{y}^j} \frac{\partial \mathbf{y}^j}{\partial \mathbf{x}_1^k} \frac{\partial \mathbf{x}_1^k}{\partial \mathbf{x}^i} = \frac{\partial \mathcal{J}}{\partial \mathbf{y}^j} \mathbf{B}_2^{jk}(\mathbf{x}_1)\mathbf{B}_1^{ki}(\mathbf{x}_2) = \nabla_{\mathbf{y}} \mathcal{J}^T \mathbf{B}_2(\mathbf{x}_1)\mathbf{B}_1(\mathbf{x}_2) \]

Here \(\mathbf{B}_i\) is the Jacobian matrix for the non-linear operators \(A_i\). There are two ways to calculate this sensitivity:

  1. Take inner product of LHS and RHS by unit basis vectors \(\hat{\mathbf{e}}_i\). Do this \(N\) times to get all the \(N\) components of the gradient.

\[ \mathbf{g}^T \hat{\mathbf{e}}_i = \mathbf{g}_i = \nabla_{\mathbf{x}} \mathcal{J}^T \hat{\mathbf{e}}_i = \nabla_{\mathbf{y}} \mathcal{J}^T \:\mathbf{B}_2(\mathbf{x}_1) \:\mathbf{B}_1(\mathbf{x}) \hat{\mathbf{e}}_i \]

This is exactly what the tangent linear model is doing. If you go back to the discussion of the tangent linear model, you see that we substitute \(\mathbf{dx}\) as \(\hat{\mathbf{e}}_i\) and get the directional derivative \(N\) times to get the gradient. This worked out okay for us because, fortunately, we accessed \(\mathbf{x}, \mathbf{x}_1\) in the same order they were computed. Therefore, this is also known as the forward mode.

Another equivalent way of looking at this is that to get the whole gradient we have to compute a few matrix-matrix vector products before finally doing one vector-matrix product.

  1. Transpose both the LHS and the RHS.

\[ \mathbf{g} = \nabla_{\mathbf{x}} \mathcal{J} = \mathbf{B}_1^T(\mathbf{x}) \:\mathbf{B}_2^T(\mathbf{x}_1) \:\nabla_{\mathbf{y}} \mathcal{J} \]

We see the advantage here clearly, if we multiply starting from the right, we only ever have to do matrix vector multiplications, and we get the entire gradient in one single go. This is what the adjoint model does.

The reverse nature of the adjoint model is clearly seen here. In the non-linear forward model and tangent linear model, \(A_1, \mathbf{B}_1\) act first respectively followed by \(A_2, \mathbf{B}_2\). For the adjoint model, \(\mathbf{B}_2^T\) acts first, followed by \(\mathbf{B}_1^T\). Furthermore, the adjoint calculation requires \(\mathbf{y}\) first even though it is computed last in the forward mode, followed by \(\mathbf{x}_1 \&\:\: \mathbf{x}\).

Therefore, we need to run the non-linear forward mode, and store \(\mathbf{x}, \mathbf{x}_2, \mathbf{y}\) to tape, and then retrieve them (pop from a stack named tape) in the reverse order during the adjoint computation. We store the data on a stack data structure named tape. The tape has all the properties of a stack data structure such as being Last-In-First-Out (LIFO) and First-In-Last-Out (FILO)) that stores the vectors that are needed during the non-linear forward run and then pops them in the reverse order for the computation of adjoint.

Any code can be viewed as a composition of its individual lines. Each line of code can be thought of as a non-linear operator. More generally, any code can be expressed as \(A = A_M \circ A_{M-1} ... \circ A_1\). If we view \(\mathbf{x}_i = A_i(\mathbf{x}_{i-1})\) as one line of code, \(d\mathbf{x}_i = \mathbf{B}_i(\mathbf{x}_{i-1})d\mathbf{x}_{i-1}\) is the linearization of that one line of code and \(\nabla_{\mathbf{x}_{i-1}} \mathcal{J} = \mathbf{B}_i^T \nabla_{\mathbf{x}_{i}} \mathcal{J}\) is the adjoint of that line of code. By definition, within the code \(xib = \nabla_{\mathbf{x}_{i}} \mathcal{J}\). Therefore, \(Jb = \frac{d\mathcal{J}}{d\mathcal{J}} = 1\).

Now we see why doing the Tangent Linear code was useful. The Jacobian matrices computed and illustrated in the comments there just need to be transposed and used for the adjoint code.

Forward mode with tape used for adjoint#

We now define the forward model with tape functionality included, which is useful to compute the adjoint.

def forward(x0 = np.array([1, 3, 3])):
    
    x = np.zeros(3)
    tape = []
    
    for i in range(3):
        x[i] = x0[i]

    y = x[0]**2
    
    tape.append(np.copy(x))

    for i in range(3):

        x[i] = y + x[i]**2 

    # Convective adjustment

    for i in range(2):

        tape.append(np.copy(x))

        if (x[i] < x[i+1]):
            x[i] = 0.5*(x[i] + x[i+1])
            x[i+1] = x[i]

    tape.append(np.copy(x))

    fc = (x[0]-5.5)**2 + 2.*x[1] + 3.*x[2]

    return fc, tape

print(f"Cost function for given x0: {forward()[0]}")
print(f"Tape of x for given x0: {forward()[1]}")
Cost function for given x0: 40.25
Tape of x for given x0: [array([1., 3., 3.]), array([ 2., 10., 10.]), array([ 6.,  6., 10.]), array([6., 8., 8.])]

Code for adjoint mode#

In this case, all the adjoint variables are suffixed with \(b\). The comments show the adjoint matrices used for the operations. They are exactly the transpose of the Jacobian matrices mentioned in the tangent linear code comments. We pop the required value of \(\mathbf{x}\) from tape whenever needed. The example below stores everything needed to tape and then pops it, but modern AD tools make smart choices and recompute some of the intermediate variables if needed to save on storage.

There is an important subtlety here as well. Sometimes, you have to avoid doing the matrix-vector product shown in the comments in the exact same order. Look in the comments below where it says LOOK HERE! to understand why.

_, tape = forward(x0 = np.array([1, 3, 3]))

def forward_b(x0 = np.array([1, 3, 3]), tape = tape):
    
    xb = np.zeros(3)
    x0b = np.zeros(3)
    yb = 0.0
    
    fcb = 1.0
    
    # [xb[0]] = [1. 0. 0. 2*(x[0]-5.5)] [xb[0]]
    # [xb[1]] = [0. 1. 0. 2.          ] [xb[1]]
    # [xb[2]] = [0. 0. 1. 3.          ] [xb[2]]
    # [fcb]   = [0. 0. 0. 0.          ] [fcb] 
    
    x = tape.pop()
    
    xb[0] = xb[0] +  2*(x[0]-5.5) * fcb
    xb[1] = xb[1] +  2.           * fcb
    xb[2] = xb[2] +  3.           * fcb
    
    for i in range(1,-1,-1):
        
        x = tape.pop()
        
        if (x[i] < x[i+1]):
        
            # [xb[i]]   = [1. 1.] [xb[i]]
            # [xb[i+1]] = [0. 0.] [xb[i+1]]

            xb[i] = xb[i] + xb[i+1]
            xb[i+1] = 0.

            # [xb[i]]   = [0.5 0.] [xb[i]]
            # [xb[i+1]] = [0.5 1.] [xb[i+1]]
            # LOOK HERE! 
            # We first compute xb[i+1] even though it is the second entry in the matrix.
            # This is because it depends on old value of xb[i].
            # If we compute xb[i] first, we lose that old value which is needed for xb[i+1].

            xb[i+1] = 0.5*xb[i] + xb[i+1]
            xb[i]   = 0.5*xb[i]
            
    x = tape.pop()
    
    for i in range(2,-1,-1):

        # [xb[i]]  = [2*x[i] 0.] [xb[i]]
        # [yb]     = [1.      1.] [yb] 
        
        yb = yb + xb[i]
        xb[i] = 2*x[i]*xb[i]
        
    # [yb]    = [0.     0.] [yb]
    # [xb[0]] = [2*x[0] 1.] [xb[0]] 
    
    xb[0] = xb[0] + 2*x[0]*yb
    yb    = 0.0

    for i in range(2,-1,-1):
        
        # [xb[i]]  = [0. 0.] [xb]
        # [x0b[i]] = [1. 1.] [x0b[i]] 
    
        x0b[i] = x0b[i] + xb[i]
        xb[i]  = 0.0
        
    return x0b

As mentioned above, we have to store the required variables to tape by running the non-linear forward code first. Then we use this tape for our adjoint calculations.

_, tape = forward(x0 = np.array([1, 3, 3]))
(g1, g2, g3) = forward_b(x0 = np.array([1, 3, 3]), tape = tape)
print("Therefore the adjoint evaluated gradient at x_0 = (1, 3, 3) is given by g = (%.2f, %.2f, %.2f)"%(g1, g2, g3))
Therefore the adjoint evaluated gradient at x_0 = (1, 3, 3) is given by g = (15.50, 10.50, 15.00)

We see that we get the same gradient using finite differences, tangent linear model, and adjoint model. The adjoint model is very efficient computationally when the size of the control vector is extremely large.

Acknowledgments

Thanks to An T. Nguyen for fruitful discussions regarding this code.