Common ways to compute derivatives video transcript¶

Hello again and welcome to Practical MDO. Today we’re talking about common ways to compute derivatives. There are many ways to compute derivatives, more than even what I’ll cover today, but we’ll focus on four main methods; finite differencing, complex-step, analytically, and through algorithmic or automatic differentiation. The best method to use is highly dependent on your problem that you’re trying to solve and I’ll talk through some of the advantages and disadvantages of each of these methods. Usually the most computationally efficient way to obtain derivatives is some mix of the methods that I’ll cover today.

This firmly falls under the differentiation category covered within this course and helps set us up for the optimization as well.

So like I mentioned I’ll really focus on four different methods here today.

Finite difference works by first querying the function and then querying it again at a slightly perturbed point. By looking at the change in function value divided by that perturbation size you can get an approximation for the slope. Now there are many different subcategories of finite difference. There’s forward, backward, central finite differencing and higher order methods. But for these purposes I’ll just show you forward first order finite differencing here. The notion of the order for finite differencing methods comes from the expansion of the Taylor series and where the truncation error is introduced. I’ll have a link in the description to the relevant section within the Engineering Design Optimization book where you can look for more info.

Let’s take a look at the simplistic quadratic function. For demonstration purposes here I’m going to show a ridiculously large step value. Nobody would use dx or h equals to one but here I want to show you what we’re dealing with. So imagine two points here; we have one dot in red which is the point at which we’re trying to approximate the derivative and the other point in white is the finite difference perturbation point. Again we’re just taking dy over dx. I’m going to really spell this out in detail to drive home the point of what’s going on here. And we obtain the slope by dividing dy by dx. If we move along the graph where we’re trying to approximate the derivative you can see in some places this approximation does better or worse. Ideally we would want the red line to be tangent to the graph. However what you’re really seeing now is more of a secant graph and I mentioned before that we don’t really use finite difference step size of one. Let’s see what it looks like with a much smaller finite difference step size.

So if we shrink this down the secant line becomes closer and closer to a tangent line – thus it better approximates the derivative. This is fantastic. For the simple quadratic case if we used a finite different step size of 10 to the negative 9 we get a dy/dx of 2 point and then 6 zeros 1 7. So this is a pretty good approximation for the derivative at this point. You can see the tangent line looks nice and it’s right against the graph. So that’s what first order finite differencing looks like.

In this very simplistic case so if we were to spell out the case we just visually looked at this is what it would look like. Using the formula that I presented before we see the function value of 3.5 plus 10^-9 minus the function value at 3.5; all of that over 10^-9. This gives us an approximation of the derivative. Again I’m really spelling this out here to drive home what’s going on. Now you might have seen, okay, we used a smaller step size and we got better derivative approximation. Why not use even smaller than 10 to the negative ninth? Unfortunately, just like ice cream, you can have too much of a good thing. If you use 10 to the negative 16th here in a computer code it would be too small compared to machine precision. What this means is that the perturbations in the function were very close to the limits of what a double floating point number can be represented as in a computer. This error is known as subtractive cancellation. What this means is that any kind of information you were getting from your function is kind of canceled out by the noise within the floating point representation. In some cases here if you were to actually query this model with this small of a step size we would actually see no difference in the function. This is because as far as the computer is concerned it’s the same number; it’s not being perturbed at all; it’s calling it at exactly the same point. Thus the the perceived slope is actually zero. So there’s a really fine balance between the correct step size to use when using finite difference. I highly recommend reading section 6.4 in Engineering Design Optimization by Martins and Ning. They have some great graphs and a few figures; they really show where it shines and where it doesn’t shine. It also helps motivate some of the other methods that I’ll be showing in this lecture.

So let’s talk about some of the advantages of finite differencing or FD. It’s simple and versatile; you don’t need any knowledge of the model or the the physical system that you’re approximating the derivatives of. It’s useful for black box functions, you don’t need the source code even, you can just have an input and output. You perturb that input and get a new perturbed output.

Now it has some disadvantages. It’s inaccurate and it’s computationally expensive. I go into more detail about this in other lectures that really motivate using analytic methods, but here I’ll just briefly say that it’s inaccurate because of some of the floating point representation concerns that I mentioned before. But also if you imagine a whole lot more than just a quadratic function being between your inputs and outputs, but it’s also inaccurate because if you have an entire computational model imagine some multi-disciplinary models with solvers, a few different loops things like that. All of a sudden if you’re perturbing the input and trying to look at the output and seeing how it changes based on the perturbed input it might not be able to be represented well throughout the model. It might not be captured throughout the model, small changes might be kind of squashed, not just due to the limitations of floating point representation in computers. But there are other situations like when you have an internal solver or multiple loops or a while loop that finite differencing produces terrible results because that does not always converge the same way. You can also imagine if you have inputs and you have tens of thousands of lines of code and you’re looking at the outputs after all that there’s a whole lot of noise that could be introduced in between there. Additionally finite differences is computationally expensive. This is because you have to do it for each and every one of the design variables that you’re interested in learning the sensitivity of. If you only have one or two or five or ten design variables it might be okay but the moment you get into the the tens, hundreds, one thousands; FD would be a terrible option. So when I say it’s computationally expensive just imagine running your model for each and every design variable that you care about just to get the gradients. That’s why it’s intractable.

So that being said about finite difference let’s now talk about complex step. You can imagine the complex step method as being a sort of different variation of the finite difference method. It almost works like magic and it returns much better results with just a few caveats.

So the main difference here is that instead of perturbing the inputs to our model in the real space we perturb it in the imaginary space. Now I’m not talking imaginary like ghosts, I’m talking about imaginary like i. If we perturb a model or a function we have much more fidelity or understanding at the floating point level in the complex plane. This all has to do with details of how floating point numbers are represented in the computer and I won’t necessarily get into that here. But just know it’s the idea that perturbing a model in the complex plane means that you can look at how that perturbation - and that perturbation alone - propagates throughout the model. Imagine trying to weigh a fruit fly alone versus trying to weigh an elephant in a fruit fly then subtracting the elephant’s mass away from that. You can use a much more accurate instrument to measure the mass of just a fruit fly instead of looking at the elephant plus the fruit fly. This is the same idea as using complex step to only perturb in the complex plane without any muddying from the real part. This allows you to amazingly avoid any subtractive cancellation problems. You can use a very, very small complex step step size and get fantastic results in terms of accuracy.

Let’s take a look at a figure here. This is Figure 6.9 from Engineering Design Optimization by Martins and Ning and I want to highlight here how complex step is operating and how it’s operating better than either of the finite differencing options. So I showed you forward difference before. That’s that blue line up top. Where if we take a look at the y-axis relative error here it slowly decreases as we get to a smaller and smaller step size. Eventually it starts to increase as we decrease our step size, again due to the subtractive cancellation kind of issues with the floating point representations. We could use central difference where we have a higher order finite differencing scheme to get better accuracy. But even then we’re just delaying the inevitable in terms of the step size and we get the same relative error increasing as we decrease step size. In the left-hand portion of this graph the complex step method and the central difference method are directly on top of each other. This is because they both have truncation errors of the second order. Now like I mentioned before complex step does not suffer from the same subtractive cancellation problems. This is again because we’re essentially perturbing the model only in the complex plane, so we can store a whole lot more decimal points or significant figures in the complex plane than we need to worry about in the real plane. This allows you to use a very, very small step size which gives you very accurate derivatives. When I say accurate I mean they’re accurate to machine precision. So that’s as good as you can do.

Now you might be saying “Hey John, what are these caveats? What are the nuances about complex step that I need to worry about?” Well, first and foremost, you might have guessed your model needs to be complex safe. What I mean by that is if you have an input which has a real part and a complex part that complex part needs to propagate entirely through your model. It needs to pop out on on your outputs. It needs to be represented correctly throughout the entire process of your model. If you’re writing a code from the ground up from scratch you might be able to do it and be mindful of what you’re doing. But if you have some code that you were handed, especially if it’s in Fortran or C but even if it’s only in Python, it might not be complex safe. There are a lot of methods, functions, and other data types that squash any complex parts. So it takes a little bit of special care to complexify your code. Luckily there are a few kind of helper functions to help you convert your code and they’re mentioned in Section 6.5 in the Engineering Design Optimization textbook.

One main advantage of cs or complex step is that it’s accurate it’s accurate to machine precision. That’s fantastic; that means that you can trust your derivatives and use gradient-based methods and know that you can hit a pretty tight optimization tolerance.

However there are some disadvantages. Like I mentioned it’s computationally expensive. Again just like for regular finite differencing you have to do this for every design variable that you care about. And then also it requires source code modification or at least knowledge of the source code. You can no longer just treat your code as a black box; it has to be a sort of gray box where you have either some knowledge of what’s going on or the ability to modify it to make sure that it is complex safe.

So I highly recommend using complex step instead of finite differencing. You never have to perform a step size study or worry about your step size because it is accurate to machine precision. The only issue - and this might be a big or small issue depending on your model - is that you have to be aware of your source code.

Next up, our third method that we cover today is analytically or by hand. What I mean by this is that you can get the derivatives of a function or model by sitting down with a pen and paper or a computer and computing analytic expressions for the derivatives by hand. When I say by hand you can use Wolfram Alpha you can use Mathematica or Maple you can use whatever you want to get a symbolic or analytic representation of the derivatives. All that I mean by analytic is that you have some kind of function or method that writes out the derivatives.

Let’s take a look at a real simple example here. This will harken back to your calculus days. If we have a simple quadratic - the same one that we looked at visually before - let’s take the derivative of it. So if we’re interested in dy/dx we simply use the power rule here and we get x minus 1.5. Cool, that was easy, I like that, that’s pretty solid. We understand the power rule. I could do that.

Unfortunately, all too often engineering equations do not look like that. Here’s one for my phd which is more challenging but not that challenging. It’s just a few different terms; you have to kind of track them. Here I was getting the the change in temperature with respect to time and so you have to consider all the the inflow and outflow of heat that is coming into this kind of fuel recirculation system. Again in this case it’s not much tougher than multiplying dividing adding or subtracting.

However, you might be trying to model something as complex as three-dimensional turbulent fluid flow. Then if you’re trying to get an analytic representation of the Navier-Stokes equations and a solution for the derivatives, you can’t. So analytic expressions for derivatives have their place and it pays to know when to use them and when to spend the developer cost to do them.

The advantages for analytic or hand-derived expressions; it’s accurate. You know that they’re going to be exact or analytic. They’re potentially efficient. If you code them in a way that makes sense and takes advantage of the structure you can get the full Jacobian for a system at a much reduced cost compared to finite differencing or complex step. I say potentially because it’s all about your implementation or how you actually obtain these.

Now there are some disadvantages with analytic approaches. There’s often a large developer cost. You can imagine sitting down and getting an analytic closed form solution for your derivatives may be challenging. It may take some brain power. It may take some sitting down and understanding what’s going on. It’s not nearly as easy as just using finite difference or complex step. Maybe it just wouldn’t be worth your developer time versus the computational savings.

Our fourth and final method that we cover in today’s lecture is algorithmic or automatic differentiation. Now algorithmic differentiation or AD is pretty cool. The idea of AD is that you can take any set of numerical code and at the end of the day it’s simply made up of addition, subtraction, multiplication, division, and other basic operators. If we were to unroll all of the code - all the operations that occur from inputs to outputs - we would be able to write them as a combination of these basic operators. Automatic differentiation or AD then takes these basic operators and simply computes the derivatives of each one of these operations piece by piece and then combines them together using the chain rule and gives you an expression for the derivative of the outputs with respect to the inputs.

Intuitively this makes sense. You know at some level that the computer has to process what you’re asking it to do and it brings it down into smaller and easier operations. By tracking this we’re able to obtain the derivative information as well.

Here’s just a quick example of this vector v here. Treat the top blue part as the inputs and the bottom part as the outputs. Everything in between would be intermediary variables that are needed for calculating the outputs. So here you can imagine if we have v1, it’s some variable. v2 uses v1 to compute a value and then there’s some loop here. Let’s say v3 is based on v2 but it loops through here. Maybe there’s a for loop or a while loop and eventually this gets fed into v4 to be v4. So in this very simple case we might have some output v4 that we care about based on some input v1 and some intermediary variables. AD works by taking all of these operations and really unrolling them, getting one flat look at them from front to back where it’s just a combination of all these smaller more basic operations. Then it gets the derivatives of these basic operations and chains them together using the chain rule.

For this very simple example here’s what that would look like. We have v4 dot here which is based on the chain rule between each and every one of these intermediary variables. This allows us to see the total derivative df/dx. Now you can imagine for an actual engineering code you ask an AD tool to help you out here and get the derivatives and if it works it produces a set of code that you can use to get the derivatives. That’d be amazing.

Now there are a few subtleties, like you have to program your code in a way that can be tracked by whatever tool you’re using to do automatic differentiation. I find myself when I’m trying to do this accidentally using some data types or arrays or other methods to store data that are not supported by a lot of AD engines. All that being said if you have a code base and either you know how to change it or it already comes as AD-able, you can use algorithmic differentiation to obtain the derivatives.

I’m really glossing over all of the details here and I really suggest looking at section 6.6 in Engineering Design Optimization for much more information about AD. One more point I want to drive home is that AD can be done in the forward or reverse modes and this corresponds to the direct and adjoint modes for derivative computation. If these words don’t mean anything to you, don’t worry, I’ll link some lectures that are relevant below.

So some advantages for AD. It’s accurate. You end up with an analytic solution. Again if you have code that’s computing the derivative it’s just as good as you sitting down with pen and paper and computing the derivative and it’s got potentially less developer cost. If you don’t actually have to sit down and compute the derivatives by hand and allow a computer to do it, it’s good.

Now I say potentially because here’s a disadvantage. It might require some code reworking. If it takes you more time to rework the code and get it ready for AD then you would save, then you probably shouldn’t do that. Additionally it might be computationally intensive. And specifically I don’t necessarily mean CPU expense. Tracking the derivatives through a complex and large model might introduce large amounts of memory overhead. In some cases for AD this becomes a problem and in many others it’s not an issue. But if you’re solving a large series of equations with millions of degrees of freedom, it’s good to be aware of your CPU and memory costs, especially when using AD.

So thanks for sticking with me. We talked about just four main methods for computing derivatives. I highly suggest if you’re looking to use any one of these methods within your models to do some more research and to dig deeper into the details. For each one I’ll have some links below to other lectures that are relevant, as well as Engineering Design Optimization, the textbook that I’ve mentioned in a few places. It often takes some practice to understand which derivative computation method you should use for your model, but hopefully some of these advantages and disadvantages that we discussed will help you make that decision.

As always, make sure to mash those like and subscribe buttons if you’ve enjoyed what you’ve seen today and thank you very much for watching.