Calculus on Computational Graphs: Backpropagation (2015)

sva_ · on Jan 19, 2024

If you want more on this, check out this paper by Terence Parr and Jeremy Howard: https://arxiv.org/abs/1802.01528

It is rather accessible.

dang · on Jan 19, 2024

Discussed at the time:

Calculus on Computational Graphs: Backpropagation - https://news.ycombinator.com/item?id=10148064 - Aug 2015 (9 comments)

itissid · on Jan 19, 2024

Or do Andrej Karpathy's youtube backprop ninja exercise on jupyter notebook by hand no?

itissid · on Jan 19, 2024

Protip here: Do it by hand, though when you start getting tired of typing latex you can switch to IDE and let github copilot complete it(it will mostly be incorrect) and then you can go in and fix its mistakes, it still saves a bunch of time. For example:

``` The importtant thing is that the derivative needs to be computed from a number of elements logits = h@W+b

logits = h@W+b

h =

h11 h12 h13

h21 h21 h23

W = w11 w12

w21 w22

w31 w32

b = b1, b2 and

Logit11 = h11w11+ h12w21 + h13w31 + b1 - Eq.1

Logit12 = h11w12+ h12w22 + h13w32 + b2 - Eq.2

Logit21 = h21w11+ h22w21 + h23w31 + b1 - Eq.3

Logit22 = h21w12+ h22w22 + h23w32 + b2 - Eq.4

DL/Dh11 = DL/DLogit11 * DLogit11/Dh11 + DL/DLogit12 * DLogit12/Dh11 = DL/Dlogit11 * w11 + DL/DLogit12 * w12

DL/Dh12 = DL/DLogit11 * DLogit11/Dh12 + DL/DLogit12 * DLogit12/Dh12 = DL/Dlogit11 * w21 + DL/DLogit12 * w22

DL/Dh13 = DL/DLogit11 * DLogit11/Dh13 + DL/DLogit12 * DLogit12/Dh13 = DL/Dlogit11 * w31 + DL/DLogit12 * w32

DL/Dh21 = DL/DLogit21 * DLogit21/Dh21 + DL/DLogit22 * DLogit22/Dh21 = DL/Dlogit21 * w11 + DL/DLogit22 * w12

DL/Dh22 = DL/DLogit21 * DLogit21/Dh22 + DL/DLogit22 * DLogit22/Dh22 = DL/Dlogit21 * w21 + DL/DLogit22 * w22

DL/Dh23 = DL/DLogit21 * DLogit21/Dh23 + DL/DLogit22 * DLogit22/Dh23 = DL/Dlogit21 * w31 + DL/DLogit22 * w32

DL/Dh = [

    DL/Dh11 DL/Dh12 DL/Dh13

    DL/Dh21 DL/Dh22 DL/Dh23

] = [[DL/Dlogit11 DL/Dlogit12], @ [[w11 w21 w31],

    [DL/Dlogit21 DL/Dlogit22]]     [w12 w22 w32]]

= DL/Dlogit * W^T -------------> This is the final gradient and note that it is a matrix multiplication or a projection of the logit gradient on the Weight layer

Now lets compute DL/dW

DL/dW = DL/DLogit * DLogit/DW

DL/DW11 = DL/DLogit11 * DLogit11/DW11 + DL/Dlogit21 * Dlogit21/DW11 = DL/DLogit11 * h11 + DL/Dlogit21 * h21

DL/DW12 = DL/DLogit12 * DLogit12/DW12 + DL/Dlogit22 * Dlogit22/DW12 = DL/DLogit12 * h11 + DL/Dlogit22 * h21

DL/DW21 = DL/DLogit11 * DLogit11/DW21 + DL/Dlogit21 * Dlogit21/DW21 = DL/DLogit11 * h12 + DL/Dlogit21 * h22

DL/DW22 = DL/DLogit12 * DLogit12/DW22 + DL/Dlogit22 * Dlogit22/DW22 = DL/DLogit12 * h12 + DL/Dlogit22 * h22

DL/DW31 = DL/DLogit11 * DLogit11/DW31 + DL/Dlogit21 * Dlogit21/DW31 = DL/DLogit11 * h13 + DL/Dlogit21 * h23

DL/DW32 = DL/DLogit12 * DLogit12/DW32 + DL/Dlogit22 * Dlogit22/DW32 = DL/DLogit12 * h13 + DL/Dlogit22 * h23

DL/DW = [[h11 h21] @ [[DL/DLogit11, DL/DLogit12]

         [h12 h22]     [DL/DLogit21, DL/DLogit22]]

         [h13 h23]

        ]

        = h^T @ DL/DLogit -------------> This is the final gradient and note that it is a matrix multiplication or a projection of the logit gradient on the hidden layer.

```

dawnofdusk · on Jan 19, 2024

Nice blog. I'll be provocative/pedantic for no good reason and say that what's described isn't "calculus" per se, because you can't do calculus on discrete objects like a graph. However, you can define the derivative purely algebraically (as a linear operation which satisfies the Leibniz chain/product rule), which is more accurately what is being described.

philipfweiss · on Jan 19, 2024

You’re not doing calculus on a graph- you’re using a graph algorithm to automate the derivative taking process.

Essentially, you transform your function into a “circuit” or just a graph with edge labels according to the relationship between parts of the expression. The circuit has the nice property that there is an algorithm you can run on it, with very simple rules, which gets you the derivative of the function used to create that circuit.

So taking the derivative becomes:

1. Transform function F into circuit C. 2. Run compute_gradiant(c) to get the gradient of F.

Lots of useful examples here: https://cs231n.github.io/optimization-2/

3abiton · on Jan 19, 2024

That is a great example. It's rarely bad to be pedantic if it leads to better understanding!

antonvs · on Jan 19, 2024

If we're being pedantic, then there's also a more general definition of calculus, which is the first definition in Merriam-Webster: "a method of computation or calculation in a special notation (as of logic or symbolic logic)." One example of this is the lambda calculus. Differential and integral calculus are just special cases of this general definition.

edgyquant · on Jan 19, 2024

Right but this is about differential calculus (the chain rule)

krackers · on Jan 19, 2024

>because you can't do calculus on discrete objects like a graph

Of course you can, what do you think shift operators and recurrence relations are? https://en.wikipedia.org/wiki/Finite_difference?#Calculus_of...

westurner · on Jan 20, 2024

All fields are graphs.

We do calculus to predict behavior in fields.

We observe metrics and conservational symmetry (or not) over paths in fields.

Nonlinearity is approximated with backpropagation.

What are field operators (graph operators)?

gatane · on Jan 20, 2024

Pardon? The more it changes... https://en.wikipedia.org/wiki/Discrete_calculus#Calculus_of_...

https://en.wikipedia.org/wiki/Calculus_on_finite_weighted_gr...

https://youtu.be/D0EUFP7-P1M?si=y1JSriU4mZ9nzntK

canjobear · on Jan 20, 2024

That’s not what’s being done here.

coolThingsFirst · on Jan 19, 2024

I still don’t understand the process of learning ML, like sure we build micrograd but is it only didactic exercise or can we use it to train it to do something serious on our own hardware?

edgyquant · on Jan 19, 2024

I don’t understand this comment, for one we’re engineers/hackers and should be curious how this stuff works. It’s exciting. Practically speaking this is like asking why learn how to write a simple forum or blog when we can’t host Facebook on our on hardware: it’s going to be hard to work on the latest models if you don’t first understand the basics.

coolThingsFirst · on Jan 19, 2024

You learn what Db is you can use a db for your own custom task.

What so i do with micrograd? Hang the code on my wall?

edgyquant · on Jan 19, 2024

Yes exactly. Learn this so you can then apply it to your own software

nextos · on Jan 19, 2024

This is like building a tiny C compiler. At some program scale, optimizations become important.

genman · on Jan 19, 2024

You can totally do some visual classification problems (like object detection) on current consumer hardware. Even more. You can also take some smaller existing language models and fine tune them for some special task - also completely feasible.

pmelendez · on Jan 19, 2024

>do something serious

I guess it depends on what you mean by serious. Pre-training a competitive LLM with current methods and consumer hardware is prohibitive for sure. Solving a classification problem could be totally doable depending on the domain.

sva_ · on Jan 19, 2024

You'll probably need some hardware acceleration. There's a good course that builds something like micrograd in the beginning and extends on it: https://dlsyscourse.org/lectures/

coolThingsFirst · on Jan 19, 2024

I have a 3060 Ti, that enough?

genman · on Jan 20, 2024

8 GB of memory is a little restrictive but still usable for many problems like visual classification problems. Training would take a little more time though as you can't do as much batch processing as it would be possible with more memory.

sva_ · on Jan 19, 2024

It should be worth to do these kind of hardware accelerations with any kind of hardware that has more TFlop/s (and memory bandwidth) than your CPU basically.

I mean, enough for what?