Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Calculus on Computational Graphs: Backpropagation (2015) (colah.github.io)
170 points by throwup238 on Jan 19, 2024 | hide | past | favorite | 24 comments


If you want more on this, check out this paper by Terence Parr and Jeremy Howard: https://arxiv.org/abs/1802.01528

It is rather accessible.


Discussed at the time:

Calculus on Computational Graphs: Backpropagation - https://news.ycombinator.com/item?id=10148064 - Aug 2015 (9 comments)


Or do Andrej Karpathy's youtube backprop ninja exercise on jupyter notebook by hand no?


Protip here: Do it by hand, though when you start getting tired of typing latex you can switch to IDE and let github copilot complete it(it will mostly be incorrect) and then you can go in and fix its mistakes, it still saves a bunch of time. For example:

``` The importtant thing is that the derivative needs to be computed from a number of elements logits = h@W+b

logits = h@W+b

h =

h11 h12 h13

h21 h21 h23

W = w11 w12

w21 w22

w31 w32

b = b1, b2 and

Logit11 = h11w11+ h12w21 + h13w31 + b1 - Eq.1

Logit12 = h11w12+ h12w22 + h13w32 + b2 - Eq.2

Logit21 = h21w11+ h22w21 + h23w31 + b1 - Eq.3

Logit22 = h21w12+ h22w22 + h23w32 + b2 - Eq.4

DL/Dh11 = DL/DLogit11 * DLogit11/Dh11 + DL/DLogit12 * DLogit12/Dh11 = DL/Dlogit11 * w11 + DL/DLogit12 * w12

DL/Dh12 = DL/DLogit11 * DLogit11/Dh12 + DL/DLogit12 * DLogit12/Dh12 = DL/Dlogit11 * w21 + DL/DLogit12 * w22

DL/Dh13 = DL/DLogit11 * DLogit11/Dh13 + DL/DLogit12 * DLogit12/Dh13 = DL/Dlogit11 * w31 + DL/DLogit12 * w32

DL/Dh21 = DL/DLogit21 * DLogit21/Dh21 + DL/DLogit22 * DLogit22/Dh21 = DL/Dlogit21 * w11 + DL/DLogit22 * w12

DL/Dh22 = DL/DLogit21 * DLogit21/Dh22 + DL/DLogit22 * DLogit22/Dh22 = DL/Dlogit21 * w21 + DL/DLogit22 * w22

DL/Dh23 = DL/DLogit21 * DLogit21/Dh23 + DL/DLogit22 * DLogit22/Dh23 = DL/Dlogit21 * w31 + DL/DLogit22 * w32

DL/Dh = [

    DL/Dh11 DL/Dh12 DL/Dh13

    DL/Dh21 DL/Dh22 DL/Dh23
] = [[DL/Dlogit11 DL/Dlogit12], @ [[w11 w21 w31],

    [DL/Dlogit21 DL/Dlogit22]]     [w12 w22 w32]]
= DL/Dlogit * W^T -------------> This is the final gradient and note that it is a matrix multiplication or a projection of the logit gradient on the Weight layer

Now lets compute DL/dW

DL/dW = DL/DLogit * DLogit/DW

DL/DW11 = DL/DLogit11 * DLogit11/DW11 + DL/Dlogit21 * Dlogit21/DW11 = DL/DLogit11 * h11 + DL/Dlogit21 * h21

DL/DW12 = DL/DLogit12 * DLogit12/DW12 + DL/Dlogit22 * Dlogit22/DW12 = DL/DLogit12 * h11 + DL/Dlogit22 * h21

DL/DW21 = DL/DLogit11 * DLogit11/DW21 + DL/Dlogit21 * Dlogit21/DW21 = DL/DLogit11 * h12 + DL/Dlogit21 * h22

DL/DW22 = DL/DLogit12 * DLogit12/DW22 + DL/Dlogit22 * Dlogit22/DW22 = DL/DLogit12 * h12 + DL/Dlogit22 * h22

DL/DW31 = DL/DLogit11 * DLogit11/DW31 + DL/Dlogit21 * Dlogit21/DW31 = DL/DLogit11 * h13 + DL/Dlogit21 * h23

DL/DW32 = DL/DLogit12 * DLogit12/DW32 + DL/Dlogit22 * Dlogit22/DW32 = DL/DLogit12 * h13 + DL/Dlogit22 * h23

DL/DW = [[h11 h21] @ [[DL/DLogit11, DL/DLogit12]

         [h12 h22]     [DL/DLogit21, DL/DLogit22]]

         [h13 h23]

        ]

        = h^T @ DL/DLogit -------------> This is the final gradient and note that it is a matrix multiplication or a projection of the logit gradient on the hidden layer.
```


Nice blog. I'll be provocative/pedantic for no good reason and say that what's described isn't "calculus" per se, because you can't do calculus on discrete objects like a graph. However, you can define the derivative purely algebraically (as a linear operation which satisfies the Leibniz chain/product rule), which is more accurately what is being described.


You’re not doing calculus on a graph- you’re using a graph algorithm to automate the derivative taking process.

Essentially, you transform your function into a “circuit” or just a graph with edge labels according to the relationship between parts of the expression. The circuit has the nice property that there is an algorithm you can run on it, with very simple rules, which gets you the derivative of the function used to create that circuit.

So taking the derivative becomes:

1. Transform function F into circuit C. 2. Run compute_gradiant(c) to get the gradient of F.

Lots of useful examples here: https://cs231n.github.io/optimization-2/


That is a great example. It's rarely bad to be pedantic if it leads to better understanding!


If we're being pedantic, then there's also a more general definition of calculus, which is the first definition in Merriam-Webster: "a method of computation or calculation in a special notation (as of logic or symbolic logic)." One example of this is the lambda calculus. Differential and integral calculus are just special cases of this general definition.


Right but this is about differential calculus (the chain rule)


>because you can't do calculus on discrete objects like a graph

Of course you can, what do you think shift operators and recurrence relations are? https://en.wikipedia.org/wiki/Finite_difference?#Calculus_of...


All fields are graphs.

We do calculus to predict behavior in fields.

We observe metrics and conservational symmetry (or not) over paths in fields.

Nonlinearity is approximated with backpropagation.

What are field operators (graph operators)?



That’s not what’s being done here.


I still don’t understand the process of learning ML, like sure we build micrograd but is it only didactic exercise or can we use it to train it to do something serious on our own hardware?


I don’t understand this comment, for one we’re engineers/hackers and should be curious how this stuff works. It’s exciting. Practically speaking this is like asking why learn how to write a simple forum or blog when we can’t host Facebook on our on hardware: it’s going to be hard to work on the latest models if you don’t first understand the basics.


You learn what Db is you can use a db for your own custom task.

What so i do with micrograd? Hang the code on my wall?


Yes exactly. Learn this so you can then apply it to your own software


This is like building a tiny C compiler. At some program scale, optimizations become important.


You can totally do some visual classification problems (like object detection) on current consumer hardware. Even more. You can also take some smaller existing language models and fine tune them for some special task - also completely feasible.


>do something serious

I guess it depends on what you mean by serious. Pre-training a competitive LLM with current methods and consumer hardware is prohibitive for sure. Solving a classification problem could be totally doable depending on the domain.


You'll probably need some hardware acceleration. There's a good course that builds something like micrograd in the beginning and extends on it: https://dlsyscourse.org/lectures/


I have a 3060 Ti, that enough?


8 GB of memory is a little restrictive but still usable for many problems like visual classification problems. Training would take a little more time though as you can't do as much batch processing as it would be possible with more memory.


It should be worth to do these kind of hardware accelerations with any kind of hardware that has more TFlop/s (and memory bandwidth) than your CPU basically.

I mean, enough for what?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: