If you cannot predict the running time of an algorithm without running it on the...

david-gpu · 2025-06-01T11:50:53 1748778653

The point being made is that in a production environment you cant't run/simulate all the possible candidate implementations to find out the fastest one -- it would take far longer than just choosing one at random. Therefore, you need an algorithmic way of picking a good candidate out of the many you have, and you can't take forever to make that selection either, because the clock is ticking the moment you receive a request to run that matrix multiply.

You can't precompute all the possible options in advance and fetch the running time from a database either, because the parameter space is just way too huge.

Notice that none of this has anything to do with having accurate models of the system. This is what people who do this for a living and have perfect knowledge of the system choose to do, for good reasons.

saagarjha · 2025-06-01T10:44:07 1748774647

Nobody writing high-performance code for these machines has that documentation. They largely do ok anyway because the time of cycle counting is long in the past and the name of the game involves cache effects and synchronization that is very hard to reason about individually but clearly visible if taken in aggregate. You don't get a cookie if you can accurately time 10 instructions, but you do if your matrix multiply over a hundred million of them is 1% faster.