If you cannot predict the running time of an algorithm without running it on the target processor, then by definition the documentation of that processor is incomplete.
For a processor that is completely documented one must be able to run a simulation model that provides the running time for a given program, when the processor is so complex that simpler methods for computing the execution time do not work.
For older NVIDIA GPUs, there exist such simulation models, but they are only partially accurate, because they are based on reverse engineering, without cooperation from the GPU vendor.
The point being made is that in a production environment you cant't run/simulate all the possible candidate implementations to find out the fastest one -- it would take far longer than just choosing one at random. Therefore, you need an algorithmic way of picking a good candidate out of the many you have, and you can't take forever to make that selection either, because the clock is ticking the moment you receive a request to run that matrix multiply.
You can't precompute all the possible options in advance and fetch the running time from a database either, because the parameter space is just way too huge.
Notice that none of this has anything to do with having accurate models of the system. This is what people who do this for a living and have perfect knowledge of the system choose to do, for good reasons.
Nobody writing high-performance code for these machines has that documentation. They largely do ok anyway because the time of cycle counting is long in the past and the name of the game involves cache effects and synchronization that is very hard to reason about individually but clearly visible if taken in aggregate. You don't get a cookie if you can accurately time 10 instructions, but you do if your matrix multiply over a hundred million of them is 1% faster.
For a processor that is completely documented one must be able to run a simulation model that provides the running time for a given program, when the processor is so complex that simpler methods for computing the execution time do not work.
For older NVIDIA GPUs, there exist such simulation models, but they are only partially accurate, because they are based on reverse engineering, without cooperation from the GPU vendor.