Benchmarking is a mess everywhere. Sure you can get some level of accuracy but reproducing any kind of benchmark results across machines is impossible. That's why perf people focus on things like CPU cycles, heap size, cache access etc instead of time. Even with multiple runs and averaged out results you can only get a surface level idea of how your code is actually performing.