Important precision: the async part is absolutely not python specific, but comes from CUDA, indeed for performance, and you will have to use cuda events too in C++ to properly time it.
For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.
Yes, it isn't python, it is... hardware. Not even CUDA specific. It is about memory moving around and optimization (remember, even the CPUs do speculative execution). I say a little more in the larger comment.
I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues
For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.