On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.
This publication seems to describe more details on the arduino implementation:
It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.
This publication seems to describe more details on the arduino implementation:
https://arxiv.org/abs/2105.02953
It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.