Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

None of these combined shift-and-add instructions need a full barrel shifter, though, so they? Usually they’re selecting from 2-4 possible shift amounts, not 64 of them.


My suggestion was that all adds go through the first two (of six, on 64 bit) layers of the barrel shifter, not the whole barrel shifter.


Mostly. ARM actually has instructions that allow a full 64-bit shift then add.

But I checked the Cortex A78 optimisation manual. They take 1 cycle if the shift is 4 or less and 2 cycles in other cases.


0-4 shift not 0-3? That is a little bit weird.


Arm64 has fast 128-bit loads. Not just with NEON, but with regular integer instructions, you can quickly load 128 bits into a pair of 64-bit registers.

So it kind of makes sense to support fast shift by four. Though, it's more likely they just profiled a bunch of code and decided fast shifts by four was worth budgeting for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: