Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What do you think of the RISC-V argument that a flags register is hard to make fast for a superscalar ooo core, and thus the RISC-V choice of two-register branching instructions is better than the traditional choice of a jmp instruction that reads the flags?

This also avoids the need to macro fuse cmp+jmp, but OTOH eats quite severely into the destination offset.




I think it’s smart, most branches are short and almost all important branches are short (eg. loops). The flags register adds an extra implicit input/output everywhere, and renaming is expensive enough without that.

However given they are already paying the cost of a variable length instruction set (for compression) I think a wide set of reg/reg, reg/imm8, reg/imm16, reg/imm32 compare+jumps would have been even better. Right now compare+jump against an imm32 is 3 instructions (4 if it’s long) vs 2 for x86, that number could have been 1 (given variable length).

Personally I think only two good choices exist for instruction sets, fully fixed size at 32b or variable with 2b control (max 2 inputs and 1 output reg!) followed by 0b, 2b, 4b, etc data (and thus very powerful instructions like the aforementioned). Anything else seems to be off the pareto curve. Goodbye cmov.


Flags don't have to add an extra implicit input/output everywhere. Both ARM and PowerPC avoid updating the flags unless explicitly requested.

Besides fixed-size instructions and the traditional variable-size instructions, one can do variable-size instructions in bundles. An example would be 25-bit and 50-bit instructions packed into 128-bit bundles, with the remaining 3 bits used to specify all the sizes. (eight patterns: nnnnn, nww, wnw, wwn, nnnw, wwnw, wnww, wnnn) Extending that out to a typical cache line of 512 bits might be better. Another option is to use 1 of every 16 bits to indicate where instructions start.

Where RISC-V got wasteful was the registers. Compilers are seldom able to use anywhere near 32 registers. On normal code, normal compilers seem to need about 8 to 10 registers free after deducting the ones reserved by the ABI. The ABI might need 3 to 5 registers. (stack, PLT, GOT, TLS, etc.) That means that roughly 11 to 15 registers are needed. Clearly, 4 bits (16 registers) is enough. Shoving some of those ABI-reserved registers out of the general-purpose set wouldn't be a bad idea; most of those are just used for addressing.


> Flags don't have to add an extra implicit input/output everywhere. Both ARM and PowerPC avoid updating the flags unless explicitly requested.

Well ultimately they do, not updating the flags means more options for the compiler (things can be scheduled in between the compare and jump), although with cmp+jmp fusion that's now a bad idea making the concept dates.

Ultimately each instruction pending execution in an OOO core needs to sit somewhere waiting for its inputs to be available. If you are x86 and you suggest cmov then potentially you need to wait for 3 registers and flags, meaning every slot in this structure needs to be capable of waiting for 4 things to happen before becoming ready. In RISC-V you only need to wait for 2 things for any instruction.


> Flags don't have to add an extra implicit input/output everywhere. Both ARM and PowerPC avoid updating the flags unless explicitly requested.

You mean things like having variants of common arithmetic instructions that update or don't update flags?

> Besides fixed-size instructions and the traditional variable-size instructions, one can do variable-size instructions in bundles. An example would be 25-bit and 50-bit instructions packed into 128-bit bundles, with the remaining 3 bits used to specify all the sizes. (eight patterns: nnnnn, nww, wnw, wwn, nnnw, wwnw, wnww, wnnn) Extending that out to a typical cache line of 512 bits might be better. Another option is to use 1 of every 16 bits to indicate where instructions start.

Yeah, something like that could be nice. Though how would jump instructions be encoded? Bundle + offset within bundle?

> Where RISC-V got wasteful was the registers. Compilers are seldom able to use anywhere near 32 registers. On normal code, normal compilers seem to need about 8 to 10 registers free after deducting the ones reserved by the ABI. The ABI might need 3 to 5 registers. (stack, PLT, GOT, TLS, etc.) That means that roughly 11 to 15 registers are needed. Clearly, 4 bits (16 registers) is enough. Shoving some of those ABI-reserved registers out of the general-purpose set wouldn't be a bad idea; most of those are just used for addressing.

Nah, I think 32 registers was a good choice. (Relatively) common loop optimizations like unrolling or pipelining need more registers. Also, some of those registers are callee saved and some are call clobbered; by making use of this information the compiler can avoid spilling and reloading of registers around function calls.

For x86-64 16 registers is fine, partly because in many cases one can operate directly on memory without needing to explicitly load/store to architectural registers, and partly because the target was and is OoO cores that aren't as dependent on those register-consuming compiler optimizations.


It is common to have a bit which causes an instruction to update flag bits. PowerPC arithmetic instructions have an "Rc" field, usually the LSB, indicated by a trailing "." in the assembly syntax. ARM arithmetic instructions have an "S" field, usually bit 20, indicated by a trailing "S" in the assembly syntax.

Bundle + offset is fine. The offsets don't need to be real. In the example given with 25-bit and 50-bit instructions, allowable low nibbles of instruction addresses might be: 0 1 2 3 4 (so it goes 0x77777773, 0x77777774, 0x77777780, 0x77777781, etc.)

I disassemble binary executables as my full-time job. I've dealt with over a dozen different architectures. I commonly deal with PowerPC, ARM, MIPS, x86-64, and ColdFire. The extra registers of PowerPC and MIPS are always wasted. Even with ARM and x86-64, unused registers are the norm. It simply isn't normal for a compiler to be able to make effective use of lots of registers. Surely there is an example somewhere that I haven't yet seen, but that would be highly abnormal code.

If more registers could be used by compilers, the Itanium would have been a success.


I'm not an expert so pardon any ignorance, but couldn't compilers be acting conservative about registers due to the "long shadow" of x86? Perhaps the modest increase of 8 registers for x64 didn't cause compiler developers to ever start considering registers as a generally abundant resource, thus constraining their designs.

> If more registers could be used by compilers, the Itanium would have been a success.

I kind of feel the Itanic never made it far enough for its register count to have mattered to anyone. I wonder if SPARC would be a better comparison... it was somewhat popular in the 90's and 00's, and was a RISC chip with oodles of registers, wasn't it?


What about something like having fixed 64-bit instruction bundles, where the first 4 bits indicate if it contains 4x15-bit, 3x20-bit, or 2x30-bit instructions? This limits the combinatorial explosion in cases hardware needs to handle for variable-width decoding (effects don't cross 64-bit boundaries). One could reserve some of those 4-bit patterns for indicating instruction dependencies (a la EPIC/Itanium) and/or indicating extended functional units with completely different opcodes (similar to EPIC/Itanium opcodes having completely different meaning if marked as FPU instructions).

Presumably some of the 4-bit prefix patterns would indicate that the final instruction in the bundle was an immediate value instead of an instruction.

Alternatively, one could take a page form Sun's MAJC playbook and shorten the opcode of the first instruction in the 64-bit bundle by 2 bits to indicate 4x16-bit, 2x32-bit, 16+16+32-bit or 32+16+16-bit instructions in the bundle.


you can also fold a cjmp+4 with a subsequent jump, though that's tough to do before branch prediction


It's an obvious mistake but macro op fusion is a "good enough" solution for x86. The type of macro op fusion that RISC-V needs is far more complicated.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: