Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Genius of RISC-V Microprocessors (erik-engheim.medium.com)
160 points by socialdemocrat on Dec 27, 2020 | hide | past | favorite | 158 comments



I find it interesting that, in one aspect, ARM went into the opposite direction with its newer 64-bit ISA. Like with RISC-V, the designers of the 64-bit ARM ISA "knew about instruction compression and macro-ops fusion when they began designing their ISA" (32-bit ARM already had the alternative compressed ISAs Thumb/Thumb2/ThumbEE); however, they chose to use fixed-size 32-bit instructions without any compressed format. I suspect that they did it expecting that, once compatibility with the 32-bit ARM ISAs is removed on newer processors, the fixed-size instructions would allow for very wide decoders, like we've seen with Apple M1 (which can decode 8 instructions in parallel). It remains to be seen whether the RISC-V choice (the first two bits of every instruction are enough to determine the instruction length) will allow for a similarly wide decoder.


Macro-op fusion is something that is really quite painful to implement in the front-end. Essentially once you have both macro-op fusion and potentially multiple micro-ops for a single instruction you have fully variable length instructions. Potentially it becomes worse than x86, where macro-op fusion has no need to be pervasive, you just have to cover very few simple cases (eg. cmp+jmp). For RISC-V even trivial things like 'load a 32 bit constant' need macro-op fusion to be competitive with x86.

An 8 wide front-end for ARM64 needs to run a decoder every 4 bytes (so 8 in total), then combine the results. It's an entirely trivial affair.

For RISC-V, without any macro-op fusion, you first need a unit every 2-bytes that determines the instruction length using the first 2 bits (16 of these), then you need a very wide mux to expand out any compressed instructions (the last one can come from 7 or 8 different offsets) so you have 8x 32b instructions, then you need your 8 decoders. It's looking much more unpleasant because of that very wide muxing (8x 8:1 32-bit muxes is not a pleasant sight). Alternatively you could do it the brute force way and have 16 decoders followed by a 16->8 reduction layer (ignoring the decode results that were at bad offsets). The designer is going to have a hard time choosing between those two bad options (personally I would go with the latter I think).

Now let's throw in macro-op fusion, suddenly rather than have two possible lengths (16b or 32b) you can have 16b+16b, 32b+32b, 16b+32b, 32b+16b. Complexity just explodes. If anybody really tried to implement a 8 wide RISC-V front-end with pervasive macro-op fusion then it would be a truly nasty thing.


What do you think of the RISC-V argument that a flags register is hard to make fast for a superscalar ooo core, and thus the RISC-V choice of two-register branching instructions is better than the traditional choice of a jmp instruction that reads the flags?

This also avoids the need to macro fuse cmp+jmp, but OTOH eats quite severely into the destination offset.


I think it’s smart, most branches are short and almost all important branches are short (eg. loops). The flags register adds an extra implicit input/output everywhere, and renaming is expensive enough without that.

However given they are already paying the cost of a variable length instruction set (for compression) I think a wide set of reg/reg, reg/imm8, reg/imm16, reg/imm32 compare+jumps would have been even better. Right now compare+jump against an imm32 is 3 instructions (4 if it’s long) vs 2 for x86, that number could have been 1 (given variable length).

Personally I think only two good choices exist for instruction sets, fully fixed size at 32b or variable with 2b control (max 2 inputs and 1 output reg!) followed by 0b, 2b, 4b, etc data (and thus very powerful instructions like the aforementioned). Anything else seems to be off the pareto curve. Goodbye cmov.


Flags don't have to add an extra implicit input/output everywhere. Both ARM and PowerPC avoid updating the flags unless explicitly requested.

Besides fixed-size instructions and the traditional variable-size instructions, one can do variable-size instructions in bundles. An example would be 25-bit and 50-bit instructions packed into 128-bit bundles, with the remaining 3 bits used to specify all the sizes. (eight patterns: nnnnn, nww, wnw, wwn, nnnw, wwnw, wnww, wnnn) Extending that out to a typical cache line of 512 bits might be better. Another option is to use 1 of every 16 bits to indicate where instructions start.

Where RISC-V got wasteful was the registers. Compilers are seldom able to use anywhere near 32 registers. On normal code, normal compilers seem to need about 8 to 10 registers free after deducting the ones reserved by the ABI. The ABI might need 3 to 5 registers. (stack, PLT, GOT, TLS, etc.) That means that roughly 11 to 15 registers are needed. Clearly, 4 bits (16 registers) is enough. Shoving some of those ABI-reserved registers out of the general-purpose set wouldn't be a bad idea; most of those are just used for addressing.


> Flags don't have to add an extra implicit input/output everywhere. Both ARM and PowerPC avoid updating the flags unless explicitly requested.

Well ultimately they do, not updating the flags means more options for the compiler (things can be scheduled in between the compare and jump), although with cmp+jmp fusion that's now a bad idea making the concept dates.

Ultimately each instruction pending execution in an OOO core needs to sit somewhere waiting for its inputs to be available. If you are x86 and you suggest cmov then potentially you need to wait for 3 registers and flags, meaning every slot in this structure needs to be capable of waiting for 4 things to happen before becoming ready. In RISC-V you only need to wait for 2 things for any instruction.


> Flags don't have to add an extra implicit input/output everywhere. Both ARM and PowerPC avoid updating the flags unless explicitly requested.

You mean things like having variants of common arithmetic instructions that update or don't update flags?

> Besides fixed-size instructions and the traditional variable-size instructions, one can do variable-size instructions in bundles. An example would be 25-bit and 50-bit instructions packed into 128-bit bundles, with the remaining 3 bits used to specify all the sizes. (eight patterns: nnnnn, nww, wnw, wwn, nnnw, wwnw, wnww, wnnn) Extending that out to a typical cache line of 512 bits might be better. Another option is to use 1 of every 16 bits to indicate where instructions start.

Yeah, something like that could be nice. Though how would jump instructions be encoded? Bundle + offset within bundle?

> Where RISC-V got wasteful was the registers. Compilers are seldom able to use anywhere near 32 registers. On normal code, normal compilers seem to need about 8 to 10 registers free after deducting the ones reserved by the ABI. The ABI might need 3 to 5 registers. (stack, PLT, GOT, TLS, etc.) That means that roughly 11 to 15 registers are needed. Clearly, 4 bits (16 registers) is enough. Shoving some of those ABI-reserved registers out of the general-purpose set wouldn't be a bad idea; most of those are just used for addressing.

Nah, I think 32 registers was a good choice. (Relatively) common loop optimizations like unrolling or pipelining need more registers. Also, some of those registers are callee saved and some are call clobbered; by making use of this information the compiler can avoid spilling and reloading of registers around function calls.

For x86-64 16 registers is fine, partly because in many cases one can operate directly on memory without needing to explicitly load/store to architectural registers, and partly because the target was and is OoO cores that aren't as dependent on those register-consuming compiler optimizations.


It is common to have a bit which causes an instruction to update flag bits. PowerPC arithmetic instructions have an "Rc" field, usually the LSB, indicated by a trailing "." in the assembly syntax. ARM arithmetic instructions have an "S" field, usually bit 20, indicated by a trailing "S" in the assembly syntax.

Bundle + offset is fine. The offsets don't need to be real. In the example given with 25-bit and 50-bit instructions, allowable low nibbles of instruction addresses might be: 0 1 2 3 4 (so it goes 0x77777773, 0x77777774, 0x77777780, 0x77777781, etc.)

I disassemble binary executables as my full-time job. I've dealt with over a dozen different architectures. I commonly deal with PowerPC, ARM, MIPS, x86-64, and ColdFire. The extra registers of PowerPC and MIPS are always wasted. Even with ARM and x86-64, unused registers are the norm. It simply isn't normal for a compiler to be able to make effective use of lots of registers. Surely there is an example somewhere that I haven't yet seen, but that would be highly abnormal code.

If more registers could be used by compilers, the Itanium would have been a success.


I'm not an expert so pardon any ignorance, but couldn't compilers be acting conservative about registers due to the "long shadow" of x86? Perhaps the modest increase of 8 registers for x64 didn't cause compiler developers to ever start considering registers as a generally abundant resource, thus constraining their designs.

> If more registers could be used by compilers, the Itanium would have been a success.

I kind of feel the Itanic never made it far enough for its register count to have mattered to anyone. I wonder if SPARC would be a better comparison... it was somewhat popular in the 90's and 00's, and was a RISC chip with oodles of registers, wasn't it?


What about something like having fixed 64-bit instruction bundles, where the first 4 bits indicate if it contains 4x15-bit, 3x20-bit, or 2x30-bit instructions? This limits the combinatorial explosion in cases hardware needs to handle for variable-width decoding (effects don't cross 64-bit boundaries). One could reserve some of those 4-bit patterns for indicating instruction dependencies (a la EPIC/Itanium) and/or indicating extended functional units with completely different opcodes (similar to EPIC/Itanium opcodes having completely different meaning if marked as FPU instructions).

Presumably some of the 4-bit prefix patterns would indicate that the final instruction in the bundle was an immediate value instead of an instruction.

Alternatively, one could take a page form Sun's MAJC playbook and shorten the opcode of the first instruction in the 64-bit bundle by 2 bits to indicate 4x16-bit, 2x32-bit, 16+16+32-bit or 32+16+16-bit instructions in the bundle.


you can also fold a cjmp+4 with a subsequent jump, though that's tough to do before branch prediction


It's an obvious mistake but macro op fusion is a "good enough" solution for x86. The type of macro op fusion that RISC-V needs is far more complicated.


Agreed: I am ready for a Risc6 fork. Keep it close enough to RISC-V that work proving out designs can be ported more or less mechanically, but take advantage of binary incompatibility to fix the worst design choices.

Status bits have turned out not to be such a problem as was once thought. Nowadays they just rename status registers along with the others.

Perhaps RISC-V's worst choice was the coarse-grained extensions architecture that locks up the most useful instructions with dozens of marginal ones, making them unusable in portable compiled code.

As a result, no RISC-V program can assume POPCOUNT is available, because it is locked away in the huge (still-unratified) "B" extension.

The most difficult problem modern CPU designers face is coming up with a way to put another hundred-million transistors to productive use -- i.e., that will make programs actually run faster, but without running too hot. Modern chips are crammed with dodgy, poorly documented gimcracks that make programs run absurdly fast much of the time, but not always.

Programmers now have two problems: discovering whether their programs actually are running fast, and discovering why the ones that aren't don't. Usually the only way to know our program is slow is when we see a faster one. It was fast until exactly that moment, then became instantly slow.


>> Perhaps RISC-V's worst choice was the coarse-grained extensions architecture that locks up the most useful instructions with dozens of marginal ones, making them unusable in portable compiled code.

Yes, this is the unintended consequence of trying to combine minimalism with too much modularity. Ironically this ends up with a more bloated core. Or perhaps the modularity is probably designed to provide more differentiation.

Need POPCOUNT, have to implement the B extension. Need vectors, have to implement scatter gather. Also Can't merge the register file with the FP registers.

In reality what will happen is people will choose instructions from these extensions and implement them as custom isa.

This will lead to a lot of non-portable code. Until some one combines all these useful ops into another standard extension. Repeat.


The solution to this is profiles. Profiles are the idea to combine extensions for specific fields. There are also special standard extensions that you can implement that overlap with the standard extensions for some special instruction.

If you want to only have core + popcount you can do that in your own core. The code will still be portable to all cores that have B extension. If there is a whole industry that needs that combination, it can just be its own profile.

There simply is no perfect solution for every possible solution. If you increase the core, you hurt those that have absolutely no use for some of these things.

RISC-V did an amazing overall job to find a combination of ideas to attempted to solve everything for all people but the reality is, no single thing they could have designed would not end up with some expert on HN explaining why the choice is wrong.


It’s a trade off, for sure. By using optional extensions, you encourage adoption in the embedded space. But if they’re not part of the base spec, good luck getting mainstream adoption through a RISC-V based desktop computer. The big problem is that people are claiming RISC-V to be the processor that’ll take over the world while ignoring that it can’t if the base spec is too limiting.


That's why you have profiles, nobody expects the base spec to take over the desktop world. However, RV64GC is a perfectly fine base profile for desktop computing.

Its not yet complete and eventually for desktop a more extended profile will be needed, but that profile will be the standard that all open source distros will build around, just as now RV64GC is the standard profile.

I don't understand why you even think the base spec should cover all things need for desktop, that was never the intention. In fact it was exactly not the intention.

RISC-V ecosystem explicit goal is to work form the smallest embedded core, to the largest super computer.


Going by what the article says

>First it is combining two instructions into one through compression.

>Then it splits it into two through decompression.

>The combine them back into one operation through macro-op fusion.

The general pattern appears to be: load 8 32bit instructions (2 compressed instructions count as 1 32bit instruction), convert them to up to 16 32 bit instructions. Now your macro op fusion would only have to operate on 32bit instructions.

Although the decoder sounds trivial the macro op fusion still has a massive problem. x86 macro op fusion only needs to look at the first instruction to know which macro op fusion to apply because x86 does not fuse more than 2 instructions.

The type of macro op fusion that RISC-V needs to be competitive would span up to 6 instructions. At that point your instruction decoding problem is as complicated as a regex which means unbounded complexity compared to whatever x86 does. Solving this problem efficiently would require some sort of "hint prefix" on the first instruction. This is far more complicated than just adding a new instruction.


Sure, but as I understand it RISC-V was not intended to be used for big high performance general CPUs. It’s specifically optimised for small core applications, with extensions allowing efficient support for only the additional functionality needed for a specific product. Complaining that RISC-V isn’t well suited to designs competing with Snapdragons and M1 is somewhat missing the point.

There has been some talk of a breaking fork of RISC-V better suited to large high performance designs. I think that makes a lot of sense and would allow RISC-V to stay good at what it was intended for.


Right, but the OP and most of HN seem to believe that RISC-V is some magic thing that is going to beat M1/Zen/whatever one day while also providing amazingly low power for tiny 32-bit cores.

Also, RISC-V is also essentially split into two incompatible instruction sets (32-bit and 64-bit, even if you forget the huge amount of extensions). It would have been better just to design two completely separate instruction sets (32-bit for low power, 64-bit for high performance), but at core it's a teaching instruction set like MIPS before it, so they went for simplicity not optimality.


Please tell me they're calling it RISC-VI.


How wasteful is it to have a bimodal decoder that assumes 8 4-byte instructions or 8 2-byte instructions, with bail-out circuitry to dispatch fewer instructions and advance to the next instruction-size boundary in the case of a stream of mixed widths? You don't need to handle the entire realm of possibilities in a single cycle if you're willing to pay a performance penalty for mixed-width instruction streams. You'd pay a performance penalty for mixed-width runs of instructions, but hopefully profile-guided-optimization would allow the compiler to know where to make the tradeoff for speed and were to make the tradeoff for size.

Also, you mentioned elsewhere that you're a hardware designer. I'm aware that split register files are advantageous for speed, while unified integer and fp register files are advantageous for small/low power implementations. Given that the high-performance implementations are going to use register renaming anyway, has anyone quantified the penalty in using register renaming to allow for a split register file in high-performance implementations while allowing compatible low-power implementations with a single register file? Maybe the transistor savings are dwarfed by other concerns in a modern hardware implementation, but given RISC-V's academic goals, economy for FPGA implementation seems useful.


> Has anyone quantified the penalty in using register renaming to allow for a split register file in high-performance implementations while allowing compatible low-power implementations with a single register file? Maybe the transistor savings are dwarfed by other concerns in a modern hardware implementation, but given RISC-V's academic goals, economy for FPGA implementation seems useful.

I'm not sure I fully get your point. A low power implementation would have one physical register for each register in the instruction set (ie. for RISC-V it's already split). If you have a unified file in the instruction set then that means that you're going to have fewer GPRs available (ok for RISC-V as 16 is enough really) and you either need unified in your OOO implementation or you need to make integer->floatingpoint forwarding an exception (at which point you'd be better just to split 16/16).


It's rare to have one function that uses both a lot of integer operations and a lot of fp operations. My understanding is that there are two advantages to having split gp and fp register files: (1) it is advantageous to not have integer/address calculations and fp operations not contend with each other over register file read ports (2) you get twice as many visible (architectural) registers without taking up more bits in the instruction stream.

My understanding is that the designers of both POWER and aarch64 did a bunch of research before settling on 32 for the number of gp registers, and there are some workloads where 32 are really needed, so having a fixed split is suboptimal for some workloads. Aarch64 isn't that old, and I'm not aware of major improvements in compiler register allocation algorithms since it was designed. Are you contending that ARM made a mistake in their analysis when expanding from 16 to 31/32 gp registers for aarch64?

So, my assumptions are (1) optimal split between fp and gp depends on workload (2) splitting them has a cost in die area for low-power implementations and (3) high performance chips can use the register renaming hardware to get the extra register file read ports from split register files and effectively tailor the gp/fp split to the workload. You wouldn't need to make integer-fp forwarding an exception, though the forwarding might have latency similar to an L1 cache read. Presumably, the register renaming hardware would keep a single bit (or small 2-3 bit saturating counter) to keep track of the last usage of each register, so that load instructions were likely to be stored to the optimal register file.

It just seems that if you're going for a brand new green field architecture design, you can get rid of the cost of a split register file for low-power implementations and get all of the benefits of a split register file in high-performance designs that are going to have register renaming hardware anyway.

Maybe I'm missing something with regard to your comment about integer-fp forwarding needing to be an exception instead of incurring slight latency.


I guess my comment about 16 registers being enough was more in the context of low power. In benchmarks I've seen going from 8->16 is a big boost, but going from 16->32 is generally only a few %. In ARM64 32 is definitely the right choice but in low power 16 is probably going to be best because with 32 your register file ends up taking up >50% of the die area and having 32 prevents you from fitting instructions into 16 bits (3 registers now uses 15 of your 16 bits, with 16 regs you have 4 bits leftover).

On the register file split I think you're generally right, it just comes down to what the penalties actually look like from having the unified file. I think the loads are a cause of trouble, really you need two different types of loads rather than having a predictor I think (low-power implementations could ignore that single bit). It's really just a matter of those register file read ports, you are going to need a lot more or need less backends or get bottlenecked on read ports (remember not everything gets forwarded, much is already retired).

One thing that's also relevant to the discussion is almost always when you add FP to a CPU these days you also add some kind of vector ops, and given vector registers are not the same length as the GPs you can't have a unified file. If I was doing green field I would probably go for a split file, but then say low power implementations shouldn't implement FP/vector at all (which is what RISC-V has done), unfortunately although this idea is popular with hardware designers it is not with software developers who are used to being able to drop a 'float' or 'double' into their C code and have it work (or worse may use a library that includes floats in its internals).


> How wasteful is it to have a bimodal decoder that assumes 8 4-byte instructions or 8 2-byte instructions, with bail-out circuitry to dispatch fewer instructions and advance to the next instruction-size boundary in the case of a stream of mixed widths?

I suspect mixed-width streams are extremely common (at least going by compiled RISC-V code I've seen). Potentially you can alter compilers to change this but the compressed instruction set is very limited, it naturally mixes compressed and uncompressed and forcing otherwise will likely produce poorer code overall.

I'd also note that branch prediction accuracy is a key part of modern processor performance, they're maybe around 95-99% accurate most of the time. Your scheme looks to have similar mispredict penalties to branch prediction (you have to chuck things away and start over, though at least you don't have to refetch) so will likely require similar level of accuracy to work well. Plus on any mixed stream performance will be very poor as you've explicitly not built a decoder for mixed streams, so you end up dripping things through one at a time or you allow limited decoding on mixed streams (maybe 2 instructions per cycle).

Edit: Perhaps the mis-predict penalty isn't so bad compared to a mis-predicted branch as detecting the mis-predict is trivial. Indeed having just realised this you may be better off without the predictor at all, just look at the bytes and decide which you should do.

Though depending on the decoders you may end up needing two entirely separate sets of logic for the compressed instruction decoder vs uncompressed instruction decoder anyway, in which case you just decode your stream in both ways then decide which set of decodings you want to use. You still have the issue with any mixed width stream killing your performance though which I suspect would be a major issue with this design (so much so it's not useful).


I don't see why macro-op fusion should be harder for RISC-V than x86. x86 ISA was designed without this in mind. The RISC-V designers in contrast seem to deliberately design instructions to make it easier to perform macro-op fusion. You can see specific revisions to the spec being made to make macro-op fusion easier.

Actual real world performance tests already suggests that RISC-V tend to require less memory for their programs as well as fewer instructions than x86 and ARM. Hence even without pervasive macro-op fusion they are in a good spot.

E.g. for CoreMark used to benchmark embedded processors RISC-V BOOM implementation came ahead ARM-32 Cortex-A9 with about 10% fewer instructions, 10% smaller memory usage and about 10% higher clock frequency (thanks to smaller core).


How? I was under the impression that RISC-V was deliberately minimalist in the hope of obtaining smaller cores, and I can see how that would allow higher clock frequency, but I would've expected the trade-off to be longer programs. How did it get shorter programs than ARM? What were the cases where a single RISC-V instruction corresponded to multiple ARM instructions?


A lot of it is about having shorter versions for frequently used instructions. E.g. compare and branch is just one instruction on RISC-V. More complex instructions simply are not used frequently so their ability to reduce code size isn't as big as people think.

In addition RISC-V has twice as many registers as ARM-32 Cortex-A9, which the particular CoreMark was comparing against. Also a different ABI conventions reduce the number of instructions spent on saving and restoring registers. E.g. RISC-V has certain registers marked as saved, temporary, arguments etc.

Then of course for RISC-V it isn't expensive to support compressed instructions. You use the same decoder not different modes and different decoders like ARM. Just 400 gates to implement, so it is kind of a no-brainer.

If comparing SPEC CPU2006 benchmark code using GCC compiler then uncompressed RISC-V vs uncompressed ARM is basically the same. Compressed RISC-V and compressed ARM is basically the same. The latter beats x86 by about 26%.

Uncompressed RISC-V is just 8% bigger than x86. So I would say in general RISC-V is paying a very small price for its simplicity. One could say there is a benefit in being able to learn from past mistakes.


The most common instructions have a compressed 16 bit representation. The compressed instructions are optional though and simple cores just have to support 32 bit instructions.


This paper here: https://carrv.github.io/2020/papers/CARRV2020_paper_12_Perot...

seems to show that RISC-V has code density ~11% worse than ARM (although against Thumb2, so not 64-bit)...


of course what you do is run the 32-bit decoders and 16-bit decoders in parallel making the same internal instruction bundle formats - then you fuse those bundles.

BTW your analysis ignores that in the compressed ISA you can have 4-byte instructions aligned on a 2 byte boundary


That kind of destroys most benefits of fixed length instructions. I expected compressed instructions to be aligned on 4 bytes which means your 32bit decoder merely has to have a 2x 16bit fallback but nope, now you gotta decode with a 2byte granularity and join two 2 byte instruction pieces into a 4 byte instruction.


Yes but it's just a 2:1 mux in the path, it's not like you need twice as many 32bit decoders


> It remains to be seen whether the RISC-V choice (the first two bits of every instruction are enough to determine the instruction length) will allow for a similarly wide decoder.

It's common practice that CPU designers actually evaluate every possible choice in an emulated environment, and make a quantitative analysis, before actually choosing to go down a particular path. At least, that's typically part of CPU design courses. This is the game that has allowed Apple to make a major leap in a short amount of iterations, and other CPU designers should become better at it. We can discuss all kinds of optimizations, but before we throw them in a model there's little we can say about their effectiveness, while the model is a very cheap way to learn more.


I'm sure ISA designers use modeling, but maybe those models differ in their precognition abilities, considering how different choices eg risc-v and aarch64 have made.


As you mention, ARM had Thumb for ARMv7. It's important to remember that ARMv7 is still a product. The Innovator's Dilemma may blindly say to cannibalize, but instead, ARM probably thinks that these are two different sectors. The 64b sector doesn't need Thumb (which comes at an opcode space + decoder cost). The chips that ARMv8 are going into have massive DRAMs and while Thumb saves some, it wasn't enough. RISC-V on the other hand is all things to all people. So an embedded chip needs the possibility of Cray style vectors because reasons.

I do think that if someone wants an 8-wide RISC-V decoder all they have to do is throw the transistors at it. That's a consequence of fixed width. Yeah, RVC will require more transistors, but not n^2 (?) like X86 does.

RISC-V is pretty elegant, well except for the bit manipulation nonsense. They really should have plagiarized ARM instead of Intel there.


I'd imagine it would since they limit the number of possible instruction lengths. The hard thing about x86 is the fact that instructions can by anywhere from 8 bits->256 bits in length (in 8 bit increments). So there's really no way to know if you are loading up an x86 instruction or you are splitting an instruction.

The fact that risc-v only supports 32,64, or 128bit instructions means that you don't have to go through a lot of extra effort to determine where instructions start or end.


Note that RISC-V can have 32, 64 or 128 bit wide registers. The instruction length can be any multiple of 16 bits, though the base instructions are all 32 bits and the C extension adds 16 bit instructions and other lengths have not yet been used. This diagram shows how instructions up to 192 bits long can be encoded:

https://www.embarcados.com.br/wp-content/uploads/2017/05/RIS...

It is a very simple circuit that can take the bottom 16 bits of any RISC-V instruction and tell you exactly how long it is, unlike the x86 where you need a sequence of steps.


Nitpicking, but the documented RISC-V encoding supports anything from 16-bit through 176-bit long instruction words in 16-bit increments, with instruction length being determined by the first 16 bits of the instruction word. Some encoding space has been reserved for 192-bits or longer. Recent versions of the RISC-V spec have actually de-emphasized this length encoding, so implementations may well be allowed to use the encoding space in different ways as long as they don't conflict with existing 32-bit (or 16-bit if the C extension is specified) instructions.


It's more than just limiting the number of possible instruction lengths; it's also that you only need the first few bits of the instruction to determine its length. With x86, you have to decode the first byte to know if the instruction has more bytes, decode these bytes to know if the instruction has even more bytes, and so on.

But since I'm not a hardware designer, I don't know if the RISC-V design is enough to make a high-performance wide decoder. With 64-bit ARM it seems very easy; once you loaded a n-byte line from the cache, the first decoder gets the first four bytes, the second decoder gets the next four bytes, and so on. With compressed RISC-V, the first decoder gets the first four bytes (0-3); the second decoder gets either bytes 4-7 or bytes 2-5; the third decoder can get bytes 8-11, 6-9, or 4-7, depending on how many of the preceding instructions were compressed; and so on. Determining which bytes each decoder gets seems very easy (it's a simple boolean formula of the first two bits of each pair of bytes), but I don't know enough about hardware design to know if the propagation delay from this slows things down enough to need an extra pipeline step once the decode gets too wide (for instance, the eighth decoder would have 8 possible choices for its input), or if there are tricks to avoid this delay (similar to a carry-skip adder).


I am a hardware designer. See my comment above, it's going to be ugly. Fast adders are still slow (and awkward), but they only have to propagate a single bit of information, this is much messier.

As a side-note carry-skip doesn't really work in modern VLSI, I guess you were probably thinking of carry-lookahead.


Ok so you are saying not only are there implicit super instructions via macro op fusion there are also variable length instructions in there too? Ok, I'm not a RISC-V expert but damn that kind of ruins even the last tiny shred of its original value proposition of being simple. Sure the core instruction set is easy but once you add extensions it's just plain ugly.


Nitpick: the longest valid x86 instruction is 15 bytes, or 120 bits (longer instructions might be structurally sound, but no hardware or software decoder will accept them).

Variable length isn't actually a huge problem at the instruction fetch level (and modern compilers will align x86 instructions to keep ifetch moving smoothly), but it does make both hardware and software decoding significantly more complex.


You seem very confused. RISC-V supports variable length instructions from 16 bits up to 192 bits (and maybe longer in future), and it's easy to tell in an instruction stream when the next instruction starts (although the stream is not self-synchronising).


Why would you need more that 32 bits for an instruction, (other than backward compatibility, which I'm assuming riscv is free from.)


RISC-V is designed to be extensible. As well as demanding obviously more than 32 bit instruction coding to support many extensions, we expect there will be some (corner) cases where very long instructions might be desirable for particular extensions.


As far as I understand a RISC-V CPU will have fixed length instructions potentially having some of them compressed. I don't think you are going to see 32, 64 and 128 length instructions mixed in the same program.

But I might be wrong about this. Do you have any source that suggests that instructions of different length should be mixed in one RISC-V program other than use of compressed instructions?


You are completely wrong, as well as mixing up instruction encoding with size of the data registers. Please read section 1.5 of the user spec.


No, need to be such a dick about it. I was frank about not being certain about this. I have read section 1.5, and I cannot see it supporting your claim. The very first sentence says:

"The base RISC-V ISA has fixed-length 32-bit instructions that must be naturally aligned on 32-bit boundaries."

Later it talks about:

"For implementations supporting only a base instruction set, ILEN is 32 bits. Implementations supporting longer instructions have larger values of ILEN."

It seems clear to me that a standard RISC-V implementation today is 32-bit fixed sized on a 32-bit boundary. There may however be support for future architectures with longer instructions. None of this suggests that a regular RISC-V implementation has to assume that instructions can be any length.

These things are not even part of the standard yet. So please, don't be such a dick about something that isn't all that clear at the moment.


RISC-V also has fixed size 32-bit instructions. Are you talking about the optional compressed extension?


The Linux distros are generally expecting that the hardware supports compressed


Sure, but that’s a single configuration option away from being changed if the hardware is altered.


Not really, it requires everything to be recompiled, which takes a few weeks assuming you have access to the cluster of RISC-V machines required to do it. We (Fedora) have rejected one platform already that wasn't going to support the compressed extension.


Why would you need to do the builds natively instead of cross compiled?


The whole toolchain just makes this assumption, and we want to build it the same way Fedora is built for other arches. Plus we're expecting server-class RISC-V hardware soon enough, at least comparable to ARM servers.


I don't understand the hype for RISC-V at all. It's an instruction set that would... just require a new backend for clang/gcc/Java and mean nothing to me unless somebody makes something that just obliterates a Ryzen.

Why should developers care? We barely have a decent ARM ecosystem after years of ARM routers and Raspberry Pi. Now we are supposed to be enthusiastic we will do that yet again for RISC-V?

Is it because it's somehow more open? But ARM wasn't that closed afaik - lost of folks produced it.

And Sun Microsystems open sourced a 64 thread CPU, the OpenSparc and nothing happened.

So, in a nutshell, why?


The approach of making a small instruction-set with optional extensions means RISC-V is very well suited for making specialized co-processors. This story discuss that more in detail: https://erik-engheim.medium.com/apple-m1-foreshadows-risc-v-...

On example of this is Esperanto Technologies, which has created an SoC with slightly more transistors than the M1, which has over 1000 RISC-V cores which implement the RISC-V vector instruction set extension to allow the processing of a large number of matrices and vectors. Basically the ET-SoC-1 as they call it is supposed to offer superior performance in the Machine Learning domain. 30-50x better performance with 100x less power consumption.

Esperanto Technologies are using the full flexibility of RISC-V by having more general purpose RISC-V cores, four of them, which are meant to run an operating system, which schedules machine learning tasks to this large number of smaller vector oriented RISC-V cores.

My understanding is that creating a good ISA is actually quite a task. RISC-V has over 1000 contributors over years who have made it happen. Esperanto Technologies apparently began with their own proprietary ISA for their coprocessors but found they could not beat RISC-V, and that making something better would just cost a lot more money and resources.

So in short the value proposition is in having a highly customizable ISA, that is well designed. There are no such other options on the market. ARM isn't highly customizable, since it has over 1000 instructions you must implement. RISC-V only has 47 instruction you must implement. All fairly simple ones.


>> There are no such other options on the market. ARM isn't highly customizable, since it has over 1000 instructions you must implement. RISC-V only has 47 instruction you must implement. All fairly simple ones.

I think this is a wrong comparison. There is a subtle difference here that is being missed.

CPU's become bloated over time as they get used differently, as the applications shift, as they try to address newer areas. And with each iteration, they still have to support the legacy code. That is how you end up with thousands of instructions

It is easy to have a lesser instruction count when you are starting from scratch and have no legacy.

What is going to stop RISC-V from becoming another ARM or perhaps x86 even, in another couple of decades ? When it spans such a large application space that most of the extensions become default and the core becomes bloated ? Time for another ISA then.


No, because verification tools are made to verify that you don’t take an extension for granted. Compilers as far as I know are made to generate code for different extensions. The CPU also contains registers to check what extensions are present.

ARM is what it is because license holders are required to implement the whole ISA. RISC-V is kind of the opposite. You are required to assume extensions are optional.

And OS is supposed to trap instructions not implemented and jump to a software emulation of them.

I don’t know all the details. But the point is they are planning for this from the start and building their tools around this. That was never the case for ARM or x86. Hence it is premature to assume it will end up the same.


It's time to talk about a completely different domain that makes programming be heaven. In electrical engineering there is no such thing as a generic component. There are only implementations of them and you have to pick among the company provided implementations. Imagine if you had to specifically choose a 4Ghz grade if statement in your programming language or make sure your while loop supports 1.3V on the CPU. You need to have a good mental model of how the statement are implemented to choose the right one.

That's how bad electrical engineering is. Programming is bliss in comparison. You can do the dumbest things and it still works out. Your computer will never go up in flames even in the most bug ridden C code base. RISC-V will bring this to a lesser extent to software development. There will be no "general" RISC-V CPU. Instead you have to constantly worry about whether your CPU supports these instructions. Implementing fallbacks will just cause very inconsistent performance drops among CPU manufacturers. Think of things like the Intel compiler disabling SIMD on AMD cpus except this time it's because the manufacturer didn't bother to implement a "bloated extension" where only one instruction is actually used but when it is used it's in Fortnite or some other popular game so you do extremely poorly in benchmarks.


> You can do the dumbest things and it still works out. Your computer will never go up in flames even in the most bug ridden C code base.

...and that, unfortunately, explains the quality of a lot of software and the low entry bar for the industry.


The issue is not whether SW can work around optional extensions. The point is that if you want a general purpose performant cpu (like ARM/x86) targeting a wide swath of applications, you will end up with these optional extensions as default. And then it is as bloated as others. We are not talking about micro-architecture optimization here to make it more performant, but isa architecure.

Ofcourse, you can change the computing module to minimal base isa and specific accelerations in ISA/HW. But then you will be restricted to niche applications and never replace the incumbents for general purpose workload.

>> Hence it is premature to assume it will end up the same.

And it is wishful thinking that it will happen otherwise. Whether ARM/x86 thought about extension from the ground up doesn't matter if you are targeting the same application space as them.

I have not seen anything done to even acknowledge this issue, let alone do something about it, for the simple reason that this lies far into the future and has no impact on the current scenario.

Given time and legacy software, it will morph into one.


> The point is that if you want a general purpose performant cpu (like ARM/x86) targeting a wide swath of applications, you will end up with these optional extensions as default.

Yes, but the number of instructions is still minuscule compared to ARM and x86. It am sitting with the reference sheet here in my hand. It is just two pages. Very quick to read over.

> And then it is as bloated as others.

Eh... nope. A huge number of those instructions exist due to cruft building up over years, not due to an actual need. A major source the huge number of instructions is tons of SIMD instructions which have superseded each other. RISC-V designers have specifically sought to avoid this preferring a Vector extension instead which adds a lot fewer instructions as it is more flexible.

People keep saying that is complex, but several companies have already made implementations of this. Esperanto Technologies is making SoCs with over 1000 RISC-V cores with vector extensions. So it cannot be that complex if they can fit that many cores.

And the compiler technology is better developed for vector instructions than for SIMD.

> And it is wishful thinking that it will happen otherwise.

It is not wishful thinking to assume a couple of lessons have been learned decades later in ISA design. You can just read up on the details about how the RISC-V ISA has been designed to understand how they are able to keep things simple while still retaining flexibility.

> Whether ARM/x86 thought about extension from the ground up doesn't matter if you are targeting the same application space as them.

The flaw in your assumption is that you seem to think all those instructions are actually used and need to be able to handle modern tasks. Tell me how much software today needs special instructions for Binary encoded decimal numbers? x86 is full of that kind of legacy bloating the ISA, with no value to modern software.

> I have not seen anything done to even acknowledge this issue, let alone do something about it, for the simple reason that this lies far into the future and has no impact on the current scenario.

I think you should just read what they have said about all these things. They have thought a lot more about this than you seem to have. They been doing ISA design for many decades and seen problem develop over time. RISC-V is a response to all those problems of the past. To not build yet another bloated ISA with redundant legacy instructions.


ARM licenses and royalties are not free. The royalties are a small but significant cost of a chip (1-2%) and licenses on the IP are a large R&D cost for the designer.

China is obviously interested in RISC-V as their access to the US market is threatened, and they could be a driver of wider use.


India is also focusing on RISC-V, as they do not have their own chip industry yet. Depending on the Western countries for your processors is something any country wants to de-risk.


This seems relevant if my corporation wants to put some $M into producing a thing. But I paid like $200 for my CPU and I would love a $500 Ryzen. Either way, I don't really care if 1% goes to royalties -- my price fluctuates more than that just because of our national currency.


You are not the target market. Maybe someday RISC-V will compete with desktop processors, but at the moment it is most interesting for embeded systems and coprocessors.


Seems fair enough. Still, the hype is so big on HN it does seem like the hope is for RISC-V everywhere. It's like saying soon enough we will all program in Haskell.


It seems unlikely that I’ll ever use a risc-v system in anger, but I’m interested in them anyway because at heart I’m a hacker, which to me means (in part) that I’m interested in technology.

Until recently, the technology world has been shrinking relative to the 90s. I cut my teeth on machines running z80, 6502 and 68k CPUs, and during my early career I developed and deployed on 88k (DG/UX), PA-RISC (HP/UX), SPARC (Solaris and DRS/NX), MIPS (Irix) and probably a bunch I forget. Things were even more diverse (but less accessible) in earlier decades - each system had its own OS, and sometimes more than one (I’ve used RSTS/E on PDP and CSM on System 25...).

Today, everything in the commercial world runs x86 and Linux. That’s it. Even Windows people are using WSL now. There are loads of reasons why that’s good, but the hacker in me finds it dead boring.

So when I see things like M1 or Raspberry Pi or ESP8266 or RISC-V, it reminds me of a time when there was a lot more diversity in computing, where I could wonder about how things worked and read about them and imagine what applications I might have for them. It’s exciting! It gets my hacker juices flowing. And this is hacker news, right?

So despite the chances of me ever using risc-v in anger being approximately zero, I still think the whole thing is fascinating, and I learned heaps about CPUs just from people comparing RV to ARM and x64.

I think that’s the reason the “hype” for RISC-V is “so big”. It’s not because it’s going to be everywhere, but because it’s quite literally hacker news.


> I think that’s the reason the “hype” for RISC-V is “so big”. It’s not because it’s going to be everywhere, but because it’s quite literally hacker news.

:-)

Your whole post was a pleasure to read.


I think that's the wrong question to ask. Aside from those writing kernels, compilers, JITs, etc., developers don't need to care, that's the point of programming languages and compilers. Throw a toolchain at a developer and tell them to do it and they can do it.

Comparing RISC-V to OpenSPARC is comparing apples to oranges. One is an ISA, the other is a hardware project.

What do you mean by "We barely have a decent ARM ecosystem". Are you talking about the tools to build for those processors? Or the tools to build new ones of those processors? Or the actual chips that are available?

I'm basically always within 10m of at least a half dozen ARM processors, and a handful of 8051s. At worst, RISC-V could be equivalent to ARM, without many of the parts that suck about it.


Are we talking about RISC-V as an abstract idea, the same way we would talk about the benefits of functional or object oriented programming here?

Or is the end goal to have hardware in stores that people can buy?

Wasn't SPARC a RISC architecture? (Wikipedia says so, but I'm no expert).

> What do you mean by "We barely have a decent ARM ecosystem". Are you talking about the tools to build for those processors? Or the tools to build new ones of those processors? Or the actual chips that are available?

I remember the early OpenWRT / RPi days when many packages weren't even compiling for ARM. Now I think we are in a much better shape. The chips were available, but the software ecosystem was way behind.

> At worst, RISC-V could be equivalent to ARM, without many of the parts that suck about it.

What I do know about ARM sucking is that the whole board design makes each product need a customer kernel more or less. Which is why we still don't have a "Linux for your phones". What I've read about RISC is that it's going to be roughly the same: encourage a lot of co-processors which mean custom boards to me and probably custom kernels.


> What I do know about ARM sucking is that the whole board design makes each product need a customer kernel more or less. Which is why we still don't have a "Linux for your phones". What I've read about RISC is that it's going to be roughly the same: encourage a lot of co-processors which mean custom boards to me and probably custom kernels.

FWIW, this has nothing to do with the ISA per se, and particularly absolutely nothing to do with whether the ISA is RISC, CISC, or something else.

The problem is that like so many other embedded systems, ARM systems don't have peripherals which are discoverable at boot, hence all that has to be hardcoded (or specified at boot via devicetree) into the kernel (e.g. the interrupt controller is model XYZZY and is accessible at address 0xBEEF). So you end up with a kernel binary that works only on a specific board.

The "server-class" ARM systems adhere to something called SBSA which is a specification incorporating things like UEFI, ACPI, PCIe etc. The end result being that SBSA systems can use a common distro kernel just like x86 systems can.


I learned something, thank you! Considering how performant modern phones are I hope all adhere to SBSA or similar so I can install any OS on my phone just like I could on my 100Mhz PC way back.


Because you can't make an open ARM or x86 core. Yes SPARC was open before and not much happened from it, but just because something didn't work the first time, doesn't mean it not a good idea. The world is different now compared to then.

If we ever want to live in a world where we have open implementation that have a huge software ecosystem that is ready, we need an open well support ecosystem.

So, if you don't care about open source, then you don't care about RISC-V. If you do however, you probably should care.


> However unlike the ARM, MIPS and x86 designers, RISC-V designers knew about instruction compression and macro-ops fusion when they began designing their ISA.

Since ARM64 was announced in 2011 this is plainly false for 64 bit ARM.

(And there is similar confusion between 32 and 64 bit ARM throughout the article).


Instruction compression dates at least to MIPS16, 1996 [1]. Macro-op fusion dates at least to the Pentium M, to 2003 [2]. These are old ideas.

[1] https://en.wikipedia.org/wiki/MIPS_architecture

[2] https://www.agner.org/optimize/microarchitecture.pdf


Indeed and ARM7TDMI with Thumb was released in 1994 so clearly Arm had forgotten about it by 2011!


I still feel that the criticisms in this article still stand: https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d99...

The chief and most difficult to address among which is the lack of advanced addressing modes ('lea' in x86) means that address generation for array access requires 3 more instructions. The writer suggests that Macro-OP fusion will pick up the slack in this case, however it has been pointed out that this level of fusion (tons of instructions, which can have various read and write locations) is very complex to do in HW and is typically not done even in high-end commercial CPUs today.


> Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.

This seems to me to be under-debated. If / when RISC-V gets more adoption what is there to stop vendors trying to 'win' by adding proprietary extensions that fragment the ecosystem.


Convincing the Chrome (or Firefox, Mathematica, MATLAB, $distro's emacs builds, ...) team to switch to their proprietary gcc/LLVM fork that supports the extensions, I assume.

Proprietary extensions work better in the embedded space, where the status quo is a proprietary toolchain; I doubt you'd be able to get a ton of buy-in from application distributors before you have market dominance.


Fair challenge. I fear though that once a firm gets to say Qualcomm or Apple size in the market then they have the scale to continue to maintain their own fork and develop their own apps with that (e.g. Apple could easily do with Safari etc.)


Its not 'under-debated' it been literally debated in every single forum that ever had a post about RISC-V. Its literally the most debated thing.

The reason its not very likely to happen is because we have a gigantic open source ecosystem that is unlikely to for free adopt your extension unless you pay them to do it.

At the end of the day, you simply can't force hardware designer into some standard unless your ISA is proprietary.


"Unlikely" and "for free". So it could happen and yes firms do pay for changes to open source software. You really haven't answered the point about what's stopping a big vendor adding its own extensions and fragmenting the ecosystem.


Nothing is stopping it, just as nothing is stopping people from copying any open source software. My point is exactly, its not designed to 'stop it'. In fact, the opposite.

If somebody wants to replicate large parts of the open source ecosystem to support their extension then that's ok.

The point is, that most of the time this simply doesn't happen and if it does, it doesn't actually hurt those that want to hold to the standard.

In some segments of the market, doing such is perfectly fine and encouraged. In others its much more unlikely, difficult and expensive.

RISC-V is designed to work from the smallest embedded chip to the largest server. Trying to create some sort of universality would never work, and trying to use legal means of preventing such is simply a bad idea and an anti-pattern.

The ecosystem is designed so that all software is able to handle the different configuration and understand when a software is not supported and why. Its designed to have the base system be maximally portable and to cover most things in standard extensions so that new implementations can access a huge amount of tooling and software without issue.


I believe RISC-V follows a pretty simple rule, where one looks at destination register only. Intel CPUs have been doing macro fusion for years. And they had to tack that on in hindsight. I don't see why RISC-V cannot make this much simpler than Intel since they are deliberately designing their instructions to work well with macro-op fusion, something Intel never did.

If Intel is able to do it I have trouble seeing why RISC-V cannot do it when they are even planning for it. I don't buy this complexity argument at all. It doesn't add up.


x86 only needs to macro op fuse 2 instructions at most, meaning you can look at the first instruction and immediately see which fusion to apply.

With RISC-V there are proposals to macro op fuse things like array out of bounds checks which comprise 5 instructions. You can't just check the first instruction and then the second. It's entirely possible that instruction 1 has 6 different micro op fusion patterns and you have to read instruction 2 to cut down on the number of potential patterns. It's also possible that the second instruction is not relevant in determining the pattern meaning you have to scan 3 instructions to determine the pattern. You have to start this pattern search on the beginning of every instruction you decode which limits your ability to add more decoders. Of course with enough spare transistors this is merely an extremely difficult problem, not an impossible one.


Yeah but each CPU can choose exactly how much fusion they want to do. That is not part of the ISA or promise they make to developers. Doing macro-op fusion like x86 should be a big win anyway combined with compressed instructions.


A simple 'lea' equivalent does not need multiple "read and write locations". ISTM that this ought to be an ideal case for macro-op fusion.


Yup, you are correct. However, the CPU still needs to figure that out by examining the read and write locations of all 3 instructions, which need to be decoded, then fused before execution.

I'm a dilettante in this area, but it doesn't sound trivial to me.


What I have gathered boils down to: the RISC-V core instruction set is simple but there are too many extensions.

The way they intend to speed up RISC-V processors is simply by throwing more hardware at it, hoping that at some point x86 and ARM will hit a complexity roadblock so that throwing even more hardware is simply not possible for them.

For example, the idea of keeping the instruction set simple means you can start off with a non super scalar architecture and get energy efficiency for micro controllers. Once you keep up with the giants you can throw extreme macro op fusion into the pile. Macro op fusing up to 6 instructions is extremely difficult on RISC-V but it's impossible on ARM and x86. So eventually RISC-V will win simply because you can throw more transistors at the problem. Scaling up instruction decoding is viable on ARM and RISC-V but the smaller instruction set means you can have more decoders on RISC-V.

However, doubling or tripling the number of decoded instructions per clock cycle is only relevant because RISC-V programs need more instructions in the first place to encode the same program.


Remind me, how is this done in ARM?


Since ARM has plenty of space in a 32 bit instruction, both the shift and the add are baked into the ldr instruction.


Even in Cortex-M, most array accesses are two instructions (lsls+ldr), as the ldr includes a "free" register+register add. For example, given an index in r0 and an array base address in r3:

    lsls    r0, r0, #2
    ldr     r0, [r0, r3]


I won’t sign up for medium to view this (let alone download another app!), does anyone have a different link?


Opening in private mode usually works for Medium link (never understood why they reward people who hide their cookies, but ok).


who pays for blogs? lol



Point 2 claims that RISC-V is the dominant architecture used for teaching. The link provides no evidence for that. I know lots of unis teaching ARM, and none teaching RISC-V. Is there any evidence that RISC-V is the dominant architecture for teaching?


Originally MIPS (or DLX) was the dominant architecture used for teaching computer architecture because the standard Computer Architecture textbook (by one of the main designers of MIPS, David Patterson [1] along with John L. Hennessy [2]) was used in most universities [3]. These two authors were basically the university-lead designers of the RISC philosophy. Patterson's team designed the RISC-I and RISC-II processors (Berkeley RISC [4]). Hennessy and his team designed the MIPS processors (Stanford MIPS [5]). This culmination eventually begot the RISC-V. So yeah, the RISC-V is now the dominant architecture used for teaching computer architecture as they now use RISC-V to teach computer architecture with their latest book edition [6]. Also for more information on that, read [7].

[1] https://en.wikipedia.org/wiki/David_Patterson_(computer_scie...

[2] https://en.wikipedia.org/wiki/John_L._Hennessy

[3] https://www.amazon.com/Computer-Architecture-Quantitative-Jo...

[4] https://en.wikipedia.org/wiki/Berkeley_RISC

[5] https://en.wikipedia.org/wiki/Stanford_MIPS

[6] https://www.amazon.com/Computer-Organization-Design-RISC-V-A...

[7] https://en.wikipedia.org/wiki/RISC-V#History


It probably refers to digital design education, not software development education.

And it does seem like universities are rapidly starting to adopt the ISA for that purpose. A few years ago, those classes would use a custom instruction set written by a professor, or maybe a simple proprietary ISA like MSP430. I don't think ARM was very common.

For example, MIT used to teach computer design with an arbitrary CPU that they called "beta". Now they use RISC-V.

https://computationstructures.org/notes/pdfs/beta.pdf

https://6004.mit.edu/web/fall20/schedule


Almost all universities I know of have shifted to RISC V. UC Berkely, MIT any of the UC's,UW, even Harvey Mudd College. So it is not an unreasonable assertion.


My UC (Santa Barbara) used MIPS in its computer architecture classes.


There’s almost no difference between the two as far as the undergrad level intro to computer architecture course is concerned.


Well, other than the fact that the ISA is different and MIPS has a couple of strange legacy quirks (branch delay slots?)


From your list, this could be a US/UK separation.


I could give Indian universities too.


From what timeframe is your information?

I noticed a lot of institutions switching to Risc-V recently.

The simple core instruction set is very appealing for teaching. It makes implementing a full interpreter in a semester attainable.

Also compiling some C code and have students actually know/understand all instructions is a big win.

All while having access to open source and already solid tooling (compilers, emulators)


At least a lot of the "important" American ones seem to have switched. At my university we used Nios II which, while basically no one will directly use it in their lives, was good enough for learning assembly. It's all pretty transferable anyways.


Compressed Instructions and Macro-Operation Fusion

It's strange to put these two together as RISC-V's killer feature. First, they are not new; they go back more than a decade. But more to the point, they are at odds.

Instruction compression says that DRAM+cache are a scarce resource. Macro-op fusion says don't worry about those big instructions, we'll combine them after decode. Which is it? BTW, micro-op fusion is not free. It has to search the instruction stream for possible matches.


With compression you avoid the cost of having multiple instructions to fuse. I don't see what that isn't a good combination.

With two instructions compressed into one and then fused into one micro-op you get both less pressure on cache while still getting high throughput in the micro-op execution pipeline.

The "killer feature" here is about designing the ISA around the fact that one knows macro-op fusion and compression exists.

E.g. x86 ISA was designed without any thought to the possible existence of macro-o fusion.

ARM seems to also have mostly ignored the existence of this in high end processors when designing their ISA. Innovation is really just about using existing stuff in clever new ways.

Btw Macro-op fusion and micro-op fusion are not the same thing.


Doh. Yes, I fat fingered micro-op for macro-op.

With compression you avoid the cost of having multiple instructions to fuse. I don't understand this. RVC is only a 16b 'abbreviation' for a subset of RISCV instructions. It's ultimately the same instruction.

Also, RVC is an extension; it isn't base and Andrew Waterman's thesis [1] doesn't even mention macro-op fusion. FWIW, the A72 uses macro-op fusion [2]. Neither architecture ruled it out, but then neither architecture was anticipating it. It's just a microarchitectural optimization.

No, the x86 (1978) certainly wasn't designed with macro-op fusion in mind. Instead, macro-op fusion (2000) was developed by Intel to improve x86 performance [3].

[1] https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

[2] https://techreport.com/review/28189/inside-arms-cortex-a72-m...

[3] https://patents.google.com/patent/US6675376


If I got two 16-bit instructions packed into a 32-bit word which then get fused, what I have achieved is equivalent to adding complex instruction to the ISA without really adding it.

It is like being able to make arbitrary CISC instruction by combining various RVC instructions.

How is that not a good thing? It is like being able to add tons of instructions without consuming any extra ISA encoding space.

I don’t see why compressed instructions and macro fusions need to be part of the base instructions.

These are just micro-architecture optimizations available to you if you want higher performance.

For a cheap small microcontroller you don’t want it.


AFAICS the argument is that the alternative approach of adding the most common instruction combinations to the base ISA (like base+index<<shift addressing mode) could have largely avoided the need for the C extension as well as macro-op fusion. This would have simplified the implementation of almost all cores, the exception being the smallest possible microcontroller.


But adding compression just costs 400 gates, how on earth is that an issue, even on a small controller?

The C extensions saves a lot more memory than these more complex instructions. So even if you don't add macro fusion, you are still getting advantages from fewer cache misses or less money spent on cache.

You seem to talk theory, when in practice we know the BOOM RISC-V CPU outperforms a ARM-32 Cortex-A9, while requiring half the silicon area. The RISC-V is 0.27 mm2 while the ARM is 0.53 mm2 using same technology.

And what you are missing from the overall picture is that a key requirement for RISC-V is that it is useable in academia and for teaching. It is supposed to be easy for students to learn as well as to implement simple RISC-V CPU cores. All of that is quickly out the window if you go down the ARM road.

That RISC-V pulls off all these things: higher performance, smaller die, simpler implementation and easier to teach validates IMHO their choices. I don't see how your argument has any legs to stand on.


> But adding compression just costs 400 gates, how on earth is that an issue, even on a small controller?

If it's so cheap and good, why is it an extension and not part of the base then?

Anyway, the problem isn't how few gates you can get away with for a low performance microcontroller, but rather how to design a wide and fast decoder for a higher end core. As the instruction stream isn't self-synchronizing, you need to decode previous instructions to know where the instruction boundary for the next instruction is. Sure, you could speculatively start to decode following instructions, but that gets hairy and consumes extra power.

> You seem to talk theory, when in practice we know the BOOM RISC-V CPU outperforms a ARM-32 Cortex-A9, while requiring half the silicon area. The RISC-V is 0.27 mm2 while the ARM is 0.53 mm2 using same technology.

Yes, BOOM is a nice design, and the (original) author used to hang around here on HN. That being said, having read the paper where those area claims were made, I think it's quite hard to do cross-ISA comparisons like this. E.g. the A9 has to carry around all the 32-bit legacy baggage (in fact, it doesn't even support aarch64, which isn't that surprising since it's an old core dating back all the way to 2010), it has a vector floating point unit, it supports the ARM 32-bit compressed ISA, and whatnot.

> And what you are missing from the overall picture is that a key requirement for RISC-V is that it is useable in academia and for teaching. It is supposed to be easy for students to learn as well as to implement simple RISC-V CPU cores. All of that is quickly out the window if you go down the ARM road.

I'm not forgetting that, and that's certainly an argument in favor of RISC-V. Doesn't mean that it's a particularly relevant argument for evaluating ISA's for production usage.

I'm not saying RISC-V is a bad idea. Certainly it seems good enough that combined with the no licensing cost aspect as well as geopolitical factors which is important for some prospective users, it has a good future ahead of it. I'm just saying that with some modest changes when the ISA was designed, it could have been even better.


> If it's so cheap and good, why is it an extension and not part of the base then?

I think what you mean is why it is not in the G extension which encompasses IMAFD but not C. I agree that is a bit odd.

I think it would have been very wrong if it was part of the I base instruction set. That should be as minimal as possible.

But I guess a question like this easily becomes very philosophical. For me it makes sense that C is not in G, because C is really an optimization and not about capability. A software developer, tool maker etc would care more about the instructions available I think than particular optimizations.

> E.g. the A9 has to carry around all the 32-bit legacy baggage

But surely that counts in RISC-V favor as ARM has no alternative modern minimal 32-bit instruction set. With RISC-V you can use 64-bit and 32-bit instructions with minimal code change.

And I don't see how ARM-64 would have made any of this any better, as it has over 1000 instructions. I am highly skeptical that you can make tiny cores out of that. But I am not a CPU guy so I am okay with being told I am wrong ;-) As long as you can give me a proper reason.

> Doesn't mean that it's a particularly relevant argument for evaluating ISA's for production usage.

True, but I think there is a value in the whole package. You see people evaluating RISC-V and finding that sure there are commercial offerings performing slightly better, or they could make custom ISA that does better. But the conclusion for many is that RISC-V is good enough and with the growing eco-system, that still makes it a better choice in sum. If you are to make a custom ISA today, it better be a lot better than RISC-V to be worth it, I would think.

I would also think there is a value for hardware makers to be on the same platform which Universities and research institutions are going to be using. As well as the same platform students are going to come out of University knowing.

Anyway thanks for the discussion. While I am pushing back on everything (seemingly) I do find this kind of discussion very valuable in learning better pros and cons. It spurs me to look up and learn more things.


Obviously the plan is to take a short string of bytes, expand it to a longer one, and then identify a single instruction that those bytes map to. This is much cleverer than just mapping the original string of bytes directly to an instruction. /s


> Keeping code small is advantageous to performance because it makes it easier to keep the code you are running inside high speed CPU cache.

It's not called the Iron Law of Performance[0] for nothing:

Time/Program = Instructions/Program * Cycles/Instruction * Time/Cycle

[0]: https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp15/cs...


I've seen quite a few presentations, but never have I seen such a direct approach to showcasing the merit of RISCV. Well done.


I am really curious about the content of that article. I work closely with different embedded CPU architectures and in most of our multi-expert evalutions of RISC-V (from different suppliers) it could not compete with commercial offerings (avoiding names).


Allright, I've used the incognito "trick" to gain access to that article and I must say that this article is an utter crap written by a person that never seen why and how processors are chosen to be integrated in projects. Comes off as a piece written by a paid pseudo-evangelist. Not going to say that the content is somehow wrong (its not), but its totally irrelevant to the sucess of particular architectures or ISAs in general. jeez.


> Comes off as a piece written by a paid pseudo-evangelist.

Yeah, I feel like Medium somehow manages to attract that sort of writing. Bizarrely, I've seen this with some authors who write on both Medium and elsewhere -- their Medium content seems weird in the same way. I wonder if it's just that after enough of that content is there, it's the norm, or whether something about the UI makes it feel more "just jot down your thoughts," while a blogpost makes the same authors draft, research, revise, etc., but it's uncanny.


Could you atleast elaborate with specifics?


The author has been writing about RISC-V with relation to Apple M1 and basically fanboying about it. It is basically the same as some guy coming in and say Rust will overtake C and C++.

And yet the article got interest, since it align with the general public and main stream media view of RSIC-V being the rising star. It is free. It must be better. etc.

I could have repeat most of his point with OpenPOWER. ( microWatts [1] ) And for many application there are lot of reason why OpenPOWER are much better than RISC-V. ( Rather unfortunate IBM is loathed by lots of people in the industry )

[1] https://github.com/antonblanchard/microwatt


A number of actors in the industry will disagree with you. Nvidia after careful analysis decided RISC-V was the superior replacement for their general purpose Falcon Chip: https://riscv.org/wp-content/uploads/2017/05/Tue1345pm-NVIDI...

Dave Ditzel, who is no dummy saw RISC-V as the better choice for the machine learning accelerator Esperanto Technologies is building: https://www.eejournal.com/article/another-risc-v-religious-c...

There are numerous research articles and examples online of companies deciding on RISC-V and their reasons for it.

Unless I misunderstand I cannot see how OpenPower matches RISC-V at all. It is a large ISA, which means you don't have the option of making e.g. small specialized co-processors based on the ISA. Nor can you make simple CPUs for the microcontroller market.

The RISC-V BOOM has half the silicon area requirement of a comparable ARM Cortex CPU, is OpenPower based chips able to match that?

If you already have a large instruction-set I don't quite see the value in an extendable ISA. You cannot keep the ISA lean as micro-architecture best practices change.


I think more than the technical merits of RISC-V, the appeal lies in the fact that you are not locked in to a core vendor and the software tools ecosystem.

RISC-V has commoditized the cpu.

It doesn't need to be orders of magnitude better. Even if it is at par or even marginally worse, people will start using it.


>general purpose Falcon Chip

general purpose embedded processor

All the example you sight are in embedded controller, where the author has been consistently comparing RSIC-V to General Purpose CPU ISA like x86 and ARMv8 / aarch64.

>The RISC-V BOOM has half the silicon area requirement of a comparable ARM Cortex CPU, is OpenPower based chips able to match that?

I did include the microWatts link in the post above.


x86 and ARM is what RISC-V designers frequently compare to in their own documentation. Likely because those are well known architectures.

But your argument doesn't really make any sense as ARM is a frequent choice for this kind of things. In fact is one of the main competing CPUs considered by Nvidia. ARM is widely used in embedded system. That it became a desktop CPU is a fairly recent phenomenon.

Esperanto Technologies use RISC-V both for general purpose processor and specialized coprocessors.

Anyway the whole point of RISC-V is to be able to span from small embedded systems to super computers.

> I did include the microWatts link in the post above.

I cannot find any quote on number of transistors it uses or comparison to other designs. Given that it is supposed to implement the IBM POWER ISA v3.0 which is quite huge I don't see how you can get the transistor count below that of RISC-V while still having decent performance.


I got a similar impression. Discussing addressing modes,

> This is probably a more faithful representation of what happens in the x86 code as well. I doubt you can multiply with anything but multiples of 2, since multiplication is a fairly complex operation.

For an article comparing instruction sets, ‘I dunno and didn't care enough to spend 30 seconds Googling’ does not inspire confidence.


“since it align with the general public and main stream media view of RSIC-V being the rising star”

I don’t think “the general public” even knows what “RISC-V” means. It’s more some specific circles where this is going on.

The good thing for those writing such articles is that it’s hard to refute them, as they only discuss the instruction set, not particular designs, and few, if any, people can meaningfully judge how true various claims “this thing my/your favorite CPU does is worth/not worth the transistors. You’ll see it once advanced designs of my favorite CPU exist” are (even more so because feature A may be worth it for a CPU that also does B, but doesn’t do C, and that has a given power budget)


On purely commercial terms, RISC-V may not compete. But as already stated in other comments, CPUs and chips are getting political.


Interesting point, it depends what were the parameters. When it come to performance/cost compared to ARM processors RISC V was seen as a better choice by most semiconductor companies.

Maybe in your experience the criteria might have been different


> When it come to performance/cost compared to ARM processors RISC V was seen as a better choice by most semiconductor companies.

Can you give some evidence to back this assertion up?


Sorry internal numbers at two FPGA companies who were benchmarking which hard-core to choose


Interesting - thanks!



I always thought it was insane that the CPUs had instructions specific to a particular application in a particular operating system embedded in them. I think the close we get to the metal, the more abstract and general purpose things should get. For example, things like "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero" should not exist at all.


This sounds nice and clean and theoretic, but in practice you get better results by matching the design to the workload. Computers aren't new, we know what the workload might be. Javascript is a couple of decades old. The challenge is getting compilers to use it - which is much easier with non-AOT languages, where you can update the compiler for existing code without involving the developer.

Otherwise you end up using too many instructions to achieve simple things. Sometimes different semantics can be much slower; you can see this when emulating architecture A on architecture B, and why Apple's switchable memory-ordering semantics are absolute genius.

Also, this is why GPUs and DSPs exist.


https://community.arm.com/developer/ip-products/processors/b...

> Javascript uses the double-precision floating-point format for all numbers. However, it needs to convert this common number format to 32-bit integers in order to perform bit-wise operations. Conversions from double-precision float to integer, as well as the need to check if the number converted really was an integer, are therefore relatively common occurrences.

> Armv8.3-A adds instructions that convert a double-precision floating-point number to a signed 32-bit integer with round towards zero. Where the integer result is outside the range of a signed 32-bit integer (DP float supports integer precision up to 53 bits), the value stored as the result is the integer conversion modulo 2^32, taking the same sign as the input float.

The semantics of numbers in JavaScript behave in a specific way. Since we are talking about numbers, it hardly gets more low-level and thus it makes good sense to design ALU operations targeting common numerical operations.


The hardware overhead to implement this particular instructions is likely very small.

The floating operation themselves are done in fixed point with large registers.

This instructions tells the cpu to bypass part of the circuit. The javascript thing is just branding to make devs aware of the application of the instructions.

Ironically, this actually gives more control of the hardware rather than increase abstraction.


The reason CISC became so complicated was that microcode stores were getting larger.

  Once the decision is made to use microprogrammed control, the cost to expand an instruction set is very small; only a few more words of control store. Since the sizes of control memories are often powers of 2, sometimes the instruction set can be made more complex at no extra hardware cost by expanding the microprogram to completely fill the control memory. [1]
The reason for an instruction like that Javascript one are similar: reports of Moore's Law demise are greatly exaggerated and DRAMs are getting bigger. Yes, these instructions may offend your sensibilities but the transistors are available and there are important benchmarks for important customers. Chips+ISAs aren't abelian groups, especially ARM chips+ISAs. RISC-V has extensions which promise to be the wild west all over again.

[1] https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pd...


It sure is an interesting instruction to implement. I can only assume that someone, somewhere, has graphs of commonly executed code sequences and concluded that there was merit in supporting Javascript's requirements in the ISA.


>For example, things like "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero" should not exist at all.

It should exist because the alternative would be C style undefined behavior in Javascript. Remember how everyone says that C is fast because undefined behavior means that C does not have to encode the semantics of a particular architecture to be fast and therefore is the fastest on every architecture. Things like bitshifting have subtle differences between architectures and therefore have to be undefined behavior to be fast everywhere.

In Javascript it's the opposite. It cannot have any undefined behavior because it is untrusted code that has to be executed inside a sandbox and not infect the user's computer. But people wanted Javascript to be faster so they simply took the behavior of the fastest and most common architecture that Javascript was used on and just put them into the Javascript specs. That architecture was x86 and to our surprise: ARM is not x86. This means if you want fast Javascript on ARM, it has to adopt x86 rounding semantics. Compared to the unbounded costs of undefined behavior it's a microscopic price to pay.


I think the instruction is just misnamed. It definitely shouldn't have the word Javascript in it. It's just a different floating point format. As others have noted, it's not more complicated than any other floating point instruction.


if there is computation people are doing a lot, and a hardware instruction can make it faster, i will take it.

so in a sense i agree you, but i only think that because i think javascript shouldn’t exist!


All five reasoned mentioned apply to the MIPS ISA as well, no?


It would be awesome if folks would make a collective decision to turn off the Medium paywall on their articles there, or simply make a markdown blog with NextJS and Vercel for free.


Am I the only one who can't read anymore the medium articles?


%gp


[flagged]


First articles on medium are usually free.


This probably happens because these links aren't paywalled for everyone. I, for instance, have never encountered any paywall on any medium.com article; whether that is because I have my browser configured to discard all cookies whenever the browser is closed (and I close my browser very often), or because I have it configured to disable Javascript by default unless the site is whitelisted (and medium.com is not on the whitelist), or something else, I don't know.


It's both, actually.

The Medium popup only comes after visiting a few articles, and it is shown with JS.


Since I do most of my reading on my phone, I find this iOS shortcut someone posted on HN a few months back of great use for bypassing almost every paywall I have come across.

I extended it to also work on Medium by using Twitter as the referrer.

Shortcut: https://www.icloud.com/shortcuts/95928cdef80a46d892d05a58a88...


The FAQ already says paywalled links are allowed as long as there's a workaround.


It does, but that was a bad decision that has made HN worse; it's time to revisit it.


[flagged]


I see this comment on every single medium article.

Just open it in private mode, or use noscript, or block cookies.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: