*Compressed Instructions and Macro-Operation Fusion* It's strange to put these t...

socialdemocrat · on Dec 28, 2020

With compression you avoid the cost of having multiple instructions to fuse. I don't see what that isn't a good combination.

With two instructions compressed into one and then fused into one micro-op you get both less pressure on cache while still getting high throughput in the micro-op execution pipeline.

The "killer feature" here is about designing the ISA around the fact that one knows macro-op fusion and compression exists.

E.g. x86 ISA was designed without any thought to the possible existence of macro-o fusion.

ARM seems to also have mostly ignored the existence of this in high end processors when designing their ISA. Innovation is really just about using existing stuff in clever new ways.

Btw Macro-op fusion and micro-op fusion are not the same thing.

CalChris · on Dec 28, 2020

Doh. Yes, I fat fingered micro-op for macro-op.

With compression you avoid the cost of having multiple instructions to fuse. I don't understand this. RVC is only a 16b 'abbreviation' for a subset of RISCV instructions. It's ultimately the same instruction.

Also, RVC is an extension; it isn't base and Andrew Waterman's thesis [1] doesn't even mention macro-op fusion. FWIW, the A72 uses macro-op fusion [2]. Neither architecture ruled it out, but then neither architecture was anticipating it. It's just a microarchitectural optimization.

No, the x86 (1978) certainly wasn't designed with macro-op fusion in mind. Instead, macro-op fusion (2000) was developed by Intel to improve x86 performance [3].

[1] https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

[2] https://techreport.com/review/28189/inside-arms-cortex-a72-m...

[3] https://patents.google.com/patent/US6675376

socialdemocrat · on Dec 28, 2020

If I got two 16-bit instructions packed into a 32-bit word which then get fused, what I have achieved is equivalent to adding complex instruction to the ISA without really adding it.

It is like being able to make arbitrary CISC instruction by combining various RVC instructions.

How is that not a good thing? It is like being able to add tons of instructions without consuming any extra ISA encoding space.

I don’t see why compressed instructions and macro fusions need to be part of the base instructions.

These are just micro-architecture optimizations available to you if you want higher performance.

For a cheap small microcontroller you don’t want it.

jabl · on Dec 28, 2020

AFAICS the argument is that the alternative approach of adding the most common instruction combinations to the base ISA (like base+index<<shift addressing mode) could have largely avoided the need for the C extension as well as macro-op fusion. This would have simplified the implementation of almost all cores, the exception being the smallest possible microcontroller.

socialdemocrat · on Dec 28, 2020

But adding compression just costs 400 gates, how on earth is that an issue, even on a small controller?

The C extensions saves a lot more memory than these more complex instructions. So even if you don't add macro fusion, you are still getting advantages from fewer cache misses or less money spent on cache.

You seem to talk theory, when in practice we know the BOOM RISC-V CPU outperforms a ARM-32 Cortex-A9, while requiring half the silicon area. The RISC-V is 0.27 mm2 while the ARM is 0.53 mm2 using same technology.

And what you are missing from the overall picture is that a key requirement for RISC-V is that it is useable in academia and for teaching. It is supposed to be easy for students to learn as well as to implement simple RISC-V CPU cores. All of that is quickly out the window if you go down the ARM road.

That RISC-V pulls off all these things: higher performance, smaller die, simpler implementation and easier to teach validates IMHO their choices. I don't see how your argument has any legs to stand on.

jabl · on Dec 28, 2020

> But adding compression just costs 400 gates, how on earth is that an issue, even on a small controller?

If it's so cheap and good, why is it an extension and not part of the base then?

Anyway, the problem isn't how few gates you can get away with for a low performance microcontroller, but rather how to design a wide and fast decoder for a higher end core. As the instruction stream isn't self-synchronizing, you need to decode previous instructions to know where the instruction boundary for the next instruction is. Sure, you could speculatively start to decode following instructions, but that gets hairy and consumes extra power.

> You seem to talk theory, when in practice we know the BOOM RISC-V CPU outperforms a ARM-32 Cortex-A9, while requiring half the silicon area. The RISC-V is 0.27 mm2 while the ARM is 0.53 mm2 using same technology.

Yes, BOOM is a nice design, and the (original) author used to hang around here on HN. That being said, having read the paper where those area claims were made, I think it's quite hard to do cross-ISA comparisons like this. E.g. the A9 has to carry around all the 32-bit legacy baggage (in fact, it doesn't even support aarch64, which isn't that surprising since it's an old core dating back all the way to 2010), it has a vector floating point unit, it supports the ARM 32-bit compressed ISA, and whatnot.

> And what you are missing from the overall picture is that a key requirement for RISC-V is that it is useable in academia and for teaching. It is supposed to be easy for students to learn as well as to implement simple RISC-V CPU cores. All of that is quickly out the window if you go down the ARM road.

I'm not forgetting that, and that's certainly an argument in favor of RISC-V. Doesn't mean that it's a particularly relevant argument for evaluating ISA's for production usage.

I'm not saying RISC-V is a bad idea. Certainly it seems good enough that combined with the no licensing cost aspect as well as geopolitical factors which is important for some prospective users, it has a good future ahead of it. I'm just saying that with some modest changes when the ISA was designed, it could have been even better.

socialdemocrat · on Dec 28, 2020

> If it's so cheap and good, why is it an extension and not part of the base then?

I think what you mean is why it is not in the G extension which encompasses IMAFD but not C. I agree that is a bit odd.

I think it would have been very wrong if it was part of the I base instruction set. That should be as minimal as possible.

But I guess a question like this easily becomes very philosophical. For me it makes sense that C is not in G, because C is really an optimization and not about capability. A software developer, tool maker etc would care more about the instructions available I think than particular optimizations.

> E.g. the A9 has to carry around all the 32-bit legacy baggage

But surely that counts in RISC-V favor as ARM has no alternative modern minimal 32-bit instruction set. With RISC-V you can use 64-bit and 32-bit instructions with minimal code change.

And I don't see how ARM-64 would have made any of this any better, as it has over 1000 instructions. I am highly skeptical that you can make tiny cores out of that. But I am not a CPU guy so I am okay with being told I am wrong ;-) As long as you can give me a proper reason.

> Doesn't mean that it's a particularly relevant argument for evaluating ISA's for production usage.

True, but I think there is a value in the whole package. You see people evaluating RISC-V and finding that sure there are commercial offerings performing slightly better, or they could make custom ISA that does better. But the conclusion for many is that RISC-V is good enough and with the growing eco-system, that still makes it a better choice in sum. If you are to make a custom ISA today, it better be a lot better than RISC-V to be worth it, I would think.

I would also think there is a value for hardware makers to be on the same platform which Universities and research institutions are going to be using. As well as the same platform students are going to come out of University knowing.

Anyway thanks for the discussion. While I am pushing back on everything (seemingly) I do find this kind of discussion very valuable in learning better pros and cons. It spurs me to look up and learn more things.

pjc50 · on Dec 27, 2020

Obviously the plan is to take a short string of bytes, expand it to a longer one, and then identify a single instruction that those bytes map to. This is much cleverer than just mapping the original string of bytes directly to an instruction. /s