More

ribit · on June 12, 2024

Modern GPUs are exposing the SIMD behind the SIMT model and heavily investing into SIMD features such as shuffles, votes, and reduces. This leads to an interesting programming model. One interesting challenge is that flow control is done very differently on different hardware. AMD has a separate scalar instruction pipeline which can set the SIMD mask. Apple uses an interesting per-lane stack counter approach where value of zero means that the lane is active and non-zero value indicates how many blocks need to be exited for the thread to become active again. Not really sure how Nvidia does it.

ribit · on June 11, 2024

How would you envision that working at the hardware level? GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible. SIMD is a natural way to implement this. In real world, SIMT is just SIMD with some additional capabilities for control flow and a programming model that focuses on SIMD lanes as threads of execution.

What’s interesting is that modern SIMT is exposing quite a lot of its SIMD underpinnings, because that allows you to implement things much more efficiently. A hardware-accelerated SIMD sum is way faster than adding values in shared memory.

avianes · on June 11, 2024

> GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible

The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.

> SIMT is just SIMD with some additional capabilities for control flow ..

In the Nvidia uarch, it does not. The key part of the Nvidia uarch is the "operand-collector" and the emulation of multi-ports register-file using SRAM (single or dual port) banking. In a classical SIMD uarch, you just retrieve the full vector from the register-file and execute each lane in parallel. While in the Nvidia uach, each ALU have an "operand-collector" that track and collect the operands of multiple in-flight operations. This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

When a warp is selected, the instruction is decoded, an entry is allocated in the operand-collector of each used ALU, and the list of register to read is send to the register-file. The register-file dispatch register reads to the proper SRAM banks (probably with some queuing when read collision occur). And all operand-collectors independently wait for their operands to come from the register-file, when an operand collector entry has received all the required operands, the entry is marked as ready and can now be selected by the ALU for execution.

That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.

Nvidia do all of this because it allow to design a more compact register-file (memories with high number of port are costly) and probably because it help to better use the available compute resources with masked operations

ribit · on June 12, 2024

In an operand-collector architecture the threads are still executed in lockstep. I don't think this makes the basic architecture less "SIMD-y". Operand collectors are a smart way to avoid multi-ported register files, which enables more compact implementation. Different vendors use different approaches to achieve a similar result. Nvidia uses operand collectors, Apple uses explicit cache control flags etc.

> This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet. The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

> That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

It is my understanding that you need to synchronize threads when accessing shared memory. Not only different threads can execute on different SIMD, but also threads on the same SIMD can access shared memory over multiple cycles on some architectures. I do not see how thread synthconization relates to operand collectors.

avianes · on June 12, 2024

> In an operand-collector architecture the threads are still executed in lockstep. > [...] > It is my understanding that you need to synchronize threads when accessing shared memory.

Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

And yes, it's hard to expose de-synchronization without memory operations, so you only need sync for memory operation. (load/store unit also have operand-collector)

> You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet

That's true, but you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

> The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

Calling UE "SIMD unit" in an SIMT uarch add a lot of ambiguity, so I'm not sure a understand you point correctly. But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

ribit · on June 12, 2024

> Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Hm, the way I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep". In my understanding the role of the operand collector was to make sure that all register arguments are available before the instruction starts executing. If the operand collector needs multiple cycles to fetch the arguments from the register file, the instruction execution would stall.

So you are saying that my understanding is incorrect and that the instruction can be executed in multiple passes with different masks depending on which arguments are available? What is the benefit as opposed to stalling and executing the instruction only when all arguments are available? To me it seems like the end result is the same, and stalling is simpler and probably more energy efficient (if EUs are power-gated).

> But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

Ah, that is interesting, so the operand collector provides a limited reordering capability to maximize hardware utilization, right? I must have missed that bit in the patent, that is a very smart idea.

> But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)? Many GPUs claim dual-issue capability, but that either refers to interleaved execution from different programs (Nvidia, Apple) or a SIMD-within SIMT or maybe even a form of long instruction word (AMD). If I remember correctly, Nvidia instructions contain some scheduling information that tells the scheduler when it is safe to issue the next instruction from the same wave after the previous one started execution. I don't know how others do it, probably via some static instruction timing information. Apple does have a very recent patent describing dependency detection in an in-order processor, no idea whether it is intended for the GPU or something else.

> you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

I think this is essentially what some architectures describe as the "register file cache". What is nice about Nvidia's approach is that it seems to be fully automatic and can really make the best use of a constrained register file.

avianes · on June 12, 2024

> I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep".

Ok I see, that definitely not what I understood from my study of the Nvidia SIMT uarch. And yes I will claim that "the instruction can be executed in multiple passes with different masks depending on which arguments are available" (using your words).

> So the operand collector provides a limited reordering capability to maximize hardware utilization, right?

Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

> What is the benefit as opposed to stalling and executing the instruction only when all arguments are available?

That's a good question, note that: I think Apple GPU uarch do not work like the Nvidia one, my understanding is that Apple uarch is way closer to a classical SIMD unit. So it's definitely not killer to move form the original SIMT uarch from Nvidia.

That said, a think the SIMT uarch from Nvidia is way more flexible, and better maximize hardware utilization (executing instruction as soon as possible always help for better utilization). And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core. With a classical SIMD uarch it may be possible but you need extra hardware to handle warp execution overlapping, and even more hardware to enable overlapping more that 2 threads.

Also, Nvidia's operand-collector allow to emulate multi-ported register-file, this probably help with register sharing. There is actually multiple patent from Nvidia about non-trivial register allocation within the register-file banks, depending on how the register will be used to minimize conflict.

> Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)?

It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread. With SIMT most of the instruction parallelism is very explicit (with thread parallelism), so it's not really "extracted" (and not from the same thread). But anyway, if you question is either multiple instructions from a single warp can be executed in parallel (across different threads), then a would say probably yes for Nvidia (not sure, there is very few information available..), at least 2 instructions from the same thread block (from the same program, but different warp) should be able to be executed in parallel.

> I think this is essentially what some architectures describe as the "register file cache"

I'm not sure about that, there is actually some published papers (and probably some patents) from Nvidia about register-file cache for SIMT uarch. And that come after the operand-collector patent. But in the end it really depend what concept you are referring to with "register-file cache".

In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

ribit · on June 13, 2024

> Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

I understand, yes, it makes sense. Of course, other architectures can make other optimizations, like selecting warps that are more likely to have data ready etc., but Nvidia's implementation does sound like a very smart approach

> And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core

That is indeed a powerful technique

> It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.

Yes, I meant executing multiple instructions from the same warp/thread concurrently, depending on the execution granularity of course. Executing instructions from different warps in the same block is slightly different, since warps don't need to be at the same execution state. Applying the CPU terminology, warp is more like a "CPU thread". It does seem like Nvidia indeed moved quite far into the SIMT direction and their threads/lanes can have independent program state. So I thin I can see the validity of your arguments that Nvidia can remap SIMD ALUs on the fly to suitable threads in order to achieve high hardware utilization.

> In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

Got it, thanks!

P.S. By the way, wanted to thank you for this very interesting conversation. I learned a lot.

ribit · on June 11, 2024

You need to consider this in the context of the relevant task. Nvidia GPUs have extremely high peak performance for GEMM, but when working with LLMs, bandwidth (and RAM capacity) becomes the limiting factor. There is a reason why real ML-focused datacenter Nvidia GPUs use much wider RAM interfaces and a much higher price point. The M2 Ultra might not have the raw compute, but it has a lot of RAM and large caches.

ribit · on June 8, 2024

I fully support the idea of open instruction sets. I am not as much sold on the idea of cookie-cutter one-size-fits-all instruction sets. RISC-V is very nice for teaching CPU basics, and it is a great fit for tiny cores or specialized microcontrollers. Unfortunately, since it has been designed for simplicity it appears that it makes it harder building high-performance cores. RISC-V philosophy for high-performance OoO cores relies on instruction fusion, and thus would require the compiler to emit fusion-friendly sequences for best performance - and these sequences might differ from CPU to CPU. To me this seems to go against the very idea of common open ISA. We already see quite a lot of fragmentation and I fear it will only get worse as time goes on. More complex instructions that combine multiple processing steps would help, it seems that the core RISC-V community is opposed to that idea out of purely ideological reasons.

camel-cdr · on June 8, 2024

> More complex instructions that combine multiple processing steps would help, it seems that the core RISC-V community is opposed to that idea out of purely ideological reasons

Thats not true, the Scalar Efficiency SIG is currently working on such an extension.

See this spreadsheet of discussed instructions: https://docs.google.com/spreadsheets/u/0/d/1dQYU7QQ-SnIoXp9v... and charter: https://github.com/riscv-admin/riscv-scalar-efficiency/blob/...

ribit · on June 8, 2024

I remember last year (?) Quancomm proposing an ISA extension that brings ARM-like addressing modes and paired stores to RISC-V, and the community reaction being very negative. Happy to hear that there are now initiatives to streamline these proposals and make RISC-V a better fit for high-performance CPUs. I am looking forward to future developments!

camel-cdr · on June 8, 2024

The negative responses were, because Qualcomm wanted to remove the C extension from the application profiles.

Qualcomm prefered a strict 32-bit instruction set, with potentially 64 bit naturally aligned instructions. RISC-V is designed for 16, 32, 48, 64 bit instructions that are 16 bit aligned, and retroactively changing that wouldn't have been a good decision. Both sides of the argument agreed that both options are resonable and don't hinder high performance designs.

Qualcomm seems to have accepted this now, as they e.g. proposes 48 bit instructions with larger immediats.

ribit · on June 6, 2024

Bugs notwithstanding (which I agree are a significant concern for Metal), I'd frankly much prefer to work with a well-designed, streamlined API like Metal instead of a needlesly verbose and complex Vulkan.

ribit · on May 24, 2024

No, per-clock performance improvements between M3 and M4 range from 0% to 20%, this is ignoring the two subtests that benefit from SME. That Twitter post is moot. GB results show high variation, it is easy enough to cherry pick pairs of results that show any point you might want. You have to compare result distributions. There were some users on anandtech forums who did it and the results are very clear.

ribit · on May 24, 2024

Geekbench supports Intel AMX and AVX-512. This is all in GB documentation.

ribit · on May 24, 2024

They have not adopted ARMv9. This is still ARMv8, but with SME.

hajile · on May 24, 2024

ARMv9.0 is very similar to ARMv8.5 (9.0 supersets 8.5 with SVE2, TME, TLA, and CCA), so it's not a massive deal. SME implies v8.7 which is basically identical to v9.2 except for those couple extensions previously mentioned.

I wonder if there is licensing at play though. Apple may have gotten a really great licensing deal on ARMv8 that they wouldn't be offered for ARMv9.

rmccue · on May 24, 2024

From what I’ve read previously, Apple has a special licensing deal already as they were part of founding Arm, although I don’t know if there’s any details on exactly how that works.

pkaye · on May 24, 2024

I believe its an architectural license which lets them design their own cores based on the the ARM instruction set. I think a few other companies may have this license but its not disclosed.

https://www.electronicsweekly.com/news/business/finance/arm-...

zimpenfish · on May 24, 2024

I believe they also don't have a per-chip license cost either (which, at Apple scale, probably adds up.)

astrange · on May 24, 2024

That seems like something people just made up, seeing as Apple didn't use ARM for something like a decade or two after that.

However, Apple basically commissioned ARMv8 in the first place to develop the A/M chips, so that presumably helps.

NobodyNada · on May 24, 2024

Apple cofounded ARM for use in the Newton product line; they released new Newton products from 1993-97 and discontinued them in 1998. They then used ARM again for the iPod, released in 2001.

astrange · on May 25, 2024

They didn't design the iPod ARM SoC though, nor the ones in iPhones for quite some time, and the microcontrollers in Macs for power management and such were not ARM. (I mean, some of them might've been, but the one I'm thinking of was SH or something.)

zimpenfish · on May 24, 2024

> Apple didn't use ARM for something like a decade or two after that.

They used the ARM610 in the Newton in 1993 (ARM was founded in late 1990) and then an 8 year gap to the iPod in 2001 (ARM7TDMI which are ARM designs.) Their first in-house ARM design (I believe) is the iPhone 4 in 2010.

They definitely didn't "architect/design ARM" for nearly a couple of decades after founding ARM, yeah, but they did use them.

ksec · on May 25, 2024

>That seems like something people just made up,

Yes. Watching it unfold, the whole misinformation self spread, and back into its loop is both interesting and tiring. It took months and some hard work to camp down Unified Memory being SRAM and something special. But this ARM deal? 5 years and counting.

It is two point;

Apple has an architectural license, which somehow became a "special deal" as if they are the only one doing it. They are not and even the Amphere Computing has one.

Apple was the part of the founder of ARM, and somehow this gave people impression they have a "special deal".

And had hajile not step up and provided the ARMv9 being a superset of ARMv8 etc etc I would have to spend some time to look up the detail of superset just to stamp out these type of non-sense. ( Each ARMv8+ and v9 have way too many features and optional extensions I dont even remember which is which )

And this is not the first time YouTuber Vadim Yuryev gave something out that is completely wrong.

skavi · on May 24, 2024

Does anyone have insight into why arm CPU vendors seem so hesitant about implementing SVE2? ~They seem~ *Apple seems to have no issue with SSVE2 or SME.

Edit: Only Apple has implemented SSVE and SME I think.

brigade · on May 24, 2024

What is the measurable benefit to implementing 128b SVE2? Like, ARM has CPUs that implement that, and it's not even disabled on some chips. So there must be benchmarks somewhere showing how worthwhile it is.

And implementing 256b SVE has different issues depending on how you do it. 4x256b vector ALUs are more power hungry than generally useful. 2x256b is only beneficial over 4x128b if you're limited by decode width, which isn't an issue now that A32/T32 support has been dropped. 3x256b would probably imply 3x128b which would regress existing NEON code. And little cores don't really want to double the transistors spent on vector code, but you can't have a different vector length than the big cores...

hajile · on May 24, 2024

I'd say that the theoretical ability to gang units together would be appealing.

If you have four 128-bit packed SIMD, you must execute 4 different instructions at once or the others go to waste. With SVE, you could (in theory) use all 4 as a single, very wide vector for common operations if there weren't a lot of instructions competing for execution ports. You could even dynamically allocate them based on expected vector size or amount of vector instructions coming down the pipeline.

Additionally, adding two 2048-bit vectors using NEON (128-bit packed SIMD) would require 16 add instructions while SVE would require just one. That's a massive code size reduction which matters for I-cache and the frontend throughput.

ribit · on May 25, 2024

I don't see how this would work out beneficially. Let's say your hardware can join 4x128b units as a virtual 512-bit SVE SIMD unit. This means you have to advertise VL as 512bit for reasons of consistency. Yes, you will save some entries in the reorder buffer if you encounter a single SVE instruction, but if the code contains independent SVE streams, you will be stalled. Moreso, not all operations will utilize all 512 register bits, so your occupancy might suffer. The only scenario I see this feature working out is if you are decode or reorder buffer limited. Neither is a problem for modern high-performance ARM cores. With x86, it might be a different story. From what I understand, AVX512 instructions can be quite large.

Modern out-of-order cores are already good at superscalar execution, so why not let them do their job? 4x128b units give you much more flexibility and better execution granularity.

janwas · on May 26, 2024

On x86 at least, the cost of OoO is astonishing - more pJ per instruction dispatch than the operation itself. Amortizing that over more operations is the whole point of SIMD. I have not yet seen such data for Arm.

That aside, see the "cmp" sibling thread for a major (4x penalty) downside to 4x128.

ribit · on May 26, 2024

Yes, OoO is expensive — after all, that is the cost of performance. Very wide SIMD is great for energy efficiency if that is what your compute patterns require (there is a good reason why GPUs are in-order very wide SMT SIMD processors). Is this the best choice for a general-purpose CPU? That I am not so sure about. A CPU needs to be able to run all kinds of code. A single wide SIMD unit is great for some problems, but it won't deliver good performance if you need more flexibility.

Could you point me to the "cmp" thread you mentioned? I don't know where to look for it.

janwas · on May 26, 2024

I agree with you we do not only want "very wide SIMD", and it seems to me that 2x512-bit (Intel) or 4x256 (AMD) are actually a good middle ground.

Sure, it's https://news.ycombinator.com/item?id=40465090.

ribit · on May 27, 2024

> I agree with you we do not only want "very wide SIMD", and it seems to me that 2x512-bit (Intel) or 4x256 (AMD) are actually a good middle ground.

I'd already classify this as "very wide". And the story is far from being that simple. Intel's 512-bit implementation is very area- and power-hungry, so much so that Intel is dropping the 512-bit SIMD altogether. AMD has 4x add units, but only two are capable of multiplication. So if your code mostly does FP addition, you get good performance. If your workflows are more complex, not so much.

The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation. And that on a core that runs lower clock and has less L1D bandwidth. Flexibility and symmetric ALU capabilities seems to be the king here.

> Sure, it's https://news.ycombinator.com/item?id=40465090

Ah, that is what you meant. Thank you for linking the post! My comment would be that this is not about 128b or 256b SIMD per se but about implementation details. There is nothing stopping ARM from designing a core with more mask write ports. Apparently, they felt this was not worth the cost. Other vendors might feel differently. I'd say this is similar to AMD shipping only two FMA units instead of four. Other vendors might feel differently.

janwas · on May 27, 2024

For very wide, I'm thinking of Semidynamic's 2048-bit HW, which with LMUL=8 gives 2048 byte vectors, or the NEC vector machines.

AFAIK it has not been publicly disclosed why Intel did not get AVX-512 into their e-cores, and I heard surprise and anger over this decision. AMD's version of them (Zen4c) are a proof that it is achievable.

I am personally happy with the performance of AMD Genoa e.g. for Gemma.cpp; f32 multipliers are not a bottleneck.

> The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation

Perhaps, though on VQSort it was more like 50% the performance. And if so, it's more likely due to the astonishingly anemic memory BW on current x86 servers. Bolting on more cores for ever more imbalanced systems does not sound like progress to me, except for poorly optimized, branch-heavy code.

ribit · on May 28, 2024

> Perhaps, though on VQSort it was more like 50% the performance.

I looked at the paper and my interpretation is that the performance delta between M1 (Neon) and the Xeon (AVX2) can be fully explained by the difference in clock (3.7 vs 3.3 Ghz) and the difference in L1D bandwidth (48byes/cycle vs. 128bytes/cycle). I don't see any evidence here that narrow SIMD is less efficient.

The AVX-512 is much faster, but that is because it has hardware features (most importantly, compact) that are central to the algorithm. On AVX2 and Neon these are emulated with slower sequences.

janwas · on May 28, 2024

Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.

Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?

ribit · on May 28, 2024

> Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.

That is interesting! So do I understand you correctly that the 512b vectors allow you to implement the algorithm more efficiently? That would indeed be a nice argument for longer SIMD

> Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?

It's a hardware detail. Intel does tie it to SIMD width, but it doesn't have to be the case. For example, Apple has 4x128b units but can only load up to 48 bytes (I am not sure about the granularity of the loads) per cycle.

janwas · on May 28, 2024

Right, longer vectors let us write more elements at a time.

I agree that the number of L1 load ports (or issue width) is also a parameter: that times the SIMD width gives us the bandwidth. It will be interesting to see what AMD Zen5 brings to the table here.

camel-cdr · on May 26, 2024

> but if the code contains independent SVE streams, you will be stalled.

Can you explain why thats bad?

Don't you still get full utilisation of the 4x128b units?

ribit · on May 26, 2024

If you do streaming-type operations on long arrays, yes. If your data sizes are small, however, four smaller units might be more flexible. As a naive example, let's take the popular SIMD acceleration of hash tables. Since the key is likely to be found close to its optimal location, long SIMD will waste compute. With small SIMD however you could do multiple lookups in parallel courtesy of OoO.

This is why I like the ARM/Apple design with "regular SIMD" and "streaming SIMD". The regular SIMD is latency-optimized and offers versatile functionality for more flexible data swizzling, while the streaming SIMD uses long vectors and is optimized for throughput.

dzaima · on May 24, 2024

You can't do 2048 bits of addition in one SVE instruction; not portably, at least (and definitely not on any existing hardware). While the maximum SVE register size is 2048 bits, the minimum is 128 bits, and the hardware chooses the supported register size, not the programmer. For portable SVE, your code needs to work for all of those widths, not just the smallest or largest. (of related note is RISC-V RVV, which allows you to group up to 8 registers together, allowing a minimum portable operation width of 128×8 = 1024 bits in a single instruction (and up to 65536×8 = 64KB for hypothetical crazy hardware with max VLEN), but SVE/SVE2 don't have any equivalent)

brigade · on May 24, 2024

A for() loop does the same thing at the cost of like 3 instructions. 4x128b has the flexibility that you don't need 512b wide operations on the same data to keep the ALUs fed. If you have 512b wide operations being split to 4x128b instructions, great, otherwise the massive OoOE window of modern chips can decode the next few loop iterations to keep the ALUs fed, or even pull instructions from a completely different kernel.

camel-cdr · on May 25, 2024

> What is the measurable benefit to implementing 128b SVE2

Probably not much, SVE2 has some nicer instructions, but neon already is quite solid.

> And implementing 256b SVE has different issues depending on how you do it

For in-order, and not very aggressively out-of-order cores having a larger vector length can be very useful to still get a lot of throughput out of your design. It also helps hide memory latency.

Here is a paper: https://ar5iv.labs.arxiv.org/html/2309.06865

and presentation: http://riscv.epcc.ed.ac.uk/assets/files/sc23/Short-reasons-f...

For aggressively out-of-order cores it should, for the most part, just be about decode, and some what memory latency hiding.

> 2x256b is only beneficial over 4x128b if you're limited by decode width [...] 3x256b would probably imply 3x128b which would regress existing NEON code.

I agree, that's why I don't get why people are "excited" for Zen5 to have 512b execution units, instead of 256b ones. At best there won't be a performance improvement for avx/avx2 code, at worst a regression.

janwas · on May 25, 2024

Anyone interested in getting such numbers could run github.com/google/gemma.cpp on Arm hardware with hwy::DisableTargets(HWY_ALL_NEON) or HWY_ALL_SVE to compare the two :) I'd be curious to see the result.

Calling hwy::DispatchedTarget indicates which target is actually being used.

skavi · on May 24, 2024

Masked instructions primarily. But apart from that it’s just a more complete ISA vs NEON. More comparable to AVX512/AVX10.

> 2x256b is only beneficial over 4x128b if you're limited by decode width

This is only true if we ignore more complex instructions and focus on things like adding two vectors.

brigade · on May 24, 2024

What is the percentage gain of using masked instructions on any benchmark/task of your choice? It can be negative on weird kernels that do lots of vector cmp since even ARM decided the cost of more than one write port in the predicate register file wasn't worth it, or if the masking adds lots of unnecessary and possibly false dependencies on the destination registers.

> This is only true if we ignore more complex instructions and focus on things like adding two vectors.

ARM implemented a CPU that had 2x256b SVE and 4x128b NEON. Literally the only benchmarks that benefitted from SVE were because they were limited by the 5-wide decode in NEON.

Do you have an actual real-world counterexample?

janwas · on May 25, 2024

It's great you bring up cmp, helps to understand why 4x128 is not necessarily as good as 1x512. Quicksort, hardly a 'weird kernel', does comparisons followed by compaction. Because comparisons return a predicate, which have only a single write port, we can only do 128 bits of comparisons per cycle. Ouch.

However, masking can still help our VQSort [1], for example when writing the rightmost partition right to left without stomping on subsequent elements, or in a sorting network, only updating every second element.

[1] https://github.com/google/highway/tree/master/hwy/contrib/so...

skavi · on May 24, 2024

I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.

I think the transition from AVX2 to AVX512 is comparable in that it provided not only larger vectors, but also a much nicer ISA. There were certainly a few projects that benefited significantly from that move. simdjson is probably the most famous example [0].

[0]: https://lemire.me/blog/2022/05/25/parsing-json-faster-with-i...

snvzz · on May 25, 2024

>I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.

Ironically, on the RISC-V side, RVV 1.0 hardware is readily available and cheap. BananaPI BPI-F3 (spacemiT K1) is RVA22+RVV, as well as some C908-based MCUs.

brigade · on May 24, 2024

CPUs with SVE have been generally available for two years now. SME and AVX-512 got benchmarks written showing them off before the CPUs were even available. Seems fair to me.

simdjson specifically benefitted from Intel's hardware decision to implement a 512b permute from 2x 512b registers with a throughput of 1/cycle. That's area-expensive, which is (probably) why ARM has historically skimped on tbl performance, only changing as of the Cortex-X4.

Anyway simdjson is an argument for 256b/512b vector permute, not 128b SVE.

Having written a lot of NEON and investigated SVE... I disagree that SVE is a nicer ISA. The set of what's 2-operand destructive, what instructions have maskable forms vs. needing movprfx that's only fused on A64FX, and dealing the intrinsics issues that come from sizeless types are all unneeded headaches. Plus I prefer NEON's variable shift to SVE's variable shifts.

janwas · on May 25, 2024

Fair point about movprfx, I understand they were short on encoding space. This can be mitigated by using *_x versions of intrinsics where masks are not used.

The sizeless headache is anyway there if you want to support RISC-V V, which we do.

One other data point in favor of SVE: its backend in Highway is only 6KLOC vs NEON's 10K, with a similar ratio of #if (indicating less fragmentation, more orthogonal).

skavi · on May 24, 2024

It’s been a while since I looked, but I remember SVE2 being much more usable than SVE. A64FX was SVE IIRC. I think SVE did not do a great job of fully replacing NEON.

neonsunset · on May 24, 2024

This.

AVX512 is all around a nice addition as JIT-based runtimes like .NET (8+) can use it for most common operations: text search, zeroing, copying, floating point conversion, more efficient forms of V256 idioms with AVX512VL (select-like patterns replaced with vpternlog).

SVE2 will follow the same route.

hajile · on May 24, 2024

SVE2 is an extension on top of SVE which some stuff already implements. The issue is more likely to be the politics of moving to ARMv9 than anything else.

As to SVE though, I'd guess variable execution time makes the implementation require a bit of work. Normally, multi-cycle tasks have a fixed number. Your scheduler knows that MUL takes N cycles and plans accordingly.

SVE seems like it should require N-M cycles depending on what is passed. That must be determined and scheduled around. This would affect the OoO parts of the core all the way from ordering through to the end of the pipeline.

That's definitely bordering on new uarch territory and if that is the case, it would take 4-5 years from start to finish to implement. This would explain why all the ARMv8 guys never got around to it. ARMv9 makes it mandatory, but that was released in 2021 or so which means non-ARM implementors probably have a ways to go.

dzaima · on May 24, 2024

SVE doesn't need variable-execution-time instructions, outside of perhaps masked load/store, but those are already non-constant. Everything else is just traditional instructions (given that, from the perspective of the hardware, it has a fixed vector size), with a blend.

ribit · on May 25, 2024

I am curious, which SVE instructions imply variable execution time? I’d guess that first fault load could be tricky to implement…

skavi · on May 24, 2024

This isn’t a convincing explanation to me. There are plenty of variable latency instructions on existing high performance arm64 cores.

anticensor · on May 25, 2024

Variable execution time instructions can always be divided into smaller fixed execution time microinstructions.

ribit · on May 24, 2024

What do you mean? Apple is the only one who has an SME/SSVE implementation.

skavi · on May 24, 2024

I misremembered. Looks like it is only Apple. I appreciate the correction.

ribit · on May 24, 2024

My guess is that Apple is simply not interested in some of the ARMv9 features. They are not eager to implement SVE and the se Ure virtualization features are probably not that relevant to them.

axoltl · on May 24, 2024

Yep, the binaries are all arm64e.

saagarjha · on May 24, 2024

This doesn’t really say much

ribit · on May 7, 2024

Right, so you are disabling all performance features and effectively turning your CPU into a low–end low–power SKU. Of course you’d get better battery life. It’s not the same thing though.

ribit · on May 7, 2024

M1 Ultra did benchmark close to 3090 in some synthetic gaming tests. The claim was not outlandish, just largely irrelevant for any reasonable purpose.

Apple does usually explain their testing methodology and they don’t cheat on benchmarks like some other companies. It’s just that the results are still marketing and should be treated as such.

Outlandish claims notwithstanding, I don’t think anyone can deny the progress they achieved with their CPU and especially GPU IP. Improving performance on complex workloads by 30–50% in a single year is very impressive.

tsimionescu · on May 8, 2024

It did not get anywhere close to a 3090 in any test when the 3090 was running at full power. They were only comparable at specific power usage thresholds.

LoganDark · on May 11, 2024

Different chips are generally compared at similar power levels, ime. If you ran 400 watts through an M1 Ultra and somehow avoid instantly vaporizing the chip in the process, I'm sure it wouldn't be far behind the 3090.

AlphaSeagull · on May 17, 2024

Ok but that doesn't matter if you can't actually run 400 watts through an M1 Ultra. If you wanna compare how efficient a chip is, sure, that's a great way to test. But you can't make the claim that your chip is as good as a 3090 if the end user is never going to see the performance of an actual 3090