This is a great post showing why you have to measure the specific tasks you care about rather than relying on general assumptions. Another example I remember seeing was crypto/hashing performance where you could find embedded processors competing with much faster general chips because they had dedicated instructions for those use-cases, and performance would fall off of a cliff if you used different encryption or hashing settings or an unoptimized libssl.
I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration. There’s a fair amount of SIMD work where the cost of copying to/from the GPU is greater than the savings until you get over a particular amount of data and that threshold should be different on systems like the M1.
> This is a great post showing why you have to measure the specific tasks you care about rather than relying on general assumptions.
If anything it turns out it's an argument that benchmarks mislead way too readily and first-principles arguments (which would quickly refute the idea of a 3.5x slowdown due to vector width) should always be used to double-check your working.
Indeed!
This reminds of a fun issue I ran into years ago with simd code(gcc, linux) . I was experimenting with various vector sizes, and found significant slowdowns for some vector sizes. I was about to call it quits, as in 'well, I'll have to do things differently', when I realized it didn't make any sense.
I double checked the actual values computed by the benchmark, which happened to be completely wrong. What I had actually found was a compiler bug !
> I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration.
Correct me if I'm wrong, but is this actually different from regular integrated graphics that have been in intel and amd chips for decades? I remember there being some initiatives from amd proposing similar offloading under the name HSA almost a decade ago.
I don't think there are actually any software really using it.
I don't think it is different. For example, the OpenCL specification allows for the possibility that data doesn't need to be copied between CPU and GPU.
In a recent interview (I think with the Changelog podcast) I heard an Apple engineer explain that the M1 had an advantage over previous systems in that not only did the data not need to be copied (which implies this isn't new) but also that no changes to the format of the data were needed given Apple's end to end control.
Yeah. It’s a real feat that Apple was able to get heterogenous computing done (something AMD was touting with OpenCL). The not having to copy data from system ram to GPU buffers etc is really great.
That's probably the difference: AMD and Intel implemented zero-copy years ago but no software used it while the Metal stack on macOS probably does take advantage.
One difference (as I understand it) is on Intel's integrated graphics the RAM used for the GPU is a dedicated segment for the GPU's use. You still need to copy data from the CPU's segment to the GPU's segment. While that might be faster than copying over PCIe it's still a copy operation. With the M1's GPU there's no segmentation so no copying.
That's how I understand it works but I might be completely wrong.
> Shared Physical Memory: The host and the device share the same physical DRAM. This is different from shared virtual memory, when the host and device share the same virtual addresses, and is not the subject of this paper. The key hardware feature that enables zero copy is the fact that the CPU and GPU have shared physical memory. Shared physical and shared virtual memories are not mutually exclusive.
Good point. Some Intel chips have had an on-package on-die?) 128MB "L4 cache" made of DRAM. That certainly sounds a lot like the M1's integrated memory.
>crypto/hashing performance where you could find embedded processors
You mean ASIC I guess.
I think this was Apple's idea in first place. Instead of having general purpose computational machine why not have some general purpose alongside with specialised silicon for the most common tasks.
After all, isn't it GPU just another specialised unit? Why not have similar stuff for everything relevant?
It's a poor post, much like the last one, if for no other reason than it's done so sloppily. There's nothing wrong with running simple, informal benchmarks but at a minimum, showing one's build and run details would make the limitations and outright mistakes more obvious.
No, I didn't. I mean just showing what your run looks like, inline in the blog, which is pretty typical, just like in the comment where someone tried to reproduce the results:
I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration. There’s a fair amount of SIMD work where the cost of copying to/from the GPU is greater than the savings until you get over a particular amount of data and that threshold should be different on systems like the M1.