This is a great post showing why you have to measure the specific tasks you care...

Veedrac · on Dec 13, 2020

> This is a great post showing why you have to measure the specific tasks you care about rather than relying on general assumptions.

If anything it turns out it's an argument that benchmarks mislead way too readily and first-principles arguments (which would quickly refute the idea of a 3.5x slowdown due to vector width) should always be used to double-check your working.

Agingcoder · on Dec 13, 2020

Indeed! This reminds of a fun issue I ran into years ago with simd code(gcc, linux) . I was experimenting with various vector sizes, and found significant slowdowns for some vector sizes. I was about to call it quits, as in 'well, I'll have to do things differently', when I realized it didn't make any sense. I double checked the actual values computed by the benchmark, which happened to be completely wrong. What I had actually found was a compiler bug !

calaphos · on Dec 13, 2020

> I’d be curious how the unified memory architecture shifts the cost dynamic for GPU acceleration.

Correct me if I'm wrong, but is this actually different from regular integrated graphics that have been in intel and amd chips for decades? I remember there being some initiatives from amd proposing similar offloading under the name HSA almost a decade ago. I don't think there are actually any software really using it.

klelatti · on Dec 13, 2020

I don't think it is different. For example, the OpenCL specification allows for the possibility that data doesn't need to be copied between CPU and GPU.

In a recent interview (I think with the Changelog podcast) I heard an Apple engineer explain that the M1 had an advantage over previous systems in that not only did the data not need to be copied (which implies this isn't new) but also that no changes to the format of the data were needed given Apple's end to end control.

gigatexal · on Dec 13, 2020

Yeah. It’s a real feat that Apple was able to get heterogenous computing done (something AMD was touting with OpenCL). The not having to copy data from system ram to GPU buffers etc is really great.

wmf · on Dec 13, 2020

That's probably the difference: AMD and Intel implemented zero-copy years ago but no software used it while the Metal stack on macOS probably does take advantage.

giantrobot · on Dec 13, 2020

One difference (as I understand it) is on Intel's integrated graphics the RAM used for the GPU is a dedicated segment for the GPU's use. You still need to copy data from the CPU's segment to the GPU's segment. While that might be faster than copying over PCIe it's still a copy operation. With the M1's GPU there's no segmentation so no copying.

That's how I understand it works but I might be completely wrong.

klelatti · on Dec 13, 2020

I'm not sure this is right e.g.

> Shared Physical Memory: The host and the device share the same physical DRAM. This is different from shared virtual memory, when the host and device share the same virtual addresses, and is not the subject of this paper. The key hardware feature that enables zero copy is the fact that the CPU and GPU have shared physical memory. Shared physical and shared virtual memories are not mutually exclusive.

From:

https://software.intel.com/content/www/us/en/develop/article...

uluyol · on Dec 13, 2020

Good point. Some Intel chips have had an on-package on-die?) 128MB "L4 cache" made of DRAM. That certainly sounds a lot like the M1's integrated memory.

addaon · on Dec 13, 2020

On-package, but not on-die. The GT[34]e processors used an external die with 64MB (GT3e) or 128MB (GT4e) of eDRAM.

mrtksn · on Dec 13, 2020

>crypto/hashing performance where you could find embedded processors

You mean ASIC I guess.

I think this was Apple's idea in first place. Instead of having general purpose computational machine why not have some general purpose alongside with specialised silicon for the most common tasks.

After all, isn't it GPU just another specialised unit? Why not have similar stuff for everything relevant?

pvg · on Dec 13, 2020

It's a poor post, much like the last one, if for no other reason than it's done so sloppily. There's nothing wrong with running simple, informal benchmarks but at a minimum, showing one's build and run details would make the limitations and outright mistakes more obvious.

acdha · on Dec 13, 2020

The article shared the full configuration – perhaps you missed it?

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

pvg · on Dec 13, 2020

No, I didn't. I mean just showing what your run looks like, inline in the blog, which is pretty typical, just like in the comment where someone tried to reproduce the results:

https://news.ycombinator.com/item?id=25409535