Incidentally that's another case where the 512bit-ness is the least interesting part, the new instructions are useful for efficiently emulating ARM NEON (Switch) and Cell SPU (Playstation 3) code but those platforms are themselves only 128bits wide so I don't believe the emulators have any use for the 512bit (or even 256bit?) variants of the AVX512 instructions.
I haven't looked into the code for these but are they possibly pipelining multiple ops per clock? If it's not dependency chained they probably calculate a few cycles at once.