0.5 cycles per instruction max, 3.7GHz clock, that's 7.4e9 instructions per second. If I'm reading it right, that instruction does 16 4-wide dot products, which is ~128 ops. So ~950Gops peak in int8 precision on a server class Xeon assuming no clock throttling.