More

Marat_Dukhan · on June 28, 2022

Linux-capable RISC-V cores often have 64-bit architecture and no SIMD/vector processing capabilities.

Marat_Dukhan · on May 29, 2022

IMO the author alludes to some enterprise software running on Wintel that has per-core licensing costs.

matthews2 · on May 29, 2022

Oracle database and some VMware products have per-core licensing.

Marat_Dukhan · on Jan 4, 2022

He's a Software Engineer on the TPU team. Are you confusing him for Thomas Kurian, GCloud SVP?

Note: I work for Google, but speak for myself.

ignoramous · on Jan 4, 2022

To add to the confusion, NetApp CEO George Kurian is Thomas' twin brother: https://archive.is/jxtu6

codemac · on Jan 5, 2022

Thank you! That's where I was confused, I worked at NetApp :(

Marat_Dukhan · on Sept 14, 2021

It performs fixed-point arithmetic on 8-bit integers. You can mimick lower than 8-bit precision by using output_min/output_max parameters in XNNPACK operators, but keep in mind that: 1. This functionality is experimental and not exposed in TFLite. You'd need to call XNNPACK APIs directly from C/C++ code. 2. Computations would still be done on 8-bit numbers.

Marat_Dukhan · on Sept 14, 2021

Author here, happy to take your questions.

elephantum · on Sept 14, 2021

Do I understand correctly that using XNNPACK and mobile acceleration is mutually exclusive? I.e. it's either XNNPACK or NNAPI/CoreML?

Should I consider XNNPACK for a modern mobile phone?

Marat_Dukhan · on Sept 14, 2021

If by acceleration you mean offloading inference to a different IP block (GPU/DSP/NPU), then yes. XNNPACK is the inference engine for CPU.

CPU is the default backend in TensorFlow Lite, and CPU inference always works and produce correct result. GPU/DSP/NPU inference can be faster, particularly for large models on high-end SoCs, but generally you need to make sure that the model is supported on the IP block, the result is correct and performance is better than the CPU baseline. And that quickly gets very complicated:

1. NN API, and TFLite GPU/DSP backends support a limited subset of all TensorFlow Lite operators, and if a model is only partially offloaded to GPU/DSP/NPU, part of it will still run on CPU, and commonly synchronization overhead kills all potential speedups of the specialized hardware. The situation is even worse in CoreML, as CoreML doesn't provide an API to even learn which operators failed to offload to GPU/NPU.

2. Bugs in GPU shader compilers and NN API drivers do happen, and unless your model is a standard MobileNet, you're likely to hit them at least on some mobile phones. Then you'd need an infrastructure to detect this situation and disable offloading the model to this IP block on particular phones.

3. Low-end SoCs usually completely lack DSP and NPU, and their GPU is often slower than CPU even in nominal peak performance. This happens because CPU cores in low-end SoCs are typically just downclocked versions of the CPU cores in high-end SoCs, but low-end GPUs have 8-16 times fewer GPU cores than their high-end counterparts.

elephantum · on Sept 15, 2021

Wow! Thanks for such a detailed answer. It's much clearer now.

elephantum · on Sept 14, 2021

Is this a drop in solution that works with every existing tflite model?

Marat_Dukhan · on Sept 14, 2021

Yes, these optimizations work with existing tflite models, so long as the quantized operators they use are supported in XNNPACK.

elephantum · on Sept 14, 2021

I see, in order to benefit, model has to be quantized. It is not super clear which kinds of quantization are supported. Both Fp16 and Int8?

Marat_Dukhan · on Sept 14, 2021

In order to benefit from optimizations in *this blog post* the model needs to be quantized to 8-bit integers. However, XNNPACK supports floating-point inference as well (including with FP16 weights), see https://blog.tensorflow.org/2020/07/accelerating-tensorflow-...

elephantum · on Sept 14, 2021

Thanks!

elephantum · on Sept 14, 2021

Do the same optimizations apply to tensorflow/tensorflow serving?

Marat_Dukhan · on Sept 14, 2021

TensorFlow doesn't support quantized inference (it supports only mimicking quantization in floating-point for quantization-aware training), so it can't immediately benefit from these optimizations.

Marat_Dukhan · on Dec 11, 2020

Good. I was surprised that Apple Silicon Macs don't have a built-in cellular modem. This reminds me how iPhone launched without a 3G modem, and I hope Apple will similarly fix the lack of cellular connectivity in the next generation of MacBooks.

copperx · on Dec 11, 2020

I don't understand; couldn't Apple just add a Qualcomm modem into every Mac?

valuearb · on Dec 11, 2020

Supposedly Qualcomm’s licensing costs are even more egregious for PCs, it makes it impossible to blanket include modems.

chii · on Dec 11, 2020

i'm sure they could. But apple wants to own the vertical.

enos_feedler · on Dec 11, 2020

More likely cost

Marat_Dukhan · on Oct 26, 2020

Even WebGL2 doesn't expose compute shaders, so any NN computations work by abusing the graphics pipeline, with many inefficiencies involved. Shader dispatch is more expensive, no access to local memory, no control over dispatch blocks. Hopefully the upcoming WebGPU specification will close these efficiency gaps.

Marat_Dukhan · on Aug 12, 2020

What a time to be alive!

Marat_Dukhan · on Nov 10, 2018

FBGEMM is faster than theoretical peak FP32 (single-precision floating-point) performance, therefore its faster than SGEMM/DGEMM in any BLAS library

danmg · on Nov 10, 2018

If they're showing something higher than a 'theoretical peak' then it's a fantastic result that must be investigated carefully for any error in data collection.

Also, that doesn't stop them from showing an apples-to-apples comparison against other libraries that provide GEMM. If other libraries are reporting the same 'beyond theoretical peak' then it most certainly is a data collection error.

Marat_Dukhan · on Nov 11, 2018

Performance on the plot is higher than FP32 peak, but there's no error - because FBGEMM does not compute in FP32, it computes in 8-bit fixed point. On a Broadwell CPU, you can do 16 FP32 multiply-adds (2x 8-wide FMA instructions via VFMAxxxPS instructions), but 32 8-bit multiply adds (1x 32-wide multiplication with accumulation of adjacent results via VPMADDUSBW instruction).

danmg · on Nov 11, 2018

Ok. Then this will introduce significant truncation errors and it's not a general GEMM. That's like claiming you've made the fastest FEM routine in the world by doing everything in half-precision.

Marat_Dukhan · on Oct 30, 2018

QNNPACK directly competes with the CPU backend of TensorFlow Lite and the gemmlowp library. The Caffe2 backend of PyTorch 1.0 integrates QNNPACK, and directly competes with TensorFlow Lite. QNNPACK targets only mobile CPUs, but Caffe2 integrates other backends for non-CPU targets, e.g. Apple's MPSCNN library for iPhone GPUs, Qualcomm's Snapdragon NPE for Qualcomm GPUs and DSPs, ARM ComputeLibrary for Android GPUs. Not sure what you mean by TensorFlow Cores: NVIDIA has TensorCores and TensorRT, and Google has Tensor Processing Units (TPU), but neither of these technologies are for mobile.

antpls · on Oct 31, 2018

Thanks for the precisions!

I was referring to TensorRT from Nvidia and TPUs from Google.

One of the strength of the TFLite API is that the same exported tflite model can run on both mobiles and servers. It may make less sense to run lite models on servers, because of the loss of precision but it may also have its own use case for very big models on cheap servers.

Nvidia sells Android devices and embedded boards for robotic, which will surely have some sort of TensorRT-derived cores if not already. Goole could one day integrate their specialized cores (security and TPUs) into their phones too, or into AI-oriented IoT devices.