Hacker News new | past | comments | ask | show | jobs | submit | Marat_Dukhan's comments login

Linux-capable RISC-V cores often have 64-bit architecture and no SIMD/vector processing capabilities.


IMO the author alludes to some enterprise software running on Wintel that has per-core licensing costs.


Oracle database and some VMware products have per-core licensing.


He's a Software Engineer on the TPU team. Are you confusing him for Thomas Kurian, GCloud SVP?

Note: I work for Google, but speak for myself.


To add to the confusion, NetApp CEO George Kurian is Thomas' twin brother: https://archive.is/jxtu6


Thank you! That's where I was confused, I worked at NetApp :(


It performs fixed-point arithmetic on 8-bit integers. You can mimick lower than 8-bit precision by using output_min/output_max parameters in XNNPACK operators, but keep in mind that: 1. This functionality is experimental and not exposed in TFLite. You'd need to call XNNPACK APIs directly from C/C++ code. 2. Computations would still be done on 8-bit numbers.


Author here, happy to take your questions.


Do I understand correctly that using XNNPACK and mobile acceleration is mutually exclusive? I.e. it's either XNNPACK or NNAPI/CoreML?

Should I consider XNNPACK for a modern mobile phone?


If by acceleration you mean offloading inference to a different IP block (GPU/DSP/NPU), then yes. XNNPACK is the inference engine for CPU.

CPU is the default backend in TensorFlow Lite, and CPU inference always works and produce correct result. GPU/DSP/NPU inference can be faster, particularly for large models on high-end SoCs, but generally you need to make sure that the model is supported on the IP block, the result is correct and performance is better than the CPU baseline. And that quickly gets very complicated:

1. NN API, and TFLite GPU/DSP backends support a limited subset of all TensorFlow Lite operators, and if a model is only partially offloaded to GPU/DSP/NPU, part of it will still run on CPU, and commonly synchronization overhead kills all potential speedups of the specialized hardware. The situation is even worse in CoreML, as CoreML doesn't provide an API to even learn which operators failed to offload to GPU/NPU.

2. Bugs in GPU shader compilers and NN API drivers do happen, and unless your model is a standard MobileNet, you're likely to hit them at least on some mobile phones. Then you'd need an infrastructure to detect this situation and disable offloading the model to this IP block on particular phones.

3. Low-end SoCs usually completely lack DSP and NPU, and their GPU is often slower than CPU even in nominal peak performance. This happens because CPU cores in low-end SoCs are typically just downclocked versions of the CPU cores in high-end SoCs, but low-end GPUs have 8-16 times fewer GPU cores than their high-end counterparts.


Wow! Thanks for such a detailed answer. It's much clearer now.


Is this a drop in solution that works with every existing tflite model?


Yes, these optimizations work with existing tflite models, so long as the quantized operators they use are supported in XNNPACK.


I see, in order to benefit, model has to be quantized. It is not super clear which kinds of quantization are supported. Both Fp16 and Int8?


In order to benefit from optimizations in *this blog post* the model needs to be quantized to 8-bit integers. However, XNNPACK supports floating-point inference as well (including with FP16 weights), see https://blog.tensorflow.org/2020/07/accelerating-tensorflow-...


Thanks!


Do the same optimizations apply to tensorflow/tensorflow serving?


TensorFlow doesn't support quantized inference (it supports only mimicking quantization in floating-point for quantization-aware training), so it can't immediately benefit from these optimizations.


Good. I was surprised that Apple Silicon Macs don't have a built-in cellular modem. This reminds me how iPhone launched without a 3G modem, and I hope Apple will similarly fix the lack of cellular connectivity in the next generation of MacBooks.


I don't understand; couldn't Apple just add a Qualcomm modem into every Mac?


Supposedly Qualcomm’s licensing costs are even more egregious for PCs, it makes it impossible to blanket include modems.


i'm sure they could. But apple wants to own the vertical.


More likely cost


Even WebGL2 doesn't expose compute shaders, so any NN computations work by abusing the graphics pipeline, with many inefficiencies involved. Shader dispatch is more expensive, no access to local memory, no control over dispatch blocks. Hopefully the upcoming WebGPU specification will close these efficiency gaps.


What a time to be alive!


FBGEMM is faster than theoretical peak FP32 (single-precision floating-point) performance, therefore its faster than SGEMM/DGEMM in any BLAS library


If they're showing something higher than a 'theoretical peak' then it's a fantastic result that must be investigated carefully for any error in data collection.

Also, that doesn't stop them from showing an apples-to-apples comparison against other libraries that provide GEMM. If other libraries are reporting the same 'beyond theoretical peak' then it most certainly is a data collection error.


Performance on the plot is higher than FP32 peak, but there's no error - because FBGEMM does not compute in FP32, it computes in 8-bit fixed point. On a Broadwell CPU, you can do 16 FP32 multiply-adds (2x 8-wide FMA instructions via VFMAxxxPS instructions), but 32 8-bit multiply adds (1x 32-wide multiplication with accumulation of adjacent results via VPMADDUSBW instruction).


Ok. Then this will introduce significant truncation errors and it's not a general GEMM. That's like claiming you've made the fastest FEM routine in the world by doing everything in half-precision.


QNNPACK directly competes with the CPU backend of TensorFlow Lite and the gemmlowp library. The Caffe2 backend of PyTorch 1.0 integrates QNNPACK, and directly competes with TensorFlow Lite. QNNPACK targets only mobile CPUs, but Caffe2 integrates other backends for non-CPU targets, e.g. Apple's MPSCNN library for iPhone GPUs, Qualcomm's Snapdragon NPE for Qualcomm GPUs and DSPs, ARM ComputeLibrary for Android GPUs. Not sure what you mean by TensorFlow Cores: NVIDIA has TensorCores and TensorRT, and Google has Tensor Processing Units (TPU), but neither of these technologies are for mobile.


Thanks for the precisions!

I was referring to TensorRT from Nvidia and TPUs from Google.

One of the strength of the TFLite API is that the same exported tflite model can run on both mobiles and servers. It may make less sense to run lite models on servers, because of the loss of precision but it may also have its own use case for very big models on cheap servers.

Nvidia sells Android devices and embedded boards for robotic, which will surely have some sort of TensorRT-derived cores if not already. Goole could one day integrate their specialized cores (security and TPUs) into their phones too, or into AI-oriented IoT devices.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: