ONNX runtime: Cross-platform accelerated machine learning

summarity · on July 25, 2023

Maybe relevant, since Azure is used as an example, MSFT & Meta recently worked on ONNX-based deployment of Llama 2 in Azure and WSL: https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...

(disclaimer: I work at GH/MSFT, not connected to the Llama 2 project)

thangngoc89 · on July 25, 2023

I would say onnx.ai [0] provides more information about ONNX for those who aren’t working with ML/DL.

[0] https://onnx.ai

Roark66 · on July 25, 2023

Onnx is cool. For one it runs (large) transformer models in the cpu twice faster than pytorch/transformers. But at the moment it lacks a number of crucial features. Specifically:

It's reliance on google's protobuf with it's 2gb single file limit is an extreme limitation. Yes you can keep weights outside your model file, but still many operations (model slicing) fail.

Second, inability to offload parts of the model to disc or cpu(like huggingface accelerate) while the rest executes on the gpu.

Thirdly, inability to partition existing large models easily. You can delete nodes, but then fixing the input/output formats means manually editing text files. The work flow is ridiculous (convert onnx to txt with pdoc, edit in text editor, convert back to binary).

I really wish they fix all this stuff and more.

craigacp · on July 26, 2023

As ONNX models are protobufs you can edit them at a Python or Java REPL (or other language but I've personally used those two). Dumping them out as text seems like a lot more work, and a lot less typesafe.

scrame · on July 26, 2023

protobuf has a serialized file limit of 2gb? or like a definition file? I haven't touched PB in ages and certainly not something that would be transmitting that much in a single message.

is your complaint that the whole context/weights needs to be sent through the whole file?

I'm asking from a position of ignorance, I'm surpised to see a serialized transport of that size, and wondering why it's specifically limited at 2gb. like to be able to mmap on 32-bit hardware?

Roark66 · on July 26, 2023

>protobuf has a serialized file limit of 2gb

I'm not sure if this is a file size limit too or just an object memory representation size limit. For me using a library designed for message passing to save/read your AI models is a bad design decision.

>is your complaint that the whole context/weights needs to be sent through the whole file?

I store large onnx models with "external" weights, but even so many operations fail with the dreaded "ModelProto exceeds maximum protobuf size of 2GB: 3385275542". So the complaint is that you simply can't do a lot of stuff with models over 2gb.

You can create a session to execute the model, you can run the vanilla optimisation over it. But trying to run transformer specific optimisation errors out as well as making any attempts at slicing the model. Additionaly some conversion processes.

It appears the code that does that looks up the model size and just errors out if over 2GB.it doesn't even try loading it

snnn · on July 26, 2023

> I'm not sure if this is a file size limit too or just an object memory representation size limit.

Message. Each message cannot be bigger than 2GB(and usually should be not this big), but a file could contain multiple messages. This limitation helps them prevent integer overflows, since every length is 32-bit but every application process is 64-bit, you can convert the length numbers to 64-bit before doing any arithmetic operation. Therefore, fundamentally, there is no way to make it secure on 32-bit platforms, or no way to support more than 2GB on 64-bit platforms without totally rewriting the code.

synergy20 · on July 25, 2023

There are two kinds of runtime: training and inference. ONNX runtime as far as I know is only for inference, which is open for all.

spearman · on July 25, 2023

The training support is much less mature and much less widely used, but it does exit: https://onnxruntime.ai/docs/get-started/training-on-device.h... https://onnxruntime.ai/docs/get-started/training-pytorch.htm...

new_user55 · on July 26, 2023

We wanted to use ONNX runtime for a "model driver" for MD simulations, where any ML model can be used for molecular dynamics simulations. Problem was it was way too immature. Like ceiling function will only work with single precision in ONNX. But the biggest issue was that we could not take derivatives in ONNX runtime, so any complicated model that uses derivatives inside was a nogo, is that limitation still exist? Do you know if it can take derivatives in training mode now?

Eventually we went with pytorch only support for the time being, with still exploring OpenXLA in place of ONNX, as a universal adapter: https://github.com/ipcamit/colabfit-model-driver

luckyt · on July 25, 2023

Yea, ONNX runtime is mostly used for inference. The requirements for training and inference differ quite a lot: training requires a library that can calculate gradients for the back propagation, loop over large datasets, split the model across multiple GPUs, etc. During inference you need to run a quantized version of the model on a specific target hardware, whether it be CPU, GPU, or mobile. So typically you will use one library for training, and convert it to a different library for deployment.

refulgentis · on July 25, 2023

There's a training runtime too (and it enables edge training, as sibling reply hopes for in next decade)

liuliu · on July 25, 2023

And this is a superficial difference carried from old days when we need to do deployment and deployment-specific optimizations.

With LoRA / QLoRA, my bet is that edge training capabilities are as important in the next decade. I don't have any citations though.

imjonse · on July 25, 2023

> And this is a superficial difference carried from old days when we need to do deployment and deployment-specific optimizations.

Is it? From what I understand, to use an analogy, ONNX is the bytecode specification and JVM whereas Pytorch, TF and other frameworks combined with converting tools are the Java compilers.

fisf · on July 25, 2023

Onnx is just a serialisation format (using protobuf iirc) for the network, weights, etc.

Your training framework and a suitable export is the compiler.

Onnx Runtime (which really has various backends), tensorrt, .. (whatever inference engine you are using) is your JVM.

nerpderp82 · on July 25, 2023

That is my understanding, ONNX is the weights and the operators. You could then project that model into SPIR-V, Verilog or run it via native code.

snnn · on July 26, 2023

It is true if your model is statically shaped. ONNX is also a collection of C++ code , a C++ library for doing shape inference. You cannot project the C++ code to another high level programming language.

nerpderp82 · on Aug 1, 2023

The model itself doesn't contain C++ code. The runtime supports Python, JS, etc. I don't think they are shipping a copy of clang as part of the rt. I could be wrong, I haven't looked. Of course it needs the scaffolding to get data in and out.

I just did an install of the runtime on Python ( pip install onnxruntime ) . Here are the additional packages it installs.

    Package       Version
    ------------- -------
    coloredlogs   15.0.1
    flatbuffers   23.5.26
    humanfriendly 10.0
    mpmath        1.3.0
    numpy         1.25.2
    onnxruntime   1.15.1
    packaging     23.1
    protobuf      4.23.4
    sympy         1.12

https://onnxruntime.ai/docs/install/

snnn · on Aug 5, 2023

ONNX is a spec. ONNX Runtime is an implementation of ONNX. There are other implementations too. But ONNX is not a text spec like the RFCs for network protocols. ONNX is also a collection of C/C++ code. ONNX's implementations rely on this code to do type and shape inference. My point was: if someone wants to implement ONNX(write a library that can load and run ONNX models), he/she has to reuse this C/C++ code, or totally rewrite them in his/her favorite programming language(but I think it is not very practical).

If an ONNX implementation wants to do codegen, like what XLA does, then usually it is based on LLVM and it needs to be shipped with a copy of LLVM.

Zetobal · on July 25, 2023

The biggest problem with onnx models is that you can't reshape them :/

modeless · on July 26, 2023

I'm personally more excited by StableHLO and/or Tinygrad as portable intermediate languages for ML. They're more RISC. ONNX seems to have almost 200 ops, StableHLO about 100, and Tinygrad about 30.

tormeh · on July 25, 2023

There's also a third-party WebGPU implementation: https://github.com/webonnx/wonnx

claytonjy · on July 25, 2023

Is anyone using Onnx-compiled models with Triton Inference Server? Is it worth it? How does it compare to other options like torchscript or tensorrt?

machinekob · on July 25, 2023

TensorRT with torchscript is a king as its a lot easier to modify, but ONNX is fine as you can also import some ONNX models to TensorRT -> https://docs.nvidia.com/deeplearning/tensorrt/api/python_api... But ofc. as everything it depends on OPs version of ONNX.

maininformer · on July 25, 2023

Super worth it.

IronWolve · on July 25, 2023

Nice, awhile ago, there was new ai python projects that came out and needed the binaries and the website install wasn't available or documented.

Many users didnt want to install random binaries (security), and the devs didnt document or link directly to the corp websites.

Now its as easy as pip install, going to make things easier.

The community is moving faster that the corps making the tools.

machinekob · on July 25, 2023

ONNX is pretty old at least 5 years and its still mostly useful on Nvidia GPU's or x64 CPU. TBH. cool that projects like that are still alive but MLIR looks like the future of proper model storage, and custom format and loading is still a king today cause you can easily modify model and optimise or even fine tune which isn't even possible in ONNX without ton of work (also static spec and versions in protobuf sucks wish they migrate to flatbuffers).

refulgentis · on July 25, 2023

for people looking at deploying ML: this comment is not even wrong[1], there's no real way to respond to it substantively. It's sort of like saying Swift came out 7 years ago and it's mostly useful for iPhone X and the first iPad Air.

[1] https://en.wikipedia.org/wiki/Not_even_wrong

machinekob · on July 25, 2023

Ok :)

zaynetro · on July 25, 2023

What's cool is that you can run Onnx models in the browser!

I have written about it in my blog: https://www.zaynetro.com/post/run-ml-on-devices

homarp · on July 25, 2023