Hacker News new | past | comments | ask | show | jobs | submit login
ONNX runtime: Cross-platform accelerated machine learning (onnxruntime.ai)
149 points by valgaze on July 25, 2023 | hide | past | favorite | 36 comments



Maybe relevant, since Azure is used as an example, MSFT & Meta recently worked on ONNX-based deployment of Llama 2 in Azure and WSL: https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...

(disclaimer: I work at GH/MSFT, not connected to the Llama 2 project)


I would say onnx.ai [0] provides more information about ONNX for those who aren’t working with ML/DL.

[0] https://onnx.ai


Onnx is cool. For one it runs (large) transformer models in the cpu twice faster than pytorch/transformers. But at the moment it lacks a number of crucial features. Specifically:

It's reliance on google's protobuf with it's 2gb single file limit is an extreme limitation. Yes you can keep weights outside your model file, but still many operations (model slicing) fail.

Second, inability to offload parts of the model to disc or cpu(like huggingface accelerate) while the rest executes on the gpu.

Thirdly, inability to partition existing large models easily. You can delete nodes, but then fixing the input/output formats means manually editing text files. The work flow is ridiculous (convert onnx to txt with pdoc, edit in text editor, convert back to binary).

I really wish they fix all this stuff and more.


As ONNX models are protobufs you can edit them at a Python or Java REPL (or other language but I've personally used those two). Dumping them out as text seems like a lot more work, and a lot less typesafe.


protobuf has a serialized file limit of 2gb? or like a definition file? I haven't touched PB in ages and certainly not something that would be transmitting that much in a single message.

is your complaint that the whole context/weights needs to be sent through the whole file?

I'm asking from a position of ignorance, I'm surpised to see a serialized transport of that size, and wondering why it's specifically limited at 2gb. like to be able to mmap on 32-bit hardware?


>protobuf has a serialized file limit of 2gb

I'm not sure if this is a file size limit too or just an object memory representation size limit. For me using a library designed for message passing to save/read your AI models is a bad design decision.

>is your complaint that the whole context/weights needs to be sent through the whole file?

I store large onnx models with "external" weights, but even so many operations fail with the dreaded "ModelProto exceeds maximum protobuf size of 2GB: 3385275542". So the complaint is that you simply can't do a lot of stuff with models over 2gb.

You can create a session to execute the model, you can run the vanilla optimisation over it. But trying to run transformer specific optimisation errors out as well as making any attempts at slicing the model. Additionaly some conversion processes.

It appears the code that does that looks up the model size and just errors out if over 2GB.it doesn't even try loading it


> I'm not sure if this is a file size limit too or just an object memory representation size limit.

Message. Each message cannot be bigger than 2GB(and usually should be not this big), but a file could contain multiple messages. This limitation helps them prevent integer overflows, since every length is 32-bit but every application process is 64-bit, you can convert the length numbers to 64-bit before doing any arithmetic operation. Therefore, fundamentally, there is no way to make it secure on 32-bit platforms, or no way to support more than 2GB on 64-bit platforms without totally rewriting the code.


There are two kinds of runtime: training and inference. ONNX runtime as far as I know is only for inference, which is open for all.


The training support is much less mature and much less widely used, but it does exit: https://onnxruntime.ai/docs/get-started/training-on-device.h... https://onnxruntime.ai/docs/get-started/training-pytorch.htm...


We wanted to use ONNX runtime for a "model driver" for MD simulations, where any ML model can be used for molecular dynamics simulations. Problem was it was way too immature. Like ceiling function will only work with single precision in ONNX. But the biggest issue was that we could not take derivatives in ONNX runtime, so any complicated model that uses derivatives inside was a nogo, is that limitation still exist? Do you know if it can take derivatives in training mode now?

Eventually we went with pytorch only support for the time being, with still exploring OpenXLA in place of ONNX, as a universal adapter: https://github.com/ipcamit/colabfit-model-driver


Yea, ONNX runtime is mostly used for inference. The requirements for training and inference differ quite a lot: training requires a library that can calculate gradients for the back propagation, loop over large datasets, split the model across multiple GPUs, etc. During inference you need to run a quantized version of the model on a specific target hardware, whether it be CPU, GPU, or mobile. So typically you will use one library for training, and convert it to a different library for deployment.


There's a training runtime too (and it enables edge training, as sibling reply hopes for in next decade)


And this is a superficial difference carried from old days when we need to do deployment and deployment-specific optimizations.

With LoRA / QLoRA, my bet is that edge training capabilities are as important in the next decade. I don't have any citations though.


> And this is a superficial difference carried from old days when we need to do deployment and deployment-specific optimizations.

Is it? From what I understand, to use an analogy, ONNX is the bytecode specification and JVM whereas Pytorch, TF and other frameworks combined with converting tools are the Java compilers.


Onnx is just a serialisation format (using protobuf iirc) for the network, weights, etc.

Your training framework and a suitable export is the compiler.

Onnx Runtime (which really has various backends), tensorrt, .. (whatever inference engine you are using) is your JVM.


That is my understanding, ONNX is the weights and the operators. You could then project that model into SPIR-V, Verilog or run it via native code.


It is true if your model is statically shaped. ONNX is also a collection of C++ code , a C++ library for doing shape inference. You cannot project the C++ code to another high level programming language.


The model itself doesn't contain C++ code. The runtime supports Python, JS, etc. I don't think they are shipping a copy of clang as part of the rt. I could be wrong, I haven't looked. Of course it needs the scaffolding to get data in and out.

I just did an install of the runtime on Python ( pip install onnxruntime ) . Here are the additional packages it installs.

    Package       Version
    ------------- -------
    coloredlogs   15.0.1
    flatbuffers   23.5.26
    humanfriendly 10.0
    mpmath        1.3.0
    numpy         1.25.2
    onnxruntime   1.15.1
    packaging     23.1
    protobuf      4.23.4
    sympy         1.12
https://onnxruntime.ai/docs/install/


ONNX is a spec. ONNX Runtime is an implementation of ONNX. There are other implementations too. But ONNX is not a text spec like the RFCs for network protocols. ONNX is also a collection of C/C++ code. ONNX's implementations rely on this code to do type and shape inference. My point was: if someone wants to implement ONNX(write a library that can load and run ONNX models), he/she has to reuse this C/C++ code, or totally rewrite them in his/her favorite programming language(but I think it is not very practical).

If an ONNX implementation wants to do codegen, like what XLA does, then usually it is based on LLVM and it needs to be shipped with a copy of LLVM.


The biggest problem with onnx models is that you can't reshape them :/


I'm personally more excited by StableHLO and/or Tinygrad as portable intermediate languages for ML. They're more RISC. ONNX seems to have almost 200 ops, StableHLO about 100, and Tinygrad about 30.


There's also a third-party WebGPU implementation: https://github.com/webonnx/wonnx


Is anyone using Onnx-compiled models with Triton Inference Server? Is it worth it? How does it compare to other options like torchscript or tensorrt?


TensorRT with torchscript is a king as its a lot easier to modify, but ONNX is fine as you can also import some ONNX models to TensorRT -> https://docs.nvidia.com/deeplearning/tensorrt/api/python_api... But ofc. as everything it depends on OPs version of ONNX.


Super worth it.


Nice, awhile ago, there was new ai python projects that came out and needed the binaries and the website install wasn't available or documented.

Many users didnt want to install random binaries (security), and the devs didnt document or link directly to the corp websites.

Now its as easy as pip install, going to make things easier.

The community is moving faster that the corps making the tools.


ONNX is pretty old at least 5 years and its still mostly useful on Nvidia GPU's or x64 CPU. TBH. cool that projects like that are still alive but MLIR looks like the future of proper model storage, and custom format and loading is still a king today cause you can easily modify model and optimise or even fine tune which isn't even possible in ONNX without ton of work (also static spec and versions in protobuf sucks wish they migrate to flatbuffers).


for people looking at deploying ML: this comment is not even wrong[1], there's no real way to respond to it substantively. It's sort of like saying Swift came out 7 years ago and it's mostly useful for iPhone X and the first iPad Air.

[1] https://en.wikipedia.org/wiki/Not_even_wrong


Ok :)


What's cool is that you can run Onnx models in the browser!

I have written about it in my blog: https://www.zaynetro.com/post/run-ml-on-devices



onnx is nice in principle, but pretty limited. core ops like non-max-suppression can't get properly converted. also model deployment is not great, memory consumption and control thereof worse than with tensorflow.


Would it not be better to use https://github.com/tinygrad/tinygrad as an intermediary framework?


Tinygrad is python only right? Can it provide gradients during C++ runtime as well? ONNX runtime have multiple language backends for inference.


One option to your case is OpenVino. It's written in C++ and has Python Bindings. Also, it can be used to train new nets. You can use ONNX files with OpenVino too.


Does it run on any of the BSDs?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: