Onnx is cool. For one it runs (large) transformer models in the cpu twice faster than pytorch/transformers. But at the moment it lacks a number of crucial features. Specifically:
It's reliance on google's protobuf with it's 2gb single file limit is an extreme limitation. Yes you can keep weights outside your model file, but still many operations (model slicing) fail.
Second, inability to offload parts of the model to disc or cpu(like huggingface accelerate) while the rest executes on the gpu.
Thirdly, inability to partition existing large models easily. You can delete nodes, but then fixing the input/output formats means manually editing text files. The work flow is ridiculous (convert onnx to txt with pdoc, edit in text editor, convert back to binary).
As ONNX models are protobufs you can edit them at a Python or Java REPL (or other language but I've personally used those two). Dumping them out as text seems like a lot more work, and a lot less typesafe.
protobuf has a serialized file limit of 2gb? or like a definition file? I haven't touched PB in ages and certainly not something that would be transmitting that much in a single message.
is your complaint that the whole context/weights needs to be sent through the whole file?
I'm asking from a position of ignorance, I'm surpised to see a serialized transport of that size, and wondering why it's specifically limited at 2gb. like to be able to mmap on 32-bit hardware?
I'm not sure if this is a file size limit too or just an object memory representation size limit. For me using a library designed for message passing to save/read your AI models is a bad design decision.
>is your complaint that the whole context/weights needs to be sent through the whole file?
I store large onnx models with "external" weights, but even so many operations fail with the dreaded "ModelProto exceeds maximum protobuf size of 2GB: 3385275542". So the complaint is that you simply can't do a lot of stuff with models over 2gb.
You can create a session to execute the model, you can run the vanilla optimisation over it. But trying to run transformer specific optimisation errors out as well as making any attempts at slicing the model. Additionaly some conversion processes.
It appears the code that does that looks up the model size and just errors out if over 2GB.it doesn't even try loading it
> I'm not sure if this is a file size limit too or just an object memory representation size limit.
Message. Each message cannot be bigger than 2GB(and usually should be not this big), but a file could contain multiple messages. This limitation helps them prevent integer overflows, since every length is 32-bit but every application process is 64-bit, you can convert the length numbers to 64-bit before doing any arithmetic operation. Therefore, fundamentally, there is no way to make it secure on 32-bit platforms, or no way to support more than 2GB on 64-bit platforms without totally rewriting the code.
We wanted to use ONNX runtime for a "model driver" for MD simulations, where any ML model can be used for molecular dynamics simulations. Problem was it was way too immature. Like ceiling function will only work with single precision in ONNX. But the biggest issue was that we could not take derivatives in ONNX runtime, so any complicated model that uses derivatives inside was a nogo, is that limitation still exist? Do you know if it can take derivatives in training mode now?
Yea, ONNX runtime is mostly used for inference. The requirements for training and inference differ quite a lot: training requires a library that can calculate gradients for the back propagation, loop over large datasets, split the model across multiple GPUs, etc. During inference you need to run a quantized version of the model on a specific target hardware, whether it be CPU, GPU, or mobile. So typically you will use one library for training, and convert it to a different library for deployment.
> And this is a superficial difference carried from old days when we need to do deployment and deployment-specific optimizations.
Is it? From what I understand, to use an analogy, ONNX is the bytecode specification and JVM whereas Pytorch, TF and other frameworks combined with converting tools are the Java compilers.
It is true if your model is statically shaped. ONNX is also a collection of C++ code , a C++ library for doing shape inference. You cannot project the C++ code to another high level programming language.
The model itself doesn't contain C++ code. The runtime supports Python, JS, etc. I don't think they are shipping a copy of clang as part of the rt. I could be wrong, I haven't looked. Of course it needs the scaffolding to get data in and out.
I just did an install of the runtime on Python ( pip install onnxruntime ) . Here are the additional packages it installs.
ONNX is a spec. ONNX Runtime is an implementation of ONNX. There are other implementations too. But ONNX is not a text spec like the RFCs for network protocols. ONNX is also a collection of C/C++ code. ONNX's implementations rely on this code to do type and shape inference. My point was: if someone wants to implement ONNX(write a library that can load and run ONNX models), he/she has to reuse this C/C++ code, or totally rewrite them in his/her favorite programming language(but I think it is not very practical).
If an ONNX implementation wants to do codegen, like what XLA does, then usually it is based on LLVM and it needs to be shipped with a copy of LLVM.
I'm personally more excited by StableHLO and/or Tinygrad as portable intermediate languages for ML. They're more RISC. ONNX seems to have almost 200 ops, StableHLO about 100, and Tinygrad about 30.
TensorRT with torchscript is a king as its a lot easier to modify, but ONNX is fine as you can also import some ONNX models to TensorRT -> https://docs.nvidia.com/deeplearning/tensorrt/api/python_api... But ofc. as everything it depends on OPs version of ONNX.
ONNX is pretty old at least 5 years and its still mostly useful on Nvidia GPU's or x64 CPU.
TBH. cool that projects like that are still alive but MLIR looks like the future of proper model storage, and custom format and loading is still a king today cause you can easily modify model and optimise or even fine tune which isn't even possible in ONNX without ton of work (also static spec and versions in protobuf sucks wish they migrate to flatbuffers).
for people looking at deploying ML: this comment is not even wrong[1], there's no real way to respond to it substantively. It's sort of like saying Swift came out 7 years ago and it's mostly useful for iPhone X and the first iPad Air.
onnx is nice in principle, but pretty limited. core ops like non-max-suppression can't get properly converted. also model deployment is not great, memory consumption and control thereof worse than with tensorflow.
One option to your case is OpenVino. It's written in C++ and has Python Bindings. Also, it can be used to train new nets. You can use ONNX files with OpenVino too.
(disclaimer: I work at GH/MSFT, not connected to the Llama 2 project)