OneAPI – A cross-industry, open, standards-based unified programming model

touisteur · on March 1, 2021

I'm having a hard time trusting Intel with anything AI mid to long term. OpenVINO was great until they removed support for Intel FPGAs, and they've been closing and deprecating low-level APIs that were necessary for low-latency, low-watt work on ex-Movidius hardware (and KeemBay always seems to be 'next semester'...). We already have huge lock-in with Nvidia, but at least we get wide expertise and huge perf boost, what does oneAPI bringing expect 'portability' and 'runs as fast as tf/torch/TVM'... OpenCL? Rocm? Vulkan targets? Who's going to debug and support that?

I'm struggling to see what's enticing in this for python people that already have largely optimised torch and tf cpu and gpu backends, especially for batch work. And for latency-sensitive inference, I thought the 'industry' was going for TVM or other 'target all the things at code level'.

I'm thinking: gimme your network in onnx format, everyone gives a C inference/compilation API, and let everyone optimize /behind/ that... Xilinx, AMD, whoever-Google-bought-last-week...

RcouF1uZ4gsC · on March 1, 2021

Seems like yet another CUDA alternative. Thing is, even if you were wanting to move away from CUDA, the churn of the alternative models would be scary. Do you want to develop for CUDA which has been stable and developed for quite some time, or do you want to try a new standard that you don’t known if it will be obsolete in a couple of years?

BadInformatics · on March 1, 2021

> do you want to try a new standard that you don’t known if it will be obsolete in a couple of years

Some incarnation of oneAPI is bound to exist as long as Intel has a foothold in the HPC market. For example, MKL and MKL-DNN have been rebranded with a oneAPI prefix. So no, the long term stability argument doesn't hold water.

I'd also note that they appear to have added support for cuBLAS and cuDNN as backends in their respective oneAPI libraries. It would be hilarious if that lead to more people running oneAPI on non-Intel hardware than first party stuff.

jcelerier · on March 1, 2021

otoh Intel has over the years developed a few parallel libraries / extensions that have then been deprecated, for instance Cilk Plus...

volta83 · on March 1, 2021

and ISPC, and... and...

celrod · on March 1, 2021

Has ispc been deprecated? Last GitHub commit was three days ago.

volta83 · on March 2, 2021

its open source, so it has been kept "alive" on and off for years already

moonbug · on March 1, 2021

it's a pet project.

g0xA52A2A · on March 1, 2021

> Seems like yet another CUDA alternative.

How many real alternatives are there though aside from OpenCL? Not to mention if you're interested in targeting something that isn't an Nvidia GPU.

hctaw · on March 1, 2021

The problem of a heterogeneity in GPGPUs is that AMD and Intel can't make products that are competitive with NVidia. The "programming model" for "cross industries" (whatever those mean, the whole naming of this project is weird) aren't particularly deep moats if there were competitive solutions at lower prices.

shmerl · on March 1, 2021

AMD GPUs are better than Nvidia for that - they had async compute way longer. I thought the problem was CUDA lock-in and lack of a nice programming model using something modern like Rust may be.

Jhsto · on March 1, 2021

Then again CUDA is pretty accessible language to program SIMD machines. I am bit sceptic if most people would recognise subjectively well-done SIMD language (lets say from correctness and performance standpoint). The programming model for SIMD is inherently parallel with exceptions for sequential code, which is pretty much the exact opposite of the programming you do on the CPU side. I'd somehow imagine most people would not find that modern at all, or alternatively, it would not be the language with "product market fit" since it's less likely to catch on. AMD or Intel may have a "better" language in some sense, but it seems most people prefer familiarity with what they already know.

marmaduke · on March 1, 2021

> I am bit sceptic if most people would recognise subjectively well-done SIMD language (lets say from correctness and performance standpoint)

I found ISPC (https://ispc.github.io) to be a little easier to work with than CUDA, for SIMD on CPUs, as it sits a bit in the middle of the CUDA and regular C models. Interop can work via C ABI i.e. trivial, it generates regular object files, and so on.

shmerl · on March 1, 2021

CUDA is still stuck with Nvidia only. So improving on that and on the language itself is clearly an open item.

Jhsto · on March 1, 2021

Well, oneAPI can be used to do that. This submission is most likely in reference of a post a few days ago [1]. Here, PTX which CUDA targets is retargeted into SPIR-V which is ran atop oneAPI [2].

[1]: https://news.ycombinator.com/item?id=26262038

[2]: https://spec.oneapi.com/level-zero/latest/index.html

moonbug · on March 1, 2021

but almost no one codes to the CUDA driver api.

oivey · on March 1, 2021

By async compute, do you mean something different than what CUDA streams expose?

AMD’s compute cards have generally been worse than Nvidia’s as far as I know.

shmerl · on March 1, 2021

I mean actual hardware capabilities. AMD GPUs were ahead in that.

See: http://developer.amd.com/wordpress/media/2012/10/Asynchronou...

oivey · on March 1, 2021

This only seems to applicable to shaders? GPGPU in most contexts is referring to code divorced from any graphics pipeline, like CUDA or ROCm. CUDA has had asynchronous and concurrent kernel calls for a long time. How are asynchronous compute shaders relevant in that context?

monocasa · on March 1, 2021

Compute shaders are divorced from the graphics pipeline as well. As far as the hardware is concerned, CUDA/ROCm and compute shaders are the same thing.

Jasper_ · on March 1, 2021

Compute shaders and CUDA have very different execution and driver models. Just take a look at the way memory is managed in each.

monocasa · on March 1, 2021

Can you be specific? Everything I've seen looks like a user space host driver difference.

Also, I the case of "asynchronous shaders" it's a command list dispatching feature, which is equally applicable to CUDA/ROCm and compute shaders.

Jasper_ · on March 2, 2021

Have you seen CUDA? It's C++, where half of it runs on the CPU, and half of it runs on the GPU. Memory transfers are automatically scheduled for you! No, obviously the hardware doesn't magically transform when you run a CUDA program, but there's a lot of features in the hardware that just aren't exercised by compute shaders, and the 'user-space host driver' is very different.

> Also, I the case of "asynchronous shaders" it's a command list dispatching feature, which is equally applicable to CUDA/ROCm and compute shaders.

Async compute affects how you have to synchronize your code, and what your hazards are. If you have a queue of items, you know when one starts that the other is done. With async compute, you can give up that guarantee in exchange for overlapping work. NVIDIA still doesn't have this in modern GPUs.

my123 · on March 2, 2021

You have async compute on NVIDIA GPUs.

See this slide deck about it: https://developer.download.nvidia.com/CUDA/training/StreamsA...

(for the implementation in Fermi, it's even more flexible since then)

oivey · on March 1, 2021

CUDA has had asynchronous command lists (multiple, actually, via streams) forever. What does AMD’s special hardware enable? Is it exposed in ROCm somehow?

Edit: Maybe I should be more precise with my language. Using CUDA streams, you can queue up and execute multiple kernels (basically commands?) in parallel. Kernels invoked on the same stream are serialized. I have no idea what the exact analog is in the compute shader world.

monocasa · on March 1, 2021

"Asynchronous shaders" doesn't mean that host code is unblocked on job dispatch (that's been the case for nearly all GPU APIs for going on 40 years). It's having multiple hardware queues with different priorities to be able to have other work to do when a kernel/shader is only using a subset of the SMs/CUs. So it's a hardware balancing feature for non heterogeneous workloads.

Nvidia hardware won't dispatch multiple kernels simultaneously, all of the streams stuff you're seeing is hints to the user space driver in how it orders the work in it's single hardware command queue.

oivey · on March 2, 2021

Is that not what this is? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

I thought this allowed kernels to run in parallel if both kernels could fit in the open resources. It doesn't have a priority system, and I'm guessing isn't smart enough to reorder kernels to fit together better. Further Googling to be sure that they mean parallel AND concurrent didn't turn up much, but Nvidia does appear to mean both.

my123 · on March 2, 2021

cudaStreamCreateWithPriority is what you'd like to use for creating a stream with a specifically defined priority.

The GPU is able to run up to 128 kernels in parallel on modern NVIDIA GPUs, concurrently.

oivey · on March 2, 2021

Ahh ok. I’ve had a similar conversation with a graphics guy that strongly paralleled this conversation, and I was confused then, too. Seemingly the asynchronous compute issues are a limitation when it only comes to shaders I guess? I write some CUDA but no shaders, and certainly not so much that I’d make super concrete claims about more esoteric features like streams.

my123 · on March 1, 2021

Were. AMD totally squandered their advantage that they had with launching GCN early through an absolutely awful software stack.

shmerl · on March 1, 2021

RDNA 2 only improved on the above. Nvidia were and remain a dead end with CUDA lock-in, but I agree that there should be more combined efforts to break it.

my123 · on March 1, 2021

Lol? ROCm is a mess.

It's awful. And it is not even supported on RDNA and RDNA2. They even dropped Polaris support last month so that it's only supported on Vega.

And that's very much not a high quality GPU compute stack, sorry...

CUDA isn't a dead end and isn't going away any time soon.

shmerl · on March 1, 2021

I was talking about hardware, not about ROCm. I said above, GPU compute could benefit from a nice programming model with a modern language. That will dislodge CUDA lock-in for good.

CUDA isn't dead, but there should be a bigger effort to get rid of it because it's not a proper GPU programming tool but rather a market manipulation tool by Nvidia.

my123 · on March 1, 2021

Nowadays, modern Nvidia hardware (Volta onwards) handles compute workloads better, notably through independent thread scheduling.

RDNA2 with its cache and low main memory bandwidth didn't help either, because while the cache is there, you're going to exceed what it can absorb by a lot in compute workloads...

shmerl · on March 1, 2021

I doubt there is major difference. And if anything, AMD caught up to Nvidia in other areas. CUDA isn't used because Nvidia hardware is better. If it was about hardware, Nvidia wouldn't have used lock-in.

my123 · on March 2, 2021

AMD doesn't even have the performance crown, by a long shot. NVIDIA did hold that pretty much non-stop since... Maxwell 2?

As of no difference in compute workloads, I can put Blender as an example: https://techgage.com/article/blender-2-91-best-cpus-gpus-for...

(6900 XT is 20.6 TFLOPS FP32, RTX 3090 is 35.6 TFLOPS FP32 _and_ also has the tensor cores, AMD can't win there because of sheer brute force)

And AMD's software stack removes them from consideration for a huge extent of compute workloads. If they aren't ready to have a proper one, it doesn't matter how good how much their hardware is good or not, it might as well be a brick.

shmerl · on March 2, 2021

I don't see Nvidia leading in raw compute power, both in GPGPU and in gaming. At most there will be parity going forward. Which will prompt Nvidia to double down on lock-ins like CUDA.

Either way, CUDA should be gone for good and replaced with something that's not tied to specific GPU. I welcome competition in hardware. I have no respect for lock-in shenanigans.

oivey · on March 2, 2021

They’re leading right now by a bit, and when it comes to workloads involving tensor cores or ray tracing they’re burying AMD.

I wish CUDA wasn’t vendor locked, too, but AMD and Intel need to start taking GPGPU software seriously. They’ve dug themselves a massive hole at this point.

shmerl · on March 2, 2021

I don't see a need to whitewash Nvidia here. They are nasty lock-in proponents. That doesn't mean AMD and Intel shouldn't invest more in improving the software side of things. Which I think they are doing (like the actual post example), so it should be getting better.

nynx · on March 1, 2021

rust-gpu (https://shader.rs) is a spirv backend for Rust. It's early days, but already really promising. In my opinion, applying a good package manager (cargo) and a good language (rust) to gpgpu and rendering could result in great things.

shmerl · on March 2, 2021

Yeah, that could be interesting. But it's still the shader approach, you just write shaders in Rust here.

May be some other model can be designed that's nicer. CUDA isn't really using shader paradigm, right?

nynx · on March 2, 2021

Compute shaders are already supported and I believe support for opencl kernels, which are essentially expanded compute shaders, is on the long-term roadmap.

volta83 · on March 1, 2021

> AMD GPUs are better than Nvidia for that

Better in what sense?

> they had async compute way longer

What is that?

shmerl · on March 1, 2021

Something like this: http://developer.amd.com/wordpress/media/2012/10/Asynchronou...

volta83 · on March 2, 2021

How is that better than CUDA streams, CUDA graphs, etc. ?

dljsjr · on March 1, 2021

Something similar that has been around for ages and isn't controlled by Intel is MAGMA: https://bitbucket.org/icl/magma/src/master/

jdc · on March 1, 2021

MAGMA Users' Guide
Univ. of Tennessee, Knoxville
Univ. of California, Berkeley
Univ. of Colorado, Denver
Date
October 2020
The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieves the fastest possible time to an accurate solution on heterogeneous architectures, starting with current multicore + multi-GPU systems. To address the complex challenges stemming from these systems' heterogeneity, massive parallelism, and the gap between compute speed and CPU-GPU communication speed, MAGMA's research is based on the idea that optimal software solutions will themselves have to hybridize, combining the strengths of different algorithms within a single framework. Building on this idea, the goal is to design linear algebra algorithms and frameworks for hybrid multicore and multi-GPU systems that can enable applications to fully exploit the power that each of the hybrid components offers.
Designed to be similar to LAPACK in functionality, data storage, and interface, the MAGMA library allows scientists to easily port their existing software components from LAPACK to MAGMA, to take advantage of the new hybrid architectures. MAGMA users do not have to know CUDA in order to use the library.

There are two types of LAPACK-style interfaces. The first one, referred to as the CPU interface, takes the input and produces the result in the CPU's memory. The second, referred to as the GPU interface, takes the input and produces the result in the GPU's memory. In both cases, a hybrid CPU/GPU algorithm is used. Also included is MAGMA BLAS, a complementary to CUBLAS routines.

http://icl.utk.edu/projectsfiles/magma/doxygen/index.html

oivey · on March 1, 2021

Isn’t MAGMA more like a canned set of linear algebra routines rather than a general interface for executing arbitrary code on GPUs? That’s pretty different.

dljsjr · on March 2, 2021

Well at the end of the day they're both libraries for doing heterogeneous parallel compute (CPU + GPU w/ one API).

oneAPI might be more general but at the end of the day heterogeneous HPC converges on linear algebra. So it might have different ergonomics but it's very similar.

xiphias2 · on March 2, 2021

This should be called oneMoreAPI.

The right way to go would be to first upstream, finish and optimize AMD and Apple M1 support to PyTorch and Tensorflow and XLA, while trying to abstract as much to common libraries as possible.

Trying to create an API first just slows development down even more.

g0xA52A2A · on March 1, 2021

Both Intel's oneAPI and AMDs ROCm/HIP stacks success are obviously tied to (and part of) the success of each companies future products. And today I'd bet AMD will come off better. They have multiple contracts for HPC systems with their future GPU products as well as an existing presence in consumer gaming products.

Intel's effort seem to be banking on the Aroura system at ANL. Beyond that I don't know who's lining up to by Xe GPUs? Though I guess oneAPI will fit into whatever they do further down the line.

jdc · on March 1, 2021

There's also OCCA which does JIT compilation to C++, OpenMP, CUDA, HIP, OpenCL and Metal. Originally built at Virginia Tech and now maintained by the Center for Efficient Exascale Discretizations of the DOE.

And it seems to have decent Python bindings.

https://libocca.org

DethNinja · on March 1, 2021

Why would someone use this instead of plain old OpenCL(or CUDA) with C++?

What is the value proposal here?

jpsamaroo · on March 1, 2021

OpenCL and various other solutions basically require that one writes kernels in C/C++. This is an unfortunate limitation, and can make it hard for less experienced users (researchers especially) to write correct and performant GPU code, since neither language lends itself to writing many mathematical and scientific models in a clean, maintainable manner (in my opinion).

What oneAPI (the runtime), and also AMD's ROCm (specifically the ROCR runtime), do that is new is that they enable packages like oneAPI.jl [1] and AMDGPU.jl [2] to exist (both Julia packages), without having to go through OpenCL or C++ transpilation (which we've tried out before, and it's quite painful). This is a great thing, because now users of an entirely different language can still utilize their GPUs effectively and with near-optimal performance (optimal w.r.t what the device can reasonably attain).

[1] https://github.com/JuliaGPU/oneAPI.jl [2] https://github.com/JuliaGPU/AMDGPU.jl

my123 · on March 1, 2021

Which is something that CUDA provided since the very beginning. (with PTX)

pjmlp · on March 1, 2021

No one will take over CUDA's dominance until they realize that one reason why most researchers flocked into it were its polyglot capabilities, and graphical debuggers.

moonbug · on March 1, 2021

in other words, Nvidia's product execution was spot-on.

pjmlp · on March 2, 2021

Helped by Khronos focus of never supporting anything other than C and letting the community come up with tools.

So for years, before they started having a beating OpenCL was all about a C99 dialect with printf debugging.

SPIR and support for C++ came later as they were already taking a beating, trying to get up.

Apparently that is also the reason why Apple gave up on OpenCL, disagreements on how it should be going, after they gave 1.0 to Khronos.

Just compare Metal, a OOP API for GPUs, with Objective-C/Swift bindings, using a C++ dialect as shading language, a framework for data management, with Vulkan/OpenGL/OpenCL.

moonbug · on March 3, 2021

well, yeah. All of the opencl implementations were (and probably still are) awful. Other than requiring a lot of boilerplate, I found it adequate, essentially a clone of the the CUDA C driver API.

opencl 2 went into the weeds going full c++, but that all got rolled back with version 3.

jpsamaroo · on March 1, 2021

Yes, and that's why Julia gained CUDA support first. My point was to respond to "Why would someone use this instead of plain old OpenCL(or CUDA) with C++?", and my answer was, "you can use something other than OpenCL C or C++". I'm not trying to say that CUDA is any lesser of a platform because of this; instead, other vendor's GPUs are now becoming easier to use and program.

mwkaufma · on March 1, 2021

Problem: too many standards. Solution: add another standard.

hestefisk · on March 2, 2021

<Insert obligatory xkcd strip here> :)

mhd · on March 1, 2021

My Rational Unified Process PTSD is triggered.

hestefisk · on March 2, 2021

What does RUP have to do with APIs?

incrudible · on March 1, 2021

We already have way more APIs than vendors, letting every vendor have their own one API would actually simplify things.

RocketSyntax · on March 1, 2021

Is "oneAPI Deep Neural Network Library (oneDNN)" an alternative to ONNX?

taylorlapeyre · on March 1, 2021

Situation: there are now n+1 competing standards.

https://xkcd.com/927/

rizzir · on March 1, 2021

By reading the name alone I can't help but think of the xkcd related to standards.

brianberns · on March 1, 2021

Classic relevant xkcd: https://xkcd.com/927/

harles · on March 1, 2021

Exactly what I was thinking of. I’m having a lot of trouble understanding the value proposition to vendors here.

BadInformatics · on March 1, 2021

It's better to think of this as a more friendly (i.e. open source development model) first party compute stack than some kind of pan-vendor standard. For example, anyone using MKL is now nominally using oneAPI libraries. They also went to the trouble of implementing/pushing existing standards instead of baking their own thing: SPIR-V for the IR format, SYCL for the high level programming interface, an OpenCL 3 implementation (AIUI they have the most complete implementation of 2.x and 3), etc.

dyingkneepad · on March 1, 2021

So "Open Source development model" (developing in the open) is not the same thing as "Open Source Code" (developing in closed doors, then throwing code to the other side of the wall once in a while). Intel has a history of doing both depending on the project and internal groups involved, where only really the Open Development projects are actually successful and long-term. OneAPI uses a lot of components and I am not entirely sure they all follow the Open Development model. There's a lot of Open Source stuff out there that you basically can't contribute to: your contributions are ignored because source code is open but development isn't. Does anybody here know about how this is done for the OneAPI-related projects?

BadInformatics · on March 2, 2021

My sense is that they're somewhat better than, say, AMD at keeping development work in the open instead of just throwing code over the wall. Hence the choice of "development model" instead of just "FOSS". It's a low bar though, IIRC a lot of ROCm is just periodic code dumps with no visible PR strategy.

someguydave · on March 1, 2021

The value is that Intel wants to compete with nvidia

stonogo · on March 1, 2021

I can't help but assume this idea seemed much more compelling when Intel was still selling Omnipath.