I'm having a hard time trusting Intel with anything AI mid to long term. OpenVINO was great until they removed support for Intel FPGAs, and they've been closing and deprecating low-level APIs that were necessary for low-latency, low-watt work on ex-Movidius hardware (and KeemBay always seems to be 'next semester'...). We already have huge lock-in with Nvidia, but at least we get wide expertise and huge perf boost, what does oneAPI bringing expect 'portability' and 'runs as fast as tf/torch/TVM'... OpenCL? Rocm? Vulkan targets? Who's going to debug and support that?
I'm struggling to see what's enticing in this for python people that already have largely optimised torch and tf cpu and gpu backends, especially for batch work. And for latency-sensitive inference, I thought the 'industry' was going for TVM or other 'target all the things at code level'.
I'm thinking: gimme your network in onnx format, everyone gives a C inference/compilation API, and let everyone optimize /behind/ that... Xilinx, AMD, whoever-Google-bought-last-week...
Seems like yet another CUDA alternative. Thing is, even if you were wanting to move away from CUDA, the churn of the alternative models would be scary. Do you want to develop for CUDA which has been stable and developed for quite some time, or do you want to try a new standard that you don’t known if it will be obsolete in a couple of years?
> do you want to try a new standard that you don’t known if it will be obsolete in a couple of years
Some incarnation of oneAPI is bound to exist as long as Intel has a foothold in the HPC market. For example, MKL and MKL-DNN have been rebranded with a oneAPI prefix. So no, the long term stability argument doesn't hold water.
I'd also note that they appear to have added support for cuBLAS and cuDNN as backends in their respective oneAPI libraries. It would be hilarious if that lead to more people running oneAPI on non-Intel hardware than first party stuff.
The problem of a heterogeneity in GPGPUs is that AMD and Intel can't make products that are competitive with NVidia. The "programming model" for "cross industries" (whatever those mean, the whole naming of this project is weird) aren't particularly deep moats if there were competitive solutions at lower prices.
AMD GPUs are better than Nvidia for that - they had async compute way longer. I thought the problem was CUDA lock-in and lack of a nice programming model using something modern like Rust may be.
Then again CUDA is pretty accessible language to program SIMD machines. I am bit sceptic if most people would recognise subjectively well-done SIMD language (lets say from correctness and performance standpoint). The programming model for SIMD is inherently parallel with exceptions for sequential code, which is pretty much the exact opposite of the programming you do on the CPU side. I'd somehow imagine most people would not find that modern at all, or alternatively, it would not be the language with "product market fit" since it's less likely to catch on. AMD or Intel may have a "better" language in some sense, but it seems most people prefer familiarity with what they already know.
> I am bit sceptic if most people would recognise subjectively well-done SIMD language (lets say from correctness and performance standpoint)
I found ISPC (https://ispc.github.io) to be a little easier to work with than CUDA, for SIMD on CPUs, as it sits a bit in the middle of the CUDA and regular C models. Interop can work via C ABI i.e. trivial, it generates regular object files, and so on.
Well, oneAPI can be used to do that. This submission is most likely in reference of a post a few days ago [1]. Here, PTX which CUDA targets is retargeted into SPIR-V which is ran atop oneAPI [2].
This only seems to applicable to shaders? GPGPU in most contexts is referring to code divorced from any graphics pipeline, like CUDA or ROCm. CUDA has had asynchronous and concurrent kernel calls for a long time. How are asynchronous compute shaders relevant in that context?
Compute shaders are divorced from the graphics pipeline as well. As far as the hardware is concerned, CUDA/ROCm and compute shaders are the same thing.
Have you seen CUDA? It's C++, where half of it runs on the CPU, and half of it runs on the GPU. Memory transfers are automatically scheduled for you! No, obviously the hardware doesn't magically transform when you run a CUDA program, but there's a lot of features in the hardware that just aren't exercised by compute shaders, and the 'user-space host driver' is very different.
> Also, I the case of "asynchronous shaders" it's a command list dispatching feature, which is equally applicable to CUDA/ROCm and compute shaders.
Async compute affects how you have to synchronize your code, and what your hazards are. If you have a queue of items, you know when one starts that the other is done. With async compute, you can give up that guarantee in exchange for overlapping work. NVIDIA still doesn't have this in modern GPUs.
CUDA has had asynchronous command lists (multiple, actually, via streams) forever. What does AMD’s special hardware enable? Is it exposed in ROCm somehow?
Edit: Maybe I should be more precise with my language. Using CUDA streams, you can queue up and execute multiple kernels (basically commands?) in parallel. Kernels invoked on the same stream are serialized. I have no idea what the exact analog is in the compute shader world.
"Asynchronous shaders" doesn't mean that host code is unblocked on job dispatch (that's been the case for nearly all GPU APIs for going on 40 years). It's having multiple hardware queues with different priorities to be able to have other work to do when a kernel/shader is only using a subset of the SMs/CUs. So it's a hardware balancing feature for non heterogeneous workloads.
Nvidia hardware won't dispatch multiple kernels simultaneously, all of the streams stuff you're seeing is hints to the user space driver in how it orders the work in it's single hardware command queue.
I thought this allowed kernels to run in parallel if both kernels could fit in the open resources. It doesn't have a priority system, and I'm guessing isn't smart enough to reorder kernels to fit together better. Further Googling to be sure that they mean parallel AND concurrent didn't turn up much, but Nvidia does appear to mean both.
Ahh ok. I’ve had a similar conversation with a graphics guy that strongly paralleled this conversation, and I was confused then, too. Seemingly the asynchronous compute issues are a limitation when it only comes to shaders I guess? I write some CUDA but no shaders, and certainly not so much that I’d make super concrete claims about more esoteric features like streams.
RDNA 2 only improved on the above. Nvidia were and remain a dead end with CUDA lock-in, but I agree that there should be more combined efforts to break it.
I was talking about hardware, not about ROCm. I said above, GPU compute could benefit from a nice programming model with a modern language. That will dislodge CUDA lock-in for good.
CUDA isn't dead, but there should be a bigger effort to get rid of it because it's not a proper GPU programming tool but rather a market manipulation tool by Nvidia.
Nowadays, modern Nvidia hardware (Volta onwards) handles compute workloads better, notably through independent thread scheduling.
RDNA2 with its cache and low main memory bandwidth didn't help either, because while the cache is there, you're going to exceed what it can absorb by a lot in compute workloads...
I doubt there is major difference. And if anything, AMD caught up to Nvidia in other areas. CUDA isn't used because Nvidia hardware is better. If it was about hardware, Nvidia wouldn't have used lock-in.
(6900 XT is 20.6 TFLOPS FP32, RTX 3090 is 35.6 TFLOPS FP32 _and_ also has the tensor cores, AMD can't win there because of sheer brute force)
And AMD's software stack removes them from consideration for a huge extent of compute workloads. If they aren't ready to have a proper one, it doesn't matter how good how much their hardware is good or not, it might as well be a brick.
I don't see Nvidia leading in raw compute power, both in GPGPU and in gaming. At most there will be parity going forward. Which will prompt Nvidia to double down on lock-ins like CUDA.
Either way, CUDA should be gone for good and replaced with something that's not tied to specific GPU. I welcome competition in hardware. I have no respect for lock-in shenanigans.
They’re leading right now by a bit, and when it comes to workloads involving tensor cores or ray tracing they’re burying AMD.
I wish CUDA wasn’t vendor locked, too, but AMD and Intel need to start taking GPGPU software seriously. They’ve dug themselves a massive hole at this point.
I don't see a need to whitewash Nvidia here. They are nasty lock-in proponents. That doesn't mean AMD and Intel shouldn't invest more in improving the software side of things. Which I think they are doing (like the actual post example), so it should be getting better.
rust-gpu (https://shader.rs) is a spirv backend for Rust. It's early days, but already really promising. In my opinion, applying a good package manager (cargo) and a good language (rust) to gpgpu and rendering could result in great things.
Compute shaders are already supported and I believe support for opencl kernels, which are essentially expanded compute shaders, is on the long-term roadmap.
The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieves the fastest possible time to an accurate solution on heterogeneous architectures, starting with current multicore + multi-GPU systems. To address the complex challenges stemming from these systems' heterogeneity, massive parallelism, and the gap between compute speed and CPU-GPU communication speed, MAGMA's research is based on the idea that optimal software solutions will themselves have to hybridize, combining the strengths of different algorithms within a single framework. Building on this idea, the goal is to design linear algebra algorithms and frameworks for hybrid multicore and multi-GPU systems that can enable applications to fully exploit the power that each of the hybrid components offers.
Designed to be similar to LAPACK in functionality, data storage, and interface, the MAGMA library allows scientists to easily port their existing software components from LAPACK to MAGMA, to take advantage of the new hybrid architectures. MAGMA users do not have to know CUDA in order to use the library.
There are two types of LAPACK-style interfaces. The first one, referred to as the CPU interface, takes the input and produces the result in the CPU's memory. The second, referred to as the GPU interface, takes the input and produces the result in the GPU's memory. In both cases, a hybrid CPU/GPU algorithm is used. Also included is MAGMA BLAS, a complementary to CUBLAS routines.
Isn’t MAGMA more like a canned set of linear algebra routines rather than a general interface for executing arbitrary code on GPUs? That’s pretty different.
Well at the end of the day they're both libraries for doing heterogeneous parallel compute (CPU + GPU w/ one API).
oneAPI might be more general but at the end of the day heterogeneous HPC converges on linear algebra. So it might have different ergonomics but it's very similar.
The right way to go would be to first upstream, finish and optimize AMD and Apple M1 support to PyTorch and Tensorflow and XLA, while trying to abstract as much to common libraries as possible.
Trying to create an API first just slows development down even more.
Both Intel's oneAPI and AMDs ROCm/HIP stacks success are obviously tied to (and part of) the success of each companies future products. And today I'd bet AMD will come off better. They have multiple contracts for HPC systems with their future GPU products as well as an existing presence in consumer gaming products.
Intel's effort seem to be banking on the Aroura system at ANL. Beyond that I don't know who's lining up to by Xe GPUs? Though I guess oneAPI will fit into whatever they do further down the line.
There's also OCCA which does JIT compilation to C++, OpenMP, CUDA, HIP, OpenCL and Metal. Originally built at Virginia Tech and now maintained by the Center for Efficient Exascale Discretizations of the DOE.
OpenCL and various other solutions basically require that one writes kernels in C/C++. This is an unfortunate limitation, and can make it hard for less experienced users (researchers especially) to write correct and performant GPU code, since neither language lends itself to writing many mathematical and scientific models in a clean, maintainable manner (in my opinion).
What oneAPI (the runtime), and also AMD's ROCm (specifically the ROCR runtime), do that is new is that they enable packages like oneAPI.jl [1] and AMDGPU.jl [2] to exist (both Julia packages), without having to go through OpenCL or C++ transpilation (which we've tried out before, and it's quite painful). This is a great thing, because now users of an entirely different language can still utilize their GPUs effectively and with near-optimal performance (optimal w.r.t what the device can reasonably attain).
No one will take over CUDA's dominance until they realize that one reason why most researchers flocked into it were its polyglot capabilities, and graphical debuggers.
Helped by Khronos focus of never supporting anything other than C and letting the community come up with tools.
So for years, before they started having a beating OpenCL was all about a C99 dialect with printf debugging.
SPIR and support for C++ came later as they were already taking a beating, trying to get up.
Apparently that is also the reason why Apple gave up on OpenCL, disagreements on how it should be going, after they gave 1.0 to Khronos.
Just compare Metal, a OOP API for GPUs, with Objective-C/Swift bindings, using a C++ dialect as shading language, a framework for data management, with Vulkan/OpenGL/OpenCL.
well, yeah. All of the opencl implementations were (and probably still are) awful. Other than requiring a lot of boilerplate, I found it adequate, essentially a clone of the the CUDA C driver API.
opencl 2 went into the weeds going full c++, but that all got rolled back with version 3.
Yes, and that's why Julia gained CUDA support first. My point was to respond to "Why would someone use this instead of plain old OpenCL(or CUDA) with C++?", and my answer was, "you can use something other than OpenCL C or C++". I'm not trying to say that CUDA is any lesser of a platform because of this; instead, other vendor's GPUs are now becoming easier to use and program.
It's better to think of this as a more friendly (i.e. open source development model) first party compute stack than some kind of pan-vendor standard. For example, anyone using MKL is now nominally using oneAPI libraries. They also went to the trouble of implementing/pushing existing standards instead of baking their own thing: SPIR-V for the IR format, SYCL for the high level programming interface, an OpenCL 3 implementation (AIUI they have the most complete implementation of 2.x and 3), etc.
So "Open Source development model" (developing in the open) is not the same thing as "Open Source Code" (developing in closed doors, then throwing code to the other side of the wall once in a while). Intel has a history of doing both depending on the project and internal groups involved, where only really the Open Development projects are actually successful and long-term. OneAPI uses a lot of components and I am not entirely sure they all follow the Open Development model. There's a lot of Open Source stuff out there that you basically can't contribute to: your contributions are ignored because source code is open but development isn't. Does anybody here know about how this is done for the OneAPI-related projects?
My sense is that they're somewhat better than, say, AMD at keeping development work in the open instead of just throwing code over the wall. Hence the choice of "development model" instead of just "FOSS". It's a low bar though, IIRC a lot of ROCm is just periodic code dumps with no visible PR strategy.
I'm struggling to see what's enticing in this for python people that already have largely optimised torch and tf cpu and gpu backends, especially for batch work. And for latency-sensitive inference, I thought the 'industry' was going for TVM or other 'target all the things at code level'.
I'm thinking: gimme your network in onnx format, everyone gives a C inference/compilation API, and let everyone optimize /behind/ that... Xilinx, AMD, whoever-Google-bought-last-week...