CUDA C++ _can_ work like that. But I would say that these are mostly kiddie whee...

CUDA C++ _can_ work like that. But I would say that these are mostly kiddie wheels for convenience. And because, in GPU programming, performance is king, most (?) kernel developers are likely to eventually need to drop those wheels. And then:

* No single source (although some headers might be shared)

* Kernels are compiled and linked at runtime, for the platform you're on, but also, in the general case, with extra definitions not known apriori (and which are different for different inputs / over the course of running your program), and which have massive effect on the code.

* You may or may not use some kind of compiled kernel caching mechanism, but you certainly don't have all possible combinations of targets and definitions available, since that would be millions or compiled kernels.

It should also be mentioned that OpenCL never included the kiddie wheels to begin with; although I have to admit it makes it less convenient to start working with.