I actually really hate CUDA's programming model and feel like it's too low-level to actually get any productive work done. I don't really blame Nvidia because they basically invented the programmable GPU and it wouldn't be fair to have them also come up with the perfect programming model right out of the gate but at this point it's pretty clear that having independent threads work on their own programs makes no sense. High performance code requires scheduling across multiple threads in a way that is completely different if you are coming from CPUs.
Of course, one might mention that GPUs are nothing like CPUs–but the programming model works super hard to try to hide this. So it's not really well designed in my book. I actually quite like the compilers that people are designing these days to write block-level code, because I feel like it better represents the work people want to do and then you pick which way you want it lowered.
As for Nsight (Systems), it is…ok, I guess? It's fine for games and stuff I guess but for HPC or AI it doesn't really surface the information that you would want. People who are running their GPUs really hard know they have kernels running all the time and what the performance characteristics of them are. Nsight Compute is the thing that tells you that but it's kind of a mediocre profiler (some of this may be limitations of hardware performance counters) and to use it effectively you basically have to read a bunch of blog posts by people instead of official documentation.
Despite not having used it much, my impression was that Nvidia's "moat" was that they have good networking libraries, that they are pretty good (relatively) and making sure all their tools work, and they have had consistent investment on this for a decade.
GPUs are a type of barrel processor, which are optimized for workloads without cache locality. As a fundamental principle, they replace the CPU cache with latency hiding behavior. Consequently, you can't use algorithms and data structures designed for CPUs, since most of those assume the existence of a CPU cache. Some things are very cheap on a barrel processor that are very expensive on a CPU and vice versa, which changes the way you think about optimization.
The wide vectors on GPUs are somewhat irrelevant. Scalar barrel processors exist and have the same issues. A scalar barrel processor feels deceptively CPU-like and will happily compile and run normal CPU code. The performance will nonetheless be poor unless the C++ code is designed to be a good fit for the nature of a barrel processor, code which will look weird and non-idiomatic to someone who has only written code for CPUs.
There is no way to hide that a barrel processor is not a CPU even though they superficially have a lot of CPU-like properties. A barrel processor is extremely efficient once you learn to write code for them and exceptionally well-suited to HPC since they are not latency-sensitive. However, most people never learn how to write proper code for barrel processors.
Ironically, barrel processor style code architecture is easy to translate into highly optimized CPU code, just not the reverse.
I wanted to upvote you originally, but I'm afraid this is not correct. A GPU is not a barrel processor. In a barrel processor a single context is switched between multiple threads after each instruction. A barrel processor design has a singular instruction pipeline and a singular cache across all threads. In a GPU, due to the independence of the execution units, those threads will execute those instructions concurrently on all cores, as long as a program-based instruction dependency between threads is not introduced. It's true parallelism. Furthermore, each execution unit embeds its own instruction scheduler, it's own pipeline and its own L1 cache (see [1] for NVidia's architecture).
Barrel processors are a spectrum and GPUs are on one end of it. Yes, the classic canonical barrel processors (e.g. Tera architecture) more or less work as you outline. That is a 40 year old microarchitecture, they haven't been designed that way for decades.
Modern barrel processors implementations have complex microarchitectures that are much closer to a modern GPU in design. That is not accidental, the lineage is clearly there if you've worked on both. I will grant that vanishingly few people have ever seen or worked on a modern non-GPU barrel processor, since they are almost exclusively the domain of exotics built for government applications AFAICT.
They are similar enough wrt. how they hide memory access latency within each single processing core ("streaming multiprocessor") by switching across hardware threads ("wavefronts").
A context cannot be shared by multiple threads. Each thread must have its own context, otherwise all threads will crash immediately. Thus your description of a barrel processor is completely contrary to reality.
When threads are implemented only in software, without hardware support, you have what is called coarse-grained multithreading. In this case, a CPU core executes one thread, until that thread must wait for a long time, e.g. for the completion of some I/O operation. Then the operating system switches the context from the stalled thread to another thread that is ready to run, by saving all registers used by the old thread and restoring the registers of the new thread, from the values that were saved when the new thread has been executed last time.
Such multithreading is coarse-grained, because saving and restoring the registers is expensive so it cannot be done often.
When hardware assists context-switching, by being able to store internally in the CPU core multiple sets of registers, i.e. multiple thread contexts, then you can have FGMT (fine-grained multithreading). In the earliest CPUs with FGMT the switching of the thread contexts was done after each executed instruction, but in all more recent CPUs or GPUs with FGMT the context switching can be done after each clock cycle.
Barrel processors are a subset of the FGMT processors, the simplest and the least efficient of them. Barrel processors are now only of historical interest. Nobody has made barrel processors during the last decades. In barrel processors, the threads are switched in round robin, i.e. in a fixed order. You cannot choose the next thread to run. This wastes clock cycles, because the next thread in the fixed order may be stalled, waiting for some event, so nothing can be done during its allocated clock cycle.
The name "barrel", introduced by CDC 6600 in 1964, refers to the similarity with the barrel of a revolver, you can rotate it with a position, bringing the next thread for execution, but you cannot jump over a thread to reach some arbitrary position.
What is switched in a barrel CPU at each clock cycle between threads is not a context, i.e. not the registers, but the execution units of the CPU, which become attached to the context of the current thread, i.e. to its registers. For each thread there is a distinct set of registers, storing the thread context.
The descriptions of the internal architecture of GPUs are extremely confusing, because NVIDIA has chosen to replace in its documentation all the words that have been used for decades when describing CPUs with different words, with no apparent reason except of obfuscating the GPU architecture. AMD has followed NVIDIA, and they have created a third set of architectural terms, mapped one to one to those of NVIDIA, but using yet other words, for maximum confusion.
For instance, NVIDIA calls "warp" what in a CPU is called "thread". What NVIDIA calls "thread" is what in a CPU is called "vector lane" or "SIMD lane". What NVIDIA calls "stream multiprocessor" is what in a CPU is called "core".
Both GPUs and CPUs are made of multiple cores, which can execute programs in parallel.
Each core can execute multiple threads, which share the same execution units. For executing multiple threads, most if not all GPUs use FGMT, while most modern CPUs use SMT (Simultaneous Multithreading).
Unlike FGMT, SMT can exist only on superscalar processors, i.e. which can initiate the execution of multiple instructions in the same clock cycle. Only in that case it may also be possible to initiate the execution of instructions from distinct threads in the same clock cycle.
Some GPUs may be able to initiate 2 instructions per clock cycle, only when certain conditions are met, but for all such GPUs their descriptions are typically very vague and it may be impossible to determine whether those 2 instructions may come from different threads, i.e. from different warps in the NVIDIA terminology.
I've only done a little work on CUDA, but I was pretty impressed with it and with their NSys tools.
I'm curious what you wish was different.