I'm not sure whether the approach you are talking about is coarse grained or fine grained, feels pretty fine grained to me. Programming with data flow is already used by some high performance computing frameworks, e.g. cuda and onetbb supports task graphs, which are not instruction level data flow but more coarse grained task flow.
Not sure if exposing the underlying pipelining to programmers will be that useful.
1. Graphs are more general but more complicated. Control flow graphs are nice, data dependence graphs are more messy.
2. You still need an underlying representation for the graph. If you consider instructions as a certain representation of a data flow graph, you are still getting a graph, but perhaps not that flexible for programmers. And if you want to encode dependencies more precisely, you will need longer instructions, and modern CPUs are typically a lot faster than the main memory.
3. Exposing pipelining details as the ISA may limit future optimization. Due to hardware limitation, the maximum number of inflight operations must be bounded, so you either need some way to restrict the programmers to not do some crazy data flow graphs, or limit it implicitly by blocking when there is too many inflight operations. The latter case is quite similar to what we currently have, and the first case will leak some implementation details to the ISA.
The previous post was mostly about changing perspective, not really about the details.
You're actually right that something like that would need to operate on the granularity of "(micro) tasks" because any more fine grained approaches aren't efficient as we know by now.
Also I've noted that we have (and had) already some hardware that goes in that direction to some extend.
> Not sure if exposing the underlying pipelining to programmers will be that useful.
I think that would be fundamental. In fact this would be a crucial part of the reconfigurability of the system.
But it wouldn't be only be about exposing pipelining. Also the steps in the pipeline as its structure would need to be programmable. Think microcode, but more powerful, and available to the programmer.
> You still need an underlying representation for the graph.
Sure.
That's some data structure produced by a compiler.
Think FPGA bitstream.
Only that one would need to be able to dynamically load and change it on the fly.
> If you consider instructions as a certain representation of a data flow graph […]
This expresses some (of the usual) misunderstanding of my idea.
The whole point is to change the viewpoint form programming a computer through a command stream (a stream of instructions), to something more "declarative" (a data-flow "execution graph").
Instead of an command stream you would load (parts) of said "execution graph" which encodes the function and wiring of your processing units (like said above, think FPGA bitstream), and than "just pipe" data through that graph.
If the graph does "not fit" (does not map completely) into the available hardware resources—like the programmable processing units and their programmable interconnects, and the available smart memory cells (a kind of "intelligent" scratchpad memory cells, as an replacement for registers)—it needs to get "tiled". Than the "tiles" could be "moved underneath" the (fixed) hardware resources. This would be actually the usual way how things work as otherwise the CPU would need to integrate all of the memory of the computer (which is not realistic).
The nice thing about that concept would be that the actual data doesn't need to be moved as the graph-tiles get (virtually) "moved" by rewiring and reprogramming the HW processing units / smart memory cells.
Doing that replacements of execution graph-tiles following the "waves of data-flow through the system" is something that does not involve any "jumps", just a upfront know constant stream of "bitstream", and could be nicely parallelized with the actual computations in the processing units (after the "data wavefront" moved forward you can safely replace the graph tiles "behind" it without interrupting the rest of the computations).
Doing things this way seems complicated at first, but my intuition is that all that magic that is done in today's CPUs is actually even more complicated. Today's hardware is doing already parts of the described things. (Only that it's not reconfigurable). But having always the possibility to arbitrary jump around in memory hanging over your head, and not having the data-flow structures available to the hardware (it needs to "guess" them!), makes things way more complex than imho needed. In my opinion we're doing things "backwards" by now!
If the actual hardware is a data-flow machine at its core anyway, we should embrace that on the higher levels, and not pretend that we still program a very fast PDT-11.
The "only" problem is that the C/C++ (and the-like) abstract machines (and therefore the "machine code" such languages get compiled to) pretends that our computers are still PDTs. As the above article says:
>> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success. [emphasis mine]
"large numbers of threads" == a lot of in flight "micro tasks" (at some stages of their "virtual pipeline")
"wide vector units" == powerful programmable "processing units", with wide "buses" (connected by some hyper-cube or hyper-torus on-chip network)
"much simpler memory model" == Just let the data flow! No "moving parts" at all (besides the flowing "application data", or the equivalent from some external "device"-ports)
TL;DR: Just feed the machine two data-streams, one describing the ever-updating "execution graph" (as tiles get "moved"), and one providing the actual data that needs to flow through that "virtually pipelined" system (graph structure). No direct RAM access (besides "virtually local" smart memory scratchpad access by the processing units, mediated though software controlled on-chip network routing) as "external" system-memory would be seen as an "device" laying behind some appropriate processing unit, no jumps (as this would be "just branches described by upfront know data" in the "execution graph", through with data could possibly flow; which is actually not much different than speculatively executing code).
The processing units could be of three kinds I think (but forgot to mention): They could be programmed to be sources, transformers, or sinks. Sources and sinks would be I/O elements, data-generators, or interfaces to persistent storage, transformers map nicely to ("clustered", or better said "sequenced and glued together") functions, or given they would have access to (private, "virtually local") scratchpad memory also to co-routines (which would be the place where the effects happen; which is of course synonymous to state manipulation). For more efficiency and programmer flexibility they could include FPGA hardware and make the implementation of their specific functions a matter of "changing the code".
We have plenty of transistor this days. We could really start trying to overcome the PDP…
The hardware actually isn't the problem. It's (as always!) the software, and of course the compilers.
That said, it's clear to me you still need a "language to talk to the hardware". But it would be more declarative than the current imperative "machine code", I guess. Designing such a language and it's abstract machine (a truly new kind of ISA and ASM) is the challenge here. Synthesizing the appropriate hardware form that in a next step wouldn't be too complicated I think.
Happy to hear some opinions on those ideas!
I know they're unusual at least (so much to criticize); but isn't "reinventing the computer" the topic of the original submission actually? :-)
Not sure if exposing the underlying pipelining to programmers will be that useful. 1. Graphs are more general but more complicated. Control flow graphs are nice, data dependence graphs are more messy. 2. You still need an underlying representation for the graph. If you consider instructions as a certain representation of a data flow graph, you are still getting a graph, but perhaps not that flexible for programmers. And if you want to encode dependencies more precisely, you will need longer instructions, and modern CPUs are typically a lot faster than the main memory. 3. Exposing pipelining details as the ISA may limit future optimization. Due to hardware limitation, the maximum number of inflight operations must be bounded, so you either need some way to restrict the programmers to not do some crazy data flow graphs, or limit it implicitly by blocking when there is too many inflight operations. The latter case is quite similar to what we currently have, and the first case will leak some implementation details to the ISA.