So, at a SW level the equivalent to HW's "no single static scheduling works in the face of $ misses" is "no single static scheduling works in the face of blocked IO"[0], and the practical solution for the latter has proven to involve (instead of forcing the kernel to be sufficiently smart about processes' access patterns) exposing nonblocking IO to userland, why not allow the ISA to query[1] if an L1 access is currently viable (enabling dispatch to different static schedules)?
[0] on a tangent: is the fact that modern cloud development uses copious amounts of Terraform et.al., explaining what is hooked up where[2], just an epiphenomenon of Conway's law? Or is there something more fundamental driving this particular pendulum swing?
[1] sure, there may have been security concerns before, but now doesn't the Spectre family imply that adversaries can exploit the $ state anyway?
> why not allow the ISA to query[1] if an L1 access is currently viable
Well that's interesting. Query/branch on whether some address is presently in a given cache line, with more fun if one can test whether whether that line is exclusive/shared etc. The hardware definitely knows that, though the information may be stale immediately after providing it.
Selling that to the hardware guys would mean finding a compelling use case. I expect modifying the cache coherency implementation is done reluctantly. I can't think of one where whether a single line is resident is worth branching on, and querying whether N are present doesn't really fit the ISA model.
Further up the stack though, whether a given page is currently in memory (as opposed to swapped out) is a similar sort of thing, where one might want to prefetch it while doing other work. That might not be worth a syscall though.
So I love the suggestion but can't see the return on the silicon investment.
Not sure if the cache coherency model would need to be changed: I had been thinking of an OP like "LOAD dest, address, flag", where either dest would be loaded from address from L1, or would be set to an arbitrary value on miss (with flag being set to indicate which happened).
> why not allow the ISA to query[1] if an L1 access is currently viable (enabling dispatch to different static schedules)?
For tight loops loading one or two things per iteration, maybe it'd be worth it; most other things, though, will result in code size explosion (as you'd have two paths for each load, potentially multiplying on multiple consecutive loads), and, depending on the implementation, wouldn't play well with branch prediction (though it'd be very fun to just let branch mispredictions happen, given that the check may be stale by the load time anyway).
And not everyone has given up on trying to defeat cache side-channels either - spectre didn't magically erase all reasons to want to be able to speculate loads without slow context switches.
not necessarily two per load: I was guessing that frequently all loads succeed would dispatch to a happy path and any load failing to a not-so-happy path.
(with the hopes that this might allow a software-defined rough equivalence to TFA's "An out-of-order machine starts off like the first example in early iterations, and transposes into the second example when a cache miss happens, or over many iterations of the loop if not decode limited." Or at least enough to be worthwhile given a factor of 2 power win from not using OoO?)
How temporally responsive is branch prediction these days? I'd guess there'd be a fair amount of locality, in that the $ status of one iteration's data would be highly correlated with another iteration's.
It sounds like you want to use the cache as a private local memory near the core, this is called a scratchpad memory.
And to maintain the scratchpad memory a DMA is usually used.
It is much more reliable than a cache that can evict data behind your back.
Because we generally have no (or little) control over the cache eviction policy, and an interrupt or context switch can evict what you purposely fetch in the cache (and cache coherence can also cause data eviction).
PS: Note that in some circuits the scratchpad and the L1 cache share the same memory, this enables an adjustable cache and scratchpad size
Thanks, the existence of scratchpad (which makes the IO analogy even closer) adequately explains the lack of a non-stalling load.
Compilers which care (which would seem to be those targeting embedded; not anything gp) can then, instead of providing the two differing static schedules mentioned elsewhere in this thread, provide two different code paths: one to kick off scratchpad DMA if necessary (and possibly do something useful in the meantime?), and one that knows all its data is already in scratchpad, so it can statically unroll to an appropriate width.
I feel bad that I really don't understand any of this. My mental model of what a CPU does stops at the assembly level, and intuitively feels like (and looks like) doing some convolution on an image - e.g. a local, nodal point moving through a bitmap, leaving behind the wake of some set of operations, a smear, with some of that smear propagating forward. But now I realize that all of these models are a middling-high level picture of a CPU's side effect, not it's operation - the manner in which it combines it's operands is, effectively, voodoo. But voodoo with which I am comfortable, for reasons involving NAND gates and Alan Turing, but I wonder: should I be so comfortable?
Very, very few people understand what a CPU truly does. We all have differing understandings, yours probably about like mine, if not better.
Once you feel you understand perfectly how it works, Keller comes along and says that you can run the same program 100 times, and it'll never execute the same way twice.
Anymore, I just treat it like a black box I'm glad to not have to bargain with.
I found NandGame[1] to really help illustrate just how a simple CPU operates. The bottom-up approach was great at demystifying the various sub-units and how they come together to form a CPU.
Of course, modern CPUs are incredibly more complex, but most appear to function like a really fast, simple CPU.
For many programmers it stops at the API, and often in a higher level language like JavaScript or Python. Understanding down to ASM is pretty far down compared to the average.
A truly "reinvented" CPU would use a reconfigurable coarse grained data flow approach, directly available to the programmer, instead of "yet another command stream based von Neumann machine", imho.
(Actually something that is in parts already at the heart of modern CPUs; only that nothing of that is available directly to the programmer, where it would be actually useful, which is utter nonsense of course; and that it's not reconfigurable; the compiler can't construct pipelines on the fly; not even the hardware build-in JIT, which can be found in every modern CPU, which tries to translate that misfortunate command stream into a data-flow graph inside the CPU as good as it can—of course after we deleted all information about that data-flow that we collected during compilation—can't do that ;-). Makes sense? No? Doesn't matter. Current CPUs work like that. Because we just "can't" fix our software! Thank you, C/C++).
Only switching from command-stream to data-flow could overcome the von Neumann bottleneck. Also only by such a switch asynchronous clockless designs would be feasible (which are the only way to overcome the power-wall in the long run).
"Fun" thing is that to some extend such data-flow hardware is available today. Because it's the only kind of hardware that scales into dimensions where truly powerful AI applications are possible. Only that almost nobody thinks about how to make this kind of almost maximally efficient hardware applicable to general purpose computations. But that would be in fact trivially simple: Just rewrite the world in some new kind of data-flow language (which needs to get invented first, of course). Ah, and when we're at it, don't forget to use balanced ternary for your new kind of computer—as this is the most efficient and beautiful system for computing hardware.
Historically, exotic computing architectures fail because they run up against the limits of compilers, languages, and the programmers who work with them.
It seems it's far "easier" to make hardware for computers that run progams quickly than it is to formulate those programs. And even harder to get programmers to write those programs in new ways.
The history of computers is littered with graveyards full of interesting and potentially superior hardware architectures that failed to achieve market penetration and economies of scale.
Changing "too much" doesn't seem to be an option, so we're stuck with the stuff invented many many decades ago, and likely never will be able to overcome those local "optima" of the past—just because the market won't accept any disruptive changes any more by now.
And sure: It's always the software… (As said also in the long reply to the sibling comment).
The good news is that FPGAs have democratized a bit of the HW stuff for us software folks. It's fairly trivial to get started with an FPGA dev board and building out our own custom CPU architectures, though of course it's likely going to be "just for fun."
So even if a new and novel concept can't take over the world, we can build our own lightsabers, even just for hobby stuff.
Yes and no. At least to validate my ideas FPGAs would not really be sufficient.
Core to my idea are three things (on the HW-level): Runtime reconfigurability, dense and fast reconfigurable on chip networking, and asynchronous clockless logic. (Besides "smart memory", but that's something we can actually have).
Nothing of that FPGAs really offer. (There is some kind of support for reconfigurability on the bigger Xilinx chips, but that does not really smell like what I'm looking for; also it's proprietary and very limited).
FPGAs can of course simulate those named things. But without exposing their advantages. Actually such a simulation would be very slow compared to when doing things "the usual way".
This is especially sad as FPGAs as such have already a lot of the other desired properties: They're reconfigurable hardware at all!
So one would need first to build some new / modified kind of "FPGA" (especially adapted to async logic and runtime reconfigurability) only to try out my ideas in a halfway realistic setting.
On the other hand, doing some ISA research and experiments doesn't need any hardware in the first place. This could be done fully in some software simulated environment. Only that this way it's impossible to know whether there would be any real merit in all this stuff. Simply, because I would expect a SW simulation to be quite inefficient and slow (even on a FPGA). So this wouldn't be anything that could convince anybody, frankly.
Well, maybe I run at some point into some crazy person that has a few dozens $ mils. spare to throw at a "high-fantasy" project like that, to validate the basic assumptions. IF it would work it could actually change for the better everything we take for granted today about how computers work. Imho it's time. The von Neumann bottleneck is killing any progress! We should get rid of it. This needs some radical¹ rethinking of our machines (and especially their software).
This reads like a hand wavy reference to some sort of programmable systolic array thing. A current example in this vein is googles TPU ai accelerators. Note they get drastic increase in performance by drastically limiting precision (read small values and registers, a specific computation requiring limited operations) hence massive reduction in transistors per computational element. Ok fine; but for statefull calculations that need to load and or store information per compuatational element it’s not the win one might think. Specifically memory bandwidth can be the limitation. In these cases modern GPUs are really really hard to beat.
> This reads like a hand wavy reference to some sort of programmable systolic array thing.
Indeed, systolic arrays are a big inspiration. Like also the Xputer, the Reduceron, computational RAM, parts of IBM's TrueNorth chips, and of course in parts current AI and GPU hardware.
> Ok fine; but for statefull calculations that need to load and or store information per compuatational element it’s not the win one might think.
The whole point of my idea is to mitigate this.
Exactly this problem is the famous von Neumann bottleneck. And it's holding back any progress we could probably make regarding computer hardware.
Imho the whole misery we're in is due the fact almost all "basic" stuff is build in languages that are married to the "C-Machine". The core of the problem: The "C-Machine" can't be build efficiently with hardware that does not simulate a von Neumann computer. So we're stuck.
The only way out of this "local optimum", like I call it, would be to change the basic programming model. From imperative to some mix of declarative data-flow with FP elements. Than "rebuild the world" on top of that… (Simple as that, isn't it? :-D).
One can't fix the von Neumann bottleneck as long as it's possible (or actually required!) to be able to jump around freely in memory. So we need to change that! Pure FP languages could pose a way around that basic problem. In FP languages one can "contain" all state manipulations on the language level in isolated parts of the program. Everything between this parts is pure data-flow. So you could have explicit "machines" that manipulate state and talk to the outside world strongly separated form the parts that do only pure transformations on data. Those "machines" would be the "processing units" (with their attached scratch-pad memory) that I'm talking about in the other long sibling comment.
I'm not sure whether the approach you are talking about is coarse grained or fine grained, feels pretty fine grained to me. Programming with data flow is already used by some high performance computing frameworks, e.g. cuda and onetbb supports task graphs, which are not instruction level data flow but more coarse grained task flow.
Not sure if exposing the underlying pipelining to programmers will be that useful.
1. Graphs are more general but more complicated. Control flow graphs are nice, data dependence graphs are more messy.
2. You still need an underlying representation for the graph. If you consider instructions as a certain representation of a data flow graph, you are still getting a graph, but perhaps not that flexible for programmers. And if you want to encode dependencies more precisely, you will need longer instructions, and modern CPUs are typically a lot faster than the main memory.
3. Exposing pipelining details as the ISA may limit future optimization. Due to hardware limitation, the maximum number of inflight operations must be bounded, so you either need some way to restrict the programmers to not do some crazy data flow graphs, or limit it implicitly by blocking when there is too many inflight operations. The latter case is quite similar to what we currently have, and the first case will leak some implementation details to the ISA.
The previous post was mostly about changing perspective, not really about the details.
You're actually right that something like that would need to operate on the granularity of "(micro) tasks" because any more fine grained approaches aren't efficient as we know by now.
Also I've noted that we have (and had) already some hardware that goes in that direction to some extend.
> Not sure if exposing the underlying pipelining to programmers will be that useful.
I think that would be fundamental. In fact this would be a crucial part of the reconfigurability of the system.
But it wouldn't be only be about exposing pipelining. Also the steps in the pipeline as its structure would need to be programmable. Think microcode, but more powerful, and available to the programmer.
> You still need an underlying representation for the graph.
Sure.
That's some data structure produced by a compiler.
Think FPGA bitstream.
Only that one would need to be able to dynamically load and change it on the fly.
> If you consider instructions as a certain representation of a data flow graph […]
This expresses some (of the usual) misunderstanding of my idea.
The whole point is to change the viewpoint form programming a computer through a command stream (a stream of instructions), to something more "declarative" (a data-flow "execution graph").
Instead of an command stream you would load (parts) of said "execution graph" which encodes the function and wiring of your processing units (like said above, think FPGA bitstream), and than "just pipe" data through that graph.
If the graph does "not fit" (does not map completely) into the available hardware resources—like the programmable processing units and their programmable interconnects, and the available smart memory cells (a kind of "intelligent" scratchpad memory cells, as an replacement for registers)—it needs to get "tiled". Than the "tiles" could be "moved underneath" the (fixed) hardware resources. This would be actually the usual way how things work as otherwise the CPU would need to integrate all of the memory of the computer (which is not realistic).
The nice thing about that concept would be that the actual data doesn't need to be moved as the graph-tiles get (virtually) "moved" by rewiring and reprogramming the HW processing units / smart memory cells.
Doing that replacements of execution graph-tiles following the "waves of data-flow through the system" is something that does not involve any "jumps", just a upfront know constant stream of "bitstream", and could be nicely parallelized with the actual computations in the processing units (after the "data wavefront" moved forward you can safely replace the graph tiles "behind" it without interrupting the rest of the computations).
Doing things this way seems complicated at first, but my intuition is that all that magic that is done in today's CPUs is actually even more complicated. Today's hardware is doing already parts of the described things. (Only that it's not reconfigurable). But having always the possibility to arbitrary jump around in memory hanging over your head, and not having the data-flow structures available to the hardware (it needs to "guess" them!), makes things way more complex than imho needed. In my opinion we're doing things "backwards" by now!
If the actual hardware is a data-flow machine at its core anyway, we should embrace that on the higher levels, and not pretend that we still program a very fast PDT-11.
The "only" problem is that the C/C++ (and the-like) abstract machines (and therefore the "machine code" such languages get compiled to) pretends that our computers are still PDTs. As the above article says:
>> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success. [emphasis mine]
"large numbers of threads" == a lot of in flight "micro tasks" (at some stages of their "virtual pipeline")
"wide vector units" == powerful programmable "processing units", with wide "buses" (connected by some hyper-cube or hyper-torus on-chip network)
"much simpler memory model" == Just let the data flow! No "moving parts" at all (besides the flowing "application data", or the equivalent from some external "device"-ports)
TL;DR: Just feed the machine two data-streams, one describing the ever-updating "execution graph" (as tiles get "moved"), and one providing the actual data that needs to flow through that "virtually pipelined" system (graph structure). No direct RAM access (besides "virtually local" smart memory scratchpad access by the processing units, mediated though software controlled on-chip network routing) as "external" system-memory would be seen as an "device" laying behind some appropriate processing unit, no jumps (as this would be "just branches described by upfront know data" in the "execution graph", through with data could possibly flow; which is actually not much different than speculatively executing code).
The processing units could be of three kinds I think (but forgot to mention): They could be programmed to be sources, transformers, or sinks. Sources and sinks would be I/O elements, data-generators, or interfaces to persistent storage, transformers map nicely to ("clustered", or better said "sequenced and glued together") functions, or given they would have access to (private, "virtually local") scratchpad memory also to co-routines (which would be the place where the effects happen; which is of course synonymous to state manipulation). For more efficiency and programmer flexibility they could include FPGA hardware and make the implementation of their specific functions a matter of "changing the code".
We have plenty of transistor this days. We could really start trying to overcome the PDP…
The hardware actually isn't the problem. It's (as always!) the software, and of course the compilers.
That said, it's clear to me you still need a "language to talk to the hardware". But it would be more declarative than the current imperative "machine code", I guess. Designing such a language and it's abstract machine (a truly new kind of ISA and ASM) is the challenge here. Synthesizing the appropriate hardware form that in a next step wouldn't be too complicated I think.
Happy to hear some opinions on those ideas!
I know they're unusual at least (so much to criticize); but isn't "reinventing the computer" the topic of the original submission actually? :-)
Forgive me if misunderstand, but could you not use an array oriented language such as APL, J or K? Whilst they probably don't quite mesh with explicitly controlling the data flow, but don't they and other declarative languages allow the computer to optimise the best path to maximize computational throughput?
Indeed something APL-like would make a good "assembler" kind of language for such an architecture!
APLs are much more data oriented and much less control-flow heavy. Also they don't use the notion of LOAD / STORE of mutable memory cells as a basic language feature (like all the imperative "C-like" languages, that therefore need a "C-Machine" to be executed efficiently).
So yes, APLs would be a good start to develop the machine language for such a machine that I have in mind. (Only that I have some more ideas about it, that go quite beyond that; like making "sending request to your environment / a 'machine'", which is synonymous to having effects, a language level thing).
> But that would be in fact trivially simple: Just rewrite the world in some new kind of data-flow language (which needs to get invented first, of course).
I hope that was sarcasm. But if not: No, rewriting the world is not trivially simple. Not remotely.
I believe they are just being quiet. Their series of talks is over.
In December front-man Ivan Godard wrote [1] on their forum that they are holding off on twenty-something patents so as not to "start the clock" on them. And they can't talk about anything on the forum that is not yet filed.
26 June 2022: "The past year has mostly been spent on the uncore and tools; the ISA itself is essentially finished. Uncore includes behind-the-scenes parts of the chip like the PLB, TLB, interrupts, and spiller. Tools include the compiler driver, the post-linker and pre-linker, debug support, documentation, and LLVM intrinsics."
I think they are also working on a funding round which will allow them to file their next batch of patents.
I wouldn’t be surprised. It’s a very ambitious goal to do on someone’s own. It’s not a monumental task: it’s a directed graph of interdependent monumental tasks, each needing to be perfectly executed.
My bet is that they’ll end with a sizeable number of patents, at least.
So I suppose that end game is those patents getting bought by some big name semi manufacturer and quietly incorporated into future chips using different terminology... and then folks will smugly talk about what a terrible idea it all was.
>> I’m tired of the dominance of the out-of-order processor. They are large and wasteful, the ever-popular x86 is especially poor, and they are hard to understand. Their voodoo would be more appreciated if they pushed better at the limits of computation, but it’s obvious that the problems people solve have a latent inaccessible parallelism far in excess of what an out-of-order core can extract.
One thing I find interesting is the recent focus on branch predictors. OoO designs like Zen and *Lake have been seeing performance improvements due to better branch prediction (and going back, things like trace cache) which strongly suggests that flow control is very important and therefore, data-parallel code is not the majority. There is no architecture that optimally fits "general purpose computing" so we have OoO, Vector instructions, Hyper Threading, Multi-core, GPUs, and TPUs.
My hope is that GPU/TPU/ML will converge on something that is very good at all 3. But maybe my hope there is due to my lack of hands-on with those.
What I'm seeing right now (my feeling about new silicon) is numerous simple cores with varied memory access patterns (stream processing, large scratchpad, systolic things, message passing over crazy fabrics...) and little to no silicium dedicated to handling complex data access patterns or complex control flow.
The main motivation seems to be 'easy to tape out in one go' which is great in a sense (you get new hw faster & cheaper, with far less risk). We're seeing exploration on how to break even the gpu model's upcoming limits. Sparse algebra, taskgraph/dataflow oriented prepared or JITed pipelines, and lots of work expected from the compiler and the developer. Even the tensor or rt on nvidia gpus are strange specialized beasts there, difficult to use, especially in generic ways.
Seems like a decade of niche optimisation work on dozens of new architectures is coming...
If it can be computed, it can be computed slowly. (Of course, if we have to make N iterations to get at a working design, and N is really large, then speed would make a massive difference as to when we get there...)
Analog CPUs are a horrible idea. You do not want security checks to be analog. Analog accelerators are great idea. The analog world is where security goes to cry and die. Just think of RowHanmer
[0] on a tangent: is the fact that modern cloud development uses copious amounts of Terraform et.al., explaining what is hooked up where[2], just an epiphenomenon of Conway's law? Or is there something more fundamental driving this particular pendulum swing?
[1] sure, there may have been security concerns before, but now doesn't the Spectre family imply that adversaries can exploit the $ state anyway?
[2] resembling nothing so much as text file versions of the "system diagrams" people used to draw in the 1960's with green plastic stencils: https://americanhistory.si.edu/sites/default/files/ogmt_admi... Or would JCL cards be a closer analogy?