Reinvent? Without any *substantial* improvement? A truly "reinvented" CPU would ...

cmrdporcupine · on Aug 15, 2022

Historically, exotic computing architectures fail because they run up against the limits of compilers, languages, and the programmers who work with them.

It seems it's far "easier" to make hardware for computers that run progams quickly than it is to formulate those programs. And even harder to get programmers to write those programs in new ways.

The history of computers is littered with graveyards full of interesting and potentially superior hardware architectures that failed to achieve market penetration and economies of scale.

still_grokking · on Aug 15, 2022

I know that.

I've said something similar (only regarding operating systems, and approaches to security) elsewhere today:

https://news.ycombinator.com/item?id=32468423

Changing "too much" doesn't seem to be an option, so we're stuck with the stuff invented many many decades ago, and likely never will be able to overcome those local "optima" of the past—just because the market won't accept any disruptive changes any more by now.

And sure: It's always the software… (As said also in the long reply to the sibling comment).

cmrdporcupine · on Aug 15, 2022

The good news is that FPGAs have democratized a bit of the HW stuff for us software folks. It's fairly trivial to get started with an FPGA dev board and building out our own custom CPU architectures, though of course it's likely going to be "just for fun." So even if a new and novel concept can't take over the world, we can build our own lightsabers, even just for hobby stuff.

still_grokking · on Aug 18, 2022

Yes and no. At least to validate my ideas FPGAs would not really be sufficient.

Core to my idea are three things (on the HW-level): Runtime reconfigurability, dense and fast reconfigurable on chip networking, and asynchronous clockless logic. (Besides "smart memory", but that's something we can actually have).

Nothing of that FPGAs really offer. (There is some kind of support for reconfigurability on the bigger Xilinx chips, but that does not really smell like what I'm looking for; also it's proprietary and very limited).

FPGAs can of course simulate those named things. But without exposing their advantages. Actually such a simulation would be very slow compared to when doing things "the usual way".

This is especially sad as FPGAs as such have already a lot of the other desired properties: They're reconfigurable hardware at all!

So one would need first to build some new / modified kind of "FPGA" (especially adapted to async logic and runtime reconfigurability) only to try out my ideas in a halfway realistic setting.

On the other hand, doing some ISA research and experiments doesn't need any hardware in the first place. This could be done fully in some software simulated environment. Only that this way it's impossible to know whether there would be any real merit in all this stuff. Simply, because I would expect a SW simulation to be quite inefficient and slow (even on a FPGA). So this wouldn't be anything that could convince anybody, frankly.

Well, maybe I run at some point into some crazy person that has a few dozens $ mils. spare to throw at a "high-fantasy" project like that, to validate the basic assumptions. IF it would work it could actually change for the better everything we take for granted today about how computers work. Imho it's time. The von Neumann bottleneck is killing any progress! We should get rid of it. This needs some radical¹ rethinking of our machines (and especially their software).

¹ from lat. "radix", eng. "the root"

le-mark · on Aug 15, 2022

This reads like a hand wavy reference to some sort of programmable systolic array thing. A current example in this vein is googles TPU ai accelerators. Note they get drastic increase in performance by drastically limiting precision (read small values and registers, a specific computation requiring limited operations) hence massive reduction in transistors per computational element. Ok fine; but for statefull calculations that need to load and or store information per compuatational element it’s not the win one might think. Specifically memory bandwidth can be the limitation. In these cases modern GPUs are really really hard to beat.

still_grokking · on Aug 18, 2022

> This reads like a hand wavy reference to some sort of programmable systolic array thing.

Indeed, systolic arrays are a big inspiration. Like also the Xputer, the Reduceron, computational RAM, parts of IBM's TrueNorth chips, and of course in parts current AI and GPU hardware.

> Ok fine; but for statefull calculations that need to load and or store information per compuatational element it’s not the win one might think.

The whole point of my idea is to mitigate this.

Exactly this problem is the famous von Neumann bottleneck. And it's holding back any progress we could probably make regarding computer hardware.

Imho the whole misery we're in is due the fact almost all "basic" stuff is build in languages that are married to the "C-Machine". The core of the problem: The "C-Machine" can't be build efficiently with hardware that does not simulate a von Neumann computer. So we're stuck.

The only way out of this "local optimum", like I call it, would be to change the basic programming model. From imperative to some mix of declarative data-flow with FP elements. Than "rebuild the world" on top of that… (Simple as that, isn't it? :-D).

One can't fix the von Neumann bottleneck as long as it's possible (or actually required!) to be able to jump around freely in memory. So we need to change that! Pure FP languages could pose a way around that basic problem. In FP languages one can "contain" all state manipulations on the language level in isolated parts of the program. Everything between this parts is pure data-flow. So you could have explicit "machines" that manipulate state and talk to the outside world strongly separated form the parts that do only pure transformations on data. Those "machines" would be the "processing units" (with their attached scratch-pad memory) that I'm talking about in the other long sibling comment.

pca006132 · on Aug 15, 2022

I'm not sure whether the approach you are talking about is coarse grained or fine grained, feels pretty fine grained to me. Programming with data flow is already used by some high performance computing frameworks, e.g. cuda and onetbb supports task graphs, which are not instruction level data flow but more coarse grained task flow.

Not sure if exposing the underlying pipelining to programmers will be that useful. 1. Graphs are more general but more complicated. Control flow graphs are nice, data dependence graphs are more messy. 2. You still need an underlying representation for the graph. If you consider instructions as a certain representation of a data flow graph, you are still getting a graph, but perhaps not that flexible for programmers. And if you want to encode dependencies more precisely, you will need longer instructions, and modern CPUs are typically a lot faster than the main memory. 3. Exposing pipelining details as the ISA may limit future optimization. Due to hardware limitation, the maximum number of inflight operations must be bounded, so you either need some way to restrict the programmers to not do some crazy data flow graphs, or limit it implicitly by blocking when there is too many inflight operations. The latter case is quite similar to what we currently have, and the first case will leak some implementation details to the ISA.

still_grokking · on Aug 15, 2022

The previous post was mostly about changing perspective, not really about the details.

You're actually right that something like that would need to operate on the granularity of "(micro) tasks" because any more fine grained approaches aren't efficient as we know by now.

Also I've noted that we have (and had) already some hardware that goes in that direction to some extend.

> Not sure if exposing the underlying pipelining to programmers will be that useful.

I think that would be fundamental. In fact this would be a crucial part of the reconfigurability of the system.

But it wouldn't be only be about exposing pipelining. Also the steps in the pipeline as its structure would need to be programmable. Think microcode, but more powerful, and available to the programmer.

> You still need an underlying representation for the graph.

Sure.

That's some data structure produced by a compiler.

Think FPGA bitstream.

Only that one would need to be able to dynamically load and change it on the fly.

> If you consider instructions as a certain representation of a data flow graph […]

This expresses some (of the usual) misunderstanding of my idea.

The whole point is to change the viewpoint form programming a computer through a command stream (a stream of instructions), to something more "declarative" (a data-flow "execution graph").

Instead of an command stream you would load (parts) of said "execution graph" which encodes the function and wiring of your processing units (like said above, think FPGA bitstream), and than "just pipe" data through that graph.

If the graph does "not fit" (does not map completely) into the available hardware resources—like the programmable processing units and their programmable interconnects, and the available smart memory cells (a kind of "intelligent" scratchpad memory cells, as an replacement for registers)—it needs to get "tiled". Than the "tiles" could be "moved underneath" the (fixed) hardware resources. This would be actually the usual way how things work as otherwise the CPU would need to integrate all of the memory of the computer (which is not realistic).

The nice thing about that concept would be that the actual data doesn't need to be moved as the graph-tiles get (virtually) "moved" by rewiring and reprogramming the HW processing units / smart memory cells.

Doing that replacements of execution graph-tiles following the "waves of data-flow through the system" is something that does not involve any "jumps", just a upfront know constant stream of "bitstream", and could be nicely parallelized with the actual computations in the processing units (after the "data wavefront" moved forward you can safely replace the graph tiles "behind" it without interrupting the rest of the computations).

Doing things this way seems complicated at first, but my intuition is that all that magic that is done in today's CPUs is actually even more complicated. Today's hardware is doing already parts of the described things. (Only that it's not reconfigurable). But having always the possibility to arbitrary jump around in memory hanging over your head, and not having the data-flow structures available to the hardware (it needs to "guess" them!), makes things way more complex than imho needed. In my opinion we're doing things "backwards" by now!

If the actual hardware is a data-flow machine at its core anyway, we should embrace that on the higher levels, and not pretend that we still program a very fast PDT-11.

https://queue.acm.org/detail.cfm?id=3212479

The "only" problem is that the C/C++ (and the-like) abstract machines (and therefore the "machine code" such languages get compiled to) pretends that our computers are still PDTs. As the above article says:

>> A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success. [emphasis mine]

"large numbers of threads" == a lot of in flight "micro tasks" (at some stages of their "virtual pipeline")

"wide vector units" == powerful programmable "processing units", with wide "buses" (connected by some hyper-cube or hyper-torus on-chip network)

"much simpler memory model" == Just let the data flow! No "moving parts" at all (besides the flowing "application data", or the equivalent from some external "device"-ports)

TL;DR: Just feed the machine two data-streams, one describing the ever-updating "execution graph" (as tiles get "moved"), and one providing the actual data that needs to flow through that "virtually pipelined" system (graph structure). No direct RAM access (besides "virtually local" smart memory scratchpad access by the processing units, mediated though software controlled on-chip network routing) as "external" system-memory would be seen as an "device" laying behind some appropriate processing unit, no jumps (as this would be "just branches described by upfront know data" in the "execution graph", through with data could possibly flow; which is actually not much different than speculatively executing code).

The processing units could be of three kinds I think (but forgot to mention): They could be programmed to be sources, transformers, or sinks. Sources and sinks would be I/O elements, data-generators, or interfaces to persistent storage, transformers map nicely to ("clustered", or better said "sequenced and glued together") functions, or given they would have access to (private, "virtually local") scratchpad memory also to co-routines (which would be the place where the effects happen; which is of course synonymous to state manipulation). For more efficiency and programmer flexibility they could include FPGA hardware and make the implementation of their specific functions a matter of "changing the code".

We have plenty of transistor this days. We could really start trying to overcome the PDP…

The hardware actually isn't the problem. It's (as always!) the software, and of course the compilers.

That said, it's clear to me you still need a "language to talk to the hardware". But it would be more declarative than the current imperative "machine code", I guess. Designing such a language and it's abstract machine (a truly new kind of ISA and ASM) is the challenge here. Synthesizing the appropriate hardware form that in a next step wouldn't be too complicated I think.

Happy to hear some opinions on those ideas!

I know they're unusual at least (so much to criticize); but isn't "reinventing the computer" the topic of the original submission actually? :-)

bear8642 · on Aug 15, 2022

> data-flow language

Forgive me if misunderstand, but could you not use an array oriented language such as APL, J or K? Whilst they probably don't quite mesh with explicitly controlling the data flow, but don't they and other declarative languages allow the computer to optimise the best path to maximize computational throughput?

still_grokking · on Aug 18, 2022

Indeed something APL-like would make a good "assembler" kind of language for such an architecture!

APLs are much more data oriented and much less control-flow heavy. Also they don't use the notion of LOAD / STORE of mutable memory cells as a basic language feature (like all the imperative "C-like" languages, that therefore need a "C-Machine" to be executed efficiently).

So yes, APLs would be a good start to develop the machine language for such a machine that I have in mind. (Only that I have some more ideas about it, that go quite beyond that; like making "sending request to your environment / a 'machine'", which is synonymous to having effects, a language level thing).

AnimalMuppet · on Aug 15, 2022

> But that would be in fact trivially simple: Just rewrite the world in some new kind of data-flow language (which needs to get invented first, of course).

I hope that was sarcasm. But if not: No, rewriting the world is not trivially simple. Not remotely.

still_grokking · on Aug 15, 2022

I see, nothing is obvious on the interwebs! :-D

The "rewrite the world" part is actually where such ideas obviously fall apart (as I said also in the sibling replies)…

So I'm very well aware of that "minor issue" regarding my idea.

AnimalMuppet · on Aug 15, 2022

I thought that was a bit over the top to be actually serious.