Well, past experiences are not that useful right now. I don't think the Itanium or the Cell were great architectures, but it's clear that their most obvious failure mode (mainstream architectures improving faster than them) isn't a showstopper anymore.
For some guess on the next big thing, I would imagine that stuff that avoid the need of memory coherence have nice odds.
The most obvious failure for the Cell architecture was that it was horrendous to program. That was partly caused by the lack of memory coherence between the PPE (main core that ran the OS) and the SPEs (the faster cores meant for offloading computations to), but also caused by a lack of developer tools.
The main lesson I would take away from Cell is that you want to design an architecture that allows gradual performance refinement. With the Cell, it was more of a step function: good performance came from using the SPEs well, but using the SPEs well was hard. There wasn't much of an in-between.
Yes, I mispoke by saying the SPEs lacked memory coherence, since you used the same physical addresses to load memory into the SPEs local storage. What I meant is that the SPEs are divorced from the normal memory hierarchy. You needed to explicitly retrieve and send all memory from each SPEs, which was a significant burden on programmers.
Does this mean it was a mistake or just ahead of its time? I think software-managed memory tiers have been a dream for advanced architectures for a very long time. The problem, perhaps is assuming software-managed means programmer-managed.
In a single system there is precedent in virtual memory systems using software-managed page mappings rather than static page table data structures. In HPC there is of course the precedent of distributed parallel systems with message-passing rather than shared memory abstractions. And the separation of GPU memory from system memory is certainly common today. Similar things are happening in software-defined storage (i.e. RAM/SSD/disk/tape tiering).
Computation libraries and frameworks help straddle the gap between application programmer needs and current language/runtime/architecture semantics. I think one problem of current markets is that people tend to want to evaluate hardware independently of software, or in terms of yesterday's software.
There is also a strange history of wanting to discard the software/firmware offered by the hardware vendor (for being insecure/inept/whatever), but blindly accepting the complex hardware design that enables our naive/platform-independent software to run well. The whole Spectre debacle shows how that may have been wishful thinking that we can divorce the software and hardware designs...
Does anyone else have real-world Itanium experience? We've got a couple 5+ year old HP Integrity servers running OpenVMS on Itanium at work that we use to batch process large ASCII vendor files in an ETL process written in C. They certainly don't embarrass themselves. We'll be connecting them to a new Pure Storage SAN in a few months and the IT guys are really excited to see what happens to performance.
I take the same exact C code, compile it with Visual Studio, and then run the same ASCII files locally on my 7th-gen 4c/8t i7 desktop and am not seeing any improvement. That's about the best I can do as far as benchmarking goes.
I used Itanium from 2003-2006 for scientific computing on a big cluster. For that purpose it was much faster than Intel's Xeons of similar vintage. It was also significantly faster than the MIPS and POWER systems we had. Caveats:
- The simulation suite that I used most heavily was developed in-house at the same lab that bought the hardware. It was profiled and tuned specifically for our hardware. The development team was a real software development team with experience writing parallel simulation code since the early 1990s. It wasn't just a bunch of PhD students trying to shove new features into the software to finish their theses.
- There was also close cooperation between the lab, the system vendor (HP), and Intel.
- Other software (like all the packages shipped with RHEL for Itanium) didn't seem particularly fast.
- God help you if you needed to run software that was available only as binary for x86. The x86 emulation was not fast at all.
It was great for numerical code that had been specifically tuned for it. It was pretty good for numerical code that had been tuned for other contemporary machines. Otherwise I didn't see particularly good performance from it. I don't know if it really was a design that was only good for HPC or if (e.g.) it also would have been good for Java/databases, given sufficient software investments.
Maybe it would have been competitive against AMD64, even considering the difficulty of architecture-switching, if it had not been so expensive. But I'm not sure Intel had wiggle room to price Itanium to pressure AMD64 even if they had wanted to; Itaniums were quite big, complicated chips.
At my previous job we used itanium servers for the secure64 dns platform[1]. It performed well, the logic behind itanium is security not performance (according to secure64) but performance was never an issue. I do know the hardware support for rsa did give nice performance for dnssec signing.
The logic behind Itanium was mostly to have a 64-bit architecture that was specific to Intel without any cross-licensing to other vendors like AMD. And which would also replace proprietary RISC flavors like PA-RISC.
It did also have a fair bit of security acceleration built-in. As I recall, one of the principals behind Secure64 was one of the HP leads associated with the original IA64 announcement. Secure64 pushed Itanium for security applications when it was becoming clear that it was never going to be a mainstream 64-bit architecture.
For some guess on the next big thing, I would imagine that stuff that avoid the need of memory coherence have nice odds.