Couldn't find too many simple alternatives, so, wrote my own while on vacation for fun. Hopefully it's useful as open source, will fix-up CI over weekend and add a lock-free (not atomic free) allocator while I'm at it.
Depends. Right now they're in two different spaces. One still hanging on to graphics/gaming, the other coming from dense compute space. We'll see how long it takes them to converge.
For all intents and purposes, they've already converged: the underlying GPU microarchitectures have been fairly general purpose SIMD-ish for a long time now.
And (as one example), CUDA happily runs on any gamer "consumer" GPU.
What do you mean by programmable gather/scatter. GPUs already do efficient gather and scatter operations. I think the knight's landing AVX-512 even has efficient gather and scatter.
Read the linked article, and the paper linked from there. Basically the idea is that gather/scatter can be very inefficient from a cache and BW perspective. In the worst case you're using only a single element per cache line. So the idea is to "move" the scatter/gather engine to the memory controller, and pack the vectors already in the cache rather than in the register file.
Will it work in reality? No idea, but it's an interesting idea certainly worth exploring.
> The key technology presented in this paper is the Sparse Data Reduction Engine (SPDRE)
It's a shame you missed the opportunity to call this a Sparse Data Engine (SpaDE) - you would then get nice terminology about shovelling data around. On the other hand, the work around cache invalidation looks solid, and one out of two ain't bad :).
As you note in the paper, a key difference between SPDRE and things like Impulse is that SPDRE works in batch, whereas Impulse is on-demand. That means a higher up-front cost for setting up a reorganisation, but a lower cost for accessing it. Do we know how that advantages and disadvantages the two approaches in different domains?
I can imagine that for classic HPC stuff like working on matrices, batch is better. You have a matrix of structures, and you're going to access some particular field in every one of them, potentially several times. So, all of the work done during reorganisation is useful.
On the other hand, i can imagine that for searchy tasks, as in databases, access might be much sparser and harder to predict. I might have an graph of data that i want to find a path through; i expect to only touch a tiny number of the nodes in the graph, but i don't know which ones upfront. Reorganising the relevant data out of all the nodes would be a huge amount of wasted work.
The programmer interface to both approaches seems like it could be pretty similar: define a reorganisation that should exist somewhere in memory, wait for a signal that that it is ready, access it, tear it down. Does that open the door to hybrid approaches which combine on-line calculation with speculative bulk work? That would limit the interface to a lowest common denominator way to specify reorganisations; would that sacrifice too much power?
It'll be called SPiDRE (pronounced spider) at MEMSYS. Name coined by a colleague, I can't take credit.
Thanks for reading! You'll have to wait for the presentation and follow-on papers for some of those answers :). If you read the Dark Bandwidth paper, there are some solutions mentioned there and in the presentation (http://www.jonathanbeard.io/slides/beard_hcpm2017.pdf) that could apply to what you suggest.
Seems very much "back to the future." Systolic array processors were used to accelerate neural networks in the 1980's. Great for matrix math too. (ref: http://repository.cmu.edu/cgi/viewcontent.cgi?article=2939&c...). These aren't quite the systolic array processor of old, but too close to be considered new arch/micro-arch. The formula is simple, have low precision MM to accelerate, drop in a matrix multiply unit that can be blocked for and high bandwidth memory to feed it and let it go. I'm waiting for more new takes on old arch....as fabbing chips becomes more economical, I hope to see more retro chips. Especially things that didn't quite make the jump from research to production b/c of scaling (or other reason), might now make sense.
Server/HPC, there are quite a few coming online. The Cavium presenter @goingarm listed (https://www.packet.net) on their slides, might be decent for getting a cloud instance to try out. There are many others. A good place to start finding them is: http://arm-hpc.gitlab.io.
A nice thing about ARM is that you get lots of different micro-arch under a single ISA. Once the ball gets rolling there will be many processor choices to choose from, which I think is a great thing (however, in full disclosure I'm one of the authors, and I work for ARM Research for our HPC program..although all statements above are my own opinion and not those of ARM).
Officially I think the only thing that can be said, has been said by Paul Messina: "I believe that the Aurora system contract is being reviewed for potential changes that would result in a subsequent system in a different time frame from the original Aurora system. But since that’s just early negotiations, I don’t think we can be anymore specific on that.” source: https://insidehpc.com/2017/06/told-aurora-morphing-novel-arc...
C++Now is branching out. I gave a talk this year on FIFO communications and a tutorial on RaftLib. One not so C++ focused and the other definitely C++ focused (C++ library). I enjoyed the wide variety of people and topics. Will try to go again next year.