Fractal: An Execution Model for Fine-Grain Nested Speculative Parallelism [pdf]

arjo1 · on July 5, 2017

I tried to understand what is going on in this paper. Am I right to say that it seems like they are looking at parallelizing subtasks within each domain and at each time taking advantage of unordered tasks if possible?

suvinay · on July 5, 2017

Correct. One of the goals of this work was to expose and exploit the abundant parallelism within large atomic tasks. However, the tasks within a domain can be ordered or unordered. Our system will exploit any kind of parallelism that is available.

suvinay · on July 5, 2017

Hi all. This is Suvinay Subramanian from MIT. I am one of the authors on the Fractal paper. I wanted to provide some context for this work, and briefly summarize our paper. I hope this will address some of the comments on this thread. And I am happy to answer any further queries you may have.

You may also be interested in the slides of this work presented at the International Symposium on Computer Architecture (ISCA) last week: http://bit.ly/fractal-slides

What motivated this work? While multi-cores are pervasive today, programming these systems and extracting performance on a range of real world applications remains challenging. Systems that support speculative parallelization, such as Transactional Memory (supported in hardware in commercial processors today, such as Intel Haswell, IBM's Power 8 etc.) help simplify parallel programming, by allowing programmers to designate blocks of code as atomic, in contrast to using complex synchronization using locks, semaphores etc.

However, their performance on real world applications remains poor. This is because real world applications are complex, and often have large atomic tasks either due to necessity or convenience. For example, in database workloads, transactions have to remain atomic (i.e. all actions within a transaction have to "appear as a single unit"). But each transaction can be long (~few million cycles) -- such large tasks are prone to aborts, and are expensive to track (in software or hardware).

Now, there is often abundant parallelism inside such large tasks (as has been noted in prior work). But the challenge is how to extract this parallelism while also keeping the entire unit atomic. Prior approaches, including nested Transactional Memory, and Thread-Level-Speculation (TLS) have been unsuccessful in extracting this parallelism -- section 7 of our paper explains why (or see slides).

Fractal provides a new execution model that addresses the limitations of prior approaches, and proposes a hardware architecture that can successfully scale several challenging workloads -- some of which have eluded successful parallelization for decades. Importantly, Fractal allows us to operate at the granularity of "small" tasks that are easy to track. Further Fractal enables programmers to seamlessly compose speculative parallel algorithms.

What are the key ideas of Fractal? There are two key insights: 1. Decouple atomicity from parallelism -- We introduce the notion of a domain, where all tasks belonging to a domain appear to execute as a single atomic unit. By doing this, we can execute the "subtasks" in parallel, while ensuring atomicity of the "collection of subtasks" via other mechanisms.

2. Ordering to enforce atomicity -- Unlike prior techniques that "merge" subtasks to enforce atomicity, we leverage ordering to enforce atomicity.

Tell me more about the hardware, aka the Swarm architecture. Conventional wisdom dictates that ordering limits parallelism. Now ordered parallelism is hard -- there have been several attempts at taming this beast, including Thread-Level Speculation (TLS), as alluded to in one of the comments.

In our prior work, Swarm, we demonstrated that it is possible to extract parallelism even in the presence of ordering constraints, with a more expressive execution model, and some nifty hardware tricks.

At a high level, the Swarm hardware executes tasks speculatively, and elides ordering constraints. Only true data dependences force an "ordered" execution. On the hardware side, Swarm has large hardware task queues to manage tasks, and a high-throughput commit protocol and speculation mechanisms that allow it to scale.

You can read more on Swarm at: http://bit.ly/swarmarch

Fractal relies on the Swarm hardware to enforce ordering guarantees while extracting parallelism. The key challenges were how to assign a suitable ordering to the subtasks that also maintains the atomicity of "domains" of subtasks, how to assign this ordering dynamically, and how to support "unbounded nesting", eg: if your subtask makes a library call, that itself could be parallelized (but must still remain atomic), and so on (supporting unbounded nesting is essential for software composition of algorithms).

Is Fractal / Swarm just an accelerator? We believe that Fractal and Swarm are a more general solution to some of the hardest parallel computing problems. We have shown that the ideas presented in these works are applicable in a wide range of domains ranging from databases, to discrete-event simulation, to machine learning, and graph analytics.

When will we see this in hardware? Currently, we evaluate our ideas in simulation. But our group is working on prototyping this on an FPGA in the coming year. So stay tuned!

deepnotderp · on July 5, 2017

Awesome! I have a couple of questions if you don't mind:

1) Does this require the programmer to rewrite software or make annotations? Even if that's the optimal way, can a compiler do this automatically, at least to some extent?

2) how general of a workload can this accelerate? You mention speculative multithreading and HTM in both papers, both of which are fairly general purpose techniques.

suvinay · on July 5, 2017

1. Currently the programmer needs to delineate code into tasks. A couple of points to note: -- Often the programmer can start with the serial code, and determine how to delineate tasks. In our experience this is often easier than writing more complex parallelized versions of the same algorithm. Compared to the serial code, our programs are just a few percent longer. -- We have implemented high-level interfaces along the lines of OpenMP or OpenTM to make it easier for the programmer to break straight-line serial code into fine-grain tasks. For example, if you have a for loop to parallelize, you can use the fractal::forall(...) construct. Table 1 of the Fractal paper lists some of these high-level interface functions.

We are currently working on a compiler that can take serial code and automatically annotate tasks and domains. Keep watching our webpage for updates on that.

2. Our execution model subsumes both Transactional Memory (TM) and Thread-Level Speculation (TLS). Any program that can be expressed in those paradigms can be expressed in Fractal. And as we show in the paper, Fractal can express parallelism that was not possible in TLS or TM. So we believe it is general purpose. Further. as indicated above, we have seen these ideas to be applicable to applications in a wide range of domains.

That being said, we are always on the lookout for applications that reveal deficiencies in our execution model, to help improve it :)

deepnotderp · on July 5, 2017

1. Yes ,I've seen that, and I think that an OpenMP-esque annotation system specified by the programmer would be optimal, but I don't see any reason that a "sufficiently smart compiler" [ ;) ] can't produce an approximation. Given the SpMT-like "thread" (task) squashing, it shouldn't cause any semantic issues.

2. Absolutely, SpMT has been proposed as a general purpose technique, that's why I found it odd that the benchmarks weren't on a general purpose benchmark like SpecINT, do you have any benchmarks on SpecINT or any plan to do so?

suvinay · on July 11, 2017

2: We are currently working on a compiler to automatically delineate tasks, assign domains etc. given sequential code. As part of that effort, we are looking at SPEC workloads, yes.

exabrial · on July 5, 2017

Is this a physical processor or scheduling algorithm?

suvinay · on July 5, 2017

This is a new execution model + hardware architecture. Currently we have evaluated it in simulation only. But stay tuned for hardware.

nkoren · on July 5, 2017

Complete n00b question -- this is very, very far outside my domain -- but by using much more of a chip in parallel, does this present any challenging thermal implications?

suvinay · on July 5, 2017

This is a concern orthogonal to our work. Today, there are commercial chips with a large number of cores / processing elements today, e.g. NVidia's Volta has ~5120 "cores", Intel's Xeon Phi has upto 72 cores. And there are approaches and active research to handle thermal concerns in such large chips. Maybe someone who is more knowledgeable can shed light on those approaches. I do not think Fractal introduces any new thermal concerns.

deepnotderp · on July 5, 2017

Looks like a processor.

znpy · on July 5, 2017

Did i get it right, does this CPU use the x86-64 ISA ?

(see page 8, Table 2: Configuration of the 256-core system)

nn3 · on July 5, 2017

It's only simulated at this point.

There are fairly good cycle accurate open source x86 simulators, such as PTLsim or marss86

http://marss86.org/~marss86/index.php/Home http://caps.cs.binghamton.edu/tools.html

suvinay · on July 5, 2017

We use the x86-64 ISA for our cores, yes. But our ideas are applicable beyond the x86 ISA. As nn3 points out, currently our ideas are evaluated in simulation.

kom107 · on July 6, 2017

Hi Suvinay! Do you all have a website that keeps interested people updated about the upcoming Swarm hardware? I would like to follow this work. Thanks!

suvinay · on July 11, 2017

Thank you for your interest. We are planning to create a web page for this soon (likely http://swarm.mit.edu -- it's not live yet). Till then, you should be able to find our papers and some upcoming ones at my advisor, Daniel Sanchez's web page: http://people.csail.mit.edu/sanchez/.

brian_herman · on July 5, 2017

Moore's law lives on?

BooneJS · on July 5, 2017

IMO, everyone is getting more creative since scaling doesn't help performance as much anymore. Moore's Law allowed chip makers to lean on physics to make existing architectures faster. Now the plethora of accelerators (risky new architectures) is proof that physics isn't doing as much of the heavy lifting.

Quequau · on July 5, 2017

Back in 2013 or so HotChips had a presentation from some dude from DARPA about chip development at and after the end of Moore's law scaling. It probably was the most prophetic tech talk I've ever seen. Your comment could essentially be a quote straight out of it.

Anyway, for me the rub comes in with the fact that historically it's taken a really long time for mainstream software to exploit new instruction sets or architectures.

deepnotderp · on July 5, 2017

To be fair, a lot of people had that prediction. But DARPA has definitely been ahead of the game in most research.

More importantly, Dennard scaling has basically already dropped dead since around ~40-28nm, although we did get a "pseudo-continuation" in the form of FinFETs and friends.

microcolonel · on July 5, 2017

Yeah, seems like it's just getting more discrete and less continuous.

That said, this is not an 88x speedup for general purpose computing, just for some specific workload of parallel compute.

deepnotderp · on July 5, 2017

If I read this correctly, this is effectively a better way to do thread level speculation (aka Speculative Multithreading), so in theory this could help in general purpose computing as well, just not 88x ;)

suvinay · on July 5, 2017

Your intuition on this improving Thread Level Speculation (TLS) is correct. I do want to clarify that this is not just a better implementation of TLS though -- our execution model is more expressive and avoids some of the pitfalls of TLS (see the Swarm paper http://bit.ly/swarmarch to learn more). Yes, you could use our implementation with a TLS-like execution model and improve on the TLS performance. But you are likely to reap more rewards by using the Fractal execution model + implementation.

suvinay · on July 5, 2017

The 88x is the maximum speedup we get, yes. We believe our ideas are applicable to a wide range of domains ranging from databases, to discrete-event simulation, to machine learning, and graph analytics.

microcolonel · on July 5, 2017

Very cool work nonetheless, it's easy for the software folks to feel completely hopeless these days with transistors suddenly getting more expensive per capita instead of cheaper.