Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are there languages that have first-class support for representing/optimizing memory hierarchy characteristics? Optimizing C compilers, for example, may have extensions to force specific alignments:

https://software.intel.com/en-us/articles/coding-for-perform...

But I'm not aware of languages where e.g. declaring alignments is part of the base language. Awareness of L1, L2, L3 cache characteristics, plus NUMA nodes, is increasingly important to writing high performance code. Every language I've ever used exposes nothing of that hierarchy; it's all just "memory".

People who write performance-critical code are already aware of these issues and can measure/hint/optimize to work with the memory hierarchy in existing languages. But I feel like this could be better if we had languages that modeled these issues up front instead of only exposing the abstracted view. Maybe such languages already exist and I just haven't encountered them yet.



Representing memory hierarchy in a somewhat-HLL was the goal of the programming language Sequoia, presented in the paper 'Sequoia: Programming the Memory Hierarchy' [1][2] from Stanford University.

Abstract:

"We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms."

[1] DOI: 10.1109/SC.2006.55 [2] http://graphics.stanford.edu/papers/sequoia/sequoia_sc06.pdf


(Disclaimer: I work on Legion)

Sequoia has been largely superseded by its spiritual successor Legion [1], another programming system by Alex Aiken. Legion doesn't focus so much on low-level memory hierarchies, but it does a very, very good job of scaling to very large machines, and taking advantage of heterogeneous processors such as GPUs. It is also incomparably better at working with dynamic programs, whereas Sequoia needed to know basically everything about the program up front (including what machine it will run on).

If you're looking for a modern version of Sequoia that actually runs on modern hardware, I would strongly recommend looking at Legion. (Or the language component of Legion, Regent [2].)

[1]: http://legion.stanford.edu/

[2]: http://regent-lang.org/


I'm not sure if anyone is still reading this thread, but I wrote up some thoughts on the differences between Sequoia and Legion and some of the lessons we learned. Maybe this is a little too much "inside baseball", but hopefully some will enjoy it.

Sequoia was basically an answer to the question, "what could a compiler do for you if it knew everything about your program?" And I do mean everything: in addition to your source code Sequoia had to be given, at compile time, the sizes of every input array in addition to the exact layout of the machine it was to execute on. While these restrictions were lifted to some degree later on, Sequoia basically had to be able to statically compute every task that it would execute along with the exact sizes of all inputs and outputs. And with this information it was able to do some pretty [fantastic optimizations](http://theory.stanford.edu/~aiken/publications/papers/ppopp0...). I believe, though I can't find the reference at the moment, that the compiler was able to generate (from completely hardware agnostic source code) implementations of BLAS routines competitive with Intel MKL. This was a pretty mind-blowing achievement and to this day I'm not aware of any work that has really been able to equal it.

Requiring programs to be fully static also had a dramatic cost, one which it turned out not to be so easy to lift. In what I believe was the [final paper to be published on Sequoia](http://theory.stanford.edu/~aiken/publications/papers/ppopp1...), Mike Bauer and others showed how to extend Sequoia to support limited forms of dynamic behavior. But the solution wasn't all that satisfying and the work to get there was fairly excruciating. Instead of continuing to work with the language, Mike ended up switching gears to start a new project entirely, Legion.

As an aside, one of the other lessons learned from the Sequoia project is that research projects tend to be tied to the Ph.D. students who work on them. In the Sequoia project, there was a gap in continuity between the first and second generations of students to work on the project. This meant that second-generation students like Mike basically ended up picking up a big legacy code base on their own. This turned out to be a huge drain on productivity and was among the reasons that Sequoia was eventually abandoned.

The biggest decision made about Legion up front was that it had to be dynamic. This was a huge risk because the analysis performed by Sequoia simply could not have been performed at runtime with reasonable cost. Sequoia relied on knowing every single task that would ever execute and the exact bounds of every array input and output. This meant that Sequoia could do perfect alias analysis on these input and output arrays. Because Legion would be a dynamic system, any analysis would have to be performed during the execution of the program. There was no way we were going to do this in Legion without fundamentally changing the abstractions of the programming model, and frankly it wasn't obvious up front if this would even be possible to make tractable.

One of the central insights of Legion was to [explicitly reify the notion of data partitioning](http://legion.stanford.edu/pdfs/oopsla2013.pdf) in the programming model. Sequoia didn't need this because it simply knew the inputs and outputs to every task. Legion couldn't afford to pay the $O(N \operatorname{log} N)$ cost to perform this analysis at runtime. By lifting partitioning into a first-class primitive, we were able to radically reduce the runtime cost of this analysis by leveraging what the user already know about their data partitioning. Originally this required users to declare certain properties of their partitions, such as disjointness, in order to make things efficient. Eventually we discovered a [partitioning sublanguage](http://legion.stanford.edu/pdfs/dpl2016.pdf) which allowed us to statically discover most of these important properties, and to provide many static guarantees even in the presence of data-dependent behavior.

The other big insight of Legion was what we call index tasks. Index tasks are basically collections of tasks that can be described in $O(1)$ space. This turns out to be essential if you want to scale your system to very large distributed machines. Any representation which is $O(N)$ inevitably becomes a scalability bottleneck, either because of memory-capacity constraints or because of runtime complexity. By making the index task representation $O(1)$ we were able to design an [optimization to keep the runtime analysis complexity $O(1)$ as well](http://legion.stanford.edu/pdfs/cr2017.pdf).

One consequence of being a dynamic runtime system is that Legion itself isn't able to optimize the low-level code executed by the system. It seems to be a general principle that the closer you are to the hardware, the more static you need to be. Conceptually, Legion solves this problem by splitting it into two parts: the kernels, or low-level code that has to execute as efficiently as possible on the processor, and the orchestration of the parallelism that exists between those kernels. Legion focuses nearly entirely on the upper layer, leaving the lower layer to be handled by programmers. We've recovered some of this capability via the [Regent programming language](http://regent-lang.org/) which sits on top of Legion, but we haven't taken this nearly as far as Sequoia did.

Despite being a dynamic runtime system, we still put a lot of effort into making sure that it had a principled foundation. Legion is one of the few runtime systems I'm aware of that has its own [type system and operational semantics](http://legion.stanford.edu/pdfs/oopsla2013.pdf). This also made it relatively easy to develop Regent.

With Regent in some ways we've now come full circle, since we're back to having a programming language again. However, one key difference between Regent and Sequoia is what happens when the compilers run into behavior they can't analyze. In Sequoia, it's of the utmost importance that the compiler understand your program, because if the compiler can't prove that parallelism is safe, then it will serialize that part of the program. In contrast, if Regent can't determine that parallelism is safe, it will simply fall back to the runtime system. This means that in general Regent suffers from fewer performance cliffs, or places where performance can suddenly (and sometimes inexplicably) degrade.

There were some lessons we failed to learn from Sequoia. Sequoia made big use of [graph-based IRs](http://theory.stanford.edu/~aiken/publications/papers/ppopp0...) for its compiler. This was a design I copied when developing the [RDIR format](https://github.com/StanfordLegion/rdir) for Regent, a decision I now consider to be a mistake. Graph-based IRs are superficially appealing because they seem to represent the fundamental invariants that you want to reason about. That is, if two tasks are mutually unreachable in the IR, then they are provably safe to execute in parallel. However, working with the graph-based IR as a representation of the program turns out to be a huge pain. RDIR has been by far the biggest source of bugs in Regent and continues to be largest maintenance burden. The other big motivation for a graph-based IR, that it makes certain aliasing properties of the program more explicit than a traditional CFG, is something that I now think could be done adequately well in more traditional formats.

One final note about hardware support: Sequoia made a big bet on the [Roadrunner supercomputer](https://en.wikipedia.org/wiki/IBM_Roadrunner). Roadrunner was an excellent match for Sequoia's programming model, because it used scratchpads (instead of caches) for the compute-heavy processing units. This meant that data (and even instructions!) had to be explicit copied in and out of the memories of the individual processors, something which Sequoia excelled at. Unfortunately, Roadrunner ended up not being so representative of the machines that would follow, and the effort required to retool all the infrastructure ended up being a big drain on the project. Though it would not surprise anyone if these ideas ended up resurfacing in the future, a project like this simply can't afford not to work on contemporary hardware.

In contrast, Legion made its bet on GPUs and heterogeneous architectures more generally. So far this appears to be turning out well and has positioned us to work well on architectures such as [Summit](https://en.wikipedia.org/wiki/Summit_(supercomputer)). However, investment in technologies such as LLVM mean that we're also in a better position to adapt if future technologies end up going in a different direction.


Thanks for this write-up. I can't wait to see what comes out of the Legion projects next. One of the most exciting things I think that is going on.


How does Legion differ from languages like Halide which separate scheduling from the algorithm?

How much of this could get handled by PGO (profile guided optimization) to observe locality and access patterns?

Do you think AI has a place in directing scheduling given enough semantic information so that "bad hypothesis" can get pruned?

Does it rely on a static description of machine (latencies and bandwidth) or does it evolve over time? Can it handle things that are dynamic but are thought of as static like memory bandwidth in a cloud environment?

How does Legion compare to cache oblivious techniques?

Are there changes at the processor level that could fuse operations given the existence of data in the cache, much like hyperthreading can do a context switch on a cache miss.

Do we need to modify what constitutes a Basic Block? Is modern hardware too low level?

Have you read the "Collapsing Towers of Interpreters" paper and do you think it applies to your work?


Wow, this is a tough set of questions! Some initial thoughts:

> How does Legion differ from languages like Halide which separate scheduling from the algorithm?

Yes, that's one of the similarities. Legion, Sequoia, and Halide all share a separation between the specification of the algorithm from the mapping to any particular architecture.

The biggest difference between Legion and Halide, aside from the fact that Halide is much more domain-specific to image processing, is that Legion (being a dynamic runtime system) focuses more on higher-level orchestration of tasks executing in a distributed system.

> How much of this could get handled by PGO (profile guided optimization) to observe locality and access patterns?

Yes, this is called the Inspector/Executor technique. While it's an old technique, the most recent general-purpose implementation that I know of is this paper: [Code Generation for Parallel Execution of a Class of Irregular Loops on Distributed Memory Systems](http://web.cse.ohio-state.edu/~rountev.1/presto/pubs/sc12.pd...).

The main place where it struggles is memory capacity. It turns out that the profiles, when you're running on large distributed systems, can become very, very large. Keep in mind that modestly sized simulations these days run on hundreds to thousands of machines, and the largest run on tens of thousands of machines. So it's not hard to see why this can be problematic.

> Do you think AI has a place in directing scheduling given enough semantic information so that "bad hypothesis" can get pruned?

Yes. One nice property of Legion's approach to mapping is that no matter what mapping is chosen, the system will guarantee that the computation is still executed correctly. So the worst you can do is make yourself slower.

I think mapping, especially in the distributed setting, is one of the big unsolved problems. For domain-specific settings we have some preliminary answers. E.g. see this paper for how it applies MCMC search to find optimal mapping strategies for DNN execution: [Beyond Data and Model Parallelism for Deep Neural Networks](https://arxiv.org/pdf/1807.05358.pdf)

> Does it rely on a static description of machine (latencies and bandwidth) or does it evolve over time? Can it handle things that are dynamic but are thought of as static like memory bandwidth in a cloud environment?

Legion's model of the machine happens to be static at the moment, because that's the most expedient to implement, but explicitly designed with dynamic behavior in mind. One of the biggest cases is where you lose a machine completely (e.g. because it dies, or your spot reservation gets revoked). I'm not sure if more fine-grained behavior can be exploited in a useful way, e.g. temperature fluctuations in a CPU might very well influence performance, but unless you can predict them I don't see what you can necessarily do about that. For many of these problems, being more asynchronous helps, because at least you can turn latency-limited problems into throughput-limited ones.

> How does Legion compare to cache oblivious techniques?

I'd say that cache oblivious techniques are particular algorithms (or perhaps mapping strategies) that you could implement in Legion, but which Legion is agnostic to. Legion provides mechanism but tries to avoid hard-coding any particular policy.

> Are there changes at the processor level that could fuse operations given the existence of data in the cache, much like hyperthreading can do a context switch on a cache miss.

We don't do this on quite such a low level, but we can do it at higher levels. Mappers can select the ordering of tasks on a processor, as well as the placement and layout of data, which is in theory sufficient to play with this. In the past Legion's overheads were too high to really think about this, but we've made some significant improvements in overhead recently which could allow us to start to think about things at this granularity.

> Do we need to modify what constitutes a Basic Block? Is modern hardware too low level?

I'm not one of them, but I know there are people thinking about this. Can't recall any references at the moment.

In general, I think systems like Legion make us much less dependent on caches. If the system manages data movement in and out of explicitly managed memories, that gives us a lot more freedom to play with the architectures. And increasingly I think we'll need this flexibility if we're going to keep driving performance improvements. As one of my friends once said, "x86 is a platform for making shitty code run fast", but perhaps we don't have that luxury any more.

> Have you read the "Collapsing Towers of Interpreters" paper and do you think it applies to your work?

Haven't read it yet, but it sounds like it shares some ideas with [Terra](http://terralang.org/), which we do use heavily. In fact, the Regent language is built entirely in Terra. In general, these sorts of techniques make it much faster to build new languages, while it's not really in my direct line of research I am very grateful people are working on these things.


Thank you for such a detailed response, I am grateful.

You mentioned Terra which is an inspiration of mine, I really enjoy mixing high level and low level code and being able to do so in the same program which is one of the major affordances of LuaJIT.

Hugo M. Gualandi and Roberto Ierusalimschy have released the core Lua teams' version of Terra, in Pallene [1] [2] which has the austere aesthetic of Lua applied to ideas of Terra. Pallene looks to be the next incarnation of Titan [3].

[1] http://www.inf.puc-rio.br/~roberto/docs/pallene-sblp.pdf

[2] https://github.com/pallene-lang/pallene

[3] https://github.com/titan-lang/titan


P.S. Shameless plug: If you're interested in working on this sort of stuff, we're hiring for full-time staff scientists as well as interns. Drop me a line! (Contact in profile.)


3rd party vouch for shameless plug: I was a RA in a minor role at this team and these are some of the friendliest and most approachable people I've had the pleasure to work with in academia. (Hi Elliot! \o/)


Thanks, Ludwig!


Sequoia is what I thought of at first (mentioned in sibling comment), but Scala has something related:

http://scala-miniboxing.org/ldl/

Basically you can write an algorithm that is independent of the exact data layout, and then the compiler has some flexibility to work with.

I haven't worked with it all, just read the paper awhile ago. But I'm pretty sure they did something with SoA - AoS transformation (structure of arrays/array of structures). At the very least, the compiler can make the data smaller, which fits better in cache. Not sure if it can do anything more advanced than that (i.e. scaling to multiple levels of memory hierarchy rather than compacting data).

My memory of Sequoia is that it's very very far semantically from "Python" (mentioned in the original article). Their example was matrix multiplication, and the language had much more of a functional flavor (perhaps unsurprisingly). I'll have to go back and check it out again.

You can also think of MapReduce as a programming model that is locality-aware, and there have been several papers about multicore (not distributed) MapReduce.

You partition your data into smaller sections, then a user-defined "mapper" runs on each segment. The shuffle is an efficient distributed sort which you don't have to write -- it can be specialized to different problem sizes. And then the user-defined reducer also operates on a (different) small partition of data.

I think the real problem is that the ISA doesn't really expose the memory hierarchy to the programming language, not necessarily that the programming language doesn't expose it to the programmer. GPGPU might change that but I'm not entirely familiar with it.


One of the differentiating features of Jonathan Blow's language, Jai, is supposed to be AoS/SoA transformation. I wasn't aware other languages were experimenting with this as well.

I thought it was something fairly specific to game programming, but I guess it shouldn't be surprising that other performance sensitive domains might also find it useful.


That was true long ago when he demo'd such a feature, but seems to have since disappeared from the language. Jon claims there's "something better" that replaces it, a mysterious yet-to-be-revealed feature.


Welp, architectures don't really "expose" cache characteristics either in a first-order fashion; they also go along with the idea that it's all just "memory" like you say. Obviously you can query your cache characteristics but there's only so much you can do.

Some embedded systems had scratchpads and it is possible to 'wire down cache' on some architectures to similar effect. But it would be tricky in general at the language level.

I have a project half on the boil that's aiming to produce a language aimed at these kind of usages. I'm figuring that the supposed insult hurled at C ("structured assembly") isn't really an insult at all but that someone should actually do that. At the moment C, with its giant raft of Undefined Behavior, isn't really that language any more.


On GPUs you can program the L1 cache at least. I’ve never had much success with it though, as in more recent architectures I found the default L1 and the schedulers are doing a good job at keeping occupancy high and by wiring it you either don’t win anything or it often even gets slower - and there’s no way to know in advance since scheduler and cache architecture are largely a trade secret. In general, GPUs sre optimized to use their high memory bandwidth through an extreme amount of hyperthreading, and the number of possible threads goes down with the ressources your kernel occupies.


C++ has alignas since C++11 https://en.cppreference.com/w/cpp/language/alignas

and since C++17 has std::hardware_destructive_interference_size and std::hardware_constructive_interference_size https://en.cppreference.com/w/cpp/thread/hardware_destructiv...

which you can use together to either make sure that 2 variables are not in the same cache line (if you are making a lock-free list for example) or to make sure at compile-time that 2 variables are close enough to share a cache-line.


Thank you for pointing out that gem. Good to know!


The citations for Sequoia are also worth a look:

https://scholar.google.com/scholar?cites=1705642111681573037...

I see Halide and several interesting projects. Let me know if you find anything interesting :)

It does seem to confirm that locality aware programming models are probably going to look like "big data" programming models (i.e. MapReduce and family).



Edit: RcouF1 posted a more thorough response covering cache coherence as well earlier, which I hadn’t seen. Read that instead. https://news.ycombinator.com/item?id=18010256

Original post:

As does C++ as of 11. [0] For heap allocations, one used the POSIX (not stdlib) posix_memalign until C++17/C11, when std::aligned_alloc became standard. [1] The other option was manually aligning heap memory; it’s so much more convenient to use a transparent aligned allocator.

[0] https://en.cppreference.com/w/cpp/language/alignas

[1] https://en.cppreference.com/w/cpp/memory/c/aligned_alloc


Distinguishing stack versus heap memory accesses and allowing processors to not have to, e.g., check stack reads against pending heap stores could certainly have performance benefits in the memory system. The Reduceron does that since Haskell doesn't have the sort of free pointers that make this infeasible in C:

https://www.cs.york.ac.uk/fp/reduceron/


Assembly typically requires you to declare your own alignments. There are some higher-level assemblers, with macros and things, but you're right, there's rarely an abstract language offering low-level memory specification.

I also dislike the tendency to only expose the abstract view, although I think that trend is due to the difficulties involved in designing a language and compiler whose job is to output code for multiple architectures. Well, at least I hope that's why, because I don't know why you'd choose to offer less control in your language otherwise.


probably a dumb question but if you exposed the low level memory manipulation wouldn't you just end up with something like assembly?


I don't think it's a dumb question. And I think you're right: the closer you get to the way the CPU you're targeting likes its memory optimally managed, you're basically heading towards assembly.

I have notions that there can be something 'between' assembly and, for instance, C, but that's a daydream of mine without any details worked out yet.


I’m not sure you’re actually heading down the path of assembly in terms of the language itself, but you’re definitely moving towards the architectural lock-in of assembly.

But maybe thats fine if the language can express architectural-specific decisions (like memory layout) and abstract it away in a more sane fashion than ifdefsc and the rest of the codebase can maintain the pretense of portability.


Low-level memory work, if you care about things like data locations and exact layout of memory, is possible in practice with C. There are cases where you can run into aliasing problems, but controlling the actual memory layout is no problem at all. (There are some limitations.)


niftich, chubot: Thanks for the pointers to Sequoia and citations thereof. It looks very interesting. That is the sort of thing I had in mind when I asked my question. There are a lot of citations to follow up on.

the8472, RcouF1uZ4gsC, stochastic_monk: Thanks for highlighting the language-level alignment support in Rust and newer revisions of C and C++.

glangdale, regarding "I have a project half on the boil that's aiming to produce a language aimed at these kind of usages. I'm figuring that the supposed insult hurled at C ("structured assembly") isn't really an insult at all but that someone should actually do that." -- That's great! I would like to see that language.


> But I'm not aware of languages where e.g. declaring alignments is part of the base language

It is part of the base language in zig: https://ziglang.org/documentation/master/#Alignment


There's interesting work being done by the guys working on the Unity game engine. Their new data-oriented programming model (still based on C#) promises efficient packing of objects into arrays of properties which (supposedly) results in massive performance gains thanks to sequential processing. They also have a custom compiler optimized for this scenario. However, AFAIK there are no plans to provide this separately from the engine (on the other hand, they've been separating out a lot of functionality into optional modules so you can go really light).


XLA, the JIT compiler in tensorflow could hide this from you. It's a thing of beauty. It can vectorize along multiples of the memory higherachies of CPUs, GPUs, and the family of TPUs.


It seems to me that it is generally better to keep architecture-specific memory commands as part of a platform-based library that is imported into the code, depending on where it is being compiled (Separation-of-concerns). Imagine the amount of inappropriate/unnecessary commands that one would create in a language if you moved the superset of all possible memory commands on all possible architectures, into the language itself. There would be enough to make it ugly.


There's bit fields in C++.

https://en.cppreference.com/w/cpp/language/bit_field

If you don't need the full precision or range of a 64-bit integer you can pack multiple of them together to get multiple integers transferred in a single cache line read. Compressing in this manner also means you are more likely to get a cache hit.


That's important for security, as we have seen with meltdown and spectre.

The problem is that general-purpose architectures don't expose the cache hierarchy very well.


Wouldn't it be better if the slow RAM we have was replaced more and more by the fast RAM used in caches, before we do anything else?


Caches are significantly less dense than normal ram. You wouldn't be able to fit gigabytes in the same area. They would probably be more expensive as well.


Cache sizes have been increasing for years. I'm not sure what you're suggesting, or how it would diminish from this software-oriented approach.


The closest I can think of is shared memory in CUDA / local memory in OpenCL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: