Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Python, Ruby, and Javascript are Slow (speakerdeck.com)
367 points by jasonostrander on March 1, 2013 | hide | past | favorite | 200 comments


Speaking as a compiler guy, and having a hand in a few successful commercial JITs: The only reason he thinks they aren't slow is because they haven't yet reached the limits of making the JIT faster vs the program faster. Yes, it's true that the languages are not slow in the sense of being able to take care of most situations through better optimization strategies. As a compiler author, one can do things like profile types/trace/whatever, and deoptimize if you get it wrong. You can do a lot. You can recognize idioms, use different representations behind people's back, etc.

But all those things take time that is not spent running your program. On average, you can do pretty well. But it's still overhead. As you get farther along in your JIT, optimization algorithms get trickier and trickier, your heuristics, more complex. You will eventually hit the wall, and need to spend more time doing JIT'ing than doing real work to make optimizations to some code. This happens to every single JIT, of course. This is why they try to figure out which code to optimize. But even then, you may find there is too much of it.

Because of this, the languages are slower, it's just the overhead of better JIT algorithms, not slower code. In practice, you hope that you can optimize enough code well enough that nobody cares, because the ruby code takes 8ms, and the C code takes 5ms.

For example: Almost all of the allocations and copying can be optimized, but depending on the language, the algorithms to figure out what you can do safely may be N^3.

Also, PyPy is still pretty young in its life cycle (in this iteration of PyPy:P) for folks to say that they can make stuff much faster if they only had a few things. It really needs a very large set of production apps being rin by a very large set of folks for quite a while to see where the real bottlenecks still are. Past a certain point, you run out of optimization algorithm bullets. The way compilers get the last 20% is by tuning the algorithms for 10 years.

Of course, i'm not trying to slag on PyPy, I think they've done an amazing job of persevering through multiple rewrites to get somewhere that seems to be quite good now. I just am a little wary of a fairly young JIT saying that all big performance problems fall into a few categories.


Interesting point of view, the problem in compiler construction is well known ("Proebsting's law", though it says it's more like 18 years instead of 10.)

The issue with benchmarks is surely well known, also by the PyPy authors; I wonder what the biggest application is that they have benchmarked or that runs on PyPy.

Your point on the JIT compiler interrupting program execution is certainly valid, too, but not necessarily so. One could easily do the code generation in a separate background thread and let execution switch over only if necessary. But, as you have already said, a latency issue certainly exists. This is one of the cases where interpreters usually have a leg up, and there are promising ways of optimizing interpreters.


Yes, you could do a background thread, with some caveats:

1. On most current CPU's, this will cause really bad cache/memory thrashing, enough to probably impact the program.

2. This may actually cause significant slowdown, depending on how long it takes to optimize a given set of code (IE it may be better to spend 100ms paused optimizing than 5000ms in the background). This is, of course, a latency issue.

3. State of the art for most JIT's is still to use one thread. The number of folks doing actual parallel code generation is nil. So sadly, even if you had 4 cores, 3 empty, you'll still, at best, get to use one of them for the background thread doing the optimizing. There are parts that are trivial to parallelize if you've structured the JIT "right", but they aren't always the parts that are slow.


Background compilation in a separate thread actually works pretty well. IE9 has been shipping it with Chakra for a while, and Firefox is now getting it (and it improved the benchmarks a lot, especially on ARM).


Good to hear it's gotten better. Admittedly, I wasn't thinking about browser based JITs when I said that :)

I'm actually curious if you have any stats on how much of the time this is being done on actual busy machines where it's going to compete for L1/etc resources vs how often it's able to be offloaded onto an otherwise empty core.

IE i expect their to be a significant difference in the use cases for JIT's like PyPy, which are probably going to sit on shared servers that folks are trying to maximize utilization of, vs desktops where I imagine most browsing probably doesn't use all cores at 100%.


> Admittedly, I wasn't thinking about browser based JITs when I said that :)

Don't HotSpot and JRockit also do background (de)compilation & swapping of generated code?


Yes, but in hotspot's case I cannot remember if it is actually turned on in both "server" and "client"


Aren't server and client not now merged with tiered compilation in Hotspot?


No, AFAIK. "Tiered compilation, introduced in Java SE 7, brings client startup speeds to the server VM. ... Tiered compilation is now the default mode for the server VM. "

Again, AFAIK, the server VM still has a significantly different set of tuning than the client VM. In particular, it runs some significantly more complex opts that the client VM does not.


ad 1) Hm, this seems to be a good point, but what's with the following line of thinking: some thread A interprets a program P, while another thread B compiles P to native machine code (P'). Now, if another thread C would start executing P' (taking the data/snapshot from A), then C's caches should build up and remain accurate. Of course, if this happens too often, then the caching behavior will be shitty. I always wondered (based on my interest in interpretation), how much I-cache misses the instruction cache flushes after inline-caching in native machine code cause. (If you have some data on that, please let me know.)

ad 3) I am well aware of that. However, I remember that at PLDI'11 there was a talk from Univ. of Edinburgh chaps doing parallel trace-based dynamic binary translation. Obviously, DBT is less work than a high level, full-blown JIT, but at least it's not nil :)


I wonder what the biggest application is that they have benchmarked or that runs on PyPy.

speed.pypy.org has benchmark info on Django, Twisted and some other large, non-trivial codebases.


I know these from the site, and have looked at the Django benchmark that's listed there. I think it's a rather small benchmark that does exercise lots of Django internals (but that benchmark comes from Unladden Swallow). I don't know for the twisted ones, though.

What I actually wanted to know, what the biggest application is, i.e., a not benchmark.


We're in no shortage of production code. The problems are quite trivial - two people doing the same think might have different stuff in mind (so you cannot have a heuristic that works for everyone) and the fact that we're running on a shoestring budget compared to other JIT-for-dynamic-language projects. We don't even have 3 people full time.

As for the "single thing" - it's just the next thing on the infinite list of things that can be optimized better. Having a better assembler backend would be good (but smaller) win, etc. etc. For now and for quite a bit in the future, it's clear what to do to make X Y or Z faster.


I know it's for a simpler core language, but LuaJIT is quite a shoestring project too, with one main developer. Yet LuaJIT posts impressively fast results, even though Lua makes no distinction between objects and hash tables. In this and other ways, Lua can be compared to Javascript...but it does have a fast, high-quality JIT implementation, despite not a ton of money or even mindshare. I'm curious why you think this is (or if you think the facts are different).

P.S.: "hash_set" in the slides should be "hash_map."


I know, Mike is really good. However, this sort of approach does not scale toward larger teams hence the limits of what sort of language you can potentially implement. Lua is much much simpler than Javascript, which is again simpler than Python (Python is really vast). Also, can you find two Mikes?


Lua does not seem simpler than javascript in any way that would affect a JIT compiler.

Lua is a fairly simple language for the user, in the sense that it uses a few general mechanisms rather than a plethora of more specialized ones. However this "simplicity" can be misleading because these general mechanisms are very powerful, and can generally be used to implement most things other languages have specialized abstractions for. This sort of powerful and general mechanism is actually fairly hard to write an optimizing compiler for, because there's little stated explicitly about what's going to happen at runtime, and few obvious constraints about what's allowed to happen.

The reason LuaJIT (and most modern JIT compilers, although LuaJIT seems better than average) is so fast is because it uses very very local (in terms of both location and time) runtime context to know when to specialize operations that are conceptually much richer, e.g., "this number is really an integer" (Lua does not have a separate integer type), or "I know what function/operator will be called here (even though it conceptually dispatches through a table or metatable), so I can inline it or use a direct call."

I'm less sure about python (never used it much), but almost all the differences seem be user-facing ones -- large amounts of syntax-sugar for things that Lua offers sufficient mechanism for, but no built-in UI (in the programming-language sense). As it's the power/complexity of the mechanism that matters, not the UI, such user-facing richness doesn't really affect a JIT compilers.


I claim you're wrong. I've been working on a range of JIT compilers, but the PyPy is the biggest one. The problem is not "how hard the semantics are", but "how much of the semantics you need to support", especially that interaction between them is a headache. Lua does not have the equivalent of where statement in JS (albeit most JIT compilers just don't optimize it) for example.

I won't argue about the JS (although DOM interaction comes to mind as a big headache), but in Python syntax is trivial. It's all the semantics and not just how they're, but how much of it. descriptors, metaclassses, new/old style classes, tons of builtin types, tons of builtin modules, all of it the user will expect to seamlessly integrate with the JIT compiler.


Mike Pall + Lars Bak = JIT Dream Team :)


I develop a compiler for ActionScript 3. Though it’s not a great language, it does have a few distinct advantages over its cousin JavaScript. First, type annotations. Obviously the more knowledge you have about the structure of the program, the better you can help it run well. Having a class model instead of a prototype model also helps—much as I like working with prototypes—because you can easily eliminate late binding (“hash lookups”) and allocations, two of the performance killers mentioned in the slides. The operative word there is easily. You can analyse and JIT all you want, but you cannot feasibly solve at runtime the fundamental constraints imposed by the language.

Say a compiler author needs twice as much effort and cleverness to make programs in language X run fast than for language Y. That means that—all other things being equal—implementations of X will be twice as slow as Y for the same basic quality of implementation.


> For example: Almost all of the allocations and copying can be optimized, but depending on the language, the algorithms to figure out what you can do safely may be N^3.

... and this is pretty much his point: He can keep optimizing, but the moment you start passing complex objects around and copying them all over the place, instead of passing raw buffers around and operating on them in place, you've massively raised the bar in terms of the complexity of the necessary optimizations needed.


.) Just wondering, are there any languages/runtime-systems based on the idea of a fixed memory layout? No heap just "preallocated" buffers? I remember that was / probably still is quite common in the embedded world. I guess it is done easily with globals in C

.) Any profilers providing information reg. the heap-allocs as part of the execution costs?

.) Any Runtimes / VM actually optimizing the layout of those omni-present List/Array/Hashtable/Bag/Set/Dict of MyObjectTypes to have elements laid out as close as possible in memory? (The position of the actual objects that is not only the pointers to the objects within the containers)

Are we "safe" as heap allocs / gc / memory handling is really of no significant impact compared to other issues? Or is memory handling an large part of the "What Andy giveth, Bill taketh away." story?


No heap? Hmmm... As the slidedeck pointed out, much time is spent in both allocation and copying of memory, and in practice those things are most often strings. Given that 1) the languages under consideration declare strings as immutable, and 2) the strings enter the system at fixed points (e.g. source code or external inputs), and 3) exit at fixed points (e.g. sent out a socket).... Imagine a system that does not allow heap allocation or memory copying at all. Just doesn't allow them!

    var structure = "dog" + "house"
    print structure
does not need to allocate or move memory. That's how the script languages do it now, but should they? All the really need to do is print "dog" and then print "house", there's nothing that says those 8 letters need to be ever adjacent in memory.

Since it's so darn easy (and sloppy) to call malloc() and memcpy() the script-engine writers are going to keep doing so until they're given a MacrocosmicGod-like environment that won't allow malloc() or memcpy() and so will have to come up with a clever workaround. maybe it will be something like linked-lists of immutable blocks of memory? Something like ropes? Something smarter than that?

Anyway, script-engine writers, please forget that malloc() and memcpy() exist, and see what you come up with. It'll be wonderful. (P.S. I was in the script-engine business for a dozen years and didn't solve the problem, but that's just because I'm not smart enough.)


.) At our workplace, we use preallocation for almost everything. When we do dynamic allocations, typically it is from free-lists from pools that we pre-allocate.

Almost all of our allocation/freeing costs are cheap O(1). Pretty much everything is zero-copy. That is why I cringe a little when I hear that GC's are "faster than manual MM". Because manual MM done well can eliminate almost all of the costs of memory management.

I also cringe a little when I see "malloc" deep inside C functions.

I really like the basic convention in C that you should pass needed allocations in as arguments (and it is virtually always possible) allowing these allocations to be members of of members of other allocations, aggregating allocations so there are much fewer.

.) Information regarding heap-allocs is relatively easy to obtain with "oprofile", look up the time spent in "malloc" and "free" and the related functions underneath them.

.) C makes it very easy to use an array of the values you want, rather than an indirection. For Lists, I don't think that is a useful idea. For hash tables, given that many entries are empty, I think it is a better trade-off to have your hash array as an indirection, though the buckets could be explicitly put together. And so on.

I often encountered dynamic allocation overheads are noticeable or even huge chunks of my runtime when I had to work with code that used them. Reducing the dynamic allocation really helps. We're not safe, we need to work towards it explicitly.


So does the JIT optimization algorithms depend on popular conventions and patterns of how people write code with a language? Like in JS theres bad language features people avoid and certain patterns on how to code something. If such things changed would the optimizations start failing? I guess I'm wondering if speed is somewhat related to trends in that language.


Once you get past generic dataflow based optimizations that have some notion of optimality[1], yes, you start to write optimizations that target idioms.

For example: Most compilers started doing structure layout and reorg optimizations (transforming structures with arrays into arrays of structures, and vice versa) to tackle specific benchmarks. In some cases, they discovered it also was useful generally.

But whatever the benchmarks are, that's often what gets targeted. You can't optimize in a vacuum, you need to know what you are trying to develop optimization algorithms to do. This usually occurs by taking user complaints/programs/whatever, finding out why they are slow, seeing if there is a common solution, and developing an optimization to do it.

So yeah, if you are targeting certain patterns, and the patterns change, ..

[1] IE You've removed all possible redundant computations, made it so everything is only computed among paths it is actually used, the calculations occur in lifetime optimal fashion, etc.


To counter the overhead of the JIT algorithm, would it be possible to do that work in a background thread or cache the JIT optimizations to disk?


See above. It varies. In some cases, it may be worse to do this than to stop the program. It's not as simple as "always work in the background until ready" .


Related to this is the importance of deforestation. Some good links:

* http://en.wikipedia.org/wiki/Deforestation_%28computer_scien...

* http://www.haskell.org/haskellwiki/Short_cut_fusion

Deforestation is basically eliminating intermediate data structures, which is similar to what the "int(s.split("-", 1)[1])" versus "atoi(strchr(s, '-') + 1)" slides are about. If you consider strings as just lists of characters, then it's basically a deforestation problem: the goal is to eliminate all the intermediate lists of lists that are constructed. (It's something of a peculiar case though, because in order to transform into the C code you need to not only observe that indexing an rvalue via [1] and throwing the rest away means that the list doesn't have to be constructed at all, but you also need to allow strings to share underlying buffer space—the latter optimization isn't deforestation per se.)

I don't know if there's been much effort into deforestation optimizations for dynamic languages, but perhaps this is an area that compilers and research should be focusing on more.

On another minor note, I do think that the deck is a little too quick to dismiss garbage collection as an irrelevant problem. For most server apps I'm totally willing to believe that GC doesn't matter, but for interactive apps on the client (think touch-sensitive mobile apps and games) where you have to render each frame in under 16 ms, unpredictable latency starts to matter a lot.


Automatic deforestation can remove intermediate results in a pipeline of computation, but it can not rewrite a program that is based around the querying / updating of a fixed data structure to use an efficient imperative equivalent throughout.


Deforestation is easily done in lazy languages like Haskell.

As for GC, it would be nice to have good real time GCs in runtimes.


> As for GC, it would be nice to have good real time GCs in runtimes.

After decades of GC research, I think the conclusion is, "Yeah, that would be nice." Current state of the art gives us some very nice GCs that penalize either throughput or predictability. One of my favorite stories about GC is here:

http://samsaffron.com/archive/2011/10/28/in-managed-code-we-...


It has nothing to do with laziness. It has everything to do with the guaranteed absence of side effects.

Deforestation is /more useful/ in strict languages, because allocation of temporary structures costs more. So fusion on strict arrays is better than on lazy streams.

You just can't do it unless you can freely reorder multiple loops, and to do that you need a proof there are no side effects. Haskell just makes that trivial.


"Deforestation is easily done in lazy languages like Haskell."

You can also do it in stream-based or data-flow-based languages. Or in pretty much any DSL you decide to implement, if the semantics of the language itself is reasonable.


Try telling that to a dedicated C++ guy. Apparently the C calls are dangerous, dirty and just to be avoided.


Fully agree. Only when C goes away we have the possibility to have more secure software.


Yeesh.

When incompetent programmers go away perhaps. C in and of itself is not the issue. And neither is scaring people away from it with horror stories.

I've run into young engineers recently who thing that pretty much any 'C style' system call is necessarily dangerous, because OMG the developer has to remember to pass in the length of the buffer they're passing in as well as the buffer itself. No, you just have to not be a frickin' idiot.


Incompetent programmers are here to stay as they are cheaper to hire.

We need languages that are not designed to allow security exploits by accident, like Ada, Modula-2 or any other in the same school of thought.


A nice talk. The punchline for me was:

    Things that take time
     •Hash table lookups
     •Allocations
     •Copying
Interestingly, that's exactly how you write fast C++ code. His point is that languages like Python lack good API's for preallocating memory.


It's how you write fast algorithms in general, in any programming language. Minimize the number of reads and writes per iteration/recursion.

In higher-level programming languages, it's just a bit harder to control the number of reads and writes because you're working at several layers of abstraction above them, and are concerned with solving higher-level problems. Use the language that provides the appropriate level of abstraction for the problem you're trying to solve.


On the other hand, why should the abstraction layers prevent that? I mean, abstraction layers abstract away the [hopefully] unimportant low-level choices from me - but "copy or not copy" or "allocate once or allocate thrice" isn't a choice that I need to make anyway; the abstraction layer simply should make the 'non-copy' choice for me. Exactly the same way that the C abstraction layer right now makes the proper opcode-ordering choices for me as good (or better) than I can do manually in assembler.

The problem is that we haven't yet implemented those abstraction layers in this smart way - for example, Haskell can implement 'fusion' of multiple string operations so that they are merged together and executed without intermediate copies; and the abstraction layer for that is exactly as high-level as the Python examples in original poster's slides. Sure, it's objectively hard to change core Python like that - but it theoretically can be done, so it should&will be done.


its not always possible to go with the "non-copy" choice. for example, there are very good reasons for having immutable strings, and once you've made that choice at the language level every string function you write is going to copy at least once.

I think Alex Gaynor is correct and that basically what is wrong at the moment is that dynamic languages lack API's that have any sensitivity to performance concerns. There's always going to be a hard limit based on the nature of using a JIT vs. a static multi-pass compiler. There's always going to be a hard limit based on fundamental language choices (implementations of primitives, mutable vs. immutable strings, amount of overhead in object instantation, etc.) But we're nowhere near those limits right now.


Compilers can work around that choice - if you want to do something with immutable strings (like the Haskell example I mentioned, it does have immutable strings) then you will have to make some copy, but you don't need to make a copy per every function - if you're stringing three string functions in a row, the compiler can "fuse" the processing so that only a single, final copy is made, not the intermediate ones.

For any language the compiler may know which variables won't ever be used - for example in pseudocode

  b = a.lowercase()
  c = b.replace("x","y")
  d = a.lowercase.replace("x","y")
both 'b' and the intermediate result in 'd' are strings, but the compiler can flag these two 'throw-away' variables as mutable strings (while still maintaining the promise that all programmer-visible strings will be immutable); and you may have a special version of 'replace' standard function that does no-copy, in-place replacement in such cases. It means extra work in building API/stdlib, but brings better performance for the same programs.


Meh, MEH.

I'm almost never waiting on my python code. I'm waiting on network or disk or database or joe to check in his changes or etc.

I'm sure there are people who do wait. But that's why numpy, c extensions, all the pypy, psycho, and similar things exist.

Python and more broadly "scripting" languages are for speed of development. Something else can take on speed of execution faster than 90% of people need it to be.


It's not an unresolved question whether idiomatic Python is slower than idiomatic C/C++ for solving comparable problems. Python is much, much slower than C.


This is completely true. It is indeed well known and common wisdom.

However, I think the point the parent was trying to make is: Python is much slower than C and many other languages, however most of the time speed is unimportant. When it becomes important, there are many technologies to mitigate the problem in your "hot loops."

If speed is your primary concern don't use Python et. al. If it isn't your main concern go ahead it probably won't become an issue and if it does you probably will be able to get around it.


I think just saying don't use python where speed is important is missing the whole point of this slideshow. The author is saying, hey, if we thought about this and put in a few different features we could make it a lot faster without programmers having to do much other than use an alternate implementation of some data stuctures.

While I'd agree that for 99% of us we're not going to find python/ruby/php/javascript to be a bottleneck that can't be mitigated, that's no reason to say it's not worth trying to make them faster. If we can make changes to these languages that will make them more efficient, why not do it?


When it becomes important, there are many technologies to mitigate the problem in your "hot loops."

There is an implicit assumption there that most of the time in your program that could be saved is spent in a small number of hot spots. This will often be true, but unfortunately it is not necessarily so.

This is a particular problem in languages like Python, which are useful (among other things) for their support for rapid prototyping and their easily readable code. All of that is lost if you can’t perform local optimizations to reach an acceptable level of performance, leaving a ground-up rewrite in a faster language like C as the next most likely strategy.

The kinds of techniques mentioned in the linked slides could help to create a middle ground that would be very useful for performance-sensitive projects that currently find themselves between a rock and a hard place.


The deck is about the performance of the language.


@chadcf and @tptacek

I was responding to @tptacek criticism of the parent not the deck. The deck is great and it mirrors the wisdom I have picked up from optimizing my own code over the years. I personally find it really frustrating not being able to easily pre-alloc lists in Python. I think that having better APIs would go a long way.

As the deck says:

"Line for line these languages are fast!"

"We need better no-copy/preallocate APIs"

"Take care in data structures"


Forgive the naive question, but why not:

    l = [object()] * 100
Perhaps the difference is stack vs. heap?


That will create a list of 100 instances of the same object.

  object[0].x = 1
  print object[1].x
  > 1
Edit: On second read, it looks like you're asking something other than what I thought you were asking. Yes, you could create a list of 100 items and then replace its elements, but that's not idiomatic.


but it is idiomatic in C, which is the point of the slide. C was built around a performance focused idiom, which is to pre-allocate memory and then do in place writes and swaps to mutate the buffer to the state you need it to be. Python is built around an idiom of largely creating copies of objects and appending them to dynamically allocated lists. Its a much slower idiom.


Yes, the question was regarding this line in the original post: "I personally find it really frustrating not being able to easily pre-alloc lists in Python."

So my mind of course wandered in the direction of how to do that.


And this comment thread is about why some people don't care in some applications. HN meta!


This comment thread is isomorphic to a comment thread on Packrat parsing with comments about how "I don't parse anything I just use sexprs".


>Python is much slower than C and many other languages, however most of the time speed is unimportant. When it becomes important, there are many technologies to mitigate the problem in your "hot loops."

I'm not convinced by this "speed is unimportant".

Well, if you're writing shell scripts in Python/Ruby etc, OK, it might be. It might not even be important in web programming.

But for using any language as a generic programming language speed is very important.

The reason you cannot build full blown desktop apps like a browser or GUI libraries in Python? Lack of speed and memory control. And yes, you could offload the work to some extension. And that's a barrier.

Suddenly knowing Python is not enough. You got to also learn, e.g, C, and you have a segmented program structure, with some stuff here and some stuff there. Or you relegate Python to just the scripting layer for your program and do the real stuff in C/C++ (like Adobe Lightroom uses Lua).

I don't want to mitigate the problem in my "hot loops" with another language. I want to not have that problem in the first place. That would make me more productive.

One example: imagine NumPy in pure Python.

For one, it would be trivial to include in your project. Without building anything, it would work in all platforms.

Second, it would be far more accessible to people that don't know C/Fortran/et al to hack it.

Third, it would have been available for Python 3 or PyPy in a few months, not after several years.

Alright. Now, another way of achieving better speed is parallelism. But due to the bad support for it (GIL, lack of first class support) it's not easy to achieve this in CPython/MRI. Sure, you could use multiple processes but then you get all the issues of handling them and synchronising them with your own ad-hoc solution, and without first-class support from the language. Which is a barrier.

Yet another way to get more work done --for some kind of programs-- is evented code. So you have something like Node or Twisted. But Node doesn't have language support, so you get the "callback spaghetti" and Twister and co are external dependencies to the language, so they add another overhead.

Again, barriers.

People say "Speed doesn't matter" because they are trained by their language to only work on problems where speed doesn't matter. So it's more like a self-fulfilling prophecy.

Or course, if you constrain yourselves in "convenient" domains that your language supports fast enough, speed doesn't matter. But every step out of this and you are in need of clutches, from C extensions, to Cython, to Psyco, to Numpy, etc.


> It's not an unresolved question whether idiomatic Python is slower than idiomatic C/C++ for solving comparable problems. Python is much, much slower than C.

The real question is does it matter for a particular project.

If it is a desktop GUI. Does it matter if you write it in C++ and the time from button click to status update is 5usec or 1msec?

If you are receiving 10 messages per second, parsing out json and sending back a response or saving it to disk, does it always matter that it all happens in 10msec instead of 11msec. Maybe it does, I found it often doesn't.


Yes battery power and general speed. On the other hand, it's hard to sacrifice more/better programs over speed; but IMO it does matter.


Meh, when there's an io call or a network request in front of the computation you'll never know.

EDIT: removed an additional comment about scientific computing that is now relevant as someone replied to it.


[Edit: the parent originally had a sentence about not understanding why people like Python for Scientific Computing. This was my response to that. The parent has now removed the sentence.]

We (the people using Python for Scientific Computing) like Python for the following reasons:

1. Numpy+Scipy+matplotlib+cvxopt is a very speedy environment. Its only real competitor for what it provides is MatLab. I have a colleague who bench marked Python vs. Matlab for our workload. Python is faster. (often because some of the algorithms used are newer than the equivalents in Matlab.)

2. It is a very productive environment. We do a lot of evolutionary changes and prototyping. Doing in this in C would slow us down in dev. time. This is academic work and mostly the code isn't important the analysis is.

3. We generally know where the "hot loops" are. Which is what we focus on for optimization. This generally involves doing math on paper. Then implementing it. If you turn loops in to matrix multiplications and use a good matrix library you get a great speed up.


Sorry, I decided that it wasn't important before you replied. I am genuinely interested in why people use python for scientific computing, tho.

I have a colleague who bench marked Python vs. Matlab for our workload. Python is faster

Is it also faster than C? From my limited experience, it seems that people sometimes spend a lot of time on concurrency when faster code would have been easier.

This generally involves doing math on paper. Then implementing it.

Ah, yes, math always wins. This reinforces your point #2.

So, is #2 that much of a win? Do scientific programs spend more time in "development" than "production"?


> Is it also faster than C? From my limited experience, it seems that people sometimes spend a lot of time on concurrency when faster code would have been easier

It can reach FORTRAN speeds with the right tools. With Numba (http://numba.pydata.org/), your pure Python code gets compiled down to optimized machine code at call time, if your arguments are Numpy arrays. With NumbaPro (https://store.continuum.io/cshop/numbapro), we automatically parallelize for multi-core CPUs, and we emit CUDA/PTX for GPUs, and automatically exploit the parallelism in your data and algorithm.

The reason "higher level languages" can be faster than lower-level ones is because the compiler has more information about data parallelism. Typically "low level languages" are lower in that their type primitives are smaller, and hence the algorithms around those have turned vectorizable arrays into opaque for loops over arbitrary loop variables.

I certainly agree with you that many people now reach for distributed and parallel while leaving a lot of single-core and single-node performance on the table, mostly by ignoring the realities of memory bandwidth on modern CPUs. However, that level of efficiency is well within the reach of the Scientific Python stack. (See this blog post for how we're building a persistence format that respects memory hierarchy: continuum.io/blog/blz-format)


As a counterpoint; a last project I wrote in college was a machine learning algorithm. By a rough comparison it was on the order of 10000 times faster in C++ than the preexisting matlab implementation. The cause was that the performance bottleneck was not in large matrix operations; instead, there were lots of iterative updates until convergence; this meant small vectors; a C++ template-based matrix library such as Eigen ends up inlining almost all of it into one no-allocation dense bit of math the traditional optimizer can milk for every last bit.

And it's not just about static/dynamic language differences here: practically, JIT might even do better by specializing the algorithm for a particular dimensionality, whereas that's impractical in C++ since you don't know the dimensionality until runtime.

Now, sometimes you can reduce your algorithm to some large-scale eigenvalue decomposition or whatever, and then numpy or similar might provide reasonable performance. But it's not a very general solution because performance on small structures is terrible (and iterative simple updates are common in many algorithms). JITted code relying on some underlying native library (like numpy) could never extract reasonable performance from this type of code; it would be forced to make many, many function calls in the innermost loop.


Good points, certainly, but just to clarify: Numba is not particularly dependent on Numpy's built-in vectorized and matrix operations. Instead, it's using the datatype information to do JIT type inference over the functions being called with the matrix/array arguments, and building machine code for them. You can call Numba JITted functions from other Numba JITted functions, and the overhead is the same as C functions calling each other.


I've got no numba experience whatsoever, but if you're doing real function calls and memory allocations for simple things like multiplying small matrices, your code will be at least an order of magnitude slower than optimal, even in C. malloc's a big hit, and function calls often are too - not just because of the call itself (and the CPU cache hit that can involve), but no less significantly because they're opaque to the optimizer - and that means that the wrapping function is often optimized much less well.

It's not a fundamental issue, but I haven't seen a JIT do this particularly well, yet. All that inlining makes compiling slower, so to some extent the run-time nature of the JIT is an inherent limitation here.


There is no "production" in scientific programs. It runs once correctly to make the figure... more seriously, ontology is often a moving target, so the longer in takes to rewrite significant parts of the data structures, the less time there is to do science.

re: concurrency: I have a script that boots hundreds of IPython workers on hundreds of cores. I then make a client object (in antoher IPython shell), and map my 1e8 parameter configurations on to the cores, all in under a minute. This is much faster than rewritng in C.

I even implemented a special case of the brain simulator we've developed in Python (http://thevirtualbrain.org/) in C w/ unaliased pointer arithmetic etc. It's 50% faster but took more than 50% longer to write; on the other hand the PyCUDA implementation is 80x faster, and didn't take 80x, maybe 10x. Also a win because PyCUDA takes care of the uglier details.

so #2 is a big win


It's 50% faster but took more than 50% longer to write

At this point it's useful to know how long it takes to run, and how long to write. Is a run days long, months long, or years long? Or another way, is concurrency more expensive than a C re-programmer?

Also a win because PyCUDA takes care of the uglier details.

Is there not an analogous C++ library to take care of ugly details?

(I actually like python a lot, so there's a bit of devil's advocate going on. But, my longest running python programs take less than an hour.)


Typical simulations for us take between half a minute and several days, but this can depend because it's typically necessary to do a parameter sweep in several dimensions (leading in extreme cases to runtimes of several months on a cluster).

I believe Thrift (now shipped w/ CUDA SDK) makes things easier, but (since you know Python) nothing like NumPy exists in C++ and PyCUDA maps NumPy seamlessly into GPU computing, which is a big win.


If you are getting good results from CUDA and PyCUDA, you might want to take a look at Numba and NumbaPro: http://numba.pydata.org and http://continuum.io/numbapro. They are still in their early stages but work pretty well on a number of cases. Here is an example of what NumbaPro can do: http://docs.continuum.io/numbapro/generalizedufuncs.html#gen...

Numba is completely open source. NumbaPro is not open source, but it is free for academic users.


> Is there not an analogous C++ library to take care of ugly details?

No. In general, there isn't an analogous library at the more static-explicit languages (it doesn't matter much what library you choose). There are libs that people use when they have similar requisites, but they rarely are analogous.


Does that mean we can't discuss what makes languages or their implementations performant without having a detailed conversation about the relevance of performance?


No, I'm in complete agreement with the OP, but you said

Python is slower than idiomatic C/C++ for solving comparable problems

And when io and especially network is involved, that is not true. Your efficient C code can't make up for time lost elsewhere in the system. No one is clamoring for curl to be rewritten in assembly.


Is this really true? I have investigated a few performance / power problems caused by somebody using a bad networking API or using a networking API correctly, despite doing the same amount of I/O.

It is also more likely to be possible to use efficient platform-specific APIs for things like zero-copy I/O in C than in a scripting language.


Zero-copy is often not actually a win; it's also uncommon in C code, too. The parent comment is right; I/O bound programs tend to do just as well in slow languages as in fast. It's true that language performance often doesn't matter, just like it's true that parser designs don't matter if you just use sexprs for everything.


I disagree. I've measured it and it matters.


For your data set.


For lots of data sets, and that includes many so-called I/O-bound datasets. Don't forget that a lot of I/O is very fast nowadays: if you're reading 100's of MB from the disk a second, you can't spend much time on the CPU processing each byte before the CPU becomes the bottle - you have on the order of 10-100 of clock cycles a byte, no more. Simple things like decoding utf-8 and checksumming can take a significant portion of this time, and that's before you're parsing anything. (Hence stuff like protocol buffers...)

If you're trying to make a fast program you obviously avoid too much expensive I/O. You can't avoid some latency, but you can often avoid a lot, and cache a lot, and place the rest closer to the consumer.

When you say I/O dominated, are you sure you don't mean: interactive website? Because I think the real saving grace there is that's it's OK for websites to be very slow - from the perspective of a CPU. It's not that the I/O needs to take a lot of time, it's that you have 100ms (and that's before ajax and relatives, which can hide even more latency), and you just don't need a lot of optimization to get into that restriction. And once you have, the difference between 100ms and 1ns just doesn't matter nearly as much; so sure, then you start to accept very inefficient I/O setups even though much more efficient ones could be readily available.


True, but obvious.


I hate to feed someone on a troll, but you started this off with "Python is much, much slower than C." This comment thread is sponsored by obvious.


Speed in Python (or Ruby, or JS) isn't a big deal... until it is. When that happens, would you rather have to switch over to C and glue the resulting binary in (assuming you're not using JS, in which case you're just SOL), or would you rather have a high performance API at your fingertips for optimization when you need it?


Well, my usual answer there is to change the file extension to .pyx and see what Cython can do with a few type annotations. Usually the results are pretty good, and sometimes they're very good.


I think most of the gains you'd get from that would be orthogonal to the gains you would get from giving the JIT a little more information about allocations.


Does Cython give you much better results than PyPy? I would have thought that if just a few type annotations make a big difference tracing in PyPy could figure them out.


I don't know. We're using a bunch of C extensions that aren't trivially compatible with PyPy, so using it isn't really an option -- which is a pity, since PyPy sounds pretty amazing. Cython, on the other hand, integrates really well with CPython, and it can be as fast as C if you need it to be. I'm pretty happy with it.


If you're completely satisfied with Python's performance, that's wonderful. It means the talk wasn't aimed at you. Move along.


>Meh, MEH. I'm almost never waiting on my python code. I'm waiting on network or disk or database or joe to check in his changes or etc.

Meh, MEH. That's because you don't do anything involved with your Python code.

>I'm sure there are people who do wait. But that's why numpy, c extensions, all the pypy, psycho, and similar things exist.

That they HAVE to exist could also be considered a sad state of affairs though. With a faster language you would just use the language, not external extensions and tricks.


Anything involved with what? What kind of specific task do you actually mean by 'involved'?

If you don't mind leaving Python's advantages on the table then use C in good health. Odds are that other people will be waiting on you to produce the C code, so let's hope you actually needed to do that.


Alex's point is that Python on PyPy is trivially comparable to those C extensions in speed. So why give up Python, ever, if JITs are this good?


What if writing performant code on modern Python implementations is only incrementally easier than writing it in C to begin with?

With the right libraries, the hard parts of C probably turn out to be string processing with zero-copy string idioms, the requirement to lay out every data structure in fiddly detail, the requirement to track individual allocations, and the requirement to manage the memory lifecycle. What if performant Python only gives you an advantage on the last one of those?


I would say that the difference between fast Python code and C is still quite large.

- the syntax is less error-prone - ownership semantics are much clearer. You'll never segfault because you sent some memory into the wrong function - not as much detail is needed for memory layout, the JIT abstracts a lot of it away - there are high-level APIs handy - development and distribution are simpler with one less language - the barrier to optimising things is lower


I'm interested in this discussion. Which of those issues could you dispense with using more modern APIs and idioms in C? Look at Objective C (mentally wipe off all the object goo), particular NSMutableString and NSMutableArray and NSMutableData, for examples of what I'm thinking about.

The C syntax we're stuck with. But how big a deal is that syntax?

Segfaults are mitigated if you don't expose pointers, except to the extent that C programmers have to think about memory lifecycle (like I said, I think this is indisputably a win for high level languages). Look at NSMutableString for an example of a C-style idiom that removes whole classes of pointer operations.

I dispute that JITs abstract away details about storage; they may allow you to not think about those details for code that doesn't need to be performant, and they can help the language get out of the way when you need to care about the storage details, but the question I'm asking is limited to performant code. There is no question that nonperformant Python code is way easier to write than any kind of C code!

There are better APIs available in Python than are commonly available to C or even ObjC, but that's a solvable problem. Let's stipulate better APIs, to the limit of what the language would allow (in other words, it's totally fair to say that the design of C/C++ would prohibit certain kinds of easy APIs).

Development and deployment are easier in some cases for Python (for instance, building on OS X and deploying on Linux), but far easier for C in others (for instance, building code that will run in a kernel or as a plugin in the address space of another process).

I dispute that the barrier to optimization is lower in Python for obvious reasons: C programmers can optimize without working around the exposed wires and ductwork of the language runtime. C programmers generally have an easier time optimizing than Python programmers; that is probably the #1 reason any Python programmer ever writes C.

As a current Golang programmer I agree strongly with the commenter below that when you take this idea and apply it to a new language you wind up with something that looks a lot like Go, which does work great. But I'm not advocating Go here.


I would say that the things you mentioned that we shouldn't count already add up to a lot (syntax, memory management, segfaults, security vulnerabilities). Another big one is that if you have these additional constructs in Python you can smoothly migrate from slow to fast code. You don't have to create a C file, rewrite your whole algorithm, create a build process to compile the C file (which you don't need with Python), and get the C functions to be callable from Python. In contrast with the method proposed in this PyPy presentation you just change a couple of lines. If instead of advocating writing just the performance critical parts in C you are advocating writing everything in C, then in addition to the issues you mentioned then you're missing the high level features of Python for the code that isn't performance critical.

The woes of pointers (segfaults and security vulnerabilities) cannot be addressed in a library without a performance penalty. If you want a nice error message instead of a segfault or random memory overwrite you will have to pass around type information at run time. You could however have a production version of the stdlib that did not pass around type information, but that would only solve the issue at development time: the security vulnerabilities in production would still be there.

There is also an argument to be made that many of the optimizations mentioned in the presentation can be done automatically by the compiler/JIT. For example Javascript JITs already optimize small hash tables used as objects, since every Javascript object is a hash table. Load forwarding followed by code motion can remove unnecessary intermediate allocations. And the square example should have been written as:

    [i*i for in in xrange(n)]
This can allocate the result list of the right size at the start of the allocation.


I routinely program in Python and C, and syntax matters much to me.

My personal favorite feature of Python is simply the syntatic sugar that allows me to write stuff like "for element in array" without having to remember that an index exists. These little things add up fast when you're trying to focus on the problem at hand!


I actually was able to write a macro in C that, along with a certain paradigm for defining collections, allows foreach loops that are just as nice as Python.


The crux of the argument seems to come down to trading on optimised development time versus optimised execution time. C with the right set of APIs could nail both of those, which is no slight to Python. Look at BIND, they've gone mixed C++/Python because they know you use the right language when you need it, and don't get religious.


You can control the details without it having to be overly fiddly. Certainly you can do better than C. You may end up doing similar things, but your code will be way easier to write and read. Go does this quite well.


Back when I wanted to investigate the numeric performance of v8 I wrote a Runge-Kutta integrator + Lorenz attractor in C and in JavaScript as a simple-but-not-entirely-trivial benchmark. I was actually pretty impressed with how fast the v8 version was. On the downside, it's fairly non-idiomatic js and not that much nicer to look at than the C. Doing a million steps on my machine takes 0.65 seconds in node.js v0.8.4, 0.41 seconds in C compiled with gcc -O0, and 0.13 seconds with gcc -O3. Here is the code if anyone is interested. Note that it's not commented, not thread-safe, and doesn't free memory, so use at your own risk :)

https://gist.github.com/anonymous/5066486

    gcc strange.c rk4.c; ./a.out

    node strange.js


> Back when I wanted to investigate the numeric performance of v8

Straightforward numerical computations really isn't a good jit benchmark, because numerical computations are by far the easiest thing to JIT, and JITted perfs are going to be much closer to AOT than in the general case (unless the problem can be vectorized an the AOT compiler is vectorizing, I don't think JITs can usually vectorize)


Yup...I wanted to try to quickly measure just how good the JIT was for that kind of stuff, to see if we were getting to the point where it's feasible to do physics in the browser. Turns out it's "fast enough", and yet still 5x slower than equivalent stock C.


Great presentation, thank you for making me aware of an aspect of Python performance. One slide struck me as odd - the "basically pythonic" squares() function. I understand it's a chosen example to illustrate a point, I just hope people aren't writing loops like that. You inspired me to measure it

    $ cat squares.py
    def squares_append(n):
        sq = []
        for i in xrange(n):
            sq.append(i*i)
        return sq

    def squares_comprehension(n):
        return [i*i for i in xrange(n)]
    $ PYTHONPATH=. python -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 148 usec per loop
    $ PYTHONPATH=. python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    10000 loops, best of 3: 74.1 usec per loop
    $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 46.9 usec per loop
    $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    100000 loops, best of 3: 8.67 usec per loop
I'm curious to know how many allocations/copies a list comprehension saves in CPython/PyPy. However I wouldn't begin to know how to measure it.


If you really want power, use NumPy:

    from numpy import arange

    def squares_numpy(n):
        a = arange(n)
        return a * a

    $ python -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 130 usec per loop
    $ python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    10000 loops, best of 3: 95.4 usec per loop
    $ python -m timeit -s "from squares import squares_numpy" "squares_numpy(1000)"
    100000 loops, best of 3: 5.31 usec per loop


The creators of Common Lisp knew what Alex is talking about. Lisp is, of course just as dynamic as Ruby, Python or Javascript, but it exposes lower-level details about data structures and memory allocation iff the programmer wants them.

Features that come to mind include preallocated vectors (fixed-size or growable), non-consing versions of the standard list functions and the ability to bang on most any piece of data in place. There are fairly few situations in which a CL program can't come within a factor of 2 or 3 of the performance of C.


In the early 80's there was a time Lisp compilers could even beat FORTRAN for floating point computations.

http://www.cs.berkeley.edu/~fateman/papers/lispfloat.pdf


From the discussion of their most relevant benchmark (Singular value decomposition):

  The Allegro CL 4.1 times of 3.9 seconds beat the f77 time of 4.8 [seconds];
Nice!

  setting on optimization for f77 brought its time down to 0.45 seconds.

  Thus for this system, the [LISP] compiled code can have quite comparable speed to
  that of the corresponding unoptimized Fortran in this case as well.
Oh really.


It's been a few decades since I have read the paper, so it seems my memory is a bit fuzzy on that regard.

On the other hand, given that C always had issues to beat even unoptimized Fortran, due to the optimization restrictions before C99, it is quite commendable that Lisp achieves such results.


It's not obvious how optimized C in the '80s wouldn't have been as fast as unoptimized Fortran 77, even with whatever optimization restriction you might be thinking of.

The way the authors of that paper talk about unoptimized code in that paper gives the impression that they don't know what they're talking about. Your comments here begin to put you at risk of a similar appearance.


Great bit of slides. Straight and to the point. If you've ever ventured under the hood of Python you'd see this in the code. If you've ever had to optimize the bejeesus out of code in C++ or C, you'd know exactly the kinds of things he's talking about.


Author/speaker here:

I don't have time to read all the comments now (thanks for all the interest though!). I just want to say I think when the video comes out it'll answer a lot of questions people are having.


I'm looking forward to the video. I'm also interested in proof of the "lame myths" claims, or links to debunkings of those myths, etc. And also if you have rants about those If There's Time topics in your last slide, I'd like to read those too. Thanks!


Someone actually posting notes with slides! It's a miracle!


yes thank god. I almost skipped it when I saw it was a slide deck, until I saw the notes. I hate it when people link to a slide deck with no notes. It's almost completely useless.


Completely agree. APIs are so important for many optimizations to pull off.

I'd really like to use a lot more buffer()/memoryview() objects in Python. Unfortunately many APIs (e.g. sockets) won't work well with them (at least in Python 2.x. Not sure about 3.x).

So we ended up with tons of unnecessary allocation and copying all over the place. So sad.


As a C/C++ programmer I find these slides kind of amusing... These languages are popular because they make things simpler, and his suggestions may very well get a nicely jit'd language on par with C, but I suspect you'll then have the same problems C does (complexity).


I don't think that added complexity invalidates the usefulness of high performance APIs in high level languages. The point would not be to write all of your code to be highly performant (that would be premature optimization), but to optimize the hot spots.

Currently, if you want to really optimize a hot spot in, say, Python, your only real option is to write that part in C. Then you have all the additional complexity of gluing that into your Python program, along with portability concerns and a more complex build process. It would be so much easier if there were a way to sacrifice local simplicity and idiom for performance while still staying in the language.

And in the case of JS, I'm not sure you have much in the way of options at all for optimizing hot spots to reduce allocations. Maybe you could write it in C and compile it to JS via emscripten? I don't know if that would even help currently, but maybe if asm.js takes off. But once again, wouldn't you rather sacrifice a small amount of elegance for performance rather than switching languages?


Depending on the application, speed may matter only for 5% (or less) of your code. If you can write that 5% in Ruby or Python, same as you wrote the other 95% in, you save a lot of difficulty compared to calling out to a C library — even if the optimized part of the code is a bit complex.


There's a lot to be said for writing everything out in a simple manner for the first pass, then profiling and adding complexity to the places where there's a big benefit from it. And if something goes wrong what you get is bad performance, not a segfault.


One main thought on this topic -- languages like Haskell and lisp also have very poor support for direct memory control, but tend to be viewed (perhaps untruthfully?) as much closer in performance to C than Python/Ruby.


Haskell and languages in the ML family have a lot of opportunities for elaborate static analysis, which often allows the resulting programs to be quite clever about optimizing the resulting programs. As one example, the GHC Haskell compiler uses loop fusion to combine multiple passes over a list into a single pass with no intermediate copies of the list produced. Consequently, Haskell code like

    map f (map g (map h someList))
is going to involve allocating exactly one list of the same size as someList, while a direct translation into Python

    map(f, map(g, map(h, someList)))
is going to involve the creation of several intermediate lists.


Most of the Haskell optimizations are aimed at making high-level code faster, not absolute speed. Something like loop fusion is trivial to do by hand, so it doesn't confer any advantage on hand-optimized Haskell programs. And when you're comparing to C, you're in the realm of performance where hand-optimizing code at the expense of simplicity, maintainability or even correctness is necessary. Otherwise you wouldn't have considered C.

So high-level optimizations like loop fusion do matter, but only for the 95% of the code which is not entirely performance critical. It's great for letting you write pretty (and therefore easier to write and more maintainable) code but not great for squeezing out all the possible performance in an inner loop.

Considering that many GHC optimizations are of this nature, it's really impressive that GHC is still good at speed in an absolute sense as well. So if you desperately need to wring out an extra bit of performance, you can still do it in Haskell instead of having to use C. Sure, the highly optimized Haskell will be relatively ugly, but it's still better than C and much easier to integrate with the rest of your codebase.


In the Haskell case you could also do something like this to avoid needing that optimisation:

  map (f . g . h) someList
And in Python 3, map returns an iterator, not a list, so you aren't building the full list until you ask for it, and you never build intermediate lists in your example. You can do the same thing in Python 2 with the itertools.imap function.


`map` is the trivial case. There are plenty of loop compositions that are completely non-trivial to do by hand. That's why array fusion (see e.g. repa or vector) is a huge win.

    foldl g . scanl y . concatMap x . filter h . unfoldr k
Fuse that by hand.

This is why we have optimizing compilers. They do what you could have done, only more often, and without mistakes.


Using generators in Python will get similar laziness in Python:

    from itertools import imap
    map(f, imap(g, imap(h, someList)))
I think Python 3's map built-in is a generator so you no longer have to use the itertools module.

Unfortunately we don't have . or currying in Python so no pointfree python :(.

    from itertools import ifilter
    # ugly python function with a "Maybe dict" return type
    def query(data, date):
        """Return the first dict where date is > x['date'] or None"""
        is_greater = lambda x: date > x['date']
        return next(
           ifilter(
               is_greater,
               data
           ),
           None
        )


Have you tried underscore.py? It's not quite pointfree but it's a step in that direction.


No, Haskell really does have significantly better performance than Python or Ruby. Same with some Lisp variants as well--there was even a research whole-program optimizing compiler for Scheme that apparently sometimes beat hand-optimized C (Stalin Scheme).

So one way to improve performance is simply by having a good compiler. And GHC, at least, is a very good compiler.

Also, Haskell support for direct memory control is not that bad. In fact, in some ways, it's better than even Java--you can have your own unboxed data types which essentially act like structs, for example. This also means that you can have unboxed arrays of more than just primitive types.

Haskell also does some very clever things with both the heap and the stack, but I'm not familiar enough with its internals to comment. My understanding is that Haskell makes heap allocation much cheaper and has a GC optimized for handling lots of small allocations (as you would expect for a functional language).

Ultimately, the point is that the question is fairly nuanced and you won't be able to pin down a single language or implementation feature that uniquely determines performance.


Looking at CPython and the bytecode it uses, it's not very hard to see why it would be slow. It's basically designed as a reference implementation, with only very tame optimizations.


My own piece of feedback based on my experience. The slides were good. But like others, JIT is not all rosy. In V8 and Dart and .NET, code gets compiled to native code as soon as possible. I think that's the best case scenario in general. You then don't have to guess as much.

The author didn't mention method dispatching. I think it's an issue for many languages. In Dart, they tried to optimize it by the specification by mostly eliminating the need to change methods at runtime. In Ruby I watched a video by one of the core Ruby developers and he said that in Ruby method dispatching can be very complicated requiring up to 20 steps to resolve them.

As important as getting the best performance out of programs is to get the programs created in the first place. That's why I'm against shying away from larger codebases. I'm in favor of OO programming exactly because I think getting things done comes first, even if that could complicate the implementation of the toolset. And OO is all about layers of abstractions that bring more performance costs with them.

That said, I absolutely abhor type annotations. They make code hideous and decrease the opportunities for experimentations. Instead of reading a + b = c algorithms, you may need to parse A a + B b = C c source code.

In Dart we have Optional Types. But the core developers are fond of type annotations, so most samples they post come with them. I take relief in being able to omit type annotations while experimenting, researching and ultimately prototyping. Although in a way I feel like a rebel in the community for this disregard. Thankfully there is this chance to share a community with them.

Reading the part that you don't like adding heuristics to help programs to go faster reminded of adding types to them even if they are mostly disregarded as in Dart.

Then again, not all "dynamic languages" are the same. Some are truly dynamic with eval and runtime method changes. Others, not so much. Sometimes the tradeoffs allow for other kinds of gains that could come into play like when deploying. So there is a lot more to it than just getting the algorithms correct.


The example he gives for strings could be optimized to near the efficiency of the C version by a sufficiently smart compiler:

    int(s.split("-", 1)[1])
If the JIT knows that s is the builtin string type and the split() method has not been overridden [1], it can speed this up by using "pseudo-strings," where a pseudo-string is an index and length into another string. This would require only O(1) time and space.

Garbage-collecting pseudo-strings would be an interesting exercise, but I'm sure it's a solvable problem [2] [3].

[1] If the preconditions for your optimization don't hold, you can always fall back to interpreting it. As noted by the speaker, this sort of logic is already a critical part of many JIT's including Pypy.

[2] The problem is actually GC'ing the parent. When the parent string is gc'ed, you have to compact the orphan strings to reclaim the remaining space; otherwise it'll be possible to write user code that uses a small finite amount of memory in CPython but has an unbounded memory leak in your compiler.

[3] You can avoid the trickiness in [2] if the parent string can be proven to outlive its children, which is the case in this example. You could probably optimize a lot of real-world code, and have an easier time implementing the compiler, if you only used pseudo-strings when they could be proven to be shorter-lived than the parent. As a bonus, this partial GC would build some infrastructure that could be recycled in a general implementation.


Kind of a poorly-named deck. It's really about why programs use features of these languages that end up causing poor performance relative to C, rather than why the individual VMs themselves are slow. It's no surprise that trading the byte-precision of C for the convenience of a garbage collector and heap-allocated data structures results in a performance decrease.

Dynamically-typed languages are often easier to program in, but require more copying (and memory allocation) as a result. Hash tables are heap-allocated and have to be garbage collected, but they're flexible - something you don't get with structs. Allocating and freeing memory has a cost, and that can add up quickly. Your primary line of optimization in most of these languages is "avoid the GC", which really boils down to "don't allocate more than you need to", which is sound advice in every language, scripting or otherwise.


Did you read the deck? The GC isn't the problem; it's layout and management of allocations that's the problem, whether you use a garbage collector or explicit deallocation to clean up the resulting mess.

I think the idea that GC is what slows down dynamic languages has to be the most prevalent misconception about language performance.


Yeah. I dunno about "most prevalent", though. TFA contained two "lame excuses", both of which are things I believed to be causes of slowness in dynamic languages, and now I have to reconsider.

    dynamic typing prevents type-based optimization
    monkey patching prevents optimization
I think the most common complaint I hear about GC is not that it affects computational throughput, but that it affects _predictability_ of computational throughput. One maybe doesn't care in scientific computing, but game developers are always going on about how they can't use a GC language because a stall mid-frame will knock them over 16ms/frame or 33ms/frame, which for console certification is a project-killer.


It's worth noting that WoW uses Lua; at some point Blizzard switched from Lua 5.0 (with used a stop-the-world GC) to 5.1 (which introduced an incremental GC) specifically because of this problem. Before, UI code could generate excess tables, kick in the GC, and dramatically impact your framerate. The incremental GC significantly helped this, since it permits the GC to be run over multiple frames, reducing or even eliminating the perceptible impact on the user experience.


I did, and I think I might have mis-stated my point. GC thrash is a symptom of the problem, not the core problem itself. Manually managing allocations avoids the "resulting mess" that must be cleaned up from, which is where your big speed boost comes from. In general, the higher up you go, the further abstracted away from memory management you become. It's not that GC is inherently slow or anything, but simply that giving up control of where and how memory is allocated (in exchange for a more flexible language) is the reason for the speed difference.


It's not a given that GC is slower than manual memory management. See, for example, this work:

http://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf

In a copy collector the GC time is proportional to the amount of live memory -- garbage is free. In FP-style programs (lots of short-lived allocations) GC can be essentially free.

The other main source of slowdown is to do with boxing and type checks. Accessing all data in a naive implementation of a dynamically typed language involves at least one type check and typically one pointer indirection to unbox the data. This can kill performance relative to the unboxed and unchecked equivalent. Consider, e.g., floating point operations -- they run in 1 cycle. If you add type check (2-3 cycles perhaps) and pointer indirection (10 cycles if it's in the cache) you can see how massive slowdowns easily arise.

Modern JS VMs will remove most of this cost. I doubt Python and Ruby do.

Basically, I would say it really depends on the interactions between the program and the language implementation.


> Modern JS VMs will remove most of this cost. I doubt Python and Ruby do.

In the python case, pypy removes this cost whenever it can. The less you use dynamic features like duck typing, the faster your code gets.


If the deck is to be believed, it's not the garbage collector or the heap that's causing the performance loss, it's that APIs and algorithms they enable are allocation-heavy compared to other "faster" languages.

Heap allocations are expensive even in non-GC languages.


The first example is lookup based on name, you can get away from that in python with slots but the common consensus is that it's not worth the performance improvement. That somehow translates into a dogma like, "Never use slots," but sometimes it is worth it.


> The first example is lookup based on name, you can get away from that in python with slots

You can get away from that with a sufficiently smart JIT turning a stable access to a known object into little more than a vtable dispatch. That's close to what V8 does with hidden classes on full cache hits (object map — its "hidden class" — + object "array" property both cached).

    point.x
will generate the assembly:

    cmp [ebx,<hidden class offset>],<cached hidden class>
    jne <inline cache miss>
    mov eax,[ebx, <cached x offset>]
(where ebx is the previously loaded `point`)


Which PyPy does, so __slots__ is entirely useless.


pypy is great, but I and others used swig and am stuck now with maintaining that stuff. There is also memory benefit to slots for when there are many instances. Sometimes you just have to do gross stuff so you can get on to the next thing, it's okay when everybody understands it enough.


The memory benefit also exists on PyPy without slots.

It may be possible to write a cffi backend for swig, but it's likely quite hard.


I'm certain you could have a language that is easy to program in, but doesn't use so much copying or use dictionaries so much. I'm not sure such a thing exists, though.


There's a whole continuum of stuff between Python/Ruby/JS and C, like Go, OCaml, and Rust. There's still a lot of refinement that could be done, though.


Interesting slides, and good point about having better APIs.

Perhaps I'm nitpicking, but with a function called `newlist_hint`, I struggle to see how anybody would adopt it. I had to go back to the slides maybe 3 times, and I still don't remember the name of this function... Those APIs must have the most obvious, logical and simple names.


I have a few comments about some of the slides, feel free to correct any misunderstandings.

Dictionary vs Object:

Lookups in both data structures is O(1), the difference being the hashing cost (and an additional memory lookup for heap) vs a single memory lookup on the stack (1 line of assembly).

Squares list:

> ... so every iteration through the list we have the potential need to size the list and copy all the data.

This is no different than stl::vector which has an amortized cost of O(1) for a push_back().

It's not going to be as fast as C, but I'd also argue for a generator version instead:

    def squares(n):
        return (i*i for i in xrange(n))
One of the main reasons people choose Python is for expressiveness and not manually managing memory, although pre-allocation does seem like a good idea.


You mean std::vector, from the STL. And yes, the amortized cost is O(1) per element and thus O(N) in total, but the constant factor and lower-order terms (the O(1) time to do the allocation and garbage-collect it later) do matter.


Its time to face it:

People start to create computer languages without carrying too much about the target processor opcodes (because in that time processor were just getting faster with time) and focus more on programmer convenience, and wild beasts like python and ruby were born..

C is fast because it was created with processor awareness in mind.. pretty simple...

these days kids are all about trying to create more and more crappy convenient sintax languages.. and they get worry when the languages dont scale? for what computer they design the language? from venus ?

nobody should be doing any serious software in python or ruby.. is such a waste of talent .. use it for education.. for fun.. or for the things they are best.. wich is not in the system/plumbing side of things


Please don't ever comment again. Thanks in advance.


Speak about Python and Ruby.

Javascript is insanely fast, with V8 and its ilk.

And I'm not talking about "toy benchmarks" either, I'm talking about envolved stuff written in plain JS (no C extensions), from the QT port to JS/Canvas, to the h264 encoder and such. Try doing those on Python and you'll see what you get. And of course all the toy benchmarks also agree.

Javascript with v8 is like a faster PyPy (with less performance deviation): 10 to 20 times faster than plain Python code.

Sure, you can extend Python with fast C code. But as the core languages are concerned, JS beats CPython hands down. (Oh, and you can also extend JS with fast C/C++ code if you need that. Node modules do it all the time).


Question:

    def squares(n):
        sq = []
        for i in xrange(n):
            sq.append(i*i)
        return sq

    A basically idiomatic version of the same in Python. No list
    pre-allocation, so every iteration through the list we have the
    potential to need to resize the list and copy all the data. That's
    inefficient.
Is that true? I'd expect .append() to change a pointer or two, not "resize and copy" the list. Even an .insert() should just move pointers at the C-level... no need to "defrag" it. I guess the key word is potential.


Python lists are contiguous blocks of memory (i.e., arrays), not linked lists, so append will sometimes have to resize and copy. It's true that amortized worst case time is O(1), but that's still going to take longer than the one allocation involved in pre-allocating an array of known size. I don't know whether (or, under what circumstances) the difference is large enough to be significant, though.


Being fairly new to C, is appending to / dynamically growing an array really just a matter of "a pointer or two"?

How can you take for granted the memory space past the end pointer is available?


On an array, you can't. This means that you can't on a Python list, either. mixmastamyk is mistaken about the implementation details.

But if you assume that "list" means "linked list", then you can just navigate to the correct part of the list, allocate enough space for one new cell, and stitch together a few pointers. Allocation and stitching is O(1). In general, navigating to part of the list is O(n), but if your list is a doubly-linked circular linked-list, or alternately if you keep a pointer to the end as a special case, then "navigate to the end of the list" becomes also O(1). I assume that all of this is what mixmastamyk was thinking Python was doing.


Thanks for the detailed response.

I was in fact taking it as almost a given that Python lists were backed by arrays under the hood.


I believe you're correct - Python lists should be O(1) for appending[0].

[0] http://wiki.python.org/moin/TimeComplexity


I think you're both right and wrong.

mixmastamyk's comment implies that (s)he believes that Python lists are, under the hood, linked lists. This is wrong. Python lists are ultimately backed by C arrays. This is why get() and set() are O(1), and insert() is O(n).

However, dynamically resizing an array to support append operations, if you're not stupid, takes amortized constant time. Individual operations may be O(n). Python implementers, happily, are not stupid.

However, insert() operations do require "defragmenting".

So, the primary question mixmastamyk asked about the cost of

    def squares(n):
        sq = []
        for i in xrange(n):
            sq.append(i*i)
        return sq
is totally correct, but a lot of the sub-reasoning is wrong.


Yes, I've always just assumed that internally a resizable Python list was a linked-list I learned about in C... fits perfectly.

I suppose using an array must improve performance in typical cases, while the resizing (a linked-list advantage) happens less often.


Resizing is not a linked-list advantage, though. Resizing is an ammortized-constant-time operation.

The only advantage I'm aware of for linked lists is insert and delete (but not append and pop), which are constant time in a linked list but O(n) in an array.

The thing is, though, that doing insert() on a linked-list usually requires a seek first. Which is O(n) on a linked-list (and may be O(n), O(logn), or O(1) on an array, depending on what you mean by "seek"). So in _practice_, usually inserts and deletes are O(n) in a linked-list as well. (But not always, because you could already have a pointer to the relevant thing, for example if you have multiple different pointed-based data structures with pointers into each other.)

Basically, as far as I can tell, linked-lists are almost useless, unless you're forced to write all your data structures from scratch and you have too little time to write yourself a proper library. The main exception is for really hairy shit where you have multiple linked-list views traversing the same data in different orders.


linked lists are the default data structure underlying almost every implementation of Stacks and Queues and the variants of those. they're far from useless.

they're also such a flexible data structure that it is literally the ONLY data structure necessary to implement any of the LISPs.


> linked lists are the default data structure underlying almost every implementation of Stacks and Queues

Citation needed. I doubt this.

Certainly if I had only ten minutes to implement a stack or a queue, without access to anything more than stdlib.h (or equivalent), a linked list is easy to get right in a hurry, and only takes a few dozen lines. But the auto-resizing array is only a little harder, and has better performance for nearly every operation, as I explained in the previous post.

> they're also such a flexible data structure that it is literally the ONLY data structure necessary to implement any of the LISPs.

Of course. So what? That's not a reason to use them anywhere other than a school assignment that requires you to use them.


> > linked lists are the default data structure underlying almost every implementation of Stacks and Queues

> Citation needed. I doubt this.

As do I. Since you're not going to reorder a stack or (in most cases) a queue, and since they contain fixed-size elements, what's the point of a linked list?


Mike Pall of luajit fame has an interesting take on it.

http://www.reddit.com/r/programming/comments/19gv4c/why_pyth...

<quote>

While I agree with the first part ("excuses"), the "hard" things mentioned in the second part are a) not that hard and b) solved issues (just not in PyPy).

Hash tables: Both v8 and LuaJIT manage to specialize hash table lookups and bring them to similar performance as C structs (1). Interestingly, with very different approaches. So there's little reason NOT to use objects, dictionaries, tables, maps or whatever it's called in your favorite language.

(1) If you really, really care about the last 10% or direct interoperability with C, LuaJIT offers native C structs via its FFI. And PyPy has inherited the FFI design, so they should be able to get the same performance someday. I'm sure v8 has something to offer for that, too.

Allocations: LuaJIT has allocation sinking, which is able to eliminate the mentioned temporary allocations. Incidentally, the link shows how that's done for a x,y,z point class! And it works the same for ALL cases: arrays {1,2,3} (on top of a generic table), hash tables {x=1,y=2,z=3} or FFI C structs.

String handling: Same as above -- a buffer is just a temporary allocation and can be sunk, too. Provided the stores (copies) are eliminated first. The extracted parts can be forwarded to the integer conversion from the original string. Then all copies and references are dead and the allocation itself can be eliminated. LuaJIT will get all of that string handling extravaganza with the v2.1 branch -- parts of the new buffer handling are already in the git repo. I'm sure the v8 guys have something up their sleeves, too.

I/O read buffer: Same reasoning. The read creates a temporary buffer which is lazily interned to a string, ditto for the lstrip. The interning is sunk, the copies are sunk, the buffer is sunk (the innermost buffer is reused). This turns it into something very similar to the C code.

Pre-sizing aggregates: The size info can be backpropagated to the aggreagate creation from scalar evolution analysis. SCEV is already in LuaJIT (for ABC elimination). I ditched the experimental backprop algorithm for 2.0, since I had to get the release out. Will be resurrected in 2.1.

Missing APIs: All of the above examples show you don't really need to define new APIs to get the desired performance. Yes, there's a case for when you need low-level data structures -- and that's why higher-level languages should have a good FFI. I don't think you need to burden the language itself with these issues.

Heuristics: Well, that's what those compiler textbooks don't tell you: VMs and compilers are 90% heuristics. Better deal with it rather than fight it.

tl;dr: The reason why X is slow, is because X's implementation is slow, unoptimized or untuned. Language design just influences how hard it is to make up for it. There are no excuses.

</quote>

Also interesting is his research on allocation sinking:

http://wiki.luajit.org/Allocation-Sinking-Optimization


Yeah i wish someone could crowd source a project so Mike could spare 20% of his time on RubyVM.


Interesting presentation, but it can't be the whole story. Even projects like SciPy which use the most rudimentary data structures (basically just a large array of floats) and algorithms (sometimes just looping through the elements in order a few times) see a considerable advantage when rewritten in C.

http://www.scipy.org/PerformancePython


Very interesting talk

Leads me to wonder - has anyone done a study of any large-scale program to check where the slow spots are? It's not that I don't trust the speaker, he makes excellent points and is obviously a great memeber of the community.

But it would be very interesting if he were able to say: "Using PyPy's secret 'hint' API, only in drop-dead obvious places, improved performance by a factor of 5".


    atoi(strchr(s, '-') + 1)
What does this do? Finds the first instance of a -, and converts the remainder of a string to an int. 0 allocations, 0 copies. Doing this with 0 copies is pretty much impossible in Python, and probably in ruby and Javascript too. </quote>

The copying could be avoided in non-idiomatic Python:

    int(buffer(s, s.find("-") + 1))


   +s.substr(s.indexOf('-') + 1)


It is almost time that people stop referring to Languages as Fast or Slow. It is an implementation that is fast or slow, not a language.


Have you considered reading the fucking presentation before snarking out on it?


What makes you think I didn't? The author seems to agree with me. He even concludes that we should improve the implementation rather than the language for optimisation.


I didn't get that from the presentation at all. What I read is a guy complaining that idiomatic Python is necessarily slow because it depends, for its clarity and terseness, on not controlling allocations or memory layout, and that Python- the- language abets the problem by having its standard library build on the same idiom.


No, the author concludes that languages based on hash table lookups and lacking copy avoidance mechanisms are inherently slower.


Which makes your comment useless and redundant. The very point of the presentation pretty much being "this declaration makes no sense"


It's true that implementation makes a bigger difference to performance than language features, but language features can indeed affect speed. One clear example is that a language that checks for integer overflow will be slower than a language that doesn't. There are a lot of situations where there's no way you could optimize out the extra instructions you'd need to check for overflow.

Some language features are easy to optimize away for some common cases (e.g. array bounds checks, so that your program throws an exception instead of segfaulting), but you can't optimize these if, for example, you're iterating over data you read from a file using indices you also read from that file.


While I agree that implementations can be fast or slow, it does not completely eliminate the effect of the language. At minimum, different languages require different levels of effort to reach their optimum. So, while a language can be hurt by a poor implementation, it does not follow that all languages have the same potential performance.

As an example, just consider the enormous amount of effort that has gone into the JVM, and Java is still generally considered to be ~2x slower that C.


This is one of my pet peeves, too.


I think its time people stop using shitty slide decks to get their point across.


Didn't it occur to you that these slides were for a presentation and that sharing them enable more people than just those that were at the presentation be informed of their content?

I, for one, am very grateful to speakers that make the extra effort required to share with a larger group what they have already shared (or are about to share) with a smaller group.


Much better to turn your notes from the talk into an article than to just throw up the slides.


But much harder. Slides (with notes!) are a VERY good compromise.

Remember: the author does not owe you anything.


He gave a talk yesterday at Waza, Heroku's conference. I believe Heroku is planning on uploading videos of the talks in the coming days.

I've been interested in this talk since I saw it announced on Twitter, but prefer to watch/listen rather than go through these slides.


If you want to learn more about what the Ruby VM has to do in order to execute your code, and some of the performance challenges for Ruby implementors (such as it's extremely flexible parameter parsing) I suggest this talk by Koichi Sasada: http://www.youtube.com/watch?v=lWIP4nsKIMU


I really wish this talk had been done by someone who spoke english, I found it rather painful to try and get through...

Any good articles that summarize this info?


I find this attitude a bit entitled. The speaker does speak English, his accent is just not very understandable.

However, opening the video and seeking to a random point, I must say that the phrase "Ruby release policy: Ruby level compatibility" isn't doing any Japanese speaker a favour.


You find his attitude entitled, I find your reply needlessly confrontational.

It's not like chadcf said "This talk is bullshit, that guy doesn't even speak English". chadcf said "I experienced this difficulty, I wish that the following thing existed, can anyone help me?" Maybe he could afford to have done

    s/speak English/speak more fluent English/


As a ruby lover, I'm interested in the ruby implementation the Author wrote and mentioned, topaz [1]. Has anyone here tried it?

"Topaz is a high performance implementation of the Ruby programming language, written in Python on top of RPython (the toolchain that powers PyPy)."

[1] http://docs.topazruby.com/en/latest/


I think the preallocate APIs sound like a cool idea. Perhaps there could also be some kind of 'my hashtable is an object' hint that could let the compiler do the same kind of optimizations on hashtables that it does on objects (assuming that your hash keys don't change much).


His suggestion for better preallocate APIs made me think of this ruby patch from Charles Nutter: http://www.ruby-forum.com/topic/173802

4 years later and they still discuss it, heh.


He mentions that he couldn't find a pure C hash table.

http://linux.die.net/man/3/hcreate


Back in the early 2000s it seemed like a lot people were using kazlib: http://www.kylheku.com/~kaz/kazlib.html


Not portable.


As a programmer that first learned c and still thinks like a C programmer in a lot of ways, this actually explains a lot to me.


i'm actually surprised noone ever talks about perl. isn't perl crazy fast compared to the other interpreted languages?


no. Perl is mildly faster, not crazy faster. same order of magnitude.


This is misleading and contains errors like calling C++ "C". Unless you have a great deal of knowledge about these things already, I urge you not to learn from this but read the slides purely for entertainment.

Question: The author claims to be a compiler author. After some digging I haven't found any information on what compilers he has written or are part of writing. Could someone point me to the compiler(s) Alex is involved with? Thanks.


He mentioned it directly in the talk: PyPy. Or are you being snarky and saying he doesn't know what he's talking about because PyPy isn't a "real" compiler. Alex has made huge contributions to several open source projects. I can't imagine too many people who know more about making Python go fast.


I've been a C/C++ programmer since ~1993 and apart from wondering why he had to "look up" a C hash table (there's one in K&R) I saw nothing to complain about.


Your reading comprehension is extremely poor here. His notes mention that he wanted to use a pure C version but couldn't find a C version by a quick google search. It did not say "here is a pure C version".


He's on this page: http://pypy.org/people.html

He works on the JIT, among other things.


Look up PyPy.


"This is misleading and contains errors like calling C++ "C"."

In what ways did that detract from his overall point?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: