Where are the main blowouts in python performance? For example, is it compilatio...

PyComfy · on Feb 7, 2018

Naive interpretation of the bytecode (not even pre-decoded, just a switch statement). And almost everything is resolved in the dynamic environment. for example,

    a = foo.bar(b)

is actually

    ldict = locals()

    ldict['a'] = ldict['foo'].__getattribute__('bar')(ldict['b'])

tom_mellior · on Feb 7, 2018

This is a bit misleading. You suggest that local variables are looked up by name in a dictionary, which is not the case. They are looked up by indexing into a C array, with the index being a constant in the bytecode. That's quite a lot simpler. Here is the corresponding code (look above for the definition of the GETLOCAL macro): https://github.com/python/cpython/blob/fc1ce810f1da593648b4d...

So your code should be more like:

    locals[a_idx] = locals[foo_idx]->__getattribute__('bar')(locals[b_idx])

But this isn't a very good rendering of the thing because it doesn't show the many redundant reference count increment/decrement pairs every time you touch a variable.

(Also, interpreter dispatch uses computed GOTOs instead of the plain switch on C compilers that support it.)

PyComfy · on Feb 8, 2018

ah sorry. i did

    import dis
    dis.dis("a = foo.bar(b)")

which gave

    1           0 LOAD_NAME                0 (foo)
                2 LOAD_ATTR                1 (bar)
                4 LOAD_NAME                2 (b)
                6 CALL_FUNCTION            1
                8 STORE_NAME               3 (a)
               10 LOAD_CONST               0 (None)
               12 RETURN_VALUE

tom_mellior · on Feb 8, 2018

Ah, OK. Yes, those LOAD_NAMEs are slower than LOAD_FAST. If you put the code into a function, you get this:

    >>> def f(foo, b):
    ...     a = foo.bar(b)
    ... 
    >>> dis.dis(f)
      2           0 LOAD_FAST                0 (foo)
                  3 LOAD_ATTR                0 (bar)
                  6 LOAD_FAST                1 (b)
                  9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
                 12 STORE_FAST               2 (a)
                 15 LOAD_CONST               0 (None)
                 18 RETURN_VALUE

LOAD_FAST is the normal case for locals inside a function. Not sure off the top of my head where LOAD_NAME would be generated in normal usage (i.e. where you don't evaluate code from a string).

Edit: Also, I'm talking about Python 3. Maybe you aren't.

gergo_barany · on Feb 7, 2018

> Where are the main blowouts in python performance?

I did some research a few years ago that tried to quantify some of this. If you trust my methodology, the biggest problems (depending on application, of course) are: boxing of numbers; list/array indexing with boxed numbers and bounds checking; and late binding of method calls. Basically, doing arithmetic on lists of numbers in pure Python is about the worst thing you can do.

And it's not just due to dynamic typing: Even if you know that two numbers you want to add are floats, they are still floats stored in boxed form as objects in memory, and you have to go fetch them and allocate a new heap object for the result.

The basic idea of my study was as follows: Compile Python code to "faithful" machine code that preserves all the operations the interpreter has to do: dynamic lookups of all operations, unboxing of numbers, reference counting. Then also compile machine code that eliminates some of these operations by using type information or simple program analysis. Compare the execution time of the different versions; the difference should be a measure of the costs of the operations you optimized away. This is not optimal because there is no way to account for second-order effects due to caching and such. But it was a fun thing to do.

The paper, with data for a set of benchmarks, is here: http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf

As for how to improve this, I think Stefan Brunthaler did the most, and the most successful, work on purely interpretative optimizations for Python. Here is one paper that claims speedups between 1.5x and 4x on some standard microbenchmarks: https://arxiv.org/abs/1310.2300

Basically, you can apply some standard interpreter/JIT optimization techniques like superinstructions or inline caching to Python. But these things are hard to do, they won't matter for most Python applications, and come with a lot of complications.

cturner · on Feb 10, 2018

I have not yet done detailed study, but your paper appears to be a fabulous resource. The context from your post is high-value also. Thanks.

laike9m · on Feb 12, 2018

Thank you so much for posting the paper!

shalabhc · on Feb 7, 2018

A good writeup is here: http://faster-cpython.readthedocs.io/notes_2017.html

SnowflakeOnIce · on Feb 7, 2018

tl;dr: Python's dynamic features add lots of overhead to every operation, and CPython's simple implementation means you pay the overhead even when you don't use the dynamic features.

A few things quickly come to mind, after having maintained a patched version of Python 2.7:

- The dot operator (e.g. `foo.x`) hides a /very complicated/ resolution process that can be /very expensive/. (The documentation about this process also deceptively makes you /think/ you understand how it all works, whereas you probably don't unless you're intimate with the C implementation.)

- Global variables are slower to access than local variables in CPython: the former require hash table operations, whereas the latter are array operations. Global variables can also be of pretty much any type, not just strings, which further complicates how globals are handled.

- `import` statements are idiomatically done at the top-level of a module, and often are used as qualified imports! E.g., `import os` followed by the use of `os.path.join(foo, bar)` later on. This hits the costs of both global variables and the dot operator.

- Other syntactically simple constructs, like indexing, relational operators, `len(foo)`, etc, all support overloading, increasing the complexity of the implementation of these operators.

- CPython has a simple implementation (bytecode interpreter, not really any optimizations), meaning the cost to support overloading and dynamism is /always paid/.