Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Python interpreter rewritten in Rust, that can run pip (rustpython.github.io)
160 points by andrew-ld on Feb 4, 2021 | hide | past | favorite | 48 comments


How are interpreters designed in Rust? What parts can be safe and what can't be?

What is the internal representation of a Python object? In particular how do you mutate a Python object while there may be other references to it - is it UnsafeCell, is there some checking?

How does the GC work? The GC needs to access and perhaps mutate all Python objects. In C++ a common technique is to make a linked list of scopes, stored in native stack frames; I'm not sure how that can be done in safe Rust.


I was curious too, so I poked around! I had never looked at this codebase before this comment, so, I may have some things wrong.

> How are interpreters designed in Rust? What parts can be safe and what can't be?

Depends on the design. Different interpreters are different.

A quick glance at the source code seems to show unsafe in:

* A data structure called "boxvec"

* Some calls to unreachable_unchecked, get_unchecked, and the like. Wonder what the overheads were here.

* Some lock/mutex implementations

* A few unsafe functions in the jit. They use cranelift. Bet there's a bunch in there too. Usually this is the classic thing that needs unsafe in interpreters.

* The python object data structures; they do some stuff to have a smaller representation

* some FFI code in the standard library, calls directly to libc

> What is the internal representation of a Python object?

https://github.com/RustPython/RustPython/blob/master/vm/src/...

> How does the GC work?

Traditionally, Python uses refcounting + cycle detection. A quick glance at their issue tracker implies that as of last month, they had recounting but no cycle checks yet.

> In C++ a common technique is to make a linked list of scopes, stored in native stack frames; I'm not sure how that can be done in safe Rust.

Looks like they have a RefCell<Vec<FrameRef>>, instead.


This uses [lalrpop](https://github.com/lalrpop/lalrpop) under the hood I believe, which is a pretty awesome parser generator library for Rust.


Nice, long go the days (20 yrs ago) i had to make a JS transpiler with lex+yacc. The pain


had to? or got to? Sounds great, please tell us more.


i wanted to add lambdas to JS, this was back in 2001 i think, so i had this compiler from my compiler class, changed it to parse JS (+ my extra features like the ()=>{} syntax) and generate an AST in XML (instead of emitting bytecode), then i made a python script that took the XML and generated minified JS from it.


Neat! I did something similar, not quite as big, but I was on a version of Python (Jython) that didn't yet have generator expressions, so I made a generator expression evaluator, gg("x for x in it") that returned a running generator. I was about a week away from shipping an ETL tool to the customer and this allowed me to hit that deadline.

I love the AST in XML approach.


> LALRPOP in fact uses LR(1) by default (though you can opt for LALR(1)), and really I hope to eventually move to something general that can handle all CFGs (like GLL, GLR, LL(*), etc).

Seems overkill for a language whose grammar is LL(1)?


Although the CPython implementation contains an LL(1) parser (which is being deprecated for a new PEG parser), the grammar contains bits that are not context-free, which involved a pre-parsing step before being fed into the LL(1) parser. That structure isn’t particularly good and it’d be beneficial for a new implementation to use something else.


> whose grammar is LL(1)?

Just curious, how did you know this?


The Python 3.9 (most recent) release notes:

> Python 3.9 uses a new parser, based on PEG instead of LL(1). The new parser’s performance is roughly comparable to that of the old parser, but the PEG formalism is more flexible than LL(1) when it comes to designing new language features. We’ll start using this flexibility in Python 3.10 and later.

https://docs.python.org/3/whatsnew/3.9.html#new-parser


That Python can be described by an LL(1) grammar is frequently mentioned in discussions, but I don't know if it's formally documented anywhere.

More specifically, the grammar [0] is in EBNF form and is ELL(1). That means that if each EBNF production is converted to a set of BNF productions in an LL(1)-compliant way, the BNF grammar as a whole is LL(1). It seems that the Python tools themselves do not check this [1], but I have verified it myself.

However, as another commenter mentioned, the grammar doesn't exactly describe Python, but a superset of it. The compiler needs to perform further checks after parsing that "should" be part of the parser itself. A more advanced parser would allow these checks to be performed in the correct place - this is probably what RustPython does when using LR(1), and was one reason why CPython recently replaced its LL(1) parser with one based on PEG.

[0] https://docs.python.org/3.8/reference/grammar.html

[1] https://discuss.python.org/t/should-there-be-a-check-whether...


Parsing a superset of the language and then using extra checks to detect bad syntax accepted by the superset is a well known and very effective way of being error tolerant and implementing decent error recovery.


It is a great strategy. One of my favorite examples of this is Rust. When talking about what synatax "await" should use, we decided on a postfix syntax: "foo.await" . But there was concern that because JavaScript and other languages use "await foo", it would be confusing.

Solution?

  async fn foo() {
      
  }
  
  fn main() {
      let x = async { await foo() };
  }

  error: incorrect use of `await`
   --> src/main.rs:6:13
    |
  6 |       let x = async { await foo() };
    |                       ^^^^^^^^^^^ help: `await` is a postfix operation: `foo().await`
Do exactly that: parse the incorrect form and emit a good diagnostic.


Does it include a concurrent garbage collector?

And does it solve the infamous GIL (global interpreter lock) problem?


I suspect an interpreter that "solves" the GIL problem will break tons of modules that currently only work in concurrent contexts because the GIL exists.


Jython and IronPython don't have a GIL so you can see the effect there (it's not as bad as I thought it would be)


Jython and IronPython can work without a GIL more easily than CPython because they don’t support native (C-based) extension modules. Does RustPython intend to support existing Python extensions written in C?


OK, I'll bite: how?

They need some form of locking around Python objects accessed from multiple threads, since Python has no ownership model. Do they add a lock around each individual object?


> They need some form of locking around Python objects accessed from multiple threads, since Python has no ownership model.

"Ownership models" are not a given. C++ has no "ownership model" and it doesn't lock around objects. Multi-threading users are just given the synchronization primitives and are expected to manage their own solutions.

The same is true in Python: https://docs.python.org/3/library/threading.html#lock-object... - in fact, if you're using objects simultaneously from multiple threads without using some sort of locking scheme, you're probably already in trouble...

The main thing that's keeping the GIL around is the reference-counting garbage collector. If your python implementation uses its own GC mechanism, it doesn't have this problem.


FreeBSD went through a similar issue back in the late 90s, early 2000s when they added pervasive multithreading into the kernel.

The GIL isn't some halting problem level thing. As you allude to, Python could definitely decide to go multithreaded and add in the requisite locks. They haven't.

I really wish that the PSF would

* Make a spec

* Break out the stdlib into a portable library, and by portable, something that can be shared across PyPy, Graal, Jython, etc.

* ??? yes


Here's the detail for Jython: https://jython.readthedocs.io/en/latest/Concurrency/

Skip down to "thread safety"


It's not difficulty. It's been done experimentally - exactly as you say. They use locks around everything.

What's hard is doing it without dumpstering single thread performance.


Most languages with proper multithreading that I've used don't have an ownership model. Most the the standard library collection types are explicitly not thread safe, and then there's separate concurrent versions for if you're writing multithreaded code. And you're given locking/mutex primitives if you want to roll your own stuff.


I’m genuinely curious why people want to work around the GIL in languages like Python and Ruby. I very much do not want my simple interpreted language to inherit the complexity of true concurrency. Why not just use a different language?


If a company has 200+ developers already working on a 10-year-old codebase in the language, then the word "just" has a lot of heavy lifting to do in that question.


The GIL doesn't really let you escape the horrors of concurrency, at least in CPython. Threads can't execute at the same time, but they can interleave in inconvenient ways.


> Why not just use a different language?

I’ve asked myself this question (in a different context) and have concluded that, in general, developers are incredibly resistant to learning and using new languages.

For example: if they weren’t, then server-side JavaScript would never have become popular.


>I’ve asked myself this question (in a different context) and have concluded that, in general, developers are incredibly resistant to learning and using new languages.

You make it sound like a personal psychological failing.

In real life, using the same language has business benefits: re-use of the same libraries, no duplicated business logic that needs to be used on both client/server, no need for training to the new language, no need for mental context switch due to the language change when working on the client or server part (aside from the essential domain knowledge of client vs server), less/common tooling infrastructure, and so on...


this is a false dichotomy - witness the numerous parallel lisps (and schemes, etc)


This is awesome! As fan of Rust and Python as I am I can't be more excited for this... great work everyone!!!


This is really cool.


what are the latest benchmark numbers of rustpython?


.


“RustPython is a Python interpreter written in Rust. RustPython can be embedded into Rust programs to use Python as a scripting language for your application, or it can be compiled to WebAssembly in order to run Python in the browser.”


They really have a project going, with 2000 PRs merged it's looking good. (Edit: oh they use some code I've written too, that's humbling)


It is a reasonably complicated Python script, so is a nice milestone of progress.


Why not? Someone thought it might be an interesting challenge, and so they did.


A single period means is interpreted as "Why?"?


I think they edited their comment to replace whatever was there with "."


Source?


I was asking. That's how people in the thread appeared to interpret it.


No: in bash, '.' means the same thing as 'source'.


Slightly misleading title. "A Python interpreter written from scratch in rust can now run 'pip'" is more accurate. The standard pip does not use Rust.


Ok, we've changed the title from "Now pip runs on a Python interpreter rewritten from scratch in rust". Hopefully less misleading. Thanks!


Wow.

Looks like Rust is winning the catfight against Go, D, Julia, C++20.

Need to grab more popcorn.


There's no "catfight".

Julia targets a different audience.

Go, for the most part, too.

C++20 is not going anywhere, and C++ is used more than ever.

D hasn't been a contender for 10+ years. It could technology wise, but it never gained traction.

All of Rust, Go, Julia, and C++ do well.

(Heck, C even continues to do well. Rust is not really eating into it).

It remains to be seen how the Mozilla situation (abhorent leadership, firings of Rust team people) will affect Rust.


And why would that be?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: