How are interpreters designed in Rust? What parts can be safe and what can't be?
What is the internal representation of a Python object? In particular how do you mutate a Python object while there may be other references to it - is it UnsafeCell, is there some checking?
How does the GC work? The GC needs to access and perhaps mutate all Python objects. In C++ a common technique is to make a linked list of scopes, stored in native stack frames; I'm not sure how that can be done in safe Rust.
I was curious too, so I poked around! I had never looked at this codebase before this comment, so, I may have some things wrong.
> How are interpreters designed in Rust? What parts can be safe and what can't be?
Depends on the design. Different interpreters are different.
A quick glance at the source code seems to show unsafe in:
* A data structure called "boxvec"
* Some calls to unreachable_unchecked, get_unchecked, and the like. Wonder what the overheads were here.
* Some lock/mutex implementations
* A few unsafe functions in the jit. They use cranelift. Bet there's a bunch in there too. Usually this is the classic thing that needs unsafe in interpreters.
* The python object data structures; they do some stuff to have a smaller representation
* some FFI code in the standard library, calls directly to libc
> What is the internal representation of a Python object?
Traditionally, Python uses refcounting + cycle detection. A quick glance at their issue tracker implies that as of last month, they had recounting but no cycle checks yet.
> In C++ a common technique is to make a linked list of scopes, stored in native stack frames; I'm not sure how that can be done in safe Rust.
Looks like they have a RefCell<Vec<FrameRef>>, instead.
i wanted to add lambdas to JS, this was back in 2001 i think, so i had this compiler from my compiler class, changed it to parse JS (+ my extra features like the ()=>{} syntax) and generate an AST in XML (instead of emitting bytecode), then i made a python script that took the XML and generated minified JS from it.
Neat! I did something similar, not quite as big, but I was on a version of Python (Jython) that didn't yet have generator expressions, so I made a generator expression evaluator, gg("x for x in it") that returned a running generator. I was about a week away from shipping an ETL tool to the customer and this allowed me to hit that deadline.
> LALRPOP in fact uses LR(1) by default (though you can opt for LALR(1)), and really I hope to eventually move to something general that can handle all CFGs (like GLL, GLR, LL(*), etc).
Seems overkill for a language whose grammar is LL(1)?
Although the CPython implementation contains an LL(1) parser (which is being deprecated for a new PEG parser), the grammar contains bits that are not context-free, which involved a pre-parsing step before being fed into the LL(1) parser. That structure isn’t particularly good and it’d be beneficial for a new implementation to use something else.
> Python 3.9 uses a new parser, based on PEG instead of LL(1). The new parser’s performance is roughly comparable to that of the old parser, but the PEG formalism is more flexible than LL(1) when it comes to designing new language features. We’ll start using this flexibility in Python 3.10 and later.
That Python can be described by an LL(1) grammar is frequently mentioned in discussions, but I don't know if it's formally documented anywhere.
More specifically, the grammar [0] is in EBNF form and is ELL(1). That means that if each EBNF production is converted to a set of BNF productions in an LL(1)-compliant way, the BNF grammar as a whole is LL(1). It seems that the Python tools themselves do not check this [1], but I have verified it myself.
However, as another commenter mentioned, the grammar doesn't exactly describe Python, but a superset of it. The compiler needs to perform further checks after parsing that "should" be part of the parser itself. A more advanced parser would allow these checks to be performed in the correct place - this is probably what RustPython does when using LR(1), and was one reason why CPython recently replaced its LL(1) parser with one based on PEG.
Parsing a superset of the language and then using extra checks to detect bad syntax accepted by the superset is a well known and very effective way of being error tolerant and implementing decent error recovery.
It is a great strategy. One of my favorite examples of this is Rust. When talking about what synatax "await" should use, we decided on a postfix syntax: "foo.await" . But there was concern that because JavaScript and other languages use "await foo", it would be confusing.
Solution?
async fn foo() {
}
fn main() {
let x = async { await foo() };
}
error: incorrect use of `await`
--> src/main.rs:6:13
|
6 | let x = async { await foo() };
| ^^^^^^^^^^^ help: `await` is a postfix operation: `foo().await`
Do exactly that: parse the incorrect form and emit a good diagnostic.
I suspect an interpreter that "solves" the GIL problem will break tons of modules that currently only work in concurrent contexts because the GIL exists.
Jython and IronPython can work without a GIL more easily than CPython because they don’t support native (C-based) extension modules. Does RustPython intend to support existing Python extensions written in C?
They need some form of locking around Python objects accessed from multiple threads, since Python has no ownership model. Do they add a lock around each individual object?
> They need some form of locking around Python objects accessed from multiple threads, since Python has no ownership model.
"Ownership models" are not a given. C++ has no "ownership model" and it doesn't lock around objects. Multi-threading users are just given the synchronization primitives and are expected to manage their own solutions.
The main thing that's keeping the GIL around is the reference-counting garbage collector. If your python implementation uses its own GC mechanism, it doesn't have this problem.
FreeBSD went through a similar issue back in the late 90s, early 2000s when they added pervasive multithreading into the kernel.
The GIL isn't some halting problem level thing. As you allude to, Python could definitely decide to go multithreaded and add in the requisite locks. They haven't.
I really wish that the PSF would
* Make a spec
* Break out the stdlib into a portable library, and by portable, something that can be shared across PyPy, Graal, Jython, etc.
Most languages with proper multithreading that I've used don't have an ownership model. Most the the standard library collection types are explicitly not thread safe, and then there's separate concurrent versions for if you're writing multithreaded code. And you're given locking/mutex primitives if you want to roll your own stuff.
I’m genuinely curious why people want to work around the GIL in languages like Python and Ruby. I very much do not want my simple interpreted language to inherit the complexity of true concurrency. Why not just use a different language?
If a company has 200+ developers already working on a 10-year-old codebase in the language, then the word "just" has a lot of heavy lifting to do in that question.
The GIL doesn't really let you escape the horrors of concurrency, at least in CPython. Threads can't execute at the same time, but they can interleave in inconvenient ways.
I’ve asked myself this question (in a different context) and have concluded that, in general, developers are incredibly resistant to learning and using new languages.
For example: if they weren’t, then server-side JavaScript would never have become popular.
>I’ve asked myself this question (in a different context) and have concluded that, in general, developers are incredibly resistant to learning and using new languages.
You make it sound like a personal psychological failing.
In real life, using the same language has business benefits: re-use of the same libraries, no duplicated business logic that needs to be used on both client/server, no need for training to the new language, no need for mental context switch due to the language change when working on the client or server part (aside from the essential domain knowledge of client vs server), less/common tooling infrastructure, and so on...
“RustPython is a Python interpreter written in Rust. RustPython can be embedded into Rust programs to use Python as a scripting language for your application, or it can be compiled to WebAssembly in order to run Python in the browser.”
What is the internal representation of a Python object? In particular how do you mutate a Python object while there may be other references to it - is it UnsafeCell, is there some checking?
How does the GC work? The GC needs to access and perhaps mutate all Python objects. In C++ a common technique is to make a linked list of scopes, stored in native stack frames; I'm not sure how that can be done in safe Rust.