I’ve seen multiple comments stating that namespacing solves (or at least mitigates) these kinds of attacks. Could someone kindly explain how? I’m sure it’s straightforward, but I don’t see it.
It doesn't, at least not directly. Namespaces make typosquatting more difficult, but it doesn't stop the main other incentives for spamming an index with inauthentic users and packages, i.e.:
1. Sneaking an inauthentic dependency into a tree somewhere;
2. Convincing less experienced users to install your package directly.
My understanding that (2), in particular, is an increasingly issue in cryptocurrency and other communities: inexperienced users typically talk on Discord and other chats, and may not fully understand that `pip install foo` essentially means "allow a random person to run code on my machine."
Indeed it can, and in fact has to by necessity in many cases: `pip install` boils down to (a variant of) `python setup.py install` for many packages, where `setup.py` is package-controlled arbitrary Python code.
This is also true for subcommands like `pip download`, as well as `pip install --dry-run`.
(There's some subtlety here, since this only applies to source distributions and not wheels. But source distributions are still the "baseline" way to distribute Python packages.)
> and a package needs to be imported by a script before anything gets run.
Even with wheels, this part is unfortunately not true: a package can install a `.pth` file[1], which can be used to auto-load code on Python interpreter startup, even if the malicious package itself is never imported directly.
There is a long-term goal to remove setup.py in favor of letting "a thousand flowers bloom".
In modern Python setups, the pyproject.toml contains an entry for the 'build-backend'. There is a fall-back to 'setuptools.build_meta:__legacy__' because a large number of projects still use setup.py.
PyPI recommends hatchling, as the example used in https://packaging.python.org/en/latest/tutorials/packaging-p... ("You can choose from a number of backends; this tutorial uses Hatchling by default, but it will work identically with setuptools, Flit, PDM, and others that support the [project] table for metadata.")
These do not use setup.py.
I use setup.py because the new build backends don't seem to support C extensions all that well. My package also has Cython code, and some Python-based code generation for the C code.
Another package I know, RDKit, by default automatically downloads missing packages if not available. (These packages are provided as source from the respective authors.)
I think distribution people really don't want the build step to be able to run arbitrary code, but they face a long, up-hill battle.
Just as a note: build backends still run arbitrary code. You’re exchanging a `setup.py` for an arbitrary Python package that specifies the build backend, and packages can specify their own backend (including one embedded in their own source).
Python as an ecosystem will probably never be able to fully remove build-time code execution from sdists, since native extensions fundamentally rely on it. The best we can do is unify and streamline them so that as many people upload wheels as possible.
Thank you. I don't think I was clear enough that the desire really doesn't match reality, in part because I'm not clear about that myself. I don't have a deep understanding of the new way of doing things, having only read about them whilst trying to understand if/when I should migrate from setup.py.
> including one embedded in their own source
I didn't know about that! I thought it needed to be a pre-build dependency.
It's a lot easier to typo a package (e.g. requests -> request results in the wrong package) than it is to typo a namespace-package (@namespace/requests -> @namespace/request would result in an error).
Somewhat likewise, namespaces can build trust in a way that single packages can't
With namespacing you still run the risk of @namespace/requests -> @namsepace/requests typos.
These may not be all that obvious. For example, the name of the popular Rust HTTP client "reqwest" is an intentional typo. tokio vs tokyo can also be a less than obvious typo.
On the other hand, this opens up for two potential places to typo. Both the name and the namespace.
This happens often to me on GitHub, you keep browsing some code, only to later realize you are in someone’s personal fork of the official project. Especially when the original author isn’t a big organization, you only have two equally arbitrary usernames as namespaces to compare against.
If the namespace is handled correctly, you won't accidentally download a malicious dependency if you get the namespace right. You'll get an error, but that's what you want for that kind of typo.
The cryptic username problem is indeed a bother, but often popular projects will use readable names for their Github accounts. There's no real fix for that if you depend on a much smaller project. When you think you may be running that risk, you should probably wonder if it's a good idea to depend on a project made by someone small enough not to be instantly recognizable within your specific programming niche.