Namespacing isn't the problem here: an ecosystem with two (or more) levels of na...

fiddlerwoaroof · on May 20, 2023

I think maven central has the right idea, but it could be generalized.

Require people submitting packages to prove they own a domain and use the domain as the namespace representing the user. We could then use SRV records or TXT records to specify where a domain’s packages are authoritatively hosted and then have conventions to find a signing key for a given domain that can be used to verify cached packages in non-authoritative hosts.

woodruffw · on May 21, 2023

Again: namespacing isn't the problem (or a totalizing solution) here.

I admire Maven's approach on a conceptual level, but it leaves a lot to be desired on a technical level: it effectively gatekeeps the entire packaging ecosystem on the ability to purchase and maintain a domain name, which (1) is unnecessarily exclusionary, and (2) assumes trust in a namespace (DNS) that was never really designed to be authenticated or a source of stable, permanent identifiers (domains change hands all the times, including for hostile reasons).

A modern version of Maven's scheme would bootstrap on well-known URIs over HTTPS instead, but even this would still represent a significant imposition on PyPI uploaders.

fiddlerwoaroof · on May 21, 2023

I think the goal is not to have "pypi uploaders" but instead have a discovery system designed to enable safe caching of artifacts and automatic discovery of where an artifact is hosted.

The exclusionary aspects can be mitigated via somewhere like github offering package hosting for their users. $10/yr for a domain name also isn't all that much.

woodruffw · on May 21, 2023

I don't know how that's panned out for the Java ecosystem, but distributed hosting is something that PyPI moved away from (with a great deal of pain and heartburn) years ago. Previous versions of PyPI were a "pure" index, with distribution hosted being provided by individual projects. This made package installation exactly as reliable and secure as the least reliable and least secure host.

fiddlerwoaroof · on May 22, 2023

You can setup a system like DNS, where downstream systems can cache verifiable versions of the packages hosted at the authoritative source to help here. If the lock file for the project records signatures or similar identifiers for the project’s dependencies, the exact source of the bits matters a lot less, reducing the reliance on package hosts remaining up indefinitely.

Symbiote · on May 21, 2023

Maven supports multiple package repositories, and it's not unusual to see small packages in a self-hosted repo. That could be a self-hosted HTTP server, S3 or similar, or GitHub's package repo.

The hostname of the repository needn't match the name space of the packages it hosts.

ryan29 · on May 21, 2023

> Again: namespacing isn't the problem (or a totalizing solution) here.

The current system works better for impersonators and squatters than it does for me as a legitimate participant, so, in my opinion, that's at least part of the problem. Domain validated namespaces go a long way towards solving that.

> it effectively gatekeeps the entire packaging ecosystem on the ability to purchase and maintain a domain name

There's no reason there can't be a default namespace similar to what Bluesky did. By default you get @example.bsky.social and can optionally switch to a custom domain. The main thing that Bluesky got wrong is they recycle your original handle when you switch to a custom domain and I think it should be (optionally) kept as an alias to prevent impersonation.

> assumes trust in a namespace (DNS) that was never really designed to be authenticated

It's trusted enough for MS365, Google Workspace, ACME (protocol), and basically everything on the internet at this point. PyPI.org is using a domain validated SSL certificate and (AFAIK) distributes unsigned packages. I view that as about the same amount of risk as using domain validation for namespaces.

> or a source of stable, permanent identifiers (domains change hands all the times, including for hostile reasons).

Domains changing ownership can be handled reasonably well by treating the domain as a mutable pointer to an immutable namespace. The immutable namespace would be the source of truth that would be trusted by clients and would point back to the domain so they reference each other. If either pointer changes the client can warn the user and let them decide how to handle it. By using the immutable namespace after following the mutable (domain) pointer once, you eliminate a lot of the value of stealing a domain to take over a namespace. If you flag domains that have recently started pointing to a different immutable namespace (ie: a new domain or owner), it acts as a cautionary warning so users know they shouldn't be blindly trusting the namespace.

Dealing with banned domains would likely be a pain point, but I don't think banning an account changes much in terms of human factors involved. On the domain side of things I'd probably maintain a public list of permanently banned domains, at least to start with, and see what problems that creates before trying to find solutions. Non-negotiable, permanent banning might even be ok.

For me personally, domain validated namespaces would be a huge improvement. I'm sick of the gold rush and having to chase my handle on every platform. Even if I get it, I'm still indistinguishable from a bad actor on most platforms because I have very little activity and nothing popular.

At least with domain verified namespaces someone who doesn't know me can go from pypi.org/@example.com to github.com/@example.com or example.com and be nearly guaranteed it's me (ignoring domain hijacking). That makes it easier to make a judgment call on whether or not to trust me because most of my activity will show up at, or be linked to from, those places.

Plus, as a package becomes more popular, the recognizably of the owner's handle also grows. With a single, domain validated namespace, that's only one definitive thing for people to have to recognize which cuts down on the cognitive load of trying to figure out if mismatched namespaces are legitimate or someone trying to engineer an attack.

I think the current system for namespaces is a big chunk of the problem because it's non-authoritative in the context of knowing who owns a namespace. By switching to domain validated namespaces a lot of that ambiguity disappears because you know that example.com, pypi.org/@example.com, github.com/@example.com, twitter.com/@example.com, etc. are all the same person or organization. That means that as soon as you know you can trust them on one of those platforms, you can trust them everywhere else because, as long as the UI flags new domain pointers to combat domain hijacking, you've basically removed impersonation and squatting as a potential threat.

So, maybe it's not perfect, but, in my opinion, it adds enough value that it's worth considering as an option. I think it's great on Bluesky because I find it so much easier to determine whether or not people are legitimately who they say they are and not some prankster or impersonator. For example, I'm following @tailscale.com on Bluesky and I know it's them.

blibble · on May 21, 2023

> which (1) is unnecessarily exclusionary

meanwhile pypa spent months making it so the only trusted publisher is microsoft

woodruffw · on May 21, 2023

This is misleading: the trusted publisher is GitHub's OIDC IdP, which happens to be the single most common publishing source for PyPI (as well as the first IdP to give us all the features we needed). It has absolutely nothing to do with Microsoft; I don't think Microsoft (as a discrete corporate entity) was even aware of the work.

The feature was intentionally built to be extensible and to support additional publishers. I know this because I'm the one who made it extensible, and who's working on supporting them[1][2].

[1]: https://github.com/pypi/warehouse/issues/13551

[2]: https://github.com/pypi/warehouse/issues/13575

lolinder · on May 21, 2023

Does Maven have these problems? I haven't seen it if so.

woodruffw · on May 21, 2023

Maven doesn't have these problems, but not because of namespacing itself. The adjacent thread has the details on that.

lolinder · on May 21, 2023

Usually with something like this is the whole system that makes it work—namespacing alone may not be enough to solve all our package manager woes, but can a package manager solve them without namespacing?

mike_hearn · on May 21, 2023

The Java ecosystem doesn't have these problems for several reasons, not just namespacing:

1. The Java ecosystem has no way to distribute executable programs. Maven is only for libraries. There is literally no equivalent of "npm install" or "pip install". This is not a strength! IMHO it's one of the things that pushes people away and towards scripting languages. Being able to distribute programs as well as libraries is a very useful feature. It does, however, mean that there's very little value to be had in pushing malware to Maven Central because the instructions for how to run anything from it would be more complicated than just distributing the JARs yourself. State actors might want to compromise widely used libraries, but that's a different kettle of fish.

2. Not only can you not distribute programs but Maven resolvers don't execute any code. Dependencies are statically declared and the "installation" process just involves walking the dependency tree, downloading the JARs to a file system cache and then computing a list of those JARs for the build system. There's no equivalent of setup.py.

You might be wondering, how then do you distribute libraries that require native code? Well, the Java ecosystem has no direct support for that either. You'll have to pre-compile your native libs for each OS and CPU arch that users might want, distribute the libraries as data resources in your JAR, extract them to somewhere in the user's home directory at runtime and load them from there. (There's also no standard mechanism for this so every library rolls its own).

3. Maven Central (but not the clients) requires that all uploads are PGP signed. So, to be able to upload bad releases of existing libraries requires you steal a signing key.

4. The JVM ecosystem is oriented around larger but fewer libraries. Although individuals do create and upload libraries of course, the huge standard library means you don't need to rely on them so often. There is no leftpad library for Java. Many dependencies will come from Google, Apache, JetBrains, etc. When you do rely on individuals they tend to have been around a long time, their libraries are relatively well known, have a lot of entries in the issue tracker, long lived mailing lists and other unforgeable evidence of real-ness.

5. All operating systems come with Python built in these days, even Windows. In contrast no OS comes with Java out of the box. So if you just want people to run malware Python is a better bet because it's just one command, users won't have to install Python itself.

I don't think having a central server is that big a difference in the grand scheme of things, though it does mean someone owns the problem of clearing bad packages were one to get in.

lolinder · on May 21, 2023

Points 1-4 all sound like substantial strengths to me.

1. Who decided that the same package manager needed to be in charge of downloading libraries for CI pipelines and finished executables? Combining the two muddies the water and makes it unclear if a given package is going to provide a library or execute code or both. Bad actors thrive in confused spaces.

2. This comes down to the same point: I don't see why the package manager for coordinating dependencies in a software project ought to be the same thing as the package manager for installing applications in an OS. They're different use cases, usually different sets of users, and, as you note, different sets of requirements. Nothing good comes from a library being able to execute code as it is downloaded, so Maven lacks that feature because that's not the use case it's targeting.

3. Sounds fabulous. When is this feature coming to PyPI?

4. This doesn't seem like an incidental difference, it's the natural result of the design of the system, and a big part of the reason for the big-package culture is Maven Central. If the Python and JavaScript ecosystems made package ownership more explicit, the culture of thousands of code dependencies would be more obviously a culture of thousands of human dependencies, and I think people would be much more freaked out with the status quo.

5. See points 1 and 2 above: the Python and JS ecosystems make a muddled mess of the distribution between development and distribution, and the result is a dangerous hybrid package manager that is frustrating for either use case.

mike_hearn · on May 22, 2023

1. Agree that having a clear separation at the UI level is of value, but a program is a set of modules that depend on each other, no different to a library. Where you start needing very different things is when distributing software to end users who aren't developers, but there's a lot of small utility programs, demos etc that would benefit from being distributed the same way as libraries. There's a reason that so many dev frameworks these days start by telling you to install a framework-specific CLI tool to help you with things.

2. I'm thinking here of cases where the users aren't really different. Apps for developers, for example. Should have made that clear, sorry!

3. PyPI already tried to use PGP signatures and gave up. It doesn't add much security over just having a good authentication system on a central server. Note that nothing checks the signatures except, I think, Maven Central itself, so it doesn't help in case of server compromise.

We're in agreement that the situation with Python/JS (and to a lesser extent Ruby?) is a real mess, but I don't see it as due to any fundamental design choices. More like a set of fuzzy accidents of history. If someone added a new command to mvn or gradle tomorrow, for example, (1) might stop being true.