Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pip and cargo are not the same (williammanley.net)
81 points by pabs3 on Jan 26, 2023 | hide | past | favorite | 101 comments


Cargo has the benefit of being much later to the game, and has thoughtfully incorporated lessons from the past. It's easier to work with 'downstream', and seems like a good model for new languages/runtimes to start with for their own packaging ecosystems.


Python had multiple different points in its history to fix its env and package problems. For reasons I don’t understand the PSF has not taken any hard stance on a solid way forward, it’s always pushed back as a community issue.

The problem is lack of leadership on this more than anything


It has.

So many things were fixed in python packaging.

   - "-m" was introduced.
   - Wheel replaced eggs. 
   - Manylinux target now exists. 
   - pypi was scaled uo, a new api
     introduced and 2fa added.
   - setup.py got replaced by
    setup.cfg and now
    pyproject.toml is gaining speed. 
   - the py launcher is a thing. 
   - Import priority changed. 
   - Import lib changed. 
   - Zipapps were added. 
   - Venv ships with python.
   - Pip replaced easy_install, 
     then ensurepip was created. 
   - The new dependancy solver
     was funded. 
   - Distutils has been sunsetted. 
     Setuptools is on its way.
This doesn't include the dozens of third party projects that target improving package that came out in the last 2 decades. Pip-tools by itself has changed the game.

A lot has been done, and the situation is 10 times better than it used to be. It is still bad, but not because nothing was done, simply because there is still a lot to do.

People dont realize the sheer scale of the task, and the inertia that comes with a language as successful and old as python.

You have a language that is 4 years older than java, highly dynamic, yet uses fortran/c/assembly code in very popular lib (scipy/numpy) accross all popular OS, and webassembly and arm, in 32 and 64 bits. People routinely install multiple versions of python on their machine, linux repos freeze the upgrades and split the packages, mac and windows stores ship broken versions, and half the python userbase is composed of people that are not professional coders that can't use a terminal. Of course companies will complain if you change anything that breaks their 5 years old server, and devs will complain if they don't get the last new great feature.

It's a really, really hard problem, dealt by a FOSS community that 10 years ago was still running on mostly volunteers and the budget of a small start up. Of course, just to get heat by comments on social media and no thanks for all the effort, as if you own total strangers your free work, with no flaws on top of that.

Not to mention you can imagine packaging is not the only things the core devs have to work on.


> This doesn't include the dozens of third party projects that target improving package that came out in the last 2 decades.

One would think that two decades of package improvement initiatives should have been a strong enough signal to PSF to prioritize building an official standard solution.


It has. It's just very slow.

A third party can ignore legacy. It can break stuff. It can avoid supporting platform. It can have a big bug and push a fix the day later. It can skip red tapes. It can document after the fact. It can drop features the next year. It can requires pypi dependancies when the stdlib is not sufficient. It can be by a single author that don't have to ask anybody what they think about before creating something. It can skip security issues and focus on practicality.

CPython cannot.

So any change to the packaging story will always take years for the smalles thing.

That's the same for removing the GIL: touching the c api is a huge deal for the scientific and machine learning stack.

That's why requests has never been included in python despite being vendored with pip: you can never update it fast enough once it's in the stdlib.

That's why we don't have node_modules equivalent yet: autoloading code in the current directory would be a big security risk for a language that is included in many os by default so we need a good design.

Don't assume the team is incompetent or deaf. Given what has already been achieved, that would be a total mismatch.


> Don't assume the team is incompetent or deaf. Given what has already been achieved, that would be a total mismatch.

I am not making any such assumptions (and I don't think most sane people are either); people are just unhappy with lack of an officially standard solution given that it was clear quite a while ago that it was needed. I also don't understand the comparison with GIL or C API - changes to them could be a breaking change whereas a fresh officially recommended solution would by definition be a new solution i.e. not a breaking change. It can be introduced and it can go through iterations while people can take their time to migrate from the myriad of third-party packages (or not, if they are happy with what they are using).

What people, who don't want to deal with deciding between or putting their faith in longevity of third-party tools, are looking for is a solution that is official, part of the standard library and is managed by PSF. That way the decision is basically made for them e.g. like it is done in Rust via cargo.


Fresh now solutions is what put us in this situation: distutis2 was fresh, setuptools was fresh, pip was fresh, wheels were fresh... So now you have a lot of complexity that come from having a lot of solutions in parallel that are slow to sunset.

But the dirty secret of python packaging is that most of the problem don't come from packaging, but from boostrapping python. And this is a huge can of worms I didn't even mentioned.

Also, plenty of things require the participations of other communities like debian splitting pip out of the main package or anaconda not being compatible with pip.

All in all, the way you hand wave the problem is typical from the critics that have a very narrow view of the situation.

If it was that easy, it would have been done.


>Don't assume the team is incompetent or deaf. Given what has already been achieved, that would be a total mismatch

To be perfectly clear, I don't think that. I just think its a missed opportunity over the years to better ways of environment and package management. For instance, on the folder auto loading code: make it a feature of only virtual environments. I know python can detect that its in one. Or engage with distros not to build Python to allow this. They already do custom builds with it (its always missing something from a standard Python installation after all).


Node also has madness that Python does not, like support for circular dependencies. I would definitely not point to Node as a positive example like Cargo.


Cargo is the golden standard, but reaching that level is going to take 10 years at least. I'm not kidding, this is what will be needed to solve all the road blocks.

The first thing we need is a whole ne way of installing python, because half of the problems stems from that. That's just a huge endeavor.


It certainly seems that way. Guido wasn’t interested in solving it and it was probably one of the things, like a gradual typing system which he was interested in, that he should have made a core issue because of its centrality.

Curiously, I wonder whether current leadership are unwilling to acknowledge that the current problems are essentially a leadership problem. See for example my short conversation with Pradyun on Mastodon (https://mas.to/@maegul/109726564552419983) where I’m not sure they were being open and rational (though they clearly know more than me).

This thread was in response to their blog post in Python packaging: https://pradyunsg.me/blog/2023/01/21/thoughts-on-python-pack...


Not disputing that there's a lack of leadership on the matter, but there's also conflicting use-cases: for example, the requirements for the "Python is just a tool that comes with my OS, and if I need any additional modules I'll install them with my OS package manager" mindset, and the "I'm developing a web application with Python and I want an isolated development environment with controlled dependencies" mindset, are quite different.


> "Python is just a tool that comes with my OS, and if I need any additional modules I'll install them with my OS package manager

Imo this isn't really viable because you will eventually run into version conflicts in the transitive dependencies of your the Python applications you're using/developing on your system, on most operating systems.

The version(s) that ships with an OS should only be used for shipping applications that are themselves part of the OS/distro.


They included virtual envs in Python 3, perhaps they should have done the same with some of the packaging solutions.


> perhaps they should have done the same with some of the packaging solutions.

The packaging solution is pip? (although for some reason it doesn't come with the now standard wheel, so it's less useful that it should be)


Looking at this from the Ruby world, cargo appears to be mostly a clone of the bundler model (with a sprinkle of rake folded in) which has existed for a looooong while ;)

It's quite funny to see languages fumbling on dependency management and somehow "refusing" to look at how bundler operates, then stumbling on a very close solution and then community rejoices at their package management being the best thing since slice bread.


Mozilla literally contracted the Bundler authors to create Cargo, so the similarity is intentional. :P


I'm quite sure it is!

I did not know the bundler authors were actively contacted though, thanks for that piece of info!


How is the approach in Ruby different/better? Asking out of ignorance.


I'm pretty sure that Rust hired the people who worked on bundler to build cargo - so it would make sense that they're pretty similar.


It's not perfect but pip (and npm...) could have learned a thing or two from Maven.

You know, prior art that appeared about 5 years later and had huge adoption.

Insularity is a software ecosystem disease.


Please describe what is better in Maven. Or please send a source with a comparison or some explanation. Thanks!


Maven uses groupIds and artifactIds to group dependencies.

GroupIds are reverse DNS notation based on a subdomain you control. They both group dependencies (duh!) and avoid the top-level squatting problem. It's much harder to mix up http-utils like in npm and pypi, since there's no http-utils. There's com.google:http-utils and there's org.apache:http-utils (all the names are made up for example purposes).

Maven has a local centralized artifact repository/cache where artifacts are downloaded and then referred to by every project. In Java there isn't even a need for symlinks, since Java has a dynamic classpath aka the places where the libraries are searched. Though Python could do the same thing with the PYTHONPATH.

Maven is plugin based and it's reasonably easy to build your own, so it's been extended like crazy to do all sorts of wonderful and loony things like build C++. It's actually expected to be extended, the tool itself implements Java builds through plugins, the core doesn't do that.

Oh, from the start it had tools to cache/proxy your own packages. So you'd be able to have a mini-centralized repo for your company with just the stuff you need, in case the main repo is down. Artifactory, Nexus, there are others. And not just cache/mirror, but proxy: you'd call your local repo, and if it wouldn't find the thing, it would download it from internet repos you configure.

Another thing, the package format and repo structure are simple and straightforward. So Gradle, sbt, other fancier Java build tools just use Maven repos.

There are a million other things like that, it's a very robust tool with a very robust ecosystem.

The main pain point is the early-2000s style XML configuration format :-|

That makes a lot of people avoid it due to verbosity. But you know what? The format is stable. It's quite readable. There are a gazillion tools to manage it, IDEs have advanced autocompletion for it, etc. And I'm not saying that other folks should copy the config file format, just the rest of the ideas. Heck, even Maven has Polyglot Maven, to use different config file formats (not super adopted, but it's there).


It failed to learn from binary repos used in C and C++ builds, which are the main reason why C++ with all its issues can still manage to compile faster in many cases.


Ecosystems where binary redistribution at the language level is the norm tend to have really gnarly downstream packaging issues, so that dependency management by a general purpose/distro package manager is basically an unsupported use case (.NET) or that for large projects, there may not be an known, working fully from-source build (Java).

I would much rather have binary redistribution left up to downstreams.


No, Cargo does support this, via sccache. Binaries just aren't provided by crates.io itself, for reasons of economics.


No it doesn't, as that is a 3rd party dependency.


Cargo deliberately supports configurable caches in order to let users provide caching layers.


ccache is not a suitable replacement for a stable ABI that allows binaries to be reused across systems.


I'm referring to sccache, not ccache. The former is a distributed ccache that allows artifacts to be reused across systems.


These are some good differences. Personally, the one thing I do miss from the modern package managers is the Java-style namespacing that piggy-backs on DNS. I wonder if there is some project out there that has decided to piggy-back on ENS namespacing. And then you can have eth.renewiltord and have your packages under that, and then when your projects are all quite established, your ENS domain will have a public value that matches the exploit surface if someone were to get it. And then you can sell it and reap the rewards of EEEEVIL!


As much as it can be annoying to type out those imports in Java, I agree.

All the "first one who asks for the name gets it" package managers suffer from problems (some of them security-related) that anything that talks to Maven Central does not.

If you want an account on Sonatype's OSS repository (which allows you to push to Maven Central), you need to prove that you have control of the domain you want to use (in reversed form) as the "group ID" you publish libraries under (IIRC, by setting a TXT record in DNS). Meanwhile, to publish to crates.io, you just link your GitHub account (which is required, gross), and then you can publish to any package name that hasn't yet been taken, and then it's yours, forever. That seems terrible, somehow.

(Granted, that just covers the group ID for depending on the library; you can put Java classes in that library wherever you want, even in "other people's" namespaces. In practice, though, I don't think this ends up being a big issue.)

(Also granted, if you let your domain registration lapse, someone else could snap it up and possibly hijack your namespace. But I wouldn't be surprised if the Sonatype folks would try to get in touch with you if someone else tried to claim your namespace, even if they had control over the domain.)


I worked for many years in the enterprise java world, and when I went to do other things I was surprised how bad or incomplete package management is in other language ecosystems. The maven packaging system (as distinct from the Maven tool itself) makes managing dependencies for building and shipping java applications much more sane.


To reiterate the point on the Maven packaging system vs. the Maven tool itself:

All the post-Maven build automation tools in the JVM world, including Gradle, Apache Ivy, and sbt (the Scala build tool) use Maven repositories.

Good package index sites give you the syntax to grab the package in your tool of choice, for example (click the tabs):

https://mvnrepository.com/artifact/org.apache.commons/common...


Gatekeeping people from publishing crates behind owning a domain will never work in the Rust ecosystem.


Plenty of com.github.xyz -packages in Java.

c.f. https://jitpack.io/


They don't need to "own" anything. They can just use Github pages or any free subdomain in this entire universe.

If you can't be bothered to set up a free account in 5 minutes somewhere, how valuable or trustworthy is your library, really?


> Personally, the one thing I do miss from the modern package managers is the Java-style namespacing that piggy-backs on DNS.

Though it's not directly apparent, I would say that cargo (and probably some other package managers) actually does that. It just also does a really good job of hiding that in the default cause where all your dependencies come from crates.io.

For every dependency, the source registry is also recorded, so the full package id is a concatenation of roughly "source registry URL + package name + package version". So if one were really interested in having bigger reliance on DNS to establish namespacing, one could utilize many different registries.

In practice, there are however good reasons related to supply chain security why you probably don't want to do that. E.g. domain names changing ownership adds a whole new attack surface that you need to secure against.


You can (almost) always set up different registries, this is not what GP is talking about. Java packages are named according to (reversed) domain names, e.g. the standard library is in net.java.*, and Java registries allow/require you to prove you own the corresponding domain (Let's Encrypt-style) using a DNS record before publishing a package.


Right, but suppose I depend on the org.example.foo package today, and next month someone else buys the example.org domain specifically so that they can insert their malicious code into the foo package?

Domain names simply do not eliminate supply–chain problems, they only make your packaging system dependent on DNS.


That's still a lot harder and more visible than just taking over an orphaned package or many of the other attacks we see in these other ecosystems.

Especially since DNS gives you real visibility into ownership when auditing/selecting packages.

org.apache is a lot more trustworthy than tk.helicopter.


Never understood why people didn't steal this idea to eliminate typosquatting and "all the cool names" squatting.


>modern package managers is the Java-style namespacing that piggy-backs on DNS.

Doesn't Go kind of do this by having things be git repo address? It's not hierarchical, but it is tied to something "real world"


Yes. I kinda like this solution but I am wary of tying a language ecosystem to any particular version–control system, even one so nice and ubiquitous as Git.


Couldn't we just use git::github.com/foo/bar or so?


I do wish more attention was paid to the ballooning disk space requirements for both.

I think Cargo is doing better, but I've yet to get a shared directory working for every build on a system. I shouldn't have to spend 2G per Rust project.


Yeah it seems super weird to me that cargo doesn't cache built artifacts based on crate name + version + enabled features + rustc version (and anything else that might make it unique). It seems silly that I have many built copies of the same dependency scattered across different projects on my machine.

Then again, maybe it's actually fairly rare that the things that make a crate build unique are all the same all that often, and the gains vs. complexities of caching just aren't worth it.


sccache works really well and there’s only two steps to install it and enable it globally, speeds up compilation time a lot as well:

https://github.com/mozilla/sccache


sccache includes the absolute path for each compilation, so it doesn't help with caching the same dependency across different projects.


Oh yeah, can fix that though with:

mkdir -p ~/cargo/target

export CARGO_TARGET_DIR="~/cargo/target"


If it's shared, you don't know what project pulled in which dependencies so you can't clean it up. I guess this is a trade off - 2G that's easy to clean up vs ?G that isn't. The only way around would be some sort of pinning and GC like in Nix.

That said, if it's not an option, I think it'd be reasonable for people to be able to make that choice themselves...


Could be symlinked (or hardlinked?) from a central cache, though.



Which part? The caveat?


It's absolutely ridiculous how little these package manager devs looked at Maven.

It predates almost all of them by at least half a decade, has a lot of hard thought knowledge but "Java is icky, Java is enterprise, bleh" and I can't even count how many mistakes Maven fixed back in 2005 I still see in pip, npm, cargo, etc in 2023.


My only experience of Maven was fighting it to get a single jar that included everything, and having it redownload every dependence with every single build.

I don't think downloading everything is a better solution.


Perhaps the privilege of getting sued by Oracle has something to do with it. I seem to remember Sun (then Oracle) blowing a fair bit of money on lawsuits surrounding Java.


I really doubt that a project by the Apache Software Foundation scared anyone of anything.

It would have taken these package manager devs a few days of gratis and lawsuit free research on their laptops to save millions of end users years of pain.


> I've yet to get a shared directory working for every build on a system

Same here. CARGO_TARGET_DIR exists, but that means every time I change a feature in some crate and rebuild, suddenly it doesn't match the version that all my other projects want, and I need to rebuild again when I return to those. But it is the closest thing we have.


Nodejs solved all this years ago, now we have super fast shared package managers like pnpm, yarn etc, not to mention deno etc, why can't other languages adopt the same system?


I assume that it is harder for Rust, because it needs to deal with `features`, with cross-compilation, with the developer using `nightly` for some projects, `stable` for others, pinned versions for yet others, etc.

That being said, harder doesn't mean unsolvable.


Maven and CPAN did it first, and in many ways better than node.


Node supports circular dependencies and arbitrary pre/post-install scripts, which are both disasters for downstream packaging. Node is not a model here.

The CAS that gets symlinked into vendorized directories à la pnpm is the Nix model, which is indeed an improvement that other package managers which support vendorization should incorporate.


It's a lot more than that if you're doing cross-platform (many) builds. It's a real pain.


Same with vcpkg for c++ development... Many of my students still have laptops with 128G of storage total, good luck making a whole $world fit in this when c:\Windows already uses half of it and they have their own documents & stuff


I use rdfind to deal with this: https://github.com/pauldreik/rdfind


That tool looks fairly old considering it’s talking about sorting inodes and benchmarking on spinning disk. Is there any tool that replaces with a COW copy instead of creating a hard link? COW is strictly a safer way to do this compression and honestly it seems like something filesystems should do automatically, perhaps with FastCDC deduplication.


Yes, that would be safer when available (although generally files within library dependencies are not modified I think?). It looks like fclones implements this, is faster and is written in Rust https://github.com/pkolaczk/fclones (the last is the most important point of course /s).


Fantastic. Now that’s a tool I’d run as it seems actually safe.

rdfind doesn’t limit itself to library dependencies so I’d be worried about it fucking up on files I’d rather not have be hardlinked just because a copy exists (eg 2 checked out git clones where modifying the code in one suddenly starts modifying files in the other)


> There is a 1 to 1 relationship between cargo crate names and what gets used in the rust file. With pip your pypi package moo can include Python package foo, or whatever else it likes.

This is false: by default, the crate name (i.e., the name of the lib target) will be the package name with hyphens replaced by underscores, but it can be overridden to any other name [0].

[0] https://doc.rust-lang.org/cargo/reference/cargo-targets.html...


That's still 1 to 1.


No, multiple packages can (and do) declare the same crate name. For instance, both the "md5" package [0] and "md-5" package [1] declare the "md5" crate. You can even add both of them as a dependency at the same time: you'll only get a compile error when you try to refer to "md5" from the code. So a Cargo package may declare at most one crate name, but this name is not unique.

[0] https://crates.io/crates/md5

[1] https://crates.io/crates/md-5


I suppose, but one can also creatively define 'what gets used in the rust file'. If you somehow need to interface with both of those packages, you can say `md_5 = { package = "md-5" }`, and now in your code it will be renamed `md_5`.


> Similarly you can’t use something in your rust code that you haven’t asked for in your Cargo.toml - transitive dependencies of your dependencies (mostly) don’t affect you

If they land on the final executable they certainly affect me.


Only indirectly. They don’t add additional types, methods, functions, etc that you must deal with in your code. The interface provided by the crate you directly depend on will, but nothing from the crates _it_ depends on will bleed through accidentally.


The naive statement from someone that never did security assements, porting code across multiple platforms, license checks,...


This appears to be deliberately missing the author's point.


This misses why dependency analysis matters.


They do add extra filesize though.


Making dependency management easy, you end up with large tree of dependencies. It is one of the main concern I have with modern package manager.

In my opinion, it lacks a kind of trust management. Packages from a same team/author could belong to a same trust group. When a package get updated, we have to audit dependencies from new trust groups. This could create a culture of reduced and audited third-party dependencies.


This is unfortunately the reality, and the primary reason why I prefer ecosystems with a higher barrier of entry to publish your library or program on, like linux distributions. Libraries simply don't get on there unless there's a program using it, in demand by the users of the distribution. Similarly, it makes it less attractive for application developers to rely on a library that's not already in the repository, as this increases packaging and install friction.


I have been thinking for years of a different trust mechanism for packages, one that is closer to capabilities, i.e. what Android is doing these days for applications.

Both approaches could certainly coexist!


Capabilities are great also! In particular at runtime.


I have just informally pitched both your idea and mine to the Rust security team :) We'll see what they think about it!

If it somehow works, feel free to use this message as the proof that I got the idea from you :)


There is a similar idea being explored with https://github.com/crev-dev/cargo-crev - you trust a reviewer who reviews crates for trustworthiness, as well as other reviewers.


Thanks for the reference! Definitely worth looking into.


Nice! Is there any place where the discussion is going?


Python ecosystem is too old to be fixed without breaking everything else. That said, use pipenv to get lockfiles and almost sane dep management. And, use global.require-virtualenv=true to prevent global package pollution.


Poetry installs from a lockfike, which is different for every environment. So it takes. While to resolve...

Cargo says "here is the package I need, at this version, go grab it", so it's quick.


Cargo does semantic versioning-based dependency resolution, similar to Poetry, it's just a lot faster because it's written in a Rust. I don't think there's a big difference in the work that happens.


Last time I used Poetry, I got frustrated at how long updating the lock file (i.e. resolving dependencies) took. Adding the debug flag showed it was spending most of its time combing through each version of the setuptools package, oldest to newest, to pick the most recent compatible version.

I was surprised it didn't do any caching, or possibly try binary search, though the latter could be rather imprecise.

Sure, cargo being written in Rust might give the dependency resolver a nice baseline speed boost, but optimisations matter too.


pdm is almost a drop-in replacement for Poetry, while being much faster on solving dependencies; look for "A Review: Pipenv vs. Poetry vs. PDM" if interested, or try it on your project.


That, really, demonstrates the cultural problem Python packaging has. With Rust, if you have a packaging improvement you make a Cargo plugin or you write an RFC & get your changes added to Cargo itself. So the Rust package manager is Cargo, and you don't have an endless string of alternatives to pick among. Just keep using Cargo, and it keeps getting better.



I don't know anything about Rust, but probably there is a huge difference, because Poetry need to download and install all packages just to get their version and only after that can it resolve dependencies.


Yes and no: it needs to download it to discover its dependencies. The version is encoded in the wheel file name.


I don't understand - if it didn't know what version until after download/install - which version would it download/install?


If I'm not mistaken, it needs to download a package to know its dependencies and version constraints. So when you have many packages with many dependencies, which themselves have dependencies etc. it can take a while for poetry to assemble the full dependency graph and determine whether there are any unsolvable constraints (e.g. package foo depends on package bar with version >= 2, but package baz depends on package bar with version < 2).

Not sure how other package managers avoid that. Maybe the central package repositories can expose the dependencies metadata without needing to download the actual package?


> If I'm not mistaken, it needs to download a package to know its dependencies and version constraints.

It's even worse than that. It needs to execute a python script (setup.py?) per package to get a list of it's dependencies and constraints. As that script may contain arbitrary platform-dependent logic (and in the case of ML-related packages often does), it can be impossible to resolve dependencies for other platforms.

> Not sure how other package managers avoid that. Maybe the central package repositories can expose the dependencies metadata without needing to download the actual package?

Yes exactly.

For dependency resolution, cargo uses only a git based index[0] which is optimized to contain only the information required for dependency resolution (omitting other package metadata such as e.g. authors). So it syncs the git repository and after that it is just lookups in local files of the index.

Only after dependency resolution does it need to consult an external server for retrieval of the actual package contents.

[0]: https://github.com/rust-lang/crates.io-index


> It needs to execute a python script (setup.py?) per package to get a list of it's dependencies and constraints.

Only for packages that use setup.py (which is still heavily used; not sure whether it's still a majority). It is slowly being replaced by setup.cfg [0] and, more recently, pyproject.toml [1], which both contain dependencies in a declarative format.

[0] https://setuptools.pypa.io/en/latest/userguide/declarative_c...

[1] https://peps.python.org/pep-0631/


Oh sorry, I was talking about dependency resolution, OP was talking about install :/




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: