I guess I don't really understand the need for something like Cargo to be up to ...

kibwen · on March 1, 2023

This thread is glossing over some important details. A package repo has two distinct storage concerns: the index (a list of which versions of packages have ever been released), and the actual packages themselves. It's convenient to have the index centralized, for maximum consistency. But the packages themselves can be stored however you want, and if you try to access a stale mirror without the most recent version of a package, then the client should have the option of using a different mirror or else accepting the old version.

For crates.io specifically, the packages are stored in S3, whereas the index is currently stored as a bog-standard Github repo (not as a Github Package), and in the near future the crates.io index will also move to crates.io itself (https://blog.rust-lang.org/inside-rust/2023/01/30/cargo-spar...).

mrweasel · on March 1, 2023

Thanks, that wasn't clear to me. Why not just dump the index on the same storage as the packages? If text files are insufficient then do an SQLite database.

kibwen · on March 1, 2023

There are sound technical reasons to give the index special treatment.

First, the index is very large, and it only ever gets larger over time. I just cloned and compressed the crates.io index (https://github.com/rust-lang/crates.io-index), which resulted in a 58 MB archive (note that I did remember to delete the .git directory).

Second, the index changes very often. Every time anyone ever publishes a new version of a package, that changes the index. For crates.io, this happens hundreds or thousands of times per day.

Third, the index is append-only.

Fourth, the index is extremely frequently requested. Any time the user manually asks for an update, or any time the user adds a new dependency, the local copy of the index needs to be updated.

Putting it all together, since the index is constantly changing and since users will constantly be asking for the latest version, this means that it would be very inefficient to serve the whole thing each time. Instead, a fine-grained solution is more efficient. In the early days of crates.io, this problem was solved by just storing the index in a git repo and letting git take care of fetching new diffs to the index (and the problem of "who pays for hosting" was solved by using Github). Now that the crates.io index is outgrowing this solution, it's moving to a more involved protocol where clients will not have local copies of the full index, but instead will only lazily fetch individual index entries as necessary, which is much faster (especially for fresh installs (including every CI run!)).

nijave · on March 1, 2023

I think in the case of Debian, packages are vetted and approved by repository maintainers before being hosted (the repository is curated). I think most application dependency repositories let anyone in and the onus is on the author and user to determine the legitimacy.

I imagine it's easier to get people to mirror curated, signed packages than, effectively, random code

8n4vidtmkvmk · on March 2, 2023

i definitely push stuff to npm and then pull it in as dep on a different project seconds later. mostly because I'm too lazy to eff around with local package resolution which has bitten me before and also implies you're linking against live code instead of a specific snapshot