> you're working in data science, and your main concern is probably making sure ...

MrJohz · on Oct 31, 2023

Unfortunately no, the problem here is that you're probably going to need a lot of compiled extensions, and some of these extensions are going to be running on your GPU (especially if you're in the ML world, but also more generally if you want to take advantage of e.g. your lab's HPC cluster). PyPI can manage some of this with the wheels system (i.e. OS, architecture, Python ABI), but there's no metadata to indicate, for example, which GPU you have available. So in most cases it's possible to just precompile all the relevant variants and let people download the best one for them, or even in some cases allow people to compile everything for themselves, but there's still situations where those aren't good options.

This is why PyTorch is famously more complicated to install via newer packages managers such as Poetry, because it requires something slightly more complicated than the existing Wheel setup, and most package managers aren't designed for that. (Pip isn't designed for that either, but PyTorch has come up with workarounds for pip already.)

Containers can't solve this problem because containers are tied to the architecture of the machine they're running on, they can't abstract that away. So even if your code is running in a container, it still needs to know which architecture, OS, resources, etc it has access to.

buildbot · on Oct 31, 2023

Not exactly no - https://stackoverflow.com/questions/63960319/does-it-matter-...

You need to be running a GPU driver on the host that supports the container cuda version.

So in theory yes, in practice, weird issue occur sometimes that really suck to debug. For example why do I get NaN loss after spending 8 days on 128 GPUs with this specific set of drivers+cuda container? (Don't hold it that way, use a matching cuda version...)

Also a lot of data scientists HATE sys-admin tasks and docker falls squarely into that for many people.

guappa · on Oct 31, 2023

The problem is that people doing data science are not developers, so instead of just using whatever is there they are reinventing a terrible version of package management.