Ask HN: How do your ML teams version datasets and models?

gschoeni · on Sept 28, 2023

We have been working on an open source tool called "Oxen" that aims to tackle this problem! Would love for you to kick the tires and see if it works for your use case. We have a free version of the CLI, python library, and server on github, and a free hosted version you can kick around at Oxen.ai.

Website: https://oxen.ai

Dev Docs: https://docs.oxen.ai

GitHub: https://github.com/Oxen-AI/oxen-release

Feel free to reach out on the repo issues if you run into anything!

axpy906 · on Sept 29, 2023

I really like your read me.

zxexz · on Sept 28, 2023

I think a decent solution is coming up with a system for storing the models and datasets, checkpoints, etc. in S3 - store the metadata, references, etc. in a well structures postgres table (schema versioning, audit logs, etc. with snapshots). Also, embedding the metadata in the model/dataset as well, in a way you could always reconstruct the database from the artifacts (in Arrow and Parquet files, you can embed arbitrary metadata at the file-level and the field level).

But perhaps the best solution is to just use something like MlFlow or WandB that handles this for you, if you use the API correctly!

axpy906 · on Sept 29, 2023

You included data lineage tracking in the first part which probably needs to be piped from an orchestrator.

At this point it’s a build or buy type deal for models using a model registry service.

Data versioning still feels unsolved to me.

plonk · on Sept 28, 2023

Models that actually get deployed get a random GUID. Our docs tell us which is which (release date, intended use, etc.)

Models are then stored in an S3 bucket. But since the IDs are unique, they can be exchanged and cached and copied with next to no risk of confusion.

axpy906 · on Sept 28, 2023

Is the bucket versioned?

plonk · on Sept 29, 2023

No, the only versioning is “old models still exist in the bucket and their IDs stay in the docs”.

We might need to add a database with more structured attributes for each model, but until now it hasn’t been a real problem.

janalsncm · on Sept 28, 2023

We have a task name, major version, description and commit hash. So the model name will be something like my_task_ v852_pairwise_refactor_0123ab. Ugly but it works.

Don’t store your data in git, store your training code there and your data in s3. And you can add metadata to the bucket so you know what’s in there/how it was generated.

thegginthesky · on Sept 28, 2023

Process, git and S3.

We trained the whole team to:

- version the analysis/code with git - save the data to the bucket s3://<project_name>/<commit_id> - we wrote a small code to get the commit id to build this path and use boto3 to both access it and save it

We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned

It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy.

Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all.

This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.

simonw · on Sept 29, 2023

I really like the idea of using the commit ID as the bucket prefix for the associated files.

herodoturtle · on Sept 28, 2023

We use MLflow's model registry:

https://mlflow.org/docs/latest/model-registry.html

john-shaffer · on Sept 28, 2023

DVC is slow because it stores and writes data twice, and the default of dozens of concurrent downloads causes resource starvation. They finally improved uploads in 3.0, but downloads and storage are still much worse than a simple "aws s3 cp". You can improve pull performance somewhat by passing a reasonable value for --jobs. Storage can be improved by nuking .dvc/cache. There's no way to skip writing all data twice though.

Look for something with good algorithms. Xethub worked very well for me, and oxen looks like a good alternative. git-xet has a very nice feature that allows you to mount a repo over the network [0]

[0] https://about.xethub.com/blog/mount-part-1

john-shaffer · on Sept 29, 2023

Clarification on file duplication: DVC tries to use reflinks if the filesystem supports it, and falls back on copying the files. It can be configured to use hardlinks instead for filesystems like ext4 [0]. This improves performance significantly.

[0] https://dvc.org/doc/user-guide/project-structure/configurati...

wingman-jr · on Sept 29, 2023

For a side project of image classification, I use a simple folder system where the images and metadata are both files, with a hash of the image acting as a key/filename - e.g. 123.img and 123.metadata. This gives file independence. Then as needed, I compile a CSV of all the image-to-metadata as needed and version that. Works because I view the images as immutable, which is not true for some datasets. On a local SSD, it has scaled to >300K images. Professionally, I've used something similar but with S3 storage for images and Postgres database for the metadata. This scales up better beyond a single physical machine for team interaction of course. I'd be curious how others have handled data costs as the datasets grow. The professional dataset got into the terabytes of S3 storage and it gets a bit more frustrating when you want to move data but are looking at thousands of dollars projected costs for egress of the data... and that's with S3, let alone a more expensive service. In many ways S3 is so much better than a hard drive, but it's hard not to compare to the relative cost of local storage when the gap gets big enough.

kvnhn · on Sept 28, 2023

I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized datasets (low 10s of GBs).

In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...

vinni2 · on Sept 28, 2023

https://dvc.org/ https://huggingface.co/

smfjaw · on Sept 28, 2023

ML Flow solves most of these issues for models, I haven't used it in relation to data versioning but it solves most model versioning and deployment management things I can think of

simon_acca · on Oct 1, 2023

Not an answer to your question, but here’s a talk by prof. Sussman (of SICP fame, among other things) outlining a vision for such a software system: https://youtu.be/EbzQg7R2pYU?si=OEqDCe3_i7KLnaq9

m_niedoba · on Sept 29, 2023

Here is a tutorial on how to use Git LFS with Azure DevOps for game dev. But the same principle applies for ML. It's about versioning large data. DevOps does not charge for storage, yet.

https://www.anchorpoint.app/blog/version-control-using-git-a...

snovv_crash · on Sept 28, 2023

CSV file in git with paths to all of the files, all the training settings, and the path to the training artifacts (snapshots, loss stats etc). The training artifacts get filled in by CI when you commit. Files can be anywhere, for us it was a NAS due to PII in the data we were training on so "someone else's computer" AKA cloud wasn't an option.

hhh · on Sept 28, 2023

Why would having PII rule out cloud?

michaelt · on Sept 28, 2023

Most cloud providers are "secure" in the sense that they lock up your data and leave the key in the door so you can access it easily. A salesman will swear, hand on heart, that they'd never abuse this. An auditor has also certified that they meet the highest standard of the check clearing.

This is enough to meet the legal requirements, as I understand things.

Some people are not credulous enough to take the salesman at his word.

snovv_crash · on Sept 30, 2023

It depends on how sensitive the PII is. Medical data and ID documents are two that I know which are particularly sensitive.

mjhea0 · on Sept 29, 2023

I'm guessing you're looking more for a dev tool, but I co-founded a company that deals with this very thing (among others) from a governance perspective. https://www.monitaur.ai/

wendyshu · on Sept 28, 2023

https://dvc.org/

https://github.com/dolthub/dolt

https://www.pachyderm.com/

BrokenCogs · on Sept 28, 2023

Git LFS

pjfin123 · on Sept 29, 2023

I put the metadata in a JSON file and then store the datasets as a zip archive on a Nginx server.

speedgoose · on Sept 28, 2023

Have you use git or git lfs to store the large files?

m_niedoba · on Sept 29, 2023

Yes, Git LFS works better then most people think. You can also use Azure DevOps, because they don't charge for storage. We use Anchorpoint as a Git client, because it's optimized for LFS.

cuteboy19 · on Sept 28, 2023

Haphazardly, with commit# + timestamp of training

nofitty376 · on Sept 29, 2023

W&B artifacts for days

warkdarrior · on Sept 28, 2023

I use five version tags, after that I just rename the dataset.

v1

v2

v2_<iso-date>

v3_final

FINAL_final

AJRF · on Sept 28, 2023

MLFlow