While I'm sure this will help some people use git to address a use case that was previously impossible with git, I can't help but feel that it a bad step overall for the git ecosystem.
It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).
I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.
You could extend git-lfs "pointer" file to support secure distributed storage using Convergent Encryption [1]. Right now, it's 3 lines:
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
By adding an extra line containing the CHK-SHA256 (Content Hash Key), you could use a distributed p2p network like Freenet to store the large files, while keeping the data secure from other users (who don't have the OID).
version https://git-lfs.github.com/spec/v2proposed
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
chk sha256:8f32b12a5943f9e0ff658daa9d22eae2ca24d17e23934d7a214614ab2935cdbb
size 12345
That's how Freenet / Tahoe-LAFS / GnuNET work, basically.
Mercurial marks their Largefiles[0] support as a "feature of last resort". IE: enabling this breaks the core concept of what a DVCS is, as you now have a central authority you need to talk with. But at the same time, many people that use Git and HG use it with a central authoritative repo.
++! When I was in the games industry, it was extremely important to have this feature (and yes, it was a last resort!) This is why, then, we choose Mercurial over Git.
Unfortunately there is a lot of wackyness and far too often assets got out of sync. We ended up regressing, and put large assets (artwork) into a subversion repo instead.
I wish there was a better option, such as truncating the history of largeFiles, but that seems to break the concept of Git/Mercurial even more than the current "fix"
Indeed that was about 5 years ago. The problems generally were around assets getting out of sync, and occasionally corruption when uploading to the large-file storage server.
Problems generally occurred when a client timed out in the middle of an upload or download.
They were troublesome issues (and silent failures) which made it unusable for production use.
Hope they got it fixed, it was a great concept, and well ahead of Git in attempting to solve!
git-lfs (and similar systems) split up the storage of objects into the regular git object store (for small files) and the large file storage. This allows you to configure how you get and push large files independently of how you get and push regular objects.
A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.
If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.
Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.
Perhaps by adding a .git_sections file which keeps track of different sets of files you might want to checkout, but don't need to. You could have it that you could define different targets (and a default) such that say you are working on a large video-game you could have one repository for everything, but define a "artists" "programmers" and "full" target, where artists can keep their huge assets together with the rest of the repo and programmers can do shallow pulls, not constantly fetching asset files which may or may not be necessary for what they're working on.
Neat! Is there a way for me to serve those files so people can then use my repository as the authoritative source? I see that they still have the concepts of remotes, so maybe things are getting there?
git lfs push should work to push files new to the repo, but I'm not sure it works to push files that exists in the repo but are new to the server (because it is a new remote)
to us, things like git lfs encourage us to ensure there is an open source package that supports it, to keep the d in dvcs, so we'll add support to our community edition too
Gitlab giving me an alternative is awesome and commendable. If I will be able to clone a repo from github including all of the lfs then push it all to gitlab that would be better than nothing! Is github contributing to your effort to have an lfs server that is open and free?
Let me give an example of one way it hurts the existing git ecosystem. Someone decides to include their external source dependencies for their project as tarballs using lfs (which is probably dumb and not the use case that lfs is trying to support, but people will do it nonetheless). Now I want to mirror that repository inside of my companies firewall which hosts its git repositories using just git over ssh. Without lfs, I would just do 'git clone --mirror && git push --mirror' and I internally have a mirror that is easy to keep up-to-date, is dependable, supports extending the source in a way that is easy to contribute back, etc..
Now what options do I have with lfs (outside of gitlab)? Create a tarball with all the source +lfs in it? Create a new repository that doesn't use lfs and commit the files to that? Each of these is less than ideal and makes contributing back to the project harder.
Imagine instead a world where this happened: Github.com announces that they are adding large file support to git. These large files will be using an alternative object storage, but the existing git, http, and ssh protocols will be extended to support fetching these new objects. When support for it lands in the git mainline repository, suddenly everyone will be able to take advantage of it, regardless of how they choose to host their repositories!
I admire gitlab for creating an open source server implementation. I just wish that github would have done it a different way that would have been better for the overall git community (not just github users).
It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).
I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.