Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

that’s a good question… There seems to be two problems.

The definition of open source depends on a license existing in a repo. Without a license it’s not legal to copy and distribute.

Public vs Private repo is a platforms issue not the code maintainers.

If a public repo does not have a license, it does not mean it free to copy and distribute.

If a private repo has an open source license like MIT, then the crawler has a right to copy and distribute that repo. Regardless if it has authorization to access the repo or not.




> Without a license it’s not legal to copy and distribute.

Yes it is. Due to both the terms you agree when you use GitHub and the general Implied License that covers everything public on the internet.

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.


Looking at that ruling, it seems the case you linked to hinged on a fact not applicable with the Stack:

>Field had actual knowledge of the Googlebot. He also was aware of the ways to prevent Google from either listing his site at all or listing it but not providing a link to the cached version. Instead of opting out, however, he chose to allow Google to both index and provide a link to the cached version.

For the AI dataset, (A) did the person know their work was being collected by this group and for this purpose, and (B) did they know of a way to prevent that collection?


It is not clear to me if they are _only_ using GitHub as source. The Stack explicitly mentions they are using Software Heritage as source and Software Heritage definitely sources from repositories that are NOT stored in GitHub (and never have been).


I don’t think that “implied license” you’re referring to holds up in the courts.


Hopefully the crawler smart enough to properly handle edge cases...

e.x. if the repo has some sort of /used-licenses/ folder where the licenses for packages and the like are included, it could make a bad decision.


> Without a license it’s not legal to copy and distribute.

Is this true? When you post anything publicly, from sticking a poster on the street to making artwork like banksy, isn’t the default set to “it’s legal to copy, unless explicitly stated otherwise”?


The default in the majority of the world is that most creative works (including software code) are by-default copyrighted by the author, and the author must explicitly license away those rights. Some jurisdictions (e.g. France) put limits on what rights the author is allowed to give up. I.e., the default is it is illegal to copy (subject to exemptions like “fair use”).


Note that this archive project is French.


Banksy apparently runs a licensing program. Their artwork is most definitely under copyright, and they rely on trademark protection as well.

There is also the practical issue that a lot of content is posted publicly without consent of the copyright owner. It's simply not true that just because someone else committed a copyright violation first, you can commit further violations without impunity based on that first violation.


> If a public repo does not have a license, it does not mean it free to copy and distribute.

Whether or not it is free to copy and distribute, it should be free to copy and distribute. (My opinion is that copyright is no good; if the file is public then you should be allowed to copy and distribute it.)

> If a private repo has an open source license like MIT, then the crawler has a right to copy and distribute that repo. Regardless if it has authorization to access the repo or not.

I should not think so. The license would only apply if you have a copy of it anyways. If you are not authorized to access it because it is private, then you would have to get a copy from somewhere else, and if nobody else is providing a copy, that shouldn't give you the right to unauthorized access. However, if it has been done, then it is done, so now there is a copy, and the license (if it is a license that allows copying it in this way) would authorize you to continue to use and distribute the copy that you have.


i’m not saying I agree or care about any of it. A sane company would never allow the use of source code from a third party without a license.

If repo is forked and the license is deleted the source code would need to be hashed to verify its the exact version of an open source repo. Mainly they don’t want copyleft or “malcious” license infecting their IP

If the hashes don’t match then it’s not technically the same code, so a company can’t safely use it without a license.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: