for something to show up verbatim in the output of a textual AI model it needs to be an input many times.
I wonder if the problem is not copilot, but many people using this person's code without license or credit, and copilot being trained on those pieces of code as well. copilot may just be exposing a problem rather than creating one.
I don't know much about AI, and I don't use copilot.
Microsoft have a public statement that they don't use proprietary code, only public code with public licenses. They have a lot of companies as customers who uses github, and they also use a lot third-party code in their own products.
Even BSD et. al. have attribution requirements - that must be a vanishingly small amount of code to be used. Me thinks the people who run GitHub (who have apparently decided to abandon the core business for the latest fun project) aren't being entirely upfront.
With the amount of resources that Microsoft has, how hard can it be for them to exclude proprietary code that other people have stolen? I’d bet it is easy for them, but they won’t do it. Because they don’t care, because who is gonna take on them?
Will they “accidentally” include proprietary code from say, Oracle? Nope. They’ll make sure of it. But Joe Random? Sure
because Microsoft is known to be extremely protective of their code. there is just no way they would expose their internal code to being straight up decoded from the model, while they can just train the model on the huge public data of GitHub
As a thought experiment: what do we all suppose would be the impact to Microsoft if they deliberately made public the proprietary source code for all of their publicly available commercial products and efforts (including licensed software, services; excluding private contracts, research), but the rest of their intellectual property and trade secrets remained private?
Since I’m posing the question, here’s my guess:
- Their stock would take at least a short term hit because it’s an unconventional and uncharacteristic move
- The code would reveal more about their strategic interests to competitors than they’d like, but probably nothing revelatory
- It might confirm or reinforce some negative perceptions of their business practices
- It might dispel some too
- It may reduce some competitive advantage amongst enormous businesses, and may elevate some very large firms to potential competitors
- It would provide little to no new advantage to smaller players who aren’t already in reach of competing with them and/or don’t have the resources to capitalize on access to the code
- It would probably significantly improve public perception of the company and its future intentions, at least among developers and the broader tech community
In other words, a wash. Overall business impact would be roughly neutral. The code has more strategic than technical value, there are few who could leverage the technical value that is any kind of revenue center with growth potential. Any disadvantage would be negated by the public image goodwill it generated.
Maybe my take is naive though! Maybe it would really hurt Microsoft long term if suddenly everyone can fork Windows 11, or steal ideas for their idiosyncratic office suite, or get really clever about how to get funded to go head to head with Azure armed with code everyone else can access too.
I think Microsoft, their ISVs, and everyone would benefit a lot if Windows were "open source" in the narrow sense - viewable source code, with a license to compile and use only to the extent that you already own the requisite Windows license(s).
Pirating Windows is already utterly trivial with KMS activation, so it's not like they'd lose anything there.
If they’d open source their software I wouldn’t have to wait two months till they finally release the pdbs for the kernel after every 2XH1 / 2XH2 update.
It’s so annoying that they are sooooo slow at this and we have to keep our users from upgrading after every release.
Maybe open-source licenses might need to be revised to disallow this sort of thing, e.g. by saying that any thing trained on GNU data must also carry that license.