DeVault: GitHub Copilot and open source laundering

submited by
Style Pass
2022-06-24 09:00:08

GitHub’s Copilot is trained on software governed by these terms, and it fails to uphold them, and enables customers to accidentally fail to uphold these terms themselves. Some argue about the risks of a “copyleft surprise”, wherein someone incorporates a GPL licensed work into their product and is surprised to find that they are obligated to release their product under the terms of the GPL as well. Copilot institutionalizes this risk and any user who wishes to use it to develop non-free software would be well-advised not to do so, else they may find themselves legally liable to uphold these terms, perhaps ultimately being required to release their works under the terms of a license which is undesirable for their goals.

Chances are that many people will disagree with DeVault's reasoning, but this is an issue that merits some discussion still. (Log in to post comments)

DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 16:26 UTC (Thu) by mpldr (subscriber, #154861) [Link] I would personally love for an independent legal team to take a look at this issue… I doubt that either the FSF's or GitHub's legal teams can be trusted to be impartial on these topics. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 18:31 UTC (Thu) by NYKevin (subscriber, #129325) [Link] Emphasis on "legal" team, not engineering team. A lot of engineers have very strange ideas about how the GPL works. For example, some engineers (presumably not including DeVault) think that just reading GPL'd code causes it to magically "infect" any code you write after that point, which is not even close to being true.* The GPL attaches to derivative works. "Derivative work" is a legal term of art, and the license does not attempt to offer a definition for it, because it's defined by the underlying copyright law. That's why you need a team of lawyers, not a team of engineers, to evaluate something like this. In particular, I think DeVault's emphasis on the model itself being a derivative work may be an unnecessary distraction, because: 1. It's not clear to me whether this claim is actually correct. A model is ultimately "just" a big bag of statistical information, and I honestly don't know whether (US) copyright law attaches to such things in the first place, but I'm skeptical (see e.g. Feist v. Rural). 2. It's not relevant. What matters is whether the output of the model is a derivative work of the original, which is a completely different legal question. Derivative works are not subject to some sort of magical "transitive property" that requires the model to also be a derivative work; you can argue that the output is derivative while taking no position on the status of the model itself. Similarly, you could argue that the output is *not* derivative, again taking no position on the model. The status of the model is not relevant to the question, unless you're going to allege an AGPL** violation. * The kernel of truth here is that, in practice, clean-room engineering is often a good idea for the avoidance of legal risk. But there's nothing in either the GPL or the copyright statute that says you have to do it. Because that would be stupid. Imagine if novelists couldn't read books without running into copyright issues. ** The AGPL is the only widely-used license whose obligations attach on creation of a derivative work, rather than on distribution of that work. As far as I know, GitHub has no intention of distributing the model itself to anyone, so if you want to sue GitHub just for creating the model, you'd have to claim an AGPL violation specifically. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 19:26 UTC (Thu) by ballombe (subscriber, #9523) [Link] The legal issue will probably depend on the specific data you can extract from copilot, and obviously github will not help you to find out, and lawyers might not be sufficient. Maybe there is some specific request that led copilot to return whole body of some GPL files. For example, by looking for certain patterns that occurs in a single software etc. That would strengthen the case. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 19:52 UTC (Thu) by Gaelan (subscriber, #145108) [Link] Someone got Copilot to generate the fast inverse square root function from Quake III (which is GPL'd), "what the fuck?" comment and all: https://twitter.com/mitsuhiko/status/1410886329924194309 Amusingly, it also autocompleted a BSD license onto that code. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 20:28 UTC (Thu) by mrugiero (guest, #153040) [Link] And in the case of the FSF their role is to actively try to be partial in favor of free software. On intent, not by accident. If they can get a win for FLOSS then they accomplished their stated mission. Impartiality is for the judge, not for the litigants. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 20:58 UTC (Thu) by mpldr (subscriber, #154861) [Link] > [the FSF's] role is to actively try to be partial in favor of free software There's nothing wrong with that – quite the opposite – but it's not helpful if what you want is a legal review. You may get some interesting points from them, sure; but it's not exactly helpful when trying to find out what is actually law (let alone that this is a court's job) DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 2:33 UTC (Fri) by scientes (subscriber, #83068) [Link] In the common law system the courts' job is to write law. /Almost not sarcastic DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 18:21 UTC (Thu) by bluca (subscriber, #118303) [Link] The EU Copyright Directive allows Text and Data Mining, even for commercial purposes, regardless of what the copyright status and/or license of the text body is. I don't see how any terms of the original license (if even there is one) apply. If you have legal access to a body of proprietary text, you can still legally do TDM on it and train a model. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 18:43 UTC (Thu) by engla (subscriber, #47454) [Link] It doesn't give you a blanket reason to publish parts of the corpus, though. Maybe it is not prescient in how the "analysis" (outcome) of the data mining might be a derivative of the data itself. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 19:38 UTC (Thu) by bluca (subscriber, #118303) [Link] Where is that clause defined? DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 18:26 UTC (Thu) by Polynka (subscriber, #129183) [Link] > And, my advice to free software maintainers who are pissed that their licenses are being ignored. First, don’t use GitHub and your code will not make it into the model (for now). This would be a good recommendation, but unfortunately… The way software freedom is defined, it means that everyone is free to redistribute the source code as they wish, _including uploading it verbatim to G*tHub_ even if it doesn’t come from there. You cannot forbid it in a licence, because it would make the software non-free (and potentially introduce a licence incompatibility). Sure, you can just ask people not to, but not everybody listens, and when you notice that somebody uploaded your software to G*tHub against your wishes, it may be already too late and the code may be already stolen and incorporated into the machine learning dataset. The only winning move would be not to play and not publish the source code anywhere nor show it to anyone. Which will obviously make your software proprietary, which may be not something that you would want. > Instead, I would update your licenses to clarify that incorporating the code into a machine learning model is considered a form of derived work, and that your license terms apply to the model and any works produced with that model. …which would be also just disregarded by G*tHub. Again, the only way to win is not to play. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 18:49 UTC (Thu) by NYKevin (subscriber, #129325) [Link] >> Instead, I would update your licenses to clarify that incorporating the code into a machine learning model is considered a form of derived work, and that your license terms apply to the model and any works produced with that model. Aha, I missed that line in DeVault's post. This is not a thing you can do. The judge decides what counts as a derivative work. You don't. The license cannot override the applicable copyright law. If copyright law does not say that a license is required in order to create the model in the first place, then no provision of the license can prohibit it. Not even "All rights reserved, do not redistribute." OTOH, if the applicable copyright law says that a license is required to create such a model, then GitHub has to comply with the terms of the license, and it doesn't matter whether the license explicitly calls this out or not. DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 23:07 UTC (Thu) by Wol (subscriber, #4433) [Link] The problem with your approach is that INTENT MATTERS. If your licence makes it clear that you consider putting your source into a ML algorithm creates a derivative work, then the Judge is likely to agree with you. If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous". Fair enough. But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not). Cheers, Wol DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 1:13 UTC (Fri) by NYKevin (subscriber, #129325) [Link] Either the model is a derivative work of your code, or it is not. Licenses can't change that, only legislatures can. The intent of the person writing the code, or even of the person creating the model, has no bearing on any of this. If it so happens that the model is not a derivative work of the original in your jurisdiction, then you have to write your legislature and get them to change the law. You can't "fix" it by putting an extra term in the license; the license is not a law, and does not entitle you to impose arbitrary rights and restrictions on other people. > If your licence doesn't make it clear, then the Judge will almost certainly side against you on the basis that "copyright law is ambiguous". That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of. At best, the defendant might be able to raise the defense of "innocent infringement" in some jurisdictions. But under US law, that does not relieve the defendant of liability, it merely reduces the monetary amount of their damages, which can still be quite substantial if many copies were made. Also, a valid copyright notice often defeats or greatly weakens this defense (see e.g. 17 USC 401(d)), depending on jurisdiction. Seriously, if anyone in this thread is contemplating acting on this suggestion, I would strongly urge that person to consult an attorney who specializes in copyright law. This is not how the law works at all. You cannot go before a judge and say "the law was ambiguous so I just did it anyway," and expect to automatically win. > But if your licence does make it clear, the Judge needs a reason to say "your licence is invalid" (and he'd rather not). You have it backwards. Either the defendant is arguing that no license is required (and so it doesn't matter whether the license is valid or invalid), or the defendant is arguing that the license is valid and its actions fall within the scope of the license. Arguing that the license is invalid is something the plaintiff might do in order to defeat the latter defense; it never makes sense for the defendant to raise such an argument. DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 2:30 UTC (Fri) by Wol (subscriber, #4433) [Link] > Either the model is a derivative work of your code, or it is not. Licenses can't change that, only legislatures can. And the legislature will not examine YOUR code, and decide YOUR case ... It is the Judge who *decides* whether it is a derivative work or not. > That is egregiously wrong. Ambiguous law does not automatically favor the defendant in any jurisdiction I've ever heard of. Who was talking about the *law*? I was talking about the *licence*. If the licence makes it absolutely clear that the licensor considers ML to create derivative code, then the licensee cannot claim an innocent mistake. The licensee MUST claim that a licence is not required and copyright does not apply. At the end of the day, it's down to the Judge to *apply* the law. And if it is clear to the Judge that the defendant "knew or should have known" that they were acting against the wishes of the licensor, then there is no defence of estoppel, or "innocent infringement", or "but I thought it was okay". And, faced with the choice of siding with the plaintiff and saying to the defendent "you knew the defendent did not permit that", or CREATING NEW LAW by explicitly defining ML into the Public Domain or whatever, which do you think a Judge is going to choose? At the end of the day, putting this stuff into your licence does not change the law. But it makes it a damn sight more likely that the Judge is going to side with your interpretation of the law. Cheers, Wol DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 2:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link] > And the legislature will not examine YOUR code, and decide YOUR case ... It is the Judge who *decides* whether it is a derivative work or not. That is exactly my point. The judge will make this decision, based on the facts of the case and what the legislature wrote in the statute. Not based on the license. The license has zero to do with what is or is not a derivative work. DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 8:04 UTC (Fri) by Wol (subscriber, #4433) [Link] And you're completely missing my point. If the Judge has to decide whether ML is a derivative work or not (and create new law in the process!!!), then if the licensor made it clear that he considered it DID make a derivative work, the Judge will be inclined to side with the licensor. If the license says "I consider this to be a derivative work", the Judge will not want to create new law by disagreeing - you're effectively twisting the Judge's arm. How far he lets you twist it is down to him :-) Remember PJ - Judges try to upset the apple-cart as little as possible. If you give the Judge an out, he will take it ... Cheers, Wol DeVault: GitHub Copilot and open source laundering Posted Jun 23, 2022 22:26 UTC (Thu) by jmspeex (subscriber, #51639) [Link] Actually, this is not just about the GPL and copyleft licenses. Even a license as permissive as the BSD still has a few requirements (including the copyright notice) that are easy to break with ML code generation. So if the code your ML model spits out is BSD-licensed, you could still get in trouble over it. DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 1:07 UTC (Fri) by developer122 (subscriber, #152928) [Link] This is an important point. There were a raft of people arguing that all code from copilot should be GPL due to ingested GPL code, to which I reply: What gives the GPL priority? There is code on github under lots of licences, eg the CDDL which is a copyleft free software licence but incompatible with the GPL. As there is significant CDDL code on github, under the same reasoning perhaps all the output should be CDDL? An then there's all the code on github which specifies no licence at all. Just because it is publicly available does not mean any and all rights have been granted to you. Plenty of this code is proprietary. The end result is an unlicencable pile of legal mush that no sane lawyer should be going anywhere near. DeVault: GitHub Copilot and open source laundering Posted Jun 24, 2022 8:16 UTC (Fri) by bluca (subscriber, #118303) [Link] Training of the model is not done under the terms of any license. It's done under the copyright directive laws, which give an exception on copyright rules when doing text and data mining, for any purpose. I don't see how there's any 'legal mush'.

Leave a Comment