OpenAI has agreed to reveal the data used to train its generative AI models to attorneys pursuing copyright claims against the developer on behalf of

OpenAI to reveal secret training data in copyright case – for lawyers' eyes only

submited by
Style Pass
2024-10-09 20:00:03

OpenAI has agreed to reveal the data used to train its generative AI models to attorneys pursuing copyright claims against the developer on behalf of several authors.

The authors – among them Paul Tremblay, Sarah Silverman, Michael Chabon, David Henry Hwang, and Ta-Nehisi Coates – sued OpenAI and its affiliates last year, arguing its AI models have been trained on their books and reproduce their words in violation of US copyright law and California's unfair competition rules. The writers' actions have been consolidated into a single claim [PDF].

On Tuesday, US magistrate judge Robert Illman issued an order [PDF] specifying the protocols and conditions under which the authors' attorneys will be granted access to OpenAI's training data.

The terms of access are strict, and consider the training data set the equivalent of sensitive source code, a proprietary business process, or secret formula. Even so, the models used for ChatGPT (GPT-3.5, GPT-4, etc.) presumably relied heavily on publicly accessible data that's widely known, as was the case with GPT-2 for which a list of domains whose content was scraped is on GitHub (The Register is on the list).

"Training data shall be made available by OpenAI in a secure room on a secured computer without internet access or network access to other unauthorized computers or devices," the judge's order states.

Leave a Comment