Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly here

There’s No Longer Any Doubt That Hollywood Writing Is Powering AI

submited by
Style Pass
2024-11-19 14:00:02

Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly here. Find The Atlantic's search tool for books used to train AI here.

For as long as generative-AI chatbots have been on the internet, Hollywood writers have wondered if their work has been used to train them. The chatbots are remarkably fluent with movie references, and companies seem to be training them on all available sources. One screenwriter recently told me he’s seen generative AI reproduce close imitations of The Godfather and the 1980s TV show Alf, but he had no way to prove that a program had been trained on such material.

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts. If a chatbot can mimic a crime-show mobster or a sitcom alien—or, more pressingly, if it can piece together whole shows that might otherwise require a room of writers—data like this are part of the reason why.

The files within this data set are not scripts, exactly. Rather, they are subtitles taken from a website called OpenSubtitles.org. Users of the site typically extract subtitles from DVDs, Blu-ray discs, and internet streams using optical-character-recognition (OCR) software. Then they upload the results to OpenSubtitles.org, which now hosts more than 9 million subtitle files in more than 100 languages and dialects. Though this may seem like a strange source for AI-training data, subtitles are valuable because they’re a raw form of written dialogue. They contain the rhythms and styles of spoken conversation and allow tech companies to expand generative AI’s repertoire beyond academic texts, journalism, and novels, all of which have also been used to train these programs. Well-written speech is a rare commodity in the world of AI-training data, and it may be especially valuable for training chatbots to “speak” naturally.

Leave a Comment