One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT co

Simon Willison’s Weblog

submited by
Style Pass
2023-06-04 18:30:04

One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot.

When someone asked Google Bard how it was trained back in March, it told them its training data included internal Gmail! This turned out to be a complete fabrication—a hallucination by the model itself—and Google issued firm denials, but it’s easy to see why that freaked people out.

I’ve been wanting to write something reassuring about this issue for a while now. The problem is... I can’t do it. I don’t have the information I need to credibly declare these concerns unfounded, and the more I look into this the murkier it seems to get.

The fundamental issue here is one of transparency. The builders of the big closed models—GPT-3, GPT-4, Google’s PaLM and PaLM 2, Anthropic’s Claude—refuse to tell us what’s in their training data.

Leave a Comment