Large language models are trained on huge datasets of text to learn the relationships and contexts of words within larger documents. These relationshi

DRINK ME: (Ab)Using a LLM to compress text

submited by
Style Pass
2024-04-26 13:30:05

Large language models are trained on huge datasets of text to learn the relationships and contexts of words within larger documents. These relationships are what allows the model to generate text.

Recently I've read concerns about LLMs being trained on copyrighted text and reproducing it. This got me thinking: Can training text be extracted from an LLM? The answer, of course, is yes, and this isn't a new (or open) question. This led me to wonder what it would take to extract entire books- or have an LLM reproduce text it's never directly been trained on. I figured that, for the most part, many texts contain sections that would naturally align with the language relationships the model has learned. If that's the case, then perhaps I could use the model to infer those relationships and correct its course whenever it deviates.

I used two texts for test. For the first, I decided to use the first chapter of "Alice's Adventures in Wonderland" as I assumed it would be in the model's training data. As I expected, I got very good compression.

Leave a Comment