Introduction The Problem Other Approaches Idea and Key Insights Token Stability and Learning Trade-off Between Training and Inference Costs Training CryptGPT Step 1: Encrypt the Dataset Step 2: Train the Tokenizer Step 3: Train the Model Model Training Training Logs and Model Artifacts Results Limitations of This Approach Model and Tokenizer Tied to the Key Susceptibility to Frequency Analysis Attacks Model Weights Leakage Addressing Those Limitations Decoupling the Model and Tokenizer from the Key Using a Stronger Encryption Algorithm Mitigating Model Weights Leakage Potential Applications Example Future Work Summary and Future Directions A Challenge for Cryptanalysts and LLM Researchers
Language models like GPT-4 are pretty awesome. They can generate text, answer questions, and help with all sorts of tasks. But as they become more popular, people are starting to worry about privacy. How can we make sure the data used to train these models and the stuff they generate stays private?