2024 was my year of training Transformer Language Models (LMs). The hours I spent on building, training and debugging my models!
I’ve been building and using NLP models using LMs as the basis for a while, but this year I spent much more time on training them from scratch, trying many variations and learning the hard way how brittle these systems are and how difficult they are to comprehend.
Wisely, this sentence is a one-sided conditional. It does not guarantee that you understand what you create. My understanding of the models I train is certainly imperfect. Even when they work, when I’ve created them successfully, they remain opaque.
One can also question to which extent I created my language models. The data, the PyTorch framework, and the CUDA library all have different authors. Writing the first version of the program, choosing architectural details, selecting the data sources, it is easy to believe one is in control. But, at least for me, this feeling was rapidly punctured by the failure of the first version. And then the failure of the second version, and the third, and the fourth…
Debugging a transformer model is hard work and it is often exceedingly difficult to tell whether one has committed a coding error or just selected the wrong hyperparameters (learning rate, batch size etc.). I certainly have done both at the same time, making it nearly impossible to tell why the model underperforms. I went through the code again and again and re-read the relevant passages in two papers again and again.