We build a BERT-like transformer model to predict the correct articles of German nouns. We try different sizes to determine the best size/accuracy tradeoff.
Our best model1 achieves an accuracy of 84%. We tested a lot of different model parameters, but always converged against approximately the same accuracy, just with broadly different runtimes. The smallest model we tested2 has only 406,659 parameters and still achieves an accuracy of 72%. Most models have been trained using 82,825 examples and have been validated on 9,203. Small-scale experiments with the sizes of val and train datasets switched show that we can get the same performance with a lot fewer examples.
A paper will be published as soon as there is a Devin-like system that simulates a grad-student that can do all the LaTeX/documentation/writing work that I don't really feel like doing.