dafont.com is a wonderful website that contains a large collection of fonts. It’s more comprehensive and esoteric than Google Fonts. One of its features is a forum where users can ask for help identifying fonts – check out this poor fellow who’s been waiting for over two years and bumped his thread. I thought it would be interesting to see if an LLM could do this task, so I scraped the forum and set up a benchmark.
I implemented this as a live benchmark. By this I mean that I only ask the LLMs to identify fonts that haven’t yet been identified by the community. I evaluate the LLMs prediction by comparing it to the community’s prediction, once the latter has been made. This way, I’m sure that I’m asking the LLM to work on images it has never seen before. Indeed, there are many examples of biased LLMs that are just too good at memorizing images and text.
Benchmark contamination is a real issue in LLM evaluation, so I think it’s important to set up benchmarks this way when possible. There have been some great suggestions in this direction, including LiveBench and the Konwinski Prize