The world of scholarly communication is rapidly evolving, with preprints and preprint servers gaining significant traction. As we enter the second wav

The Urgent Need for an Open Source PDF to HTML/XML Converter in the Preprint Ecosystem

submited by

Style Pass

2024-04-24 09:00:12

The world of scholarly communication is rapidly evolving, with preprints and preprint servers gaining significant traction. As we enter the second wave of this evolution, preprint review communities are emerging, raising important questions about how to position a document with its review and what to do about metadata?

The key problem lies in the file formats used by preprint servers, which rely heavily on PDF, making it extremely challenging to reconstitute structured content or extract metadata. PDF is a terrible format and reconstructing structured information from a PDF is as Peter Murray-Rust has famously said is like "constructing a hamburger into a cow". The scholarly community has known about the problems of PDF for a long time as attested by the beautifully named 'Beyond the PDF' symposium in 2011 (later to become Force11) which sought to explore richer and more dynamic ways to present scholarly content other than PDF.

So why is that, from where I sit, one of the most important evolutions of scholarly communications in the last decade – preprint servers and the preprint review/curate ecosystem - is dedicated to PDF? Ideally preprint servers would change the submission format, but I fear we are too late to achieve that across the entire ecosystem. At least I don't see that change coming quickly. So what are we to do?