This video series from R&D features team members describing their roles, processes and the specific technical challenges they encounter while building and shipping projects. Along with each episode, we’ll share relevant background, resources, references and advice for anyone interested in creating something similar or learning more. If you have any questions, you can email us at firstname.lastname@example.org.
In this episode, R&D Intern Lasse Nordahl explains the process of converting over 10 million scanned images of articles from The Times’s archive into machine-readable text. Improvements in optical character recognition (O.C.R.) and computer vision classification models have made it possible to extract text from large image datasets with greater accuracy than ever before, but the newspaper format presents a unique set of challenges. While most readers intuitively understand how to read a printed article, a computer struggles to distinguish between individual components like headlines, captions and multiple columns of text across many pages—especially as formatting conventions change from decade to decade. We built an automated pipeline for breaking apart and categorizing the different components of an article so we can piece them back together in a structure that mimics articles on the web today. Accurate text transcriptions of our archives could allow us to put more historic content online for readers to explore, improve our article search and introduce new datasets for research and experimentation.