Aryn’s goal is to provide analytics over unstructured documents, typically PDFs. We believe that integral to that goal is the ability to break down

We improved table extraction in DocParse with a new AI model

submited by

Style Pass

2025-01-23 21:30:02

Aryn’s goal is to provide analytics over unstructured documents, typically PDFs. We believe that integral to that goal is the ability to break down these documents into a more structured form that we can use to extract more information than you might be able to by simply stripping out all of the text and embedding it with a language model. Since documents are typically intended for human consumption, and humans use eyeballs to ingest information from documents, we believe that the first step of computer document processing is fundamentally a computer vision problem. To that end, we created a computer vision model to segment documents into a collection of physical, typed elements representing how a real person would likely parse the document.

With many documents, a large portion of customer questions can be answered by a table or set of tables within the document. If I have Boeing’s 10-K form from 2024 and I want to know how many airplanes they sold, the answer is in a table. If I have the municipal budget report of Wilmington, NC, and I want to know what percent of their budget they spent on policing, the answer is in a table. Like with documents, the tables themselves are laid out to optimize for human consumption - that is, visually. I can’t ask a PDF for an indexable data structure representing a particular table on a particular page because all the PDF sees is a collection of rectangles with text inside them. For Aryn to query these tables, we first need a way to turn a picture of a table into such a data structure.

We improved table extraction in DocParse with a new AI model

Leave a Comment

Related Posts

Recent Posts

DSON: A delta-state CRDT for resilient peer-to-peer communication

Crib Sheet: A Conventional Boy - Charlie's Diary

How the Kyiv Independent reached 20,000 paying members — with no paywall | Nieman Journalism Lab

Zillow Scraper - Agent & Property Export

Microsoft, anybody home?

The Efficiency Trap - Anthony Batt

Sketchers lets parents hide an AirTag in kids shoes

Math Is Quietly in Crisis over NSF Funding Cuts

Liberals are less willing to buy Teslas than other electric vehicles, moderated by perceptions of Elon Musk

Vintage base ball: the charm of a sport where the 1860s never ended

My Heart Of Hearts - by Scott Alexander - Astral Codex Ten

Search code, repositories, users, issues, pull requests...

Take control of your Kubernetes resources

I’m Obsessed with Other People’s Spotify Playlists

Performative ‘elections’ expose a sad lack of vision

The world's creative intelligence at your fingertips

Ryan Petersen: Building the Hidden Engine of Global Trade

‘We Are Just Here for Work’: Indian Scammer Explains Why He Moved to Laos

Search code, repositories, users, issues, pull requests...

Union-Backed Boston Ordinance Would Require Drivers in Driverless Waymos