RD-TableBench is an open benchmark to help teams evaluate extraction performance for complex tables. The benchmark includes a variety of challenging s

Reducto Document Ingestion API

submited by
Style Pass
2024-11-05 19:00:08

RD-TableBench is an open benchmark to help teams evaluate extraction performance for complex tables. The benchmark includes a variety of challenging scenarios including scanned tables, handwriting, language detection, merged cells, and more.

We also benchmarked the extraction performance of various models with RD-TableBench here. All result data points are available in the RD-TableBench Demo.

Reducto employed a team of PhD-level human labelers who manually annotated 1000 complex table images from a diverse set of publicly available documents. While other approaches benchmarked may have been trained on some of this data, it was unseen to Reducto's models during both training and validation.

The dataset was selected to include examples with different structures, text density, and language. The following graphs show a breakdown of table size by number of cells and language for each table.

For the initial release, we evaluated the following tools/methods: Reducto, Azure Document Intelligence, AWS Textract Tables, GPT4o, Google Cloud Document AI, Unstructured, and Chunkr.

Leave a Comment