Rossum Publishes World’s Largest Research Dataset and Benchmark to Accelerate Scientific Progress in Intelligent Document Processing (IDP)

March 1, 2023

Rossum, the pioneer in cloud-native Intelligent Document Processing (IDP), announced that it published the world’s largest research dataset to accelerate scientific progress in business document information extraction (IE). Large datasets are crucial to improving and measuring how AI models perform, which is why the groundbreaking DocILE (Document Information Localization and Extraction) benchmark is so important. It is the world’s largest collection of business documents for the most practical information extraction tasks in IDP. Rossum’s R&D efforts continue to focus on delivering faster and more accurate document information extraction methods, so customers can minimize slow, tedious, and error prone manual document processing.

“This is an important milestone because it advances IDP research as a whole, where everyone can now develop and test more advanced algorithms on a benchmark of challenging and highly practical tasks,” said Milan Šulc, Ph.D., Head of Rossum’s AI Labs. “The new dataset will increase accuracy levels in document information extraction by accelerating research in areas such as novel machine learning architectures and training objectives. This will ultimately lead to global optimization of business communication and workflows, further increasing the amount of the time saved for our customers.”

Datasets and benchmarks as it relates to business document IE are very rare because such documents often contain sensitive information and are legally protected. DocILE is addressing this issue by building a benchmark composed of documents from two public data sources: UCSF Industry Documents Library and Public Inspection Files (PIF). The dataset consists of more than a hundred thousand documents – real or synthetically generated (6,700 annotated business documents and 100,000 synthetically generated documents) – with labels for practical IE tasks. Additionally, it comes with a large dataset of approximately a million unlabeled documents that can be used for unsupervised learning.

The DocILE benchmark was created as a cooperation of researchers from Rossum, Czech Technical University in Prague, University of La Rochelle, and the Autonomous University of Barcelona. It follows the peer-reviewed position paper Business Document Information Extraction: Towards Practical Benchmarks, presented by Rossum’s AI Labs at the recent CLEF 2022 conference.

The benchmark is hosted as a competition at ICDAR 2023, the largest research conference on document analysis, and as a CLEF 2023 lab – see the lab teaser (arXiv preprint, accepted to ECIR 2023). Rossum sponsors the competition with a prize pool of $9000 to attract open-source contributions. To find out more about the dataset, download the detailed dataset paper (arXiv preprint). By utilizing real-world business documents, the research community can focus on advances that will have a large impact on how businesses operate globally.

While Rossum continues to lead the IDP market with its AI and machine learning capabilities, this technology is rapidly evolving. It is paramount that any company focused on AI must consistently research its next technological expansions. Utilizing the new dataset will enable ongoing innovation within the IDP field.

About Rossum

Rossum is a market leading Intelligent Document Processing (IDP) solution combining the industry’s most advanced data extraction capabilities with a complete low-code platform that automates significant amounts of manual work across a company’s document processing workflow. Each month, Rossum saves its clients more than 50,000 hours on manual document processing. Hundreds of organizations across a wide range of sizes and industries including Bosch, HelloFresh, Morton Salt and The Master Trust Bank of Japan use Rossum to reduce manual effort, improve turnaround times, and eliminate errors. Learn more at


Leave a Reply

Your email address will not be published.

Don't Miss

Pioneering Privacy: London’s Leading Innovators in Data Protection

1. Palqee Technologies – Building Trusted Business-People Relationships Palqee Technologies

Kx Brings the Power and Performance of Kdb+ to Python Developers With Pykx

KX, maker of kdb+ the industry’s most trusted Data Timehouse™