RealKIE Releases Five New Datasets for Enterprise Key Information Extraction

RealKIE presents five datasets crafted to test key information extraction.

The selection encompasses SEC S1 filings, non-disclosure agreements, charity reports, FCC invoices, and resource contracts. Aimed at tasks such as investment analysis and legal document analysis, these datasets mirror real-world complexity. The detailed description covers the annotation methods, document processing strategies, and foundational modeling techniques, providing a thorough groundwork for developing natural language processing tools capable of tackling real-life applications. Access to the datasets and OCR results is open for researchers, with the promise of making the code for baseline models accessible in the near future.

Read more: Arxiv