New publication in LREV journal

1 October 2019, by Reinhard Zierke

A new publication from LT group member Gregor Wiedemann has appeared in "Language Resources and Evaluation":

Wiedemann, G.; Heyer, G. (2019): Multi-modal page stream segmentation with convolutional neural networks, In: Language Resources and Evaluation (LREV), Online first: 27.09.2019.

The paper introduces an approach to combine image and text information for document flow separation and evaluates it on two datasets, one in-house German dataset and one public English dataset. The paper is a joint work with the ASV at Leipzig University.

Abstract

In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As first steps, the workflow usually involves batch scanning and optical character recognition (OCR) of documents. In the case of multi-page documents, the preservation of document contexts is a major requirement. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into coherent multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach for PSS based on convolutional neural networks (CNN). As a first project, we combine visual information from scanned images with semantic information from OCR-ed texts for this task. The multi-modal combination of features in a single classification architecture allows for major improvements towards optimal document separation. Further to multimodality, our PSS approach profits from transfer-learning and sequential page modeling. We achieve accuracy up to 95% on multi-page documents on our in-house dataset and up to 93% on a publicly available dataset.

Latest articles

18.05.2026|LT

ACL 2026 accepts 3 papers from LT members

We are happy to announce that ACL 2026 has accepted 3 papers from LT members:

* Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, Idris Abdulmumin, Vukosi Marivate, Shamsuddeen Hassan Muhammad, Atnafu...

04.09.2025|LT

EMNLP 2025 accepts 4 papers from LT members

We are joyfully announcing that EMNLP 2025 has accepted four papers from LT members.

Tadesse Destaw Belay, Israel Abebe Azime, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Idris Abdulmumin, Abinew Ali Ayele, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam: AfroXLMR-Social: Adapting Pre-trained...