When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

NOVOTNÝ Vít

Year of publication 2020
Type Article in Proceedings
Conference Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020
MU Faculty or unit

Faculty of Informatics

Citation
Web
Keywords Optical character recognition; OCR; Historical texts
Description

Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge.

In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms.

We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.

Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.