When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts

Starý Novotný,  Vít; Luger,  Kristýna; Vrabcová,  Tereza; Horák,  Aleš

When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	NOVOTNÝ Vít SEIDLOVÁ Kristýna VRABCOVÁ Tereza HORÁK Aleš
Rok publikování	2021
Druh	Článek ve sborníku
Konference	Recent Advances in Slavonic Natural Language Processing (RASLAN 2021)
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Full text PDF Domovská stránka workshopu
Klíčová slova	Optical character recognition · Layout analysis; Language identification; Image super-resolution; Medieval texts
Popis	The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.
Související projekty:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy