Are We There Yet? A Thorough Evaluation of POS Tagging on Czech

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

OHLÍDALOVÁ Vlasta JAKUBÍČEK Miloš RYCHLÝ Pavel

Year of publication 2025
Type Article in Proceedings
Conference Text, Speech, and Dialogue, 28th International Conference, TSD 2025
MU Faculty or unit

Faculty of Informatics

Citation
web Konferenční sborník
Doi https://doi.org/10.1007/978-3-032-02551-7_23
Keywords morphological analysis; evaluation; POS tagging
Description With recent advances in natural language processing, part-of-speech (POS) tagging is one of the areas that has seen significant improvements. Contemporary state-of-the-art tools report accuracies approaching 100% even for morphologically rich languages such as Czech that used to pose a challenge in the past. In this study, we investigate whether such accuracy is reproducible on real-world data, as previous research has demonstrated substantial discrepancies between evaluations conducted on gold-standard corpora and those based on text typically occurring on the web. To address this issue, we selected a set of widely used and well-established POS taggers and applied them to a random sample of documents from the csTenTen23 web corpus. Tokens, for which the taggers produced differing outputs, were then manually annotated. Our results indicate that the ability of modern POS taggers to handle real-world data – including a broad range of genres and topics – has improved significantly in comparison to the earlier statistically based POS taggers. Furthermore, we observe a shift in the most problematic tagging category: whereas case assignment was previously a major source of errors, the best current models struggle more with POS category distinctions. We argue that this shift may reflect ambiguities inherent in the POS category itself, where even human annotators may not fully agree.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.