Lemmatization of Czech and Croatian Noun Clusters for Terminology Extraction.

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

BLAHUŠ Marek PETREKOVÁ Katarína

Year of publication 2025
Type Article in Proceedings
Conference Recent Advances in Slavonic Natural Language Processing, RASLAN 2025
MU Faculty or unit

Faculty of Informatics

Citation
web Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2025.
Keywords noun clusters; lemmatization; terminology extraction
Attached files
Description During terminology extraction, terms discovered in corpora are presented in their canonical form. Lemmatization of multi-word terms consisting of noun clusters can be ambiguous due to the lack of information on their internal structure. In this paper, we show that grammatical case alone is often not sufficient for the construction of canonical forms of noun clusters. We focus on two-noun clusters in the genitive, which are the most frequent type with ambiguous parsing. Based on corpus research, we design rules that make use of multiple morphological categories to improve the lemmatization of noun clusters found in Czech and Croatian corpora. In addition to case, we also take note of gender, animacy, and whether the noun is a proper noun. The improvements lead to more accurate and more unified forms of the terms produced during terminology extraction for these two languages in Sketch Engine.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.