Lost in Morphology - Enhancing Bilingual Lexicon Induction with Lemmatisation

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

DENISOVÁ Michaela RYCHLÝ Pavel

Year of publication 2025
Type Article in Proceedings
Conference Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing
MU Faculty or unit

Faculty of Informatics

Citation
web
Keywords Bilingual lexicon induction; Morphology; Lemmatisation; Evaluation
Description Bilingual Lexicon Induction (BLI) is a fundamental task in cross-lingual word embedding (CWE) evaluation, aimed at retrieving word translations from monolingual corpora in two languages. However, morphological complexity poses an intractable challenge, where translations deemed incorrect tend to be morphological variations of the correct ones. This study explores the role of lemmatisation in mitigating this issue by comparing two integration strategies: (1) pre-alignment lemmatisation, applied before training monolingual word embeddings (MWEs), and (2) post-retrieval lemmatisation, applied to retrieved target words. We conduct experiments using three state-of-the-art CWEs across a wide range of language pairs, comparing Slavonic and other language families, with varying morphological complexity. Our findings reveal notable differences between the two approaches: post-retrieval lemmatisation proves more beneficial for less morphologically complex language pairs, while pre-alignment lemmatisation performs well for those with moderate complexity, and for highly inflected languages, the choice of approach has minimal impact.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.