Lost in Morphology - Enhancing Bilingual Lexicon Induction with Lemmatisation
| Authors | |
|---|---|
| Year of publication | 2025 |
| Type | Article in Proceedings |
| Conference | Proceedings of the Nineteenth Workshop on Recent Advances in Slavonic Natural Languages Processing |
| MU Faculty or unit | |
| Citation | |
| web | |
| Keywords | Bilingual lexicon induction; Morphology; Lemmatisation; Evaluation |
| Description | Bilingual Lexicon Induction (BLI) is a fundamental task in cross-lingual word embedding (CWE) evaluation, aimed at retrieving word translations from monolingual corpora in two languages. However, morphological complexity poses an intractable challenge, where translations deemed incorrect tend to be morphological variations of the correct ones. This study explores the role of lemmatisation in mitigating this issue by comparing two integration strategies: (1) pre-alignment lemmatisation, applied before training monolingual word embeddings (MWEs), and (2) post-retrieval lemmatisation, applied to retrieved target words. We conduct experiments using three state-of-the-art CWEs across a wide range of language pairs, comparing Slavonic and other language families, with varying morphological complexity. Our findings reveal notable differences between the two approaches: post-retrieval lemmatisation proves more beneficial for less morphologically complex language pairs, while pre-alignment lemmatisation performs well for those with moderate complexity, and for highly inflected languages, the choice of approach has minimal impact. |
| Related projects: |