Discriminating Between Similar Languages Using Large Web Corpora

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors

SUCHOMEL Vít

Year of publication 2019
Type Article in Proceedings
Conference Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019
MU Faculty or unit

Faculty of Informatics

Citation
Keywords language identification; discriminating similar languages; building web corpora
Description This paper presents a method for discriminating similar lan-guages based on wordlists from large web corpora. The main benefits ofthe approach are language independency, a measure of confidence of theclassification and an easy-to-maintain implementation.The method is evaluated on VarDial 2014 workshop data set. The resultaccuracy is comparable to other methods successfully performing at theworkshop.A tool implementing the method in Python can be obtained from web sitehttp://corpus.tools/.
Related projects: