Discriminating Between Similar Languages Using Large Web Corpora

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	SUCHOMEL Vít
Year of publication	2019
Type	Article in Proceedings
Conference	Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019
MU Faculty or unit	Faculty of Informatics
Citation
web	https://nlp.fi.muni.cz/raslan/2019/paper12-suchomel.pdf
Keywords	language identification; discriminating similar languages; building web corpora
Description	This paper presents a method for discriminating similar lan-guages based on wordlists from large web corpora. The main benefits ofthe approach are language independency, a measure of confidence of theclassification and an easy-to-maintain implementation.The method is evaluated on VarDial 2014 workshop data set. The resultaccuracy is comparable to other methods successfully performing at theworkshop.A tool implementing the method in Python can be obtained from web sitehttp://corpus.tools/.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum