Intrinsic Methods for Comparison of Corpora

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors

BAISA Vít SUCHOMEL Vít

Year of publication 2013
Type Article in Proceedings
Conference RASLAN 2013 Recent Advances in Slavonic Natural Language Processing
MU Faculty or unit

Faculty of Informatics

Citation
Field Informatics
Keywords text corpus; corpora comparison
Description Since there are only very few techniques for quantitative and systematic comparison of text corpora we proposed and implemented several novel methods. The procedures were applied to comparing two very large web based Czech text corpora: czTenTen12 and Hector with more than 4.47 and 2.65 billion words, respectively. All methods are fully automatic and some of them are even language independent. We released some of them so they can be used instantly for comparison of other corpora.
Related projects: