Utilizing Linguistic Resources: Theory and Practical Experience

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

NĚMČÍK Václav

Year of publication 2010
Type Article in Proceedings
Conference Proceedings of Recent Advances in Slavonic Natural Language Processing 2010
MU Faculty or unit

Faculty of Informatics

Citation
Web https://nlp.fi.muni.cz/raslan/2010/paper04.pdf
Field Informatics
Keywords linguistic resources; corpora; theory; practice
Description The Prague Dependency Treebank (henceforth PDT) is a large collection of texts in Czech. It contains several layers of rich annotation, ranging from morphology to deep syntax. It is unique in its size and theoretical background, especially for a language like Czech, which can be, with regard to the number of its speakers, considered a small language. In this article, we use PDT 2.0 to demonstrate that within real NLP systems, complex annotations may cut both ways. We present several issues that might pose problems when extracting data from PDT, and complex structures in general, and hint on possible solutions.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.