Words’ Burstiness in Language Models

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors

RYCHLÝ Pavel

Year of publication 2011
Type Article in Proceedings
Conference Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
MU Faculty or unit

Faculty of Informatics

Citation
Field Linguistics
Keywords Burstiness; Language models; Words' probability
Description Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model.
Related projects: