Detecting Co-Derivative Documents in Large Text Collections

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

POMIKÁLEK Jan RYCHLÝ Pavel

Year of publication 2008
Type Article in Proceedings
Conference Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)
MU Faculty or unit

Faculty of Informatics

Citation
Web http://www.lrec-conf.org/lrec2008/
Field Informatics
Keywords Detecting; Large Text Collections
Description We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.