Detecting Co-Derivative Documents in Large Text Collections

Pomikálek,  Jan; Rychlý,  Pavel

Detecting Co-Derivative Documents in Large Text Collections

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	POMIKÁLEK Jan RYCHLÝ Pavel
Year of publication	2008
Type	Article in Proceedings
Conference	Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)
MU Faculty or unit	Faculty of Informatics
Citation
web	http://www.lrec-conf.org/lrec2008/
Field	Informatics
Keywords	Detecting; Large Text Collections
Description	We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1% by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size.
Related projects:	Intelligent Models, Algorithms, Methods and Tools for the Semantic Web (realization) Centrum komputační lingvistiky Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce