Document Engineering for a Digital Library: PDF recompression using JBIG2 and other optimization of PDF documents

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SOJKA Petr HATLAPATKA Radim

Year of publication 2010
Type Article in Proceedings
Conference Proceedings of MEMICS 2010 conference
MU Faculty or unit

Faculty of Informatics

Citation
Field Informatics
Keywords Authoring tools and systems; Categorization; Classification; Document presentation; Representations/Standards; Character recognition; Digital mathematical library; Digitisation workflow
Description Several innovative document transformations and tools developed in the process of building the Digital Mathematical Library DML-CZ http://dml.cz are described. The main result is our new PDF re-compression tool, developed using a enhanced jbig2enc library. Together with pdfsizeopt.py by Péter Szabó, we have managed to decrease PDF storage size and transmission needs by 62%: using both programs we reduced the size of the original already compressed PDFs to 38%. We briefly describe workflow and tools developed for creating the digital library. The batch digital signature stamper, the document similarity metrics which uses four different methods, a [meta]data validation process and math OCR tools represent some of the main [by]products. Such document engineering, together with Google Scholar indexing optimization, have led to the success of serving digitized and born-digital scientific math documents to the public in DML-CZ, and are being employed also in The European Digital Mathematics Library, EuDML.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.