Structured Information Extraction from Pharmaceutical Records

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	BAMBUROVÁ Michaela NEVĚŘILOVÁ Zuzana
Year of publication	2019
Type	Article in Proceedings
Conference	Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019
MU Faculty or unit	Faculty of Informatics
Citation
web	https://nlp.fi.muni.cz/raslan/2019/paper09-bamburova.pdf
Keywords	structured information extraction; table understanding; entity recognition
Description	The paper presents an iterative approach to understanding semi-structured or unstructured tabular data with pharmaceutical records. Thetask is to split records with entities such as drug name, dosage strength,dosage form, and package size into the appropriate columns. The data isprovided by many suppliers, and so it is very diverse in terms of structure.Some of the records are easy to parse using regular expressions; othersare difficult and need advanced methods. We used regular expressionsfor the easy-to-parse data and conditional random fields for the morecomplex records. We iteratively extend the training data set using theabove methods together with manual corrections. Currently, the F1 scorefor correct classification into 5 classes is 95%.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie