SemEval-2015 Task 15: A CPA dictionary-entry-building task

Baisa,  Vít; Bradbury, Jane; Cinková, Silvie; El Maarouf, Ismail; Kilgarriff, Adam; Popescu, Octavian

SemEval-2015 Task 15: A CPA dictionary-entry-building task

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	BAISA Vít BRADBURY Jane CINKOVÁ Silvie EL MAAROUF Ismail KILGARRIFF Adam POPESCU Octavian
Year of publication	2015
Type	Article in Proceedings
Conference	Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
MU Faculty or unit	Faculty of Informatics
Citation
web	http://www.aclweb.org/anthology/S15-2053
Field	Informatics
Keywords	semeval; corpus pattern analysis; concordance clustering; semantic evaluation
Description	This paper describes the first SemEval task to explore the use of Natural Language Processing systems for building dictionary entries, in the framework of Corpus Pattern Analysis. CPA is a corpus-driven technique which provides tools and resources to identify and represent unambiguously the main semantic patterns in which words are used. Task 15 draws on the Pattern Dictionary of English Verbs (www.pdev.org.uk), for the targeted lexical entries, and on the British National Corpus for the input text. Dictionary entry building is split into three subtasks which all start from the same concordance sample: 1) CPA parsing, where arguments and their syntactic and semantic categories have to be identified, 2) CPA clustering, in which sentences with similar patterns have to be clustered and 3) CPA automatic lexicography where the structure of patterns have to be constructed automatically. Subtask 1 attracted 3 teams, though none could beat the baseline (rule-based system). Subtask 2 attracted 2 teams, one of which beat the baseline (majority-class classifier). Subtask 3 did not attract any participant. The task has produced a major semantic multidataset resource which includes data for 121 verbs and about 17,000 annotated sentences, and which is freely accessible.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum Harvesting big text data for under-resourced languages