Constructing Datasets from Dialogue Data

Varování

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Název česky	Sestavování datových souborů z dialogových dat
Autoři	SOTOLÁŘ Ondřej PLHÁK Jaromír TKACZYK Michal LEBEDÍKOVÁ Michaela ŠMAHEL David
Rok publikování	2022
Druh	Článek ve sborníku
Konference	Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Full Text PDF Domovská stránka workshopu
Klíčová slova	Dialogue Dataset;Dataset Split;Online Conversations
Přiložené soubory	RASLAN_2022_sotolar.pdf
Popis	We present methods for transforming raw dialogue data into a dataset suitable for processing with statistical NLP models. We reveal the potential pitfalls for processing this type of data, such as ensuring the representatives of the sample, the generalization ability of models, and the definition of the local context of the utterances. We use novel methods to solve these problems and demonstrate their effectiveness on an utterance classification problem. As a result, this paper provides guidelines for generating valuable datasets from dialogue data.
Související projekty:	Modelling the future: Understanding the impact of technology on adolescent’s well-being