Constructing Datasets from Dialogue Data

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SOTOLÁŘ Ondřej PLHÁK Jaromír TKACZYK Michal LEBEDÍKOVÁ Michaela ŠMAHEL David

Year of publication 2022
Type Article in Proceedings
Conference Proceedings of the 16th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022
MU Faculty or unit

Faculty of Informatics

Citation
Web
Keywords Dialogue Dataset;Dataset Split;Online Conversations
Attached files
Description We present methods for transforming raw dialogue data into a dataset suitable for processing with statistical NLP models. We reveal the potential pitfalls for processing this type of data, such as ensuring the representatives of the sample, the generalization ability of models, and the definition of the local context of the utterances. We use novel methods to solve these problems and demonstrate their effectiveness on an utterance classification problem. As a result, this paper provides guidelines for generating valuable datasets from dialogue data.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.