Design and compilation of a parallel corpus of Czech-English legal texts



Year of publication 2014
Type Appeared in Conference without Proceedings
MU Faculty or unit

Faculty of Arts

Description The aim of the present paper is to describe the design and compilation of a corpus of legal English texts translated from Czech by Czech translators, i.e. non-native speakers of English. First, the need for a systematic study of this type of text is explained by a short description of the specifics of the Czech translation market. Second, the type of texts collected for the corpus is defined. Following a brief survey among translation agencies and courts outsourcing translations to freelance translators, it was determined that the largest volumes of legal translation performed fall within the following genres: court decisions, and contracts. The paper follows to discuss the feasibility of obtaining a sufficient amount of corpus material for the individual genres. Third, the paper describes the compilation of a corpus of translations of 180 decisions of the Czech Constitutional Court published and translated between 1992 and 2013. These decisions were translated into English to make the Court's opinions and reasonings available to the international audience. The corpus contains approximately 2 million tokens in each of its sub-corpora. A tool developed at the Faculty of Arts of Masaryk University in Brno, Czech Republic was used to compile the corpus. This tool allows great customization, which was necessary to deal with this type of text. The translated text was not always an exact translation of the source and some stretches, typically details that would be of little interest for a foreign reader, were omitted. The texts were first aligned at the paragraph level, which allowed for a more global view of the text and for the identification of the omitted or jumbled text. Subsequently, the two sub-corpora were aligned at the sentence level by an automatic aligner with manual correction. Since the segmentation rules were initially developed for general texts, it was necessary to adjust them to take into account the peculiarities of legal texts.
Related projects: