Project information
Pattern Recognition-based Statistically Enhanced MT (PRESEMT)

Information

This project doesn't include Faculty of Arts. It includes Faculty of Informatics. Official project website can be found on muni.cz.

Project Identification

248307

Project Period

1/2010 - 12/2012

Investor / Pogramme / Project type

European Union

7th Specific RTD Programme
Cooperation

MU Faculty or unit

Faculty of Informatics

Cooperating Organization

Institute for Language and Speech Processing

Responsible person George Tambouratzis

Gesellschaft zurFörderung angewandter Informatik
Norwegian University of Science and Technology
National Technical University of Athens
Lexical Computing Ltd.

This proposal describes PRESEMT, a flexible and adaptable MT system, based on a language-independent method, whose principles ensure easy portability to new language pairs. This method attempts to overcome well-known problems of other MT approaches, e.g. bilingual corpora compilation or creation of new rules per language pair. PRESEMT will address the issue of effectively managing multilingual content and is expected to suggest a language-independent machine-learning-based methodology. The key aspects of PRESEMT involve syntactic phrase-based modelling, pattern recognition approaches (such as extended clustering or neural networks) or game theory techniques towards the development of a language-independent analysis, evolutionary algorithms for system optimisation. It is intended to be of a hybrid nature, combining linguistic processing with the positive aspects of corpus-based approaches, such as SMT and EBMT.

Publications

Total number of publications: 14

2012

Building a 70 billion word corpus of English from ClueWeb

POMIKÁLEK Jan RYCHLÝ Pavel JAKUBÍČEK Miloš

Paper in proceedings

Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), year: 2012
Detecting Spam in Web Corpora

BAISA Vít SUCHOMEL Vít

Paper in proceedings

6th Workshop on Recent Advances in Slavonic Natural Language Processing, year: 2012
Finding Multiwords of More Than Two Words

KILGARRIFF Adam RYCHLÝ Pavel KOVÁŘ Vojtěch BAISA Vít

Paper in proceedings

Proceedings of the 15th EURALEX International Congress, year: 2012
Linguistic Logical Analysis of Direct Speech

HORÁK Aleš JAKUBÍČEK Miloš KOVÁŘ Vojtěch

Paper in proceedings

RASLAN 2012 Recent Advances in Slavonic Natural Language Processing, year: 2012

2011

Analyzing Time-Related Clauses in Transparent Intensional Logic

HORÁK Aleš JAKUBÍČEK Miloš KOVÁŘ Vojtěch

Paper in proceedings

Proceedings of Recent Advances in Slavonic Natural Language Processing 2011, year: 2011
Corpus-based Disambiguation for Machine Translation

BAISA Vít

Paper in proceedings

Recent Advances in Slavonic Natural Language Processing, year: 2011
Effective Parsing Using Competing CFG Rules

JAKUBÍČEK Miloš

Paper in proceedings

Proceedings of Text, Speech and Dialogue 2011, year: 2011
chared: Character Encoding Detection with a Known Language

POMIKÁLEK Jan SUCHOMEL Vít

Paper in proceedings

RASLAN 2011, year: 2011
Japanese Word Sketches: Advances and Problems

SRDANOVIĆ Irena IDA Naomi SHIGEMORI BUČAR Chikako KILGARRIFF Adam KOVÁŘ Vojtěch

Peer-reviewed scientific article

Acta Linguistica Asiatica, year: 2011, volume: 1/2011, edition: 2
Practical Web Crawling for Text Corpora

SUCHOMEL Vít POMIKÁLEK Jan

Paper in proceedings

Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011, year: 2011