Disputation von Mirela-Stefania Duma am 01.12.2021, 14:00 Uhr
1. Dezember 2021, von Reinhard Zierke
Einladung zur hochschulöffentlichen Disputation im Rahmen des Promotionsverfahrens von
Frau Mirela-Stefania Duma
Titel der Dissertation:
Data Selection for Statistical Machine Translations
Abstract:
Machine Translation (MT) is a current topic in the Computational Linguistics (CL) community. Training an MT model on a domain and using it on another domain does not yield the expected performance due to the syntactic and semantic differences between the two domains. Thus, domain adaptation is necessary. Data selection, which is the topic of this thesis, is a corpus-driven domain adaptation method. Given a general domain corpus and an in-domain, each sentence from the general domain corpus is scored according to its similarity to the in-domain. The most similar sentences to an in-domain are selected as pseudo in-domain and used later on in the training of domain-focused MT systems.
There are two challenges that arise with data selection: which method to use to determine the most similar sentences from the general domain to a given in-domain and how many of the general domain sentences to select as pseudo in-domain. In this thesis, data selection methods that address both challenges are presented. I developed several scoring methods and compared them with a method I developed that automatically determines the ratio of sentences to select.
Data selection is crucial for MT systems that aim to translate domain-specific texts. The data selection SMT models presented in this thesis were trained faster in comparison with training using full general domain data, had a smaller size, and performed on a par or better than the models trained using the full training data.
Datum und Uhrzeit: Mittwoch, 1. Dezember 2021 um 14:00 Uhr
Ort: per Videokonferenz in Zoom
Betreuer: Prof. Dr.-Ing. Wolfgang Menzel, Prof. Dr. Walther von Hahn, Dr. Cristina Vertan
Prof. Dr. Matthias Rarey
Vorsitzender des Fachpromotionsausschusses Informatik