- LORELEI: Transfer Learning Across Language Similarity Networks
State-of-the-art human language technology is currently available for only a few of the approximately 6,000 languages in the world. The LORELEI project focuses on developing language processing technology for low-resource languages, i.e. languages for which large amounts of training data are difficult to obtain. Our main objective is to utilize information about language relatedness that can be expressed as graphs, and to leverage graph-based transfer learning algorithms to infer models for low-resource languages. Transfer learning will be applied to at various linguistic levels, including acoustic/phonological, morphological, and lexical levels. The project will create a number of resources, including bilingual lexicons, translation models, and networks of language relationships covering all major language families in the world.
- Submodularity for Speech and Language Applications
This project explores the use of submodularity for optimizing applications in speech and language processing, in particular machine translation. Submodular function optimization is used to select the best possible training sets for different types of translation models, tuning sets, language model data, and sparse feature sets.
- BOLT (Broad Operational Language Translation)
The Broad Operational Language Translation (BOLT) project is developing two-way speech-based translation systems for English and Iraqi Arabic. Within this context we are looking at improving machine translation from English into Iraqi Arabic with the specific goal of correctly predicting morphological forms.
- TransPhorm: Improving Access to Multi-Lingual Health Information through Machine Translation
The TransPhorm project is aimed at facilitating the production of multilingual health and safety information materials for individuals with limited English proficiency. To this end we have explored translation workflow process in public health departments in Washington State and we have created a human-computer collaborative translation management system involving machine translation. Our initial results show that machine translation combined with human post-editing can produce multilingual materials that are as accurate as human-only translations while greatly reducing the turn-around time and cost of translation. Other efforts within this project have focused on developing quantitative methods for studying user preferences in machine translation, and domain adaptation techniques for translation models.
- A Submodularity Framework For Data Subset Selection
The amount of training data available to speech and language processing systems is currently larger than ever, and it is increasing at an exponential rate, requiring dedicated software and hardward infrastructure. However, much of this data is noisy, irrelevant, and/or redundant with already existing data. We are currently exploring data subset selection techniques based on submodular function optimization to select data that is relevant and non-redundant in an application-specific way. Unlike most previous techniques, submodular data selection can be shown to have theoretical performance guarantees.
- Graph-Based Learning for Speech Processing
Current methods for acoustic modeling in automatic speech processing focus train complex classifiers from a training set, followed by adaptation to the test data. Semi-supervised learning methods, on the other hand, take unlabeled data into account during the initial training process. Thus, the classifier is guided to take into account the underlying global properties of the test data. In this project we investigate semi-supervised graph-based learning algorithms and their application to acoustic modeling in speech processing. Our primary interest is in developing high-performing, scalable classification methods that are suited to the stochastic nature of speech signals and are applicable to large data sets.