A Submodularity Framework for Data Subset Selection


Present-day applications in automatic speech and language processing can draw on unprecedented amounts of training data. Often, however, the data is noisy, and inherently redundant. Therefore, the performance of large-scale systems improves only sublinearly with the amount of additional data - a phenomenon known as diminishing returns. This projects explores the framework of submodular function optimization for the purpose of subselecting large data sets to extract the most relevant and non-redundant information. The novel algorithms are applied to data subset selection in machine translation and automatic speech recognition.

Current and Past Team Members

Related Publications