Graphical Models Research in Audio, Speech, and Language Processing
Jeff Bilmes <bilmes AT ee DOT washington DOT edu> University of Washington, Dept. of EE
SSLI Laboratory, University of Washington, Dept. of Electrical Engineering
This page provides both talk slides for and a list of references to accompany the tutorial entitled as above and presented during the 2003 Uncertainty in Artificial Intelligence (UAI'03) conference. The list is far from complete, and is intended only to provide a useful starting point and to give a bit of historical perspective.
I will be updating the list of references over time. I apologize in advance if I have missed a particular reference. If you would please email me (to address above) the reference, however, I will gladly add it to the list.
Graphical models (GMs) are a general statistical abstraction that can describe and help solve tasks in a variety of domains. Recently, much attention has been devoted to their application to audio, speech, and language. These time series have structure that creates unique and interesting research problems, most of which are still open. GMs offer a general and mathematically principled way to move beyond the ubiquitously used hidden Markov model for such signals.
In this tutorial, we will survey the various ways in which GMs are being used to represent and solve problems associated with these signals. After briefly reviewing relevant GM concepts and notation, we will discuss several representational frameworks at different levels, including: 1) representing and learning GMs over vector observation sequences, serialized and switched by hidden chains; 2) representing mid-level hidden aspects of speech, such as articulatory and pronunciation networks, where the goal is to provide anatomically inspired sequences of vocal-tract configurations; and 3) language representations, including those where words are factored into feature bundles, switching mixture-based language models, and graphs that themselves represent smoothing.
In each case, it will be emphasized how only the graph and its associated inferential machinery can portray quite dissimilar sets of models and operations. It will also be emphasized how "deterministic dependencies" and the concept of a "switching" network (or a multi-net), greatly simplifies the specification of these representations. It will further be stressed that the structure learning problem should be tailored to the particular task --- i.e., a discriminative task should effect a discriminatively structured network. Building real-world systems, however, necessitates a software system whose description language is that of dynamic graphs. Therefore, this tutorial will also present and use for descriptive purposes the Graphical Models Toolkit (GMTK), a publicly available toolkit for developing GM-based audio, speech, and language systems. The tutorial will cover many of GMTK's novel features, particularly those that facilitate the use of the GM framework for these signal types. Moreover, many of these features are likely to be applicable to most time series and dynamic problems.
Lastly, the tutorial will include an overview of current challenges, including that of exact inference complexity. It will therefore describe recent work on the triangulation of dynamic models, and how such problems differ from standard (non-dynamic) triangulation.
KEYWORDS:graphical models, bayesian networks, dynamic bayesian networks, DBN, automatic speech recognition, ASR, speech processing, language modeling, language processing, audio processing, hidden Markov models, HMMs, discriminative structure learning