Low Power Speech Recognition

Project members:

Overview

The overall goal of this project is to perform speech recognition on hand-held devices with limited computational and storage capabilities and, more importantly, with power consumption constraints. For that end, we have been carrying out research on application level, software techniques and architectural supports, leading to basically three projects.

The first involves the writing of a state-of-the-art speech recognition system to be used as a basis to analyze and benchmark CPU and memory usage patterns, investigating the performance energy-consumption trade-offs, and identifying bottlenecks that could benefit from architectural support. The second project involves the investigation of a number of new algorithms for scalar and vector parameter quantization in an attempt to reduce memory utilization, power consumption, and computation. The third one involves the design and simulation of low-precision variable representations, and the implementation of a low-resource speech recognizer using ROM-lookups.

  1. A Speech Recognition System for Resource Benchmarking

    The first project involves the writing (in C++) of a new speech recognition engine for the purpose of benchmarking CPU utilization, and power consumption of an application we plan to utilize on chip. The system has the following features:

    1. Low resource use and high accuracy

      The speech recognizer uses continuous-observation hidden Markov models (CHMM), which are one of the most successful recognition technologies and are widely used in "desktop applications." However, for hand-held recognition, the common approach has been to use discrete-observation HMMs because they are faster than the continuous models and perform almost as well for small vocabularies. For large vocabularies, storage requirements quickly become overwhelming. We chose to use continuous models because memory usage significantly contributes to power consumption and we envision a moderately sized vocabulary that is easily extendible. Moreover, several techniques exist that can make continuous models almost as inexpensive computationally as discrete ones. For example, variable quantization techniques and hybrid models such as semi-continuous HMMs allow to control the trade-off between memory usage and speed.

    2. High flexibility and speaker-independence

      We embedded phone-based CHMM's into word-based models, according to user-defined vocabulary. In other words, the recognizer is not confined to a fixed set of voice-commands, but can accept any arbitury command list. In addition, the system is expected to work regardless of speakers. The phone-based CHMM's are obtained by training the on a large number of sentences. The larger and the more diverse the training set, the better the decoding performance usually is. We have managed to train the system on two phonetically-rich databases, PhoneBook and TIMIT, with a more significant amount of training data, so that the system is more robust to speaker variations.

    There are two main parts of the recognition engine: the first one, called the front-end, processes the microphone input and converts it to features that capture the main properties of the speech signal in a compact form. The second component is the CHMM based so-called back-end. The back-end classifies (or decodes) the features into best-matching phonemes, which put together form words. Following is a more detailed description of how these two components work.

    1. Speech detection and front-end

      Because it is intended that the speech recognizer run continuously waiting for user input (but hopefully consuming very little resources while doing that), an important part of the front-end is the speech detection routine. On one hand, we do not want to miss the beginning of the user's commands but on the other hand we want to avoid decoding utterances that are mostly silence or even worse only silence. The speech detector we are using is one of the simplest; it detects speech when the energy level of the speech signal rises above a certain threshold. This design uses minimal extra resources while still accurately detecting speech when the noise conditions are stationary.

      Converting the speech signal to features is done using standard MFCC processing, which roughly computes the energy at different frequency ranges, ranges which follow the Mel-frequency scale, inspired from the human hearing sensitivity at different frequencies.

    2. Back-end and fast decoding techniques

      The back-end of the speech recognizer takes the feature vectors as input and runs the Viterbi decoding algorithm to find the sequence of phonemes/words with the highest probability of having generated them. The likelihood evaluation and Viterbi decoding in the back-end turn out to be the bottlenecks of the whole system. We applied Gaussian approximation on the likelihood evaluation and several speed-up techniques in Viterbi search space such as beam pruning, to achieve savings in computation at a negligible loss of performance.

  2. ASR Parameter Quantization

    In this project, we investigate a number of novel parameter quantization algorithms for speech recognition systems. In general, it is important to produce automatic speech recognition (ASR) systems that use as few computational and memory resources as possible, especially in low-memory/low-power environments such as for personal digital assistants. One way to achieve this is through parameter quantization. In this work, we compare a variety of novel subvector clustering procedures for ASR system parameter quantization. Specifically, we look at systematic data-driven subvector selection techniques based on entropy minimization, and compare performance primarily on an isolated word speech recognition task. Unfortunately, optimal entropy-minimizing quantization methods are intractable algorithms -- therefore, we have developed several novel heuristic schemes for approximating the optimal solution. These include disjoint scalar quantization, normalized joint scalar quantization, normalized joint pairwise quantization, simple scalar quantization with separate codebooks per parameter, and joint scalar quantization with normalization perform. We have also developed a number of vector and cluster-based quantization schemes, where entropy is the driving criterion behind forming the clusters. We have tested these schemes on a state-of-the-art speech database, and one that we plan to use to help develop the on-chip speech recognition system (see part I above). In general, on this database, all of these procedures perform well in their attempt to approximate the optimal clustering, and produce ASR systems that require significantly fewer parameters.

  3. A ROM Look-up Based Back-end

    We have realized a ultra low resource system back-end using only ROM look-ups. Most of the state-of-the-art low-power, low-memory recognizers are built upon fixed-point processors.

    Although fixed-point arithmetic provides significant savings over floating-point, accuracy problems can arise in back-end processing. Additionally, some arithmetic operations are still relatively expensive, even if cheaper than their floating-point counterparts. Instead of using continuous-value fixed-point variables and their corresponding arithmetic operations, we design a system back-end where all parameters and variables are quantized into discrete codewords and then represented by integar indices, and all operations are replaced by ROM look-ups. In this way, every variable in the back-end can be stored with no more than 8 bits, while maintaining the same performance achieved by using a 32-bit IEEE floating-point type.

    We have made modifications to the decoding mechanism in order to maximally reduce the variable bit-width. We have implemented tree-structured quantization methods on accumulative variables with a bounded dyanmic range. We have also applied normalization techniques to the forward probabilities in Viterbi decoding, where we are able to constrain all variables within a constant dynamic range. Based on these techniqeus, we have obtained a full picture of the lowest bit-width to which every single variable can be quantized without degradation in system performance.

    While finding the optimal combination of the variable bit-widths is an intractable problem, we looked for effective greedy algorithms to optimize a fully quantized back-end. We have tried several optimization schemes based on a combined metric of performance and cost. The best algorithm yields a realization of baseline performance with only 50KB in ROMs.

    With several candidates for an optimal design, we will begin power simulations. For that, we have modified an existing microprocessor power simulator to be able to perform our quantized operations. Using that we will compare power consumption, rather than juzr computational complexity, of the different back-end algorithms. We are currently arranging meetings with other teams in order to determine the parameteres necessary to simulate proposed implementations.

References:
  1. Karim Filali, Xiao Li, Jeff Bilmes, "Algorithms for Data-Driven ASR Parameter Quantization", Submitted to IEEE Transactions On Speech and Audio Processing, 2002

  2. Karim Filali, Xiao Li, and Jeff Bilmes, "Data-Driven Vector Clustering for Low-Memory Footprint ASR" (ps.gz, pdf), International Conference on Spoken Language Processing (ICSLP) 2002 Denver, Colorado

  3. Chia-Ping Chen, Karim Filali, and Jeff Bilmes, "Frontend Post-Processing and Backend Model Enhancement on the Aurora 2.0/3.0 Databases" (ps.gz, pdf), International Conference on Spoken Language Processing (ICSLP) 2002 Denver, Colorado (see also MVA website)

  4. Xiao Li, Jeff Bilmes, "Selectively Computing Dynamic Features in the likelihood Computation of ASR systems" (pdf), Technical Report, UWEETR-2003-0009, Univ. of Washington

Valid HTML 4.01!