Connect:  SPS Facebook Page  SPS Twitter  SPS LinkedIn  SPS YouTube Channel  SPS Google+ Page     Share: Share

An Overview of ASRU 2013

Tara N. Sainath and Jan (Honza) Cernocky

SLTC Newsletter, February 2014

The Automatic Speech Recognition and Understanding Workshop (ASRU) was recently hosted in Olomouc, Czech Republic from December 8-12, 2013. Each day of the workshop focused on a specific current theme that has become popular amongst ASR researchers. Below, we highlight the 4 days of the workshop, and touch on some interesting papers, in more detail.

Day 1 – Neural Network

In the past few years, deep learning has become the de-facto approach for acoustic modeling in ASR, showing tremendous improvements between 10-30% relative over alternative acoustic modeling approaches across a variety of LVCSR tasks [1].

The stage was set by a keynote on the Physiological Basis of Speech Processing: From Rodents to Humans by Christoph Schreiner (UC San Francisco), followed by invited talks on Multilayer perceptrons for speech recognition: There and Back Again by Nelson Morgan (International Computer Science Institute), and Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems by Frank Seide (Microsoft Research Asia).

There were numerous interesting papers in this section as well. For example [2] explored using speaker-indentity vectors with DNNs, obtaining impressive results on Switchboard. In addition, [3] introduced a semi-supervised training strategy for DNNs. On the surprise language task (Vietnamese) from the BABEL project, they were able to obtain a 2.2% absolute improvement in WER compared to a system built only on fully transcribed data. Finally, [4] looked at moving benefits of DNNs back into GMM modeling techniques, including having deep (multiple layers) and wide (multiple) models and by sharing model parameters. Making these changes, the authors find that the performance of GMMs comes closer to DNNs on TIMIT.

Day 2 – Limited Resource

The second day focused on limited resources for ASR.

Invited talks were given on Building Speech Recognition Systems with Low Resources (by Tanja Schultz, Karlsruhe Institute of Technology) and Unsupervised Acoustic Model Training with Limited Linguistic Resources (by Lori Lamel, CNRS-LIMSI). Mary Harper (IARPA) delivered a keynote on The Babel Program and Low Resource Speech Technology, which was followed by invited talks on Zero to One Hour of Resources: A Self-organizing Unit Approach to Training Speech Recognizers (by Herb Gish, Raytheon BBN Technologies), and Recent Progress in Unsupervised Speech Processing, by Jim Glass (Massachusetts Institute of Technology).

The work in [5] looked at automatically learning a pronunciation lexicon, by starting with a small seed lexicon and then learning pronunciations of words by transcribing speech at the word level. Experiments on a Switchboard task show that the proposed lexicon learning method achieves a WER similar to using a fully handcrafted lexicon. In addition, [6] proposes a framework which discovers acoustic units by clustering together context-dependent grapheme models, and then generates an associated pronunciation lexicon from the initial grapheme-based recognition system. Results on WSJ show the proposed approaches allow for a 13% reduction in WER, and have many implications for low-resourced languages such as the Babel dataset.

Day 3 – ASR in Applications

The third day focused on the impact ASR is making in various applications.

A keynote was delivered on Utilization of ASRU technology - present and future (Joseph Olive), while invited talks were given on the topics of Augmenting conversations with a speech understanding anticipatory search engine (Marsal Gavalda, Expect Labs), Calibration of binary and multiclass probabilistic classifiers in automatic speaker and language recognition (Niko Brummer, Agnitio), Speech technologies for data mining, voice analytics and voice biometry (Petr Schwarz, Phonexia and Brno University of Technology), From the Lab to the Living Room: The Challenges of Building Speech-Driven Applications for Children (Brian Langner, ToyTalk), and The growing role of speech in Google products, by Pedro Moreno.

One interesting paper dealing with spoken language understanding was [7], which looked at a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture shows promising results on a variety of real-world ASR applications. In addition, [8] looks at using linguistic knowledge for query understanding by extracting a set of syntactic structural features and semantic dependency features from query parse trees to enhance inference model learning. Experiments on real natural language queries indicate that using additional linguistic knowledge can improve query understanding results across various real-world tasks.

Day 4 – What’s Wrong with ASR?

Finally, Jordan Cohen and Steve Wegmann jointly delivered a keynote discussing what incorrect assumptions our current modeling approaches make, and what research directions could potentially be pursued to improve ASR performance. For example, HMMs have been around for 40 years, but make poor assumptions such as frame independence. Wegmann also had an interesting paper related to the theme of the day. In this paper [9], Wegmann appies a diagnostic analysis to the performance metric, actual term weighted value (ATWV), used in the Babel task. His analysis looks at the large ATWV gains that often occur due to system combination by increasing the number of true hits, and questions if gains can be obtained without huge expenditures needed by system combination.

Also, for the first time at ASRU, posters were left hanging throughout the conference, and they were being presented in "authors-be-at-their-posters" sessions, which lead to many in-depth discussions also over coffee, during lunch, or in the evening, simply because the poster was still there, to refresh one's memory, or clarify an idea.

References

[1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

[2] G. Saon, H. Soltau, M. Picheny, D. Nahamoo, “Speaker Adaptation of Neural Network Acoustic Models using I-Vectors,” in Proc. ASRU, 2013.

[3] K. Vesely, M. Hannemann, L. Burget, “Semi-supervised Training of Deep Neural Networks,” in Proc. ASRU, 2013.

[4] K. Demuynck and F. Friefenbach, “Porting Concepts from DNNs back to GMMs,” in Proc. ASRU, 2013.

[5] L. Lu, A. Ghoshal and S. Renals, “Acoustic Data-Driven Pronunciation Lexicon for Large Vocabulary Speech Recognition,” in Proc. ASRU, 2013.

[6] W. Hartmann, A. Roy, L. Lamel and J.L. Gauvain, “Acoustic Unit Discovery and Pronunciation Generation from a Grapheme-Based Lexicon,” in Proc. ASRU, 2013.

[7] P. Xu and R. Sarikaya, “Convolutional Neural Network Based Triangular CRF for Joint Intent Detection and Slot Filling,” in Proc. ASRU, 2013.

[8] J. Liu, P. Pasupat, Y. Wang, S. Cyphers, and J. Glass, "Query Understanding Enhanced by Hierarchical Parsing Structures," in Proc. ASRU, 2013.

[9] S. Wegmann, A. Faria, A. Janin, K. Riedhammer, N. Morgan, “The TAO of ATWV: Probabing Mysteries of Keyword Search Performance,” in Proc. ASRU, 2013.