Connect:  SPS Facebook Page  SPS Twitter  SPS LinkedIn  SPS YouTube Channel  SPS Google+ Page     Share: Share

An Overview of the Base Period of the Babel Program

Tara N. Sainath, Brian Kingsbury, Florian Metze, Nelson Morgan, Stavros Tsakalidis

SLTC Newsletter, November 2013

Program Overview

The goal of the Babel program is to rapidly develop speech recognition capability for keyword search in previously unstudied languages, working with speech recorded in a variety of conditions with limited amounts of transcription. Several issues and observations frame the challenges driving the Babel Program. The speech recognition community has spent years improving the performance of English automatic speech recognition systems. However, applying techniques commonly used for English ASR to other languages has often resulted in huge performance gaps for those other languages. In addition, there are an increasing number of languages for which there is a vital need for speech recognition technology but few existing training resources [1]. It is easy to envision a situation where there is a large amount of recorded data in a language which contains important information, but for which there are very few people to analyze the language and no existing speech recognition technologies. Having keyword search in that language to pick out important phrases would be extremely beneficial.

The languages addressed in the Babel program are drawn from a variety of different language families (e.g., Afro-Asiatic, Niger-Congo, Sino-Tibetan, Austronesian, Dravidian, and Altaic). Consistent with the differences among language families, the languages have different phonotactic, phonological, tonal, morphological, and syntactic properties.

The program is divided into two phases, Phase I and Phase II, each of which is divided into two periods. In the 28-month Phase I, 75-100% of the data is transcribed, while in the 24-month Phase II, only 50% of the data is transcribed. The first 16-month period of Phase I focuses on telephone speech, while the next 12-month period uses both telephone and non-telephone speech. These channel conditions continue in Phase II.

During each program period, researchers work with a set of development languages to develop new methods. Between 4-7 development languages are provided per period, with the number of languages increasing (and the development time decreasing) as the program progresses. At the end of each period, researchers are evaluated on an unseen surprise language, with constraints on both system build time and amount of available transcribed data. These constraints help to put the focus on developing methods that are robust across different languages, rather than tailored to specific languages. For this reason, the development languages are not identified until kickoff meetings for each program period, and the surprise languages are revealed only at the beginning of the each evaluation exercise.

In addition to challenges associated with limited transcriptions and build time, technical challenges include methods that are effective across languages, robustness to speech recorded in noisy conditions with channel diversity, effective keyword search algorithms for speech, and analysis of factors contributing to system performance.

Evaluation

All research teams are evaluated on a keyword search (KWS) task for both the development and surprise languages. The goal of the KWS task is to find all of the occurrences of a "keyword", i.e., a sequence of one or more words in a language's original orthography, in an audio corpus of unsegmented speech data [1,2].

April 2013 marked the completion of the base period of the Babel program, and teams were evaluated on their KWS performance for the Vietnamese surprise language task. The larger Vietnamese corpus consisted of 100 hours of speech in a training set, using only language resources limited to the supplied language pack (Base LR), and no test audio reuse (NTAR). This condition is known as Full Language Pack (FullLP) + BaseLR + NTAR. The smaller corpus consisted of 10 hours of speech in the training set, again using the Base LR and NTAR conditions. This condition is known as LimitedLP + Base LR + NTAR [2].

The performance of the FullLP and LimitedLP is measured in terms of Actual Term Weighted Value (ATWV) for keyword search and Token Error Rate (% TER) for transcription accuracy.

Below we highlight the system architectures and results of the four teams participating in the program.

Team Babelon

The Babelon team consists of BBN (lead), Brno University of Technology, Johns Hopkins University, LIMSI/Vocapia, Massachusetts Institute of Technology, and North-West University. In addition to improving fundamental ASR, the principal focus of the Babelon team is to design and implement the core KWS technology, which goes well beyond ASR technology. The most critical areas for this effort thus far are:

  • Robust acoustic features using Neural Network (NN) based feature extraction [3] that improved all token error rate (TER) and KWS results by 8-10 points absolute;

  • A "white listing" method [4] modified for the unknown keyword condition to guarantee very high recall with minimal increase in computation and memory;

  • Score normalization techniques [5] to make scores consistent across keywords and to optimize performance, resulting in large gains over the unnormalized basic KWS system;

  • Semi-supervised training methods for use with 10 hours or less of transcribed audio data, which derive significant gains from the untranscribed audio for both the acoustic and language models [6], as well as further improvements in the acoustic feature transforms [7];

  • Deep Neural Network (DNN) acoustic models [8] that give large improvements within a Hidden Markov Model (HMM) framework, both separately and in combination with the more traditional GMM-based acoustic models.

The primary KWS system is a combination of different recognizers using different acoustic or language models across different sites within the Babelon team, including (1) HMM systems from BBN, (2) DNN and HMM systems from Brno University of Technology, and (3) HMM systems from LIMSI/Vocapia. All three systems use the robust NN based features. The scores of each system output are normalized before combination. We also focused on single systems and the single system result was never more than 10% behind the combined result.

Team LORELEI

LORELEI is an IBM-led consortium participating in the Babel program, and includes researchers from Cambridge University, Columbia University, CUNY Queens College, New York University, RWTH Aachen, and the University of Southern California. The approach to the Babel problem taken by LORELEI has emphasized combining search results from multiple indexes produced by a diverse collection of speech recognition systems. This team has focused on system combination because it can produce the best performance in the face of limited training data and acoustically challenging material, because it is possible to achieve a wide range of tradeoffs between computational requirements and keyword search performance by varying the number and complexity of the speech recognition systems used in combination, and because the system combination framework provides a good environment which to implement and test new ideas. In addition to fundamental work on speech recognition and keyword search technologies, the consortium is also pursuing work in automatic morphology, prosody, modeling of discourse structure, and machine learning, all with the aim of improving keyword search on new languages.

The primary entries from LORELEI in the surprise language evaluation on Vietnamese used the same general architecture for indexing and search. First, all conversation sides were decoded with a speaker-independent acoustic model, and the transcription output was post-processed to produce a segmentation of the evaluation data. Next, a set of speech transcription systems, most of which were multi-pass systems using speaker adaptation, were run to produce word-level lattices. Then, as the final step in indexing, the lattices from each transcription system were post-processed to produce lexical and phonetic indexes for keyword search. All indexes were structured as weighted finite-state transducers.

The LORELEI primary full language pack evaluation system combined search results from six different speech recognition systems: four using neural-network acoustic models and two GMM acoustic models using neural network features. One of the neural network acoustic models was speaker-independent, while the other five models were speaker-adapted. Three of the models performed explicit modeling of tone and used pitch features, while the other three did not.

Likewise, the LORELEI primary limited language pack evaluation system combined search results from six different speech recognition systems: one conventional GMM system, two GMM systems using neural-network features, and three neural-network acoustic models. One of the neural network acoustic models was speaker-independent, while the other five models were speaker-adapted. A notable feature of the limited language pack system is that one of neural-network features systems used a recurrent neural network for feature extraction.

Team RADICAL

The RADICAL consortium is the only University-lead consortium in BABEL, consisting of Carnegie Mellon University (lead and integrator, Pittsburgh and Silicon Valley campuses), The Johns Hopkins University, Karlsruhe Institute of Technology, Saarland University, and Mobile Technologies. Systems are developed using the Janus and Kaldi toolkits, which are benchmarked internally and combined at suitable points of the pipeline.

The overall system architecture of the RADICAL submissions to the OpenKWS 2013 evaluation [2] is best described as

  • A fast HMM-based segmentation and initial decoding pass using a BNF-GMM system trained on the most restricted "LimitedLP-BaseLR" condition, which is then used by all further processing;

  • A number of individual Kaldi- and Janus-based systems, designed to be complementary to each other;

  • Confusion network combination and ROVER system combination steps to optimize overall TER;

  • CombMNZ-based system combination to optimize overall ATWV.

Janus-based systems use a retrieval based on confusion networks, while Kaldi-based systems use an OpenFST-based retrieval. System combination was found to be beneficial on all languages, both development and surprise. While development focused on techniques useful for the LimitedLP conditions, the 2013 evaluation systems were tuned for the primary FullLP condition first and foremost. The evaluation systems used a number of interesting techniques, such as:

  • A selection of tonal features for Vietnamese (as well as Cantonese, post-evaluation), for example, Fundamental Frequency Variation Features and two different PITCH-based schemes; cross-adaptation and combination between different tonal features provides small additional gains;

  • Techniques to retrieve and verify hits based on acoustic similarity alone, i.e. using "zero resources", which could slightly improve performance;

  • Techniques to exploit the observation that keywords tend to (co-)occur in "bursts";

  • A number of Deep Neural Network acoustic models, using bottle-neck (BNF) features and GMMs, hybrid models, or combinations thereof;

Team Swordfish

Swordfish is a relatively small team, consisting of ICSI (the lead and system developer), University of Washington, Northwestern University, Ohio State University, and Columbia University.

Given the team size, most of the effort was focused on improving single systems. Swordfish developed two systems that shared many components. One was based on HTK, while the other was based on Kaldi. In each case the front end incorporated hierarchical bottleneck neural networks that used as input both vocal tract length normalized (VTLN)-warped mel-frequency cepstral coefficients (MFCCs) and novel pitch and probability of voicing features that were generated by a neural network that used critical band autocorrelations as input. Speech/nonspeech segmentation is implemented with an MLP-FST approach. The HTK-based system used a cross-word triphone acoustic model, using 16 mixtures/state for the FullLP case and an average of 12 mixtures/state for the LimitedLP. The Kaldi-based system incorporated SGMM models.

For both systems, the primary LM was a standard Kneser-Ney smoothed trigram, but the team also experimented (for the LimitedLP) with sparse plus low rank language modeling and in some cases got small improvements. The HTK-based system learned multiwords from the highest weight non-zero entries in the sparse matrix. For Vietnamese, some pronunciation variants were collapsed across dialects. Swordfish’s keyword search has thus far primarily focused on a word-based index (except for Cantonese, where merged character and word posting lists was done), discarding occurrences where the time gap between adjacent words in more than 0.5 seconds.

Results

The performance of the surprise language full and limited language pack primary systems, measured in terms of Actual Term Weighted Value (ATWV) for keyword search and Token Error Rate (% TER) for transcription accuracy, are summarized below:

Team

FullLP

LimitedLP

TER (%)

ATWV

TER (%)

ATWV

Babelon

45.0

0.625

55.9

0.434

LORELEI

52.1

0.545

66.1

0.300

RADICAL

51.0

0.452

65.9

0.223

Swordfish

55.9*

0.332*

71.0*

0.120*

Table 1: Official NTAR condition surprise language results for Base period languages (* indicates single system results. All other results are based on system combination.)

Acknowledgements

Thank you to Mary Harper and Ronner Silber of IARPA for their guidance and support in helping to prepare this article.

References

[1] "IARPA broad agency announcement IARPA-BAA-11-02," 2011, https://www.fbo.gov/utils/view?id= ba991564e4d781d75fd7ed54c9933599.

[2] OpenKWS13 Keyword Search Evaluation Plan, March 2013 www.nist.gov/itl/iad/mig/upload/OpenKWS13-EvalPlan.pdf.

[3] M. Karafiat, F. Grezl, M. Hannemann, K. Vesely, and J. H. Ceernocky, "BUT BABEL System for Spontaneous Cantonese," in INTERSPEECH, 2013.

[4] B. Zhang, R. Schwartz, S. Tsakalidis, L. Nguyen, and S. Matsoukas, "White listing and score normalization for keyword spotting of noisy speech," in INTERSPEECH, 2012.

[5] D. Karakos et al., "Score Normalization and System Combination for Improved Keyword Spotting," in ASRU, 2013.

[6] R. Hsiao et al., "Discriminative Semi-supervised Training for Keyword Search in Low Resource Languages," in ASRU, 2013.

[7] F. Grezl and M. Karafiat, "Semi-supervised Bootstrapping Approach for Neural Network Feature Extractor Training," in ASRU, 2013.

[8] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative Training of Deep Neural Networks," in INTERSPEECH, 2013.

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com