Connect:  SPS Facebook Page  SPS Twitter  SPS LinkedIn  SPS YouTube Channel  SPS Google+ Page     Share: Share

Speaker Identification: Screaming, Stress and Non-Neutral Speech, is there speaker content?

John H.L. Hansen, Navid Shokouhi

SLTC Newsletter, November 2013

The field of speaker recognition has evolved significantly over the past twenty years, with great efforts worldwide from many groups/laboratories/universities, especially those participating in the biannual U.S. NIST SRE - Speaker Recognition Evaluation [1]. Recently, there has been great interest in considering the ability to perform effective speaker identification when speech is not produced in "neutral" conditions. Effective speaker recognition requires knowledge and careful signal processing/modeling strategies to address any mismatch conditions that could exist between the training and testing conditions. This article considers some past and recent efforts, as well as suggested directions when subjects move from a "neutral" speaking style, vocal effort, and ultimately pure "screaming" when it comes to speaker recognition. In the United States recently, there has been discussion in the news regarding the ability to accurately perform speaker recognition when the audio stream consists of a subject screaming. Here, we illustrate a probe experiment, but before that some background on speech under non-neutral conditions.

Some could argue that speech processing, and specifically speech recognition in non-neutral speaking conditions, began with a number of strategic studies in the mid to late 1980's in the area which became known as "speech under stress". A number of these studies focused on evaluating small vocabulary based speech recognition algorithms in multiple "speaking styles" or "stress" [2-5]. This included a number of advancements from researchers at MIT Lincoln Laboratory in an approach that was termed "multi-style training", where training speech was captured in a range of speaking styles, and models were trained with these simulated speaking conditions in order to anticipate variations seen in actual input speech data [3,4,5]. Some of the earliest forms of cepstral feature or cepstral mean compensation were first formulated in this domain by Chen [6], Hansen , et. al [7-10], and others in the area to address speech under stress. Detection of speech under stress [21,22] has also motivated interest in exploring non-neutral speech processing.

Great American Scream Machine Located at Six flags over Georgia (Atlanta, GA; USA)

As part of the effort to advanced speech under stress research, the SUSAS - "Speech Under Simulated and Actual Stress" corpus was developed [7,11,12]. Part of the data collection for the SUSAS corpus included speech collected on two rides at an amusement park (6 Flags Over Georgia Atlanta, GA) named "Free Fall" and a roller coaster ride named "Scream Machine". This portion of the SUSAS corpus was collected in the mid 1980's and while the research which followed focused on speech under stress, one component of this corpus which remained un-explored was the fact that all subjects who produced the required set of words during rides on the coaster also produced many uncontrollable screams. Research studies in this area concentrated on formulating stress compensation methods for speech recognition of speech under stress - which included speaking styles, emotion, Lombard effect, and task stress [8,9,10,13]. Further work in the area of "interoperability of speech technology under stress" was addressed by the NATO Research Study Group RSG.10 [14]. A study was performed using the SUSAS corpus and reported in [14] to consider the impact of stress/Lombard effect or emotion on speaker recognition systems, and while only a limited amount of speech was available for testing, that study did illustrate how speech under "non-neutral" speaking conditions were impacted by stress (emotion or Lombard effect) [results from [14] shown here in this plot for matched vs. mismatched speaker ID conditions using a default neutral trained speaker ID models].

Work continued in this domain with the goals of improved modeling of speech under stress, which augmented a traditional acoustic microphone with a physiological microphone (again, speech was collected on roller-coaster rides! - see subject with close-talk microphone and p-mic positioned at his larynx with the black Velcro strap around his neck) [15,16]. Again, the focus was on analysis of speech motor characteristics of speech under stress, illustrating how g-force, motion, as well as physical/cognitive stress impacts speech production.

Through these studies, it is clear speech produced under stress results in significant changes to speech parameters, which directly impact speech technology including speech recognition, speaker ID, speech coding, and other systems. Across these studies however, there remains the question - what about "screams"? Subjects that scream are producing an audio stream, but is there really "speaker content" in this data?

Can We Identify Speaker Content While Screaming?

Recently, the controversial question of whether speaker identification technology is mature enough to be used for screaming audio recordings has come into the spotlight [17], along with subsequent forensic analysis of screams for use in recent courtroom testimony [18]. While many changes in production have been documented in earlier studies on speech under stress, unique abnormalities come to the picture when dealing with "screaming speech". In this part of this letter, we intend to point out the difficulty in recognizing speaker identities when the speech used in testing consists of instances of the speaker screaming. A number of subjects from the Center for Robust Speech Systems at the University of Texas at Dallas participated in a small "screaming" data collection. Both spontaneous and text-dependent neutral speech was recorded for each subject. The subjects were also asked to scream during recording sessions.

Issues with collecting scream data:

One of the challenges faced in capturing audio data for this domain is that there is no unique way to define a scream. Even when subjects are asked to imagine they are in a particular situation described to them in detail, it is sometimes difficult for them to scream naturally. The "flavors" of a scream could include (not comprehensive!):

  • A scream when someone is watching a horror film at the movies (i.e., related to fear)
  • A scream when a person is riding on a roller-coaster (i.e., perhaps related to fear, anxiety or inability to stop what is happening, etc.)
  • A scream when you see someone you have not seen for a long time or a meeting a famous person (i.e., perhaps related to surprise, happiness, etc.)
  • The actual sound which is made when someone screams - there is no exact definition as to what this should contain or include.

Another issue with screams and speech technology is that they normally are produced with a short duration. In order to be able to focus on just one variability (i.e., the screaming aspect of speech production), it is important that we set aside the problem of short durations which, unfortunately, imposes an unrealistic assumption that long durations of screaming samples are available. The last obstacle in capturing screams, which is also a factor in stressed speech data collection as well as vocal effort, is the issue of clipping - since in controlled settings an automatic gain control may not exist at the input to the A/D converter. For the study conducted by CRSS-UTDallas, this was managed by adjusting the microphones to ensure that the recorded signal was not too weak for neutral speech while accurately representing the waveform without observing any clipping when the subjects scream, which also poses mitigating restrictions on the data.

Prior studies:

As we have noted, the changes in speech production have been extensively investigated for stress, Lombard effect, and different emotions in [7,19], and their impact on speech recognition [2-10]. Their effects on speaker recognition have also been measured through experimentation [20]. However, to the best of our knowledge a detailed study is not available on the detrimental effect of screaming speech for speaker recognition. It is reasonable to predict that speech production changes in screams harm both human-based and automatic speaker identification as they do in different stress and emotion conditions [13,14,10].

In this article, we do not intend to propose a technique to compensate for the mismatch between train and test when the available test data consists of screaming, nor do we attempt to improve speaker identification for screaming data. The goal here is to merely raise the question of screaming speaker identification and suggest that further research on the issue is needed. A similar challenge was conducted in alliance with the USA Federal Law Enforcement Training Center (FLETC) which included data recorded while trainees completed a simulated hostage scenario [16]. In that case, law enforcement trainees were put in an unreal but highly stressful situation which consequently included high amounts of screaming [16,20]. The speaker recognition accuracy in experiments for that study dropped from 91.7% to 70% for low and high stress levels, respectively.

Figure 1: (a) Neutral Speech, and (b) Scream from the same speaker

Experiments and results:

In the present CRSS-UTDallas study, a number of subjects participated in data collection. To illustrate the significant change in speech structure, Fig. 1 shows a sample spectrogram of the same speaker while producing neutral speech (Fig. 1(a)) and when screaming (Fig. 1(b)). Figure 2 shows the CRSS-UTDallas website which includes sample audio clips for interested readers to listen to of subjects screaming.

With respect to the question as to speaker recognition under screaming, a sample probe experiment was performed by CRSS-UTDallas. In the sample experiments, trials consisted of train and test pairs in a speaker verification task. A subset of trials were randomly selected to be evaluated by human subjects (caution was made to ensure that the listeners did not have any prior familiarity with the speakers under neutral/screaming conditions). The recognition system employed was based on maximum a posteriori (MAP) adaptation of Gaussian mixture models (GMMs) trained for each model speaker (more robust speaker ID systems are available, but for the purposes of this exercise, we wanted to simply explore a traditional baseline system performance). The test files in the trials were scored against the GMMs and the resulting scores used to obtain overall system accuracy. Performances were evaluated by computing the equal error rate (EER) for the ensemble trials. Again, it is noted that special care was taken to omit all possible mismatches between development and enrollment data (such as channel, session, microphone, etc.), so performance deviation would only be due to the presence or absence of screaming. For the automatic speaker recognition system, the Speaker ID performance for screaming test files were all in the range of 40-45% depending on which speaker was being evaluated. The condition also affects human listeners, where the highest accuracy obtained by CRSS-UTDallas lab members performing the random listener evaluation was about 25%, despite the fact that listeners were familiar with the subjects. http://crss.utdallas.edu/Projects/SID_Scream/

Figure 2: CRSS-UTDallas website with sample audio files of subjects under "scream" conditions.

The results here suggest that while using neutral trained speech, effective speaker recognition using "screaming" is simply not effective or reliable. While further research could result in more effective solutions, the current technology suggests that sufficient speaker identity information is not contained within a scream audio stream for either automatic speaker ID systems, as well as for human listeners. One final note to all speech researchers in the field wanting to explore this question on speaker ID and screaming (particularly if you are on a University campus!) - be sure to notify your neighboring labs as well as your campus police beforehand if you are plan on collecting such data, just in case someone hears people screaming down the hallways at your institutions!

1http://shawnliv.com/index.php/perform-well-under-stress/

2http://archive.constantcontact.com/fs129/1109991857457/archive/1111496386120.html

3https://www2.sixflags.com/overgeorgia/attractions/great-american-scream-machine

4http://en.wikipedia.org/wiki/The_Scream

References

[1] NIST SRE - Speaker Recognition Evaluation: http://www.nist.gov/itl/iad/mig/sre.cfm

[2] P.K. Rajasekaran, G. Doddington, J. Picone, "Recognition of speech under stress and in noise," IEEE ICASSP-1986, pp. 733 - 736, Tokyo, Japan, April 1986.

[3] R. Lippmann, E. Martin, D. Paul, "Multi-style training for robust isolated-word speech recognition," IEEE ICASSP-1987, pp. 705 - 708, Dallas, TX, April 1987.

[4] D. Paul, "A speaker-stress resistant HMM isolated word recognizer," IEEE ICASSP-1987, pp. 713 - 716, Dallas, TX, April 1987.

[5] D. Paul, E. Martin, "Speaker stress-resistant continuous speech recognition," IEEE ICASSP-1988, pp. 283 - 286, New York, NY, April 1988.

[6] Y. Chen, "Cepstral domain talker stress compensation for robust speech recognition," IEEE Trans. Acoustics, Speech and Signal Processing, pp. 433 - 439, April 1988.

[7] J.H.L. Hansen, "Analysis and Compensation of Stressed and Noisy Speech with Application to Robust Automatic Recognition," PhD. Thesis, 429pgs, School of Electrical Engineering, Georgia Institute of Technology, July 1988.

[8] J.H.L. Hansen, M. Clements, "Stress Compensation and Noise Reduction Algorithms for Robust Speech Recognition,'' IEEE ICASSP-1989, pp. 266-269, Glasgow, Scotland, May 1989

[9] J.H.L. Hansen, "Adaptive Source Generator Compensation and Enhancement for Speech Recognition in Noisy Stressful Environments,'' IEEE ICASSP-1993, pp. 95-98, Minneapolis, Minnesota, April 1993

[10] J.H.L. Hansen, "Morphological Constrained Enhancement with Adaptive Cepstral Compensation (MCE-ACC) for Speech Recognition in Noise and Lombard Effect," IEEE Trans. Speech & Audio Processing, SPECIAL ISSUE: Robust Speech Recognition, pp. 598-614, Oct. 1994.

[11] LDC - Linguistics Data Consortium: the SUSAS Speech Under Simulated and Actual Stress database: http://catalog.ldc.upenn.edu/LDC99S78

[12] J.H.L. Hansen, S. Bou-Ghazale, "Getting Started with SUSAS: A Speech Under Simulated and Actual Stress Database," EUROSPEECH-97, vol. 4, pp. 1743-1746, Rhodes, Greece, Sept. 1997.

[13] J.H.L. Hansen, "Analysis and Compensation of Speech under Stress and Noise for Environmental Robustness in Speech Recognition," Speech Communication, Special Issue on Speech Under Stress, pp. 151-170, Nov. 1996

[14] J.H.L. Hansen, C. Swail, A.J. South, R.K. Moore, H. Steeneken, E.J. Cupples, T. Anderson, C.R.A. Vloeberghs, I. Trancoso, P. Verlinde, "The Impact of Speech Under `Stress' on Military Speech Technology," NATO Research & Technology Organization RTO-TR-10, AC/323(IST)TP/5 IST/TG-01, March 2000 (ISBN: 92-837-1027-4).

[15] D.S. Finan, J.H.L. Hansen, "Toward a Meaningful Model of Speech Under Stress," 12th Conference on Motor Speech (Speech Motor Control Track), Albuquerque, NM, March 2004.

[16] E.Ruzanski, J.H.L. Hansen, D. Finan, J. Meyerhoff, "Improved 'TEO' Feature-based Automatic Stress Detection Using Physiological and Acoustic Speech Sensors," ISCA INTERSPEECH-2005, pp. 2653-2656, Lisbon, Portugal, Sept. 2005

[17] C. Ross. (2013, June 6). Fla. judge to decide on whether 911 scream analysis is admissible in Zimmerman trial [The Daily Caller]. Available: http://dailycaller.com/2013/06/06/fla-judge-to-decide-on-whether-911-scream-analysis-is-admissible-in-zimmerman-trial/

[18] S. Skurka. (2013, July 2). Day 6 of the Zimmerman Trial: Murder or Self-Defence? - The FBI Audio Voice Analyst [The Huffington Post-Canada]. Available: http://www.huffingtonpost.ca/steven-skurka/zimmerman-trial_b_3532604.html

[19] J.H.L. Hansen, "Evaluation of Acoustic Correlates of Speech Under Stress for Robust Speech Recognition.'' IEEE Proc. 15th Annual Northeast Bioengineering Conference, pp. 31-32, Boston, Mass., March 1989

[20] J. H. L. Hansen, E. Ruzanski, H. Boril, J. Meyerhoff,, "TEO-based speaker stress assessment using hybrid classification and tracking schemes," Inter. Journal of Speech Technology (Springer), vol. 15, issue 3, pp 295-311, Sept., 2012

[21] B. Womack and J. H. L. Hansen, "N-channel hidden Markov models for combined stress speech classification and recognition," IEEE Trans. on Speech and Audio Processing, 7, pp 668-677, 1999

[22] G. Zhou, J.H.L. Hansen, and J.F. Kaiser, "Nonlinear Feature Based Classification of Speech under Stress," IEEE Transactions on Speech & Audio Processing, vol. 9, no. 2, pp. 201-216, March 2001

John H.L. Hansen serves as Associate Dean for Research, Erik Jonsson School of Engineering & Computer Science, as well as Professor in Electrical Engineering and the School of Brain and Behavioral Sciences at The University of Texas at Dallas (UTDallas). At UTDallas, he leads the Center for Robust Speech Systems (CRSS). His research interests are in speech processing, speaker modeling, and human-machine interaction.

Navid Shokouhi is a PhD candidate at the Department of Electrical Engineering at the University of Texas at Dallas, Erik Jonsson School of Engineering. He works under supervision of Dr. John Hansen in the Center of Robust Speech Systems. His research interests are speech and speaker recognition in co-channel speech signals.