Connect:  SPS Facebook Page  SPS Twitter  SPS LinkedIn  SPS YouTube Channel  SPS Google+ Page     Share: Share

Speech Synthesis Perfects Everyone’s Singing

Minghui Dong, Nancy Chen, Haizhou Li

SLTC Newsletter, February 2014

Singing is more expressive than speaking. While singing is popular, singing well is nontrivial. This is especially true for songs that require high vocal skills. A singer needs to overcome two challenges among others - to sing in the right tune and at the correct rhythm. Even professional singers need intensive practice to perfect their vocal skills and to proficiently present particular singing styles, such as vibrato and resonance tuning. Recently, the Institute for Infocomm Research (I2R) in Singapore has developed a technology called Speech2Singing, which converts the singing voice of non-professional singers (or even spoken utterances) into perfect singing.

The human voice includes three essential elements: content, prosody and timbre. Content is concerned with the literal meaning of language conveyed by the voice. Prosody consists of pitch, duration (timing) and loudness of voice. Prosody characterizes the emotion and expressiveness of one’s voice. For the case of singing voice, prosody is often referred to as melody (a combination of pitch and rhythm). Timbre, on the other hand, determines the identity of a person’s voice. I2R’s Speech2Singing technology keeps the content and timbre of the voice unchanged, but modifies the prosody of the voice into the correct melody to perfect the singing voice of the user. I2R’s Speech2Singing works as follows: Singing voices of professional singers are recorded as model voice templates and stored in the database. When a user subsequently sings a song or reads the lyrics, the recorded voice of each line is compared with the corresponding line stored in the database. The user’s vocal signal is first decomposed into feature (including pitch) sequences. An enhanced singing voice is then synthesized from the adjusted feature sequence, which contains the correct pitch and timing information. To obtain the correct timing for the user’s voice, speech recognition technology is used to identify the phonetic units from both the model voice and the user’s voice. The timing information of the user’s voice is adjusted to match that of the model voice by means of dynamic time warping [1]. The correct pitch information is directly derived from the model singing voice. Finally, the reconstructed time-synchronous singing voice is overlaid with background music.

Two existing technologies related to Speech2Singing are Auto-Tune [2] and score-based conversion [3]. The popular method of Auto-Tune alters the user’s pitch to the closest note in a pre-defined scale. While it generally works well for singing voice where the melody is not far off-tune, it is not suitable for converting spoken voice. In contrast to Auto-tune, I2R's method uses a reference melody to guide the conversion, so that corrections can still be made even when the melody is completely off-tune.

Although the score-based conversion method in [3] uses musical scores as reference for the change of melody (pitch and timing), the reference melody is generated with mathematical models, making the synthesized singing voice less natural than when using human melody as a reference. I2R’s approach uses a professional singer’s melody as a reference model so that every single detail of the melody, such as the pitch envelope and vibrato (a regular, pulsating change in pitch), is perfectly preserved and imposed onto the synthesized singing voice. In addition, since the user’s voice can be mapped to the professional singer’s voice, I2R’s method allows the timing change of each syllable of the user’s voice to match that of the professional singers’ voice.

I2R has implemented this Speech2Singing technology in mobile devices such as smart-phones and tablets. This is the first software that automatically changes a user’s speech into natural singing voice. The technology has been showcased in various occasions such as I2R’s annual TechFest [4] and A*STAR’s MediaExploit [5]; it has also drawn attention from local and international media, such as AFP [6], C-Net [7], MediaCorp [8]. ‘Sing for Singapore’ was the first release of Speech2Singing to the public during 2013 Singapore’s National Day (Figure 1) [9,10] with iOS version in AppStore [11] and Android version in Google Play [12].

In 1961, an IBM 7094 became the first computer to sing (the song was “Daisy Bell”). Ever since, singing synthesis technology has progressed tremendously. Similar to Photoshop that perfects graphics, Speech2Singing technology helps perfect singing vocals.

Figure 1: Screen shots of the NDP 2013 App.


[1] L Cen, M Dong, P Chan, Template-based Personalized Singing Voice Synthesis, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012.


[3] T. Saitou, M. Goto, M. Unoki and M. Akagi, "Speech-to-Singing Synthesis: Vocal conversion from speaking voices to singing voices by controlling acoustic features unique to singing voices," Proc. Proc. 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2007), pp. 215–218, 2007.










Minghui Dong is a Scientist in the Department of Human Language Technology at Institute for Infocomm Research, Singapore. His research interests include speech synthesis, singing voice synthesis, and voice conversion.

Nancy Chen is a Scientist in the Department of Human Language Technology at Institute for Infocomm Research, Singapore. Her research interests include keyword search, pronunciation modeling, speech summarization, and computer-assisted language learning. For more information:

Haizhou Li is the Head of the Department of Human Language Technology at Institute for Infocomm Research, Singapore. He is also a Conjoint Professor at the University of New South Wales, Australia.