Connect:  SPS Facebook Page  SPS Twitter  SPS LinkedIn  SPS YouTube Channel  SPS Google+ Page     Share: Share

Speech and Audio Highlights from MediaEval 2013

Gareth J. F. Jones, Martha Larson

SLTC Newsletter, February 2014

MediaEval is a benchmarking initiative dedicated to evaluating new algorithms for multimedia access and retrieval. While it emphasizes the 'multi' in multimedia and focuses on human and social aspects of multimedia task, speech and audio processing is a key component of several MediaEval tasks each year. This article overviews the tasks with significant speech and audio elements from the MediaEval 2013 multimedia evaluation benchmark and looks ahead to the MediaEval 2014 campaign [1].

MediaEval 2013 featured a total of 12 tasks exploring aspects of multimedia indexing, search and interaction, five of which involved significant elements of speech and audio processing.

The “Spoken Web Search” (SWS) Task aimed to perform audio search in multiple languages and acoustic conditions where there are very few resources available to develop a solution for each individual language. The operational setting of the task was to imagine that you want to build a simple speech recognition system, or at least a spoken term detection (STD) or keyword spotting (KWS) system in a new dialect, language or acoustic condition, for which only a small number of audio examples are available. The research aimed to explore the question of whether it is possible to do something useful (e.g. identify the topic of a query) by using only the very limited resources available.

The task involved searching for audio content within audio content using an audio query, and contained audio for 9 different languages. Participants were required to build a language independent audio search system so that, given an audio query, it should be able to find the appropriate audio file(s) and the exact location(s) of a query term within these audio file(s). Evaluation was performed using standard NIST metrics and some other indicators. The SWS task at MediaEval 2013 expanded the size of the test dataset and number of languages over similar tasks held in 2011 and 2012. In addition, a baseline system was being offered to first-time participants as a virtual kitchen application [2].

In 2014, the task will continue under the new name of "QUESST": Query by Example Search on Speech.

The Search and Hyperlinking (S&H) Task consisted of two sub-tasks: (i) answering known-item queries from a collection of broadcast TV material, and (ii) automatically linking anchors within the known-item to other parts of the video collection. The S&H Task envisioned the following scenario: a user is searching for a segment of video that they know to be contained in a video collection. If the user finds the segment, she may wish to see further information about some aspect of this segment. This use scenario is a refinement of the previous S&H task at MediaEval 2012.

The dataset for both subtasks was a collection of 1,260 hours of video provided by the BBC. The average length of a video was roughly 30 minutes and most videos were in the English language. The collection was used both for training and testing of systems. Known-items and queries to locate them were created by volunteer subjects in sessions at the BBC offices, and relevant links were identified using crowdsourcing with Amazon Mechanical Turk. The BBC kindly provided human generated textual metadata and manual transcripts for each video. Participants were also provided with the output of two automatic speech recognition (ASR) systems and features created using automatic visual analysis.

The Similar Segments in Social Speech Task was a new task at MediaEval 2013. The task involved finding segments similar to a query segment, in a multimedia collection of informal, unstructured dialogs among members of a small community. The task was motivated by the following scenario. With users’ growing willingness to share personal activity information, the eventual expansion of social media to include social multimedia, such as video and audio recordings of casual interactions, seems inevitable. To unlock the potential value of this material, new methods need to be developed for searching such records. This requires the development of reliable models of the similarity between dialogue-region pairs. The specific motivating task was as follows: A new member has joined an organization or social group that has a small archive of conversations among members. He starts to listen, looking for any information that can help him better understand, participate in, enjoy, find friends in, and succeed in this group. As he listens to the archive (perhaps at random, perhaps based on some social tags, perhaps based on an initial keyword search), he finds something of interest. He marks this region of interest and requests “more like this”. The system returns a set of “jump-in” points, places in the archive to which he could jump and start listening/watching with the expectation of finding something similar.

One dimension of MediaEval’s interest in audio processing is tasks relating to music. MusiClef was a new task at MediaEval 2013, having previously formed part of the CLEF evaluation benchmark [3]. The MusiClef 2013: Soundtrack Selection for Commercials Task aimed at analyzing music usage in TV commercials and determining music that fits a given commercial video. This task is usually carried out by music consultants, who select a song to advertise a particular brand or a given product. By contrast, the MusiClef 2013 task aimed at automating this process by taking into account both context- and content-based information about the video, the brand, and the music.

Music is composed to be emotionally expressive. The Emotion in Music Task was another new task at MediaEval 2013, and sought to develop tools for navigating today’s vast digital music libraries, based on the assumption that emotional associations provide an especially natural domain for indexing and recommendation. Because there are a myriad of challenges to such a task, powerful tools are required for the development of systems that automate the prediction of emotion in music. As such, a considerable amount of work was dedicated to the development of automatic music emotion recognition (MER) systems. The corpus used for this task employed Creative Commons (CC) licensed music from the Free Music Archive4 (FMA), which enabled the content to be redistributed to the participants with annotations created via crowdsourcing using Amazon Mechanical Turk.

Other tasks at MediaEval 2013 included the Placing Task which seeks to place images on a world map, that is, to automatically estimate the latitude/longitude coordinates at which a photograph was taken. The main Placing Task has featured at MediaEval for several year, a newly introduced secondary task for 2013 was Placeability Prediction which asked participants to estimate the error of their predicted location. Annotating images with this kind of geographical location tag, or geotags, has a number of applications in personalization, recommendation, crisis management and archiving. Currently, the vast majority of images online are not labelled with this kind of data. The data for this task was drawn from Flickr. In comparison to previous editions of this task, the test set has not only increased drastically in size, but was also been derived according to different assumptions in order to model a more realistic use-case scenario.

The Violent Scenes Detection Task derives directly from a Technicolor use case which aims at easing a user’s selection process from a movie database, and ran for the third time at MediaEval 2013. This task was to automatically analyse movie content with the objective of identifying violent actions in the content. Another returning task was Social Event Detection (SED) task which required participants to discover social events and organize the related media items in event-specific clusters within a collection of Web multimedia content. Social events are defined as events that are planned by people, attended by people and for which the social multimedia are also captured by people. The Visual Privacy Task (VPT) aimed at exploring how image processing, computer vision and scrambling techniques can deliver technological solutions to some visual privacy problems. The evaluation was performed using both video analytics algorithms and user studies so as to provide both subjective and objective evaluation of privacy protection techniques.

The MediaEval 2013 campaign culminated in a very energetic and successful 2 days workshop in Barcelona, Spain October 2013 attended by 100 task organisers and participants.

The tasks for each MediaEval campaign are chosen following an open call based on results of a public questionnaire exploring the interest of the research community in them. The questionnaire for MediaEval 2014 has recently concluded and selection and details of the tasks to be offered are currently being finalised. Task registration will open in March 2014, details will be available from the MediaEval website [1].


Gareth J. F. Jones and Martha Larson are coordinators of the MediaEval Benchmarking Initiative for Multimedia Evaluation.

Full proceedings of MediaEval 2013 are available from: http://ceur-ws.org/Vol-1043/. More information can be found at http://www.multimediaeval.org/.