Visual search that aims at retrieving information on a specific object, person or place in a photo has progressed dramatically in the past few years. Thanks to modern techniques for large scale image description, indexing and matching, such image-based information retrieval can be conducted either in a structured image database for a given topic (e.g., paintings, book covers or monuments) or in an unstructured image database which is weakly labeled (e.g., via user-input tags or surrounding texts including captions).
This Grand Challenge aims at exploring tools to push this search paradigm forward: how to search unstructured multimedia databases based on audio-visual queries? This problem is already encountered in professional environments where large semi-structured audio-visual assets, such as cinema and TV archives, are operationally managed. In these cases, resorting to trained professionals such as archivists remains the rule, both to annotate part of the database beforehand and to conduct searches. Unfortunately, this workflow does not apply to wildly unstructured repositories accessible on-line, especially for non-professional usage.
The challenge is thus to extract automatically as much information as possible from the audio-visual query (extraction of compact low-level audio-visual signatures, detection and recognition of text present in the images, detection and recognition of speech present in the audio track, more general semantic analysis of the audio-visual content) and to use it to search the intertwined textual, audio and visual components of the database.
In a scenario where the query video features a public event and the database is composed of one or several on-line repositories, it is reasonable to assume that the piece of audio-visual content of interest has already benefited from one form or another of textual annotation (ranging from transcripts and synopsis for archived TV news to simple user-generated tags). Retrieving, sanitizing and exploiting automatically this information should allow one to produce useful captions for audio-visual queries.
This is exactly the scenario we propose to explore with this Grand Challenge: given a short video sequence, with audio, stemming from the coverage of a public event, the system should produce precise textual information on it. Sought information is not an annotation at the category level (though this could serve as a powerful intermediate analysis to expand query semantically), but a description at the event identity level:
- Which event is it?
- When and where did it take place?
- What is its context?
- What is precisely happening in the audio-visual scene?
- In particular, who are the persons in the scene?
- Where are they in the image?
- What are they doing or saying?
Participants could attack only part of this complete scenario, though systems addressing the whole pipeline while jointly exploiting visual and audio information would be preferred. Regarding test-beds, participants should feel free to report results on any real data of their choice. In addition, Technicolor will provide on demand a selection of queries (short AV clips with no additional information) and will be able to assess qualitatively the relevance of the captions produced by participants’ systems for these queries.