Precise event recognition and description from video excerpts
Visual search that aims at retrieving information on a specific object, person or place in a photo has recently witnessed a surge of activity in both academic and industrial worlds. It is for instance now possible to get precise information on a painting, a monument or a book by shooting it with a mobile phone. Such impressive systems rely on mature image description and matching technologies coupled with application-dependent databases of annotated images. They require however to circumscribe drastically the visual world to be queried (e.g., only paintings) and to organize a comprehensive information database accordingly.
This Grand Challenge aims at exploring tools to extend this search paradigm in different directions: (1) The query is an excerpt from a public event’s video, that is a small video and/or audio chunk from a longer coverage; (2) There is no strong contextual prior, thus ruling out the use of a specialized database designed on purpose; (3) Sought output is a precise textual description of the audiovisual query “scene”.
Inspiration to attack this challenge could stem from studies on the “reverse” scenario: a number of recent works indeed propose to exploit text around online pictures or videos (captions, scripts, webpage texts) to retrieve visual material based on textual queries. Searching visual contents this way is interesting on its own right, but it can also serve as a tool to harvest visual examples of a given concept (e.g., an object or an action category) for training purpose. Once trained on them, visual classifiers allow in turn automatic semantic annotation of images and videos for enhanced content-based search by keywords.
One key idea is thus to exploit the vast amount of intertwined textual and audio-visual data available online. In a number of cases, it is reasonable to assume that the piece of audio-visual content of interest has already benefited from one form or another of textual annotation by other people (ranging from comprehensive scripts for movies and TV series, to brief synopsis for archived TV news and, at the other extreme, to simple tags attached by people sharing contents). The challenge constitutes in retrieving, sanitizing and exploiting these information to produce useful “captions” for audio-visual queries. This could be seen as building a (semi-)automated audiovisual archivist that exploits indirect crowd sourcing and, more generally, as an effort towards strengthening our collective audiovisual memory.
As a prototypical scenario, we propose the following task: given a short excerpt (with or without audio track) from the video coverage of a public event, but no side information, the system should produce precise textual information on it. Sought information is not a generic labeling at the category level (though this could serve as a powerful intermediate analysis to expand query semantically), but a description at the event identity level: Which event is it? When and where did it take place? What is its context? What is precisely happening in the audio-visual scene? In particular, who are the persons in the scene? Where are they in the image and what are they doing or saying?
Participants could report results on any real data of their choice. In addition, Technicolor will provide them with a small selection of queries (snapshots and AV clips with no additional information) and will assess qualitatively captions produced by participants’ systems.
Appendix: Example of queries and answers.
More examples can be provided on request.