Call for Participation in the Multimedia Grand Challenge 2011

March 2011


Authors: Gerald Friedland, Yohan Jin

By Gerald Friedland

Deadline: August 6th, 2011.

What problems do Yahoo, HP, Nokia, Technicolor, 3DLife, and other companies see in the future of multimedia?

The Multimedia Grand Challenge is a set of problems proposed by industry leaders, geared to engage the Multimedia research community in solving relevant, interesting, and challenging questions about the industry's 2-5 year horizon for multimedia. The Grand Challenge was first presented as part of ACM Multimedia 2009 and has established itself as a prestigious competition in the multimedia community.

This year's conference will continue the tradition with both ongoing as well as brand new challenges, including:

  • HP Challenge: High Impact Visual Communication

    Images can serve as a powerful communications vehicle, conveying a wealth of information as well as emotional impact. The color, composition, content, lighting and sharpness of an image all contribute to a viewer's response to that image, and relative placement, scaling and orientation of a group of images in a collage adds an additional layer of richness and meaning to a page. These characteristics are used extensively by professionals on web sites, magazine covers and printed advertisements to draw attention, communicate a message and leave a lasting emotional impression.

    Because images hold such power, people like to use their photos to tell their own stories. However, their end result often falls short since many people lack the skill and intuitive understanding needed to create a coherent visual story from their photos. In addition, a picture is worth a thousand words. How do we create a high impact picture that can convey information across cultural boundaries and find a thousand words that best describe such a picture? This grand challenge is to find a solution which can create a collage and generate a textual description that tells the story of the set of photos.

    The system starts with a digital photo collection, such as photos taken during a vacation. It then analyzes the collection automatically using information from multiple sources such as image analysis, internet data sources, and EXIF tags. The result of the analysis is used to create the most appealing collage picture that best represents the original collection. In addition, it is also used to generate a description of the collage picture.

  • Technicolor Challenge: Precise Event Recognition and Description from Video Excerpts

    Visual search that aims at retrieving information on a specific object, person or place in a photo has recently witnessed a surge of activity in both academic and industrial worlds. It is for instance now possible to get precise information on a painting, a monument or a book by shooting it with a mobile phone. Such impressive systems rely on mature image description and matching technologies coupled with application-dependent databases of annotated images. They require however to circumscribe drastically the visual world to be queried (e.g., only paintings) and to organize a comprehensive information database accordingly.

    This Grand Challenge aims at exploring tools to extend this search paradigm in different directions: (1) The query is an excerpt from a public event's video, that is a small video and/or audio chunk from a longer coverage; (2) There is no strong contextual prior, thus ruling out the use of a specialized database designed on purpose; (3) Sought output is a precise textual description of the audiovisual query "scene".

  • Nokia Challenge: Visual Landmark Recognition

    Mobile devices provide ubiquitous access to internet, and so to almost unlimited amounts of data. Finding the information relevant to you can be a time consuming task in itself. In the modern fast paced life, when you are on the go, coming up with suitable query terms and typing them on a virtual touch keyboard is simply too slow. Image recognition has been widely recognized as a potential novel way of accessing data relevant to your immediate surroundings: snap a picture of something and the system tells you about it.

    Nokia and NAVTEQ together have created a dataset of street view data where individual buildings are identified. The dataset consists of 150k panoramic images aligned with a 3D city model consisting of 14k buildings obtained from footprint and elevation data. The images were labeled by projecting the 3D model into the panoramic images, computing visibility, and recording the identities of visible buildings.

  • Yahoo! Video Challenge: Robust Automatic Segmentation of Video According to Narrative Themes

    Video search today relies mostly on textual metadata that is associated with the video in terms of title, tags or surrounding page-text. This approach falls severely short by ignoring the richness of information within the video medium; an engine should ideally use this information to help a user search and navigate content. As video content explodes and user attention spans shrink, a next generation video search engine needs to provide users with the ability to search for sections within a video; allow users consume bits and pieces of a video that would be of interest to them; and let the users kill time during lunch breaks in creative ways. In addition, instead of offering just one thumbnail as representation for a whole video, it would be great to be able to partition a video into its constituent narrative themes and allow users to navigate through a video on a more granular level with better video surrogates.

    The challenge to researchers in the multi-media community is to develop methods, techniques, and algorithms to automatically generate narrative themes for a given video, as well as present the content in an easy-to-consume manner to end-users in a search engine experience. Naturally, the themes that emerge depend entirely on the video itself - so the methods / algorithms have to be generic. Still, there could be approaches developed for certain types and genres of videos. For instance, one approach could be employed for sitcoms, sports content could have another, educational content could have another, etc.

  • Yahoo! Image Challenge: Novel Image Understanding

    There are over 200 billion images on the internet today and this collection continues to grow by leaps and bounds. Image search engines often only surface a portion of those images and often rely on the text surrounding an image on a webpage, or the image file's name. With the growing number of images on the Internet it is important to have the ability to organize and surface the images in the most efficient, meaningful way possible so that better images can be shown to searchers.

    We want to move beyond simple image classification. Textual tags associated with an image often tell us that there is a tiger in an image. Not all images are labeled this way, but there are more than enough on any one subject to fill a search-result page.

    People come to image-search engines for many reasons. Users type an average of 2.2 words, but their underlying request is much more subtle, often representing an information or entertainment need that would normally require a much longer and deeper query.

    We need novel and useful ways to organize and structure image content. Can we sort celebrity pictures by their subject's age when the photo was taken? Or by their hair style? Can we discover how a logo has evolved over time? Can we organize pictures by their geographic location or the type of object? There are many ways to organize photos. What are the ways that are not obvious? What can we do better than we can do today?

    Users would like a better fit between their information and entertainment requests and the content returned by a search. How can we better organize multimedia content to fit user's needs and desires?

  • 3Dlife Challenge: Realistic Interaction in Online Virtual Environments

    This challenge calls for demonstrations of technologies that support real-time realistic interaction between humans in online virtual environments. This includes approaches for 3D signal processing, computer graphics, human computer interaction and human factors. To this end, we propose a scenario for online interaction and provide a data set around this to support investigation and demonstrations of various technical components.

    Consider an online dance class provided by an expert Salsa dancer teacher to be delivered via the web. The teacher will perform the class with all movements captured by a state of the art optical motion capture system. The resulting motion data will be used to animate a realistic avatar of the teacher in an online virtual ballet studio. Students attending the online master-class will do so by manifesting their own individual avatar in the virtual dance studio. The real-time animation of each student's avatar will be driven by whatever 3D capture technology is available to him/her. This could be captured via visual sensing techniques using a single camera, a camera network, wearable inertial motion sensing, or recent gaming controllers such as the Nintendo Wii or the Microsoft Kinect. The animation of the student's avatar in the virtual space will be real-time and realistically rendered, subject to the granularity of representation and interaction available from each capture mechanism.

We therefore call for submissions of contributions to the ACM Multimedia 2011 Grand Challenge track.

The submissions should:

  • Significantly address one of the challenges posted on the Grand Challenge web site.
  • Depict working, presentable systems or demos.
  • Describe why the system presents a novel and interesting solution.

Preference is given to results that are reproducible by the research community, e.g. where the data and the source code is made publicly available.

The submissions (4 pages) should be formatted according to ACM MM formatting guidelines. Based on the submission, the finalists will be selected by a committee consisting of academia and industry representatives. Finalist submissions will be published in the proceedings and presented in a special event during the ACM Multimedia 2011 conference in Scottsdale, AZ (USA). At the conference, the finalists will introduce their contribution shortly to the audience and take difficult questions from the judges. A team of judges and the attending crowd will select the top contributor and declare the winner of the Grand Challenge 2011. Special awards might be given to contributions that show an outstanding approach to the integration of multiple media or are based on a novel theoretical framework.

For more information visit the ACM Multimedia 2011 Grand Challenge website:

Previous Section Table of Contents Next Section