Yahoo! Video Challenge

Robust Automatic Segmentation of Video According to Narrative Themes

Video search today relies mostly on textual metadata that is associated with the video in terms of title, tags or surrounding page-text. This approach falls severely short by ignoring the richness of information within the video medium; an engine should ideally use this information to help a user search and navigate content. As video content explodes and user attention spans shrink, a next generation video search engine needs to provide users with the ability to search for sections within a video; allow users consume bits and pieces of a video that would be of interest to them; and let the users kill time during lunch breaks in creative ways. In addition, instead of offering just one thumbnail as representation for a whole video, it would be great to be able to partition a video into its constituent narrative themes and allow users to navigate through a video on a more granular level with better video surrogates.

The challenge to researchers in the multi-media community is to develop methods, techniques, and algorithms to automatically generate narrative themes for a given video, as well as present the content in an easy-to-consume manner to end-users in a search engine experience. Naturally, the themes that emerge depend entirely on the video itself – so the methods / algorithms have to be generic. Still, there could be approaches developed for certain types and genres of videos. For instance, one approach could be employed for sitcoms, sports content could have another, educational content could have another, etc.

Input/output

To use a pop-culture example, imagine the input is an episode of Seinfeld (an NBC TV show popular during the 1990s). The output will ideally be 3 or 4 narrative themes around each character and the corresponding video start & end ranges. The themes could overlap, but they do not have to. A way to navigate (user interface) through these narrative themes should also be presented. This output, of course, should be searchable (in that it generates a better representation of content to the search engines, as well). For example, if the output was a character name, “Kramer”, then if a user entered just “Kramer” as a search term, this Seinfeld episode video should surface in the results with the corresponding narrative themes surrounding Kramer also being presented to the user to enable them to browse/click.

In other examples, if the input was a financial news video talking about the economy, bailout package, etc., then the themes that could emerge could be company names mentioned, executives, govt. officials mentioned, etc. If the input was a sports game, then the output themes could be the major points in the game – for baseball, may be home runs, hits, walks, strikeouts, innings changes, etc.

Yahoo! can provide few sample videos in various domains where it holds copyright permissions for this research purposes.

Metrics/Evaluation

There will be 3 criteria for evaluation:

  • Relevance of narrative themes
  • Innovative presentation & navigation of sub-themes for a video
  • Efficiency of the underlying algorithm

The key criteria for evaluation will be the relevance of the themes that are extracted from a particular video. When evaluating such services at Yahoo!, we would have human judges (usually editors or product managers) rate the relevance of the narrative themes to a given video.

The second important criteria is the creativity in presenting the sub-themes in a video allowing for ease of browsing. We are looking for solution that will increase both findability and engagement with content found via search engines or while browsing.

Lastly, the elegance of the solution should be evaluated by its ease of integration into a search engine’s pipeline, and the efficiency with which it can process a video and output the narrative themes – this latter part refers to processing speed. If a technology takes a day to chew through a 20 minute video and spit out the narrative themes vs. another technology can process the video in real-time or less (20 mins = video length or less) then the latter is much more attractive clearly. New approaches and algorithms that reduce or optimize computation may be required.

ACM Multimedia 2011

Nov 28th - Dec 1st, 2011 Scottsdale, Arizona, USA

Back To Top