By Cees G.M. Snoek and Arnold W.M. Smeulders
University of Amsterdam, Netherlands
cgmsnoek@uva.nl
ArnoldSmeulders@uva.nl
ABSTRACT
In this tutorial, we focus on the challenges in internet video search, present methods how to achieve state-of-the-art performance while maintaining efficient execution, and indicate how to obtain improvements in the near future. Moreover, we give an overview of the latest developments and future trends in the field on the basis of the TRECVID competition - the leading competition for video search engines run by NIST - where we have achieved consistent top performance over the past years, including the 2008, 2009 and 2010 editions.
Categories and Subject Descriptors: H.3.3 Information Storage and Retrieval: Information Search and Retrieval
General Terms: Algorithms, Experimentation, Performance
Keywords: Visual categorization, video retrieval, information visualization
TUTORIAL DESCRIPTION
The scientific topic of video search is dominated by five major challenges:
a the sensory gap between an object and its many appearances due to the accidental sensing conditions;
b the semantic gap between a visual concept and its lingual representation;
c the model gap between the amount of notions in the world and the capacity to learn them;
d the query-context gap between the information need and the possible retrieval solutions;
e the interface gap between the tiny window the screen offers to the amount of data;
The semantic gap is bridged by forming a dictionary of visual detectors for concepts and events. The largest ones to date consist of hundreds of concepts excluding concept-tailored algorithms. It would simply take too long to achieve. Instead, we come closer to the ideal of one computer vision algorithm tailored automatically to the purpose at hand by employing example data to learn from. We discuss the advantages and limitations of a machine learning approach from examples. We show for what type of semantics the approach is likely to succeed or fail. In compensation for the absence of concept-specific (geometric or appearance) models, we emphasize the importance of good feature sets. They form the basis of the observational model by all possible color, shape, texture or structure invariant features help to characterize the concept and event at hand. Apart from good features, the other essential component is state-of- theart machine learning in order to get the most out of the learning data.
We integrate the features and machine learning aspects into a complete internet video search engine, which has successfully competed in TRECVID. The multimedia system includes computer vision, machine learning, information retrieval, and human-computer interaction. We follow the video data as they flow through the efficient computational processes. Starting from fundamental visual features, covering local shape, texture, color, motion and the crucial need for invariance. Then, we explain how invariant features can be used in concert with kernel-based supervised learning methods to arrive at an event or concept detector. We discuss the important role of fusion on a feature, classifier, and semantic level to improve the robustness and general applicability of detectors. We end our component-wise decomposition of internet video search engines by explaining the complexities involved in delivering a limited set of uncertain concept detectors to an inpatient online user. For each of the components we review state-of-the-art solutions in literature, each having different characteristics and merits.
Comparative evaluation of methods and systems is imperative to appreciate progress. We discuss the data, tasks, and results of TRECVID, the leading benchmark. In addition, we discuss the many derived community initiatives in creating annotations, baselines, and software for repeatable experiments. We conclude the course with our perspective on the many challenges and opportunities ahead for the multimedia retrieval community.
Acknowledgments
This tutorial is supported by STWSEARCHER, FES COMMIT, and the IARPA via Department of Interior National Business Center contract number D11PC20067. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
Copyright is held by the author/owner(s).
MM’11, December 25–29, 2010, Scottsdale, AZ, USA.
ACM 978-1-60558-933-6/10/10.