A 3-Hour Tutorial at ACM Multimedia 2011 for beginners in audio processing, multimedia students and researchers on intermediate level.
By Gerald Friedland
International Computer Science Institute, USA
fractor@icsi.berkeley.edu
http://www.icsi.berkeley.edu/~fractor
Motivation
Today's computers start to have enough computational power and enough memory to be able to process a large amount of data in different sensory modalities. This allows to improve the robustness of current content analysis approaches and attack problems that are impossible to solve using only a single modality. Just as a human analyst uses multiple sources of information to determine the content of a video, it seems obvious that for video content analysis, the investigation of clues across different sensor modalities and their combination can lead to better results than investigating only one stream of sensor input. This is especially true for the analysis of consumer-produced, “unconstrained” videos, such as YouTube uploads or Flickr content.
Many computer science curricula usually include basic image processing and computer vision classes but only rarely acoustic content analysis classes, and if so, acoustic analysis is often reduced to speech recognition. This results in multimedia content analysis in being mostly image and vision-centric. While visual information is a very important part of a video, acoustic information often complements it. Moreover, with multimodal integration and fusion being still a research topic, many multimedia researchers lack in-depth knowledge on how to combine and integrate modalities in an efficient and effective way.
Objectives
The objective of the tutorial is to introduce interested multimedia students and researchers who are not specialized in audio into the world of acoustic processing research with a focus on multimodal content analysis. For example, what is the accuracy of speech recognition after all and what open source options are there? Can I do indoor/outdoor detection with audio? How does an acoustic event detector work? What are the toolkits available for me to use? The goal is to enable the participants to include acoustic processing into their research as an addition to image, text, and visual video processing to enhance their multimedia content analysis results, especially on large scale video corpora. Because a 3-hour tutorial can neither replace a several semester lecture on the topic nor a degree in electrical engineering, the major purpose of this tutorial is to introduce basic concepts and technical terms, along with practical software toolkits and references to key literature. I hope to foster a high degree of cross-media fertilization that will benefit the multimedia community.
Outline
The following is a list of topics that the tutorial will cover:
- Useful and common filters
- Features for audio analysis
- Typical machine learning methods used for acoustic processing
- Evaluation methods of acoustic processing
- Example tasks of acoustic analysis
- Toolkits for acoustic analysis
- Multimodal integration
- Multimodal fusion
- Experimental setup for conducting multimodal experiments
- Acoustic and multimodal research challenges
- Discussion
Materials
The materials will be based on a class taught on the same topic at UC Berkeley in the fall semester 2011. The participants will be handed the slides of the presentations. In addition, the attendees of the tutorial will have early access to the textbook materials “Introduction to Multimedia Computing” by G. Friedland and R. Jain which is going to appear at Cambridge University Press soon after the tutorial. The textbook complements the tutorial not only with additional explanation but also with pseudo code (for reference purposes) and exercises (for deepening the presented material).
About the Presenter
Dr. Gerald Friedland is a senior research scientist at the International Computer Science Institute (ICSI), a private lab affiliated with the University of California, Berkeley, where he leads multimedia content analysis research, mostly focusing on acoustic techniques. Projects he is involved include work on acoustic methods for TRECVid MED 2011’s video concept detection task, multimodal location estimation for consumer videos, and multimodal grounded perception for robots. Dr. Friedland has published more than 100 peer-reviewed articles in conferences, journals, and books. He is associate editor for ACM Transactions on Multimedia Computing, Communications, and Applications, is in the organization committee of ACM Multimedia 2011, and a TPC co-chair of IEEE ICME 2012. Dr. Friedland is the recipient of several research and industry recognitions, among them the European Academic Software Award and the Multimedia Entrepreneur Award by the German Federal Department of Economics. Most recently, he led the team that won the ACM Multimedia Grand Challenge in 2009.
Despite being mainly a researcher, Dr. Friedland is a passionate teacher. He teaches a class on the same topic on the fall semester at UC Berkeley and he is currently authoring a new textbook “Introduction to Multimedia Computing” together with Dr. Ramesh Jain, to appear at Cambridge University Press. He is also a proud founder, program director, and instructor of the IEEE International Summer School on Semantic Computing at UC Berkeley which fosters crossdisciplinary computer science research on content extraction.