IMuR '22: Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval


IMuR '22: Proceedings of the 2nd International Workshop on Interactive Multimedia Retrieval

Full Citation in the ACM Digital Library

SESSION: Oral Presentations

Combining Semantic and Visual Image Graphs for Efficient Search and Exploration of Large Dynamic Image Collections

  • Kai Uwe Barthel
  • Nico Hezel
  • Konstantin Schall
  • Klaus Jung

Image collections today often consist of millions of images, making it impossible to get an overview of the entire content. In recent years, we have presented several demonstrators for graph-based systems allowing image search and a visual exploration of the collection. Meanwhile, very powerful visual and also joint visual-textual feature vectors have been developed, which are suitable for finding similar images to query images or according to a textual description. A drawback of these image feature vectors is that they have a high number of dimensions, which leads to long search times, especially for large image collections. In this paper, we show how it is possible to significantly reduce the search time even for high-dimensional feature vectors and improve the efficiency of the search system. By combining two different image graphs, on the one hand, an extremely fast approximate nearest neighbor search can be achieved. Experimental results show that the proposed method performs better than state-of-the-art methods. On the other hand, it is possible to visually explore the entire image collection in real time using a standard web browser. Unlike other graph-based search systems, the proposed image graphs can dynamically adapt to the insertion and removal of images from the collection.

NeoCube: Graph-Based Implementation of the M3Data Model

  • Nikolaj Mertz
  • Björn Þór Jónsson
  • Aaron Duane

In this work, we consider metadata-based exploration of media collections using the M3 data model, to support multimedia analytics applications. We propose a new metadata-server implementation based on the Neo4j graph database system, and compare it to the existing, heavily-optimised server based on a relational database system. We show that the graph-based implementation performs well for interactive metadata-space retrieval, albeit not as well as the optimised relational implementation. The graph-based implementation also allows for very efficient updates to the metadata collection, however, which are practically impossible in the optimised relational implementation.

Influence of Late Fusion of High-Level Features on User Relevance Feedback for Videos

  • Omar Shahbaz Khan
  • Jan Zahálka
  • Björn Þór Jónsson

Content-based media retrieval relies on multimodal data representations. For videos, these representations mainly focus on the textual, visual, and audio modalities. While the modality representations can be used individually, combining their information can improve the overall retrieval experience. For video collections, retrieval focuses on either finding a full length video or specific segment(s) from one or more videos. For the former, the textual metadata along with broad descriptions of the contents are useful. For the latter, visual and audio modality representations are preferable as they represent the contents of specific segments in videos. Interactive learning approaches, such as user relevance feedback, have shown promising results when solving exploration and search tasks in larger collections. When combining modality representations in user relevance feedback, often a form of late modality fusion method is applied. While this generally tends to improve retrieval, its performance for video collections with multiple modality representations of high-level features, is not well known. In this study we analyse the effects of late fusion using high-level features, such as semantic concepts, actions, scenes, and audio. From our experiments on three video datasets, V3C1, Charades, and VGG-Sound, we show that fusion works well, but depending on the task or dataset, excluding one or more modalities can improve results. When it is clear that a modality is better for a task, setting a preference to enhance that modality's influence in the fusion process can also be greatly beneficial. Furthermore, we show that mixing fusion results and results from individual modalities can be better than only performing fusion.

Impact of Blind Image Quality Assessment on the Retrieval of Lifelog Images

  • Ricardo Ribeiro
  • Alina Trifan
  • António J. R. Neves

The use of personal lifelogs can be beneficial to improve the quality of our life, as they can serve as tools for memory augmentation or for providing support to people with memory issues. In visual lifelogs, data are captured by cameras in the form of images or videos. However, a considerable amount of these images or videos are affected by different types of distortions or noise due to the non-controlled acquisition process. This article addresses the use of Blind Image Quality Assessment algorithms as a pre-processing approach in the retrieval of lifelogging images. As the amount of lifelog images has increased over the last few years, it is fundamental to find solutions to filter images in a lifelog data collection. We evaluate the impact of a Blind Image Quality Assessment algorithm by performing different retrieval experiments through a lifelogging system named MEMORIA. The results are promising and show that our approach can reduce the amount of images to process and retrieve in a lifelog data collection without losing valuable information, and provide to the user the most valuable images. By excluding a considerable amount of images in the pre-processing stage of a lifelogging system, its performance can be increased by saving time and resources.

An Asynchronous Scheme for the Distributed Evaluation of Interactive Multimedia Retrieval

  • Loris Sauter
  • Ralph Gasser
  • Abraham Bernstein
  • Heiko Schuldt
  • Luca Rossetto

Evaluation campaigns for interactive multimedia retrieval, such as the Video Browser Shodown (VBS) or the Lifelog Search Challenge (LSC), so far imposed constraints on both simultaneity and locality of all participants, requiring them to solve the same tasks in the same place, at the same time and under the same conditions. These constraints are in contrast to other evaluation campaigns that do not focus on interactivity, where participants can process the tasks in any place at any time. The recent travel restrictions necessitated the relaxation of the locality constraint of interactive campaigns, enabling participants to take place from an arbitrary location. Born out of necessity, this relaxation turned out to be a boon since it greatly simplified the evaluation process and enabled organisation of ad-hoc evaluations outside of the large campaigns. However, it also introduced an additional complication in cases where participants were spread over several time zones. In this paper, we introduce an evaluation scheme for interactive retrieval evaluation that relaxes both the simultaneity and locality constraints, enabling participation from any place at any time within a predefined time frame. This scheme, as implemented in the Distributed Retrieval Evaluation Server (DRES), enables novel ways of conducting interactive retrieval evaluation and bridged the gap between interactive campaigns and non-interactive ones.

VILT: Video Instructions Linking for Complex Tasks

  • Sophie Fischer
  • Carlos Gemmell
  • Iain Mackie
  • Jeffrey Dalton

This work addresses challenges in developing conversational assistants that support rich multimodal video interactions to accomplish real-world tasks interactively. We introduce the task of automatically linking instructional videos to task steps as "Video Instructions Linking for Complex Tasks" (VILT). Specifically, we focus on the domain of cooking and empowering users to cook meals interactively with a video-enabled Alexa skill. We create a reusable benchmark with 61 queries from recipe tasks and curate a collection of 2,133 instructional "How-To" cooking videos. Studying VILT with state-of-the-art retrieval methods, we find that dense retrieval with ANCE is the most effective, achieving an NDCG@3 of 0.566 and P@1 of 0.644. We also conduct a user study that measures the effect of incorporating videos in a real-world task setting, where 10 participants perform several cooking tasks with varying multimodal experimental conditions using a state-of-the-art Alexa TaskBot system. The users interacting with manually linked videos said they learned something new 64% of the time, which is a 9% increase compared to the automatically linked videos (55%), indicating that linked video relevance is important for task learning.