NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos


Full Citation in the ACM Digital Library

SESSION: Invited Talk 1

How People Watch Videos? Viewer Behavior Analysis for Video Archive Summarization

  • Shin'ichi Satoh

If viewers' behavior in watching videos can be observed, many clues useful for video summarization and other applications can be obtained. For example, which (parts of) videos draw attention of more viewers, which contents are most watched, what kind of video editing operation is the most successful in attracting viewers, and so on. This talk will present our attempt to jointly analyze large-scale TV video archive along with viewer rating information to mine such kind of information. Especially, it will be shown that summarization of one-year news topics is possible.

SESSION: Spotlight and Poster Session

An Empirical Study of Multilingual Scene-Text Visual Question Answering

  • Lin Li
  • Haohan Zhang
  • Zeqin Fang

In recent years, the focus on multilingual modeling has intensified, driven by the necessity to enable cross-lingual Text-based Visual Question Answering (TextVQA), which requires the understanding of questions and answers across diverse languages. Existing research predominantly revolves around the fusion of multimodal information and the processing of OCR data. This paper undertakes an empirical investigation into multilingual scene-text visual question answering, addressing both cross-lingual (English <-> Chinese) and monolingual (English <-> English and Chinese <-> Chinese) tasks, with a primary emphasis on accuracy-based metrics. Our study not only elucidates the impact of varying OCR feature extractors and distinct visual feature extractors on a selection of state-of-the-art models but also delves into the broader landscape of multilingual TextVQA. The experimental outcomes underscore the capability of multilingual pretrained models to effectively handle text-based questions. Moreover, they underscore the importance of leveraging visual features from OCR data and images in enhancing answering performance.

A New Approach for Evaluating Movie Summarization

  • George Awad
  • Keith Curtis

An important need in many situations involving video collections (archive video search/reuse, personal video organization/search, movies, tv shows, etc.) is to summarize the video in order to reduce the size and concentrate the amount of high value information in the video track. The Movie Summarization (MSUM) track in the annual TRECVID (TREC Video Retrieval Evaluation) benchmark was proposed to evaluate automatic systems working on the movie summarization domain. This track used a licensed movie dataset from Kinolorberedu, in which the goal is to summarize the storylines and roles of specific characters during a full movie. The goal of this track is twofold: Efficiently capture important facts about certain persons during their role in the movie storyline, and assess how well video summarization and textual summarization compare in this domain.

A Method of Image Dehazing Based on Atmospheric Veil Prediction by ResNet

  • Jie Zhang
  • Fan Li
  • Mengfei Kang
  • Xiongbiao Luo
  • JIng Zhao
  • Chuan Xiao
  • Haipeng Du
  • Huaijun Wang

Image defogging is an important prerequisite for video summary. In existing defogging methods, there are some weaknesses such as too long parameters calculation time such as transmission map and atmospheric veil estimation. We propose a new method independent of the transmission map, a deep ResNet are trained to learn and predict the veil. There is no need to estimate more parameters for the defogging algorithm, and thus the quality of scene radiation recovery is improved, meanwhile parameter calculations training is speed up. From the experimental results on the dataset, we can see that the training time required for parameter calculation is reduced greatly, while the performance of image defogging is improved by proposed method, meanwhile, the distortion of depth integrity is minimal.

Sequential Action Retrieval for Generating Narratives from Long Videos

  • Satoshi Yamazaki
  • Jianquan Liu
  • Mohan Kankanhalli

In this paper, we propose a novel event retrieval method called Sequential Action Retrieval, which is a work in progress, towards generating video and text narratives of long-term events from long videos. Summarizing events of user interest from long videos is a challenging topic. Our proposed method aims at detecting long-term human activities defined by a sequence of action elements. By searching action elements from a semantic video graph that structures the appeared objects and their relationships in videos, our method is able to recognize complex action events such as object owner changes involving two or more people. We conducted an initial evaluation of event-related person detection on the Narrative dataset. We introduce a new evaluation metrics, KP-IDF1 to evaluate the accuracy of the appearances of related persons. Our proposed method achieves 76.4% of KP-IDF1 for the case of bicycle theft.

Narrative Graph for Narrative Generation from Long Videos

  • Rishabh Sheoran
  • Yongkang Wong
  • Jianquan Liu
  • Mohan Kankanhalli

Advancements in camera technology and cloud storage have led to a surge in video content creation, making videos more accessible. However, consuming raw, unprocessed, and lengthy videos can be unengaging. While videos with human-authored narratives (such as videos on YouTube) are captivating, creating such video requires a tremendous amount of effort and skill, and its scalability remains a bottleneck. To address this, we propose an algorithmic narrator that generates topic-specific narratives in real-time from raw videos, inspired by ChatGPT's natural language processing capabilities. Specifically, we proposed a novel narrative graph structure that captures narrative-worthy and semantically enriched factual information, as well as establishes temporal and causal links between narrative segments. The narrative graph is then fed to the algorithmic narrator to generate a textual narrative summary. Our comprehensive empirical study demonstrates the potential of algorithmic narrators and narrative graphs in creating engaging and coherent narratives, offering insights for the future of video content consumption.

A Study on the Use of Attention for Explaining Video Summarization

  • Evlampios Apostolidis
  • Vasileios Mezaris
  • Ioannis Patras

In this paper we present our study on the use of attention for explaining video summarization. We build on a recent work that formulates the task, called XAI-SUM, and we extend it by: a) taking into account two additional network architectures and b) introducing two novel explanation signals that relate to the entropy and diversity of attention weights. In total, we examine the effectiveness of seven types of explanation, using three state-of-the-art attention-based network architectures (CA-SUM, VASNet, SUM-GDA) and two datasets (SumMe, TVSum) for video summarization. The conducted evaluations show that the inherent attention weights are more suitable for explaining network architectures which integrate mechanisms for estimating attentive diversity (SUM-GDA) and uniqueness (CA-SUM). The explanation of simpler architectures (VASNet) can benefit from taking into account estimates about the strength of the input vectors, while another option is to consider the entropy of attention weights.

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

  • Berkay Selbes
  • Mustafa Sert

Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.

Story-to-Images Translation: Leveraging Diffusion Models and Large Language Models for Sequence Image Generation

  • Haruka Kumagai
  • Ryosuke Yamaki
  • Hiroki Naganuma

Diffusion models are catalyzing breakthroughs in creative fields, with a notable impact on text-to-image generation. This study centers on the transformation of textual narratives into coherent sequences of images - a process currently hampered by issues of consistency and contextual fidelity. To address these challenges, we propose a method utilizing a large language model, with an emphasis on context and character information. Empirical evaluations, carried out using Hollywood movie scripts, clearly indicate that our approach improves both the consistency and contextual fidelity of the resulting image sequences.

A Systematic Study on Video Summarization: Approaches, Challenges, and Future Directions

  • Kajal Kansal
  • Nikita Kansal
  • Sreevaatsav Bavana
  • Bodla Krishna Vamshi
  • Nidhi Goyal

With the exponential growth of user-generated videos, video summarization has become a prominent research field to quickly understand the essence of video content. The goal is to automate the task of acquiring key segments from the video while retaining the contextual semantics of the video and combining them to generate a summary. The major challenge is to identify important frames or segments corresponding to human perception, which varies from one genre to another. To this end, the survey paper furnishes a thorough panorama encompassing diverse categories of video summarization. In this research work, we investigate, compare, and offer valuable insights into the progress and effectiveness of video summarization techniques. We discuss an end-to-end general pipeline to understand the complexity of the video summarization task. Further, we also discuss several benchmark datasets used to evaluate the performance of video summarization algorithms. Furthermore, this study also explores various challenges specific to video summarization, and potential future directions for further research, and encourages researchers to explore new avenues in video summarization.

SESSION: Invited Talk 2

Video Summarization at TRECVID - Past Efforts and What's Next

  • George Awad

In recent years, the exponential growth of multimedia content, particularly movies and videos, has posed significant challenges for content consumption and comprehension. The vast amount of available audiovisual data necessitates efficient and effective methods to extract and present concise yet informative summaries. Video and movie summarization, a sub-field of natural language processing (NLP) and computer vision, aims to address this challenge by generating brief synopses or highlights that capture the essence of the video or movie's plot, themes, and key moments. The advancements in deep learning, natural language processing, and computer vision techniques have led to substantial progress in automatic video and movie summarization technology. Such technologies have found applications in various domains, including video indexing, recommendation systems, and content understanding. However, the effectiveness and quality of these automated summarization methods vary significantly depending on the algorithms, features, and datasets used for evaluation. While numerous video summarization algorithms have been proposed, the lack of standardized evaluation benchmarks and metrics hinders fair and meaningful comparisons among these methods. Existing evaluation approaches often lack comprehensiveness, making it challenging to discern the strengths and limitations of each technique. Consequently, this research gap emphasizes the need to establish a comprehensive benchmark that facilitates rigorous evaluation, comparison, and advancement of and video movie summarization technology. In this talk I will give a brief history about TRECVID which has been running for more than 20 years evaluating content-based video retrieval tasks. I will then specifically talk about the different video summarization tracks evaluated at TRECVID starting from summarizing BBC Rushes and moving to BBC Eastenders TV episodes. Finally, I will present the latest movie summarization track and end by the deep video understanding challenge. In each presented task I will talk about the applied dataset, task goals, metrics and high-level evaluation outputs of participating teams