ICDAR '23: 4th Workshop on Intelligent Cross-Data Analysis and Retrieval
ICDAR '23: 4th Workshop on Intelligent Cross-Data Analysis and Retrieval
SESSION: Session 1: AI Applied to Medical Domain
Mask-conditioned latent diffusion for generating gastrointestinal polyp images
- Roman Macháček
- Leila Mozaffari
- Zahra Sepasdar
- Sravanthi Parasa
- Pål Halvorsen
- Michael A. Riegler
- Vajira Thambawita
In order to take advantage of artificial intelligence (AI) solutions in endoscopy diagnostics, we must overcome the issue of limited annotations. These limitations are caused by the high privacy concerns in the medical field and the requirement of getting aid from experts for the time-consuming and costly medical data annotation process. In computer vision, image synthesis has made a significant contribution in recent years, as a result of the progress of generative adversarial networks (GANs) and diffusion probabilistic models (DPMs). Novel DPMs have outperformed GANs in text, image, and video generation tasks. Therefore, this study proposes a conditional DPM framework to generate synthetic gastrointestinal (GI) polyp images conditioned on given generated segmentation masks. Our experimental results show that our system can generate an unlimited number of high-fidelity synthetic polyp images with the corresponding ground truth masks of polyps. To test the usefulness of the generated data we trained binary image segmentation models to study the effect of using synthetic data. Results show that the best micro-imagewise intersection over union (IOU) of 0.7751 was achieved from DeepLabv3+ when the training data consists of both real data and synthetic data. However, the results reflect that achieving good segmentation performance with synthetic data heavily depends on model architectures.
SESSION: Session 2: AI for Transportation
Procedural Driving Skill Coaching from More Skilled Drivers to Safer Drivers: A Survey
- Wenbin Gan
- Minh-Son Dao
- Koji Zettsu
Improving driver behaviors through driving education and coaching is well-recognized as being necessary and efficient for driving safely and reducing traffic accidents, as they promise to reduce human factors accounting for most of the crash involvements. Driver education programs are currently widely employed in many countries, ensuring that the necessary procedural driving skills and competencies are imparted during these processes, to make more skilled drivers. However, making people more skilled drivers does not make them safer ones, the effectiveness of driving education is greatly restricted by the limited amount of actual supervised driving involved and the absence of individualized feedback. To this aim, driving coaching emerges as a more practical alternative to develop safely driving by proactively providing coaching feedback to enhance skills and cultivate corrective behaviors, with the recent technological developments in intelligent vehicles and transportation. This paper presents a systematic review of the existing studies for examining the empirical evidences on the various coaching explorations for the development of drivers’ procedural driving skills. In particular, we propose a taxonomy to classify existing driving coaching into four categories, and explore the answers to three questions: what types, when and how the different kinds of driving coaching are provided and delivered. Finally, the challenges and future directions are also presented from three aspects.
Towards Multimodal Spatio-Temporal Transformer-based Models for Traffic Congestion Prediction
- Huy Quang Ung
- Yutaro Mishima
- Hao Niu
- Shinya Wada
Robust predictions of traffic congestion play a crucial role in intelligent transportation systems. Recently, multimodal data have been applied to improve the performance of traffic congestion prediction models, such as rainfall, social network posts, incident reports, etc. In this paper, we attempt to use both dynamic people-flow and rainfall data, along with a transformer-based prediction model for traffic congestion prediction. We experiment an early fusion method for combining the multimodal data. Our experiments are conducted using two private datasets containing congestion and people-flow data, alongside a corresponding public dataset that provides rainfall data. The results indicate that incorporating the people-flow data into our prediction model enhances its performance.
SESSION: Session 3: AI Applied to Texts and Images
CG-GNN: A Novel Compiled Graphs-based Feature Extraction Method for Enterprise Social Networks
- Tatsuya Konishi
- Shuichiro Haruta
- Mori Kurokawa
- Kenta Tsukatsune
- Yuto Mizutani
- Tomoaki Saito
- Hideki Asoh
- Chihiro Ono
In this paper, we propose CG-GNN, a novel compiled graphs-based feature extraction method for Enterprise Social Networks (ESNs). For the provider of ESNs, extracting features from a given social graph is essential. However, since the amount of data available for a single enterprise is often limited, it is necessary to utilize data from other enterprises. We hypothesize that each enterprise has its own enterprise-specific features, while there is a general structure underlying in all enterprises. To reflect the hypothesis, our approach introduces “compiled graphs” to capture enterprise-specific features by mapping them through functions dedicated to that enterprise. The graphs are then handled by Graph Neural Networks (GNNs) that are commonly used across all enterprises to extract general structural information. Therefore, the obtained representations by CG-GNN are balanced in terms of enterprise-specific and enterprise-generic characteristics. Through experiments with private and publicly available datasets, we show that CG-GNN outperforms baselines by a large margin. In a practical scenario, we compute the ideal input of the proposed method for the purpose of ESNs revitalization. This experiment also demonstrates its feasibility and we believe the results are useful for many ESN providers.
A Joint Scene Text Recognition and Visual Appearance Model for Protest Issue Classification
- Xuan Luo
- Sota Kato
- Asahi Obata
- Budrul Ahsan
- Ryotaro Okada
- Takafumi Nakanishi
Social movements have been a crucial means of political participation in democratic and non-democratic societies. In recent years, the Arab Spring in 2011 as a notable example, such movements aggressively used social media as a platform for organizing protesters. Visual images and texts spread around on the internet played vital roles in bonding and attracting citizens to the movements. Political scientists and sociologists attempted to exploit the vast amount of information on the internet to analyze social movements. Their use of artificial intelligence techniques, however, has been limited. In this paper, we newly introduce a joint scene-text recognition and visual appearance model. Specifically, our model employs a character-level scene text detection and an n-gram character embedding model. Our model can extract semantic information from scene texts including handwritten words on signs and placards, which past automated studies of social movements by social scientists have not yet utilized. By employing such semantic information, our model can, in contrast to past research, classify protest images into semantic categories such as types of protest issues (e.g., gender, race, green). We newly created the Protest Issue Image Dataset consisting of 2,859 images with five protest issue categories to evaluate our model. Our model scored F1 score and provided a significant boost in performance over the conventional visual model used by social scientists, which scored . Our model can be applied to various types of social movement studies, including recent research which examines how different types of protest issues relate to other political and social factors.
SESSION: Session 4: AI for Recommendation Systems
User-irrelevant Cross-domain Association Analysis for Cross-domain Recommendation with Transfer Learning
- Hao Niu
- Duc Nguyen
- Kei Yonekawa
- Mori Kurokawa
- Chihiro Ono
- Daichi Amagata
- Takuya Maekawa
- Takahiro Hara
Cross-domain recommendation (CDR) is an effective approach to boost user experience and expand business. Traditional CDR methods generally rely on sharing user-relevant data between domains (e.g., user-item interaction data or user-overlap information). However, this approach is unrealistic in many practical applications, due to user data policies. Some works attempt to circumvent this limitation by leveraging other forms of overlapped data (e.g., item-, content-, or tag-overlap). However, such a solution is not always possible, especially if these forms of overlap are either non-existent or unknown. Until now, there have been limited studies focusing on the intractable CDR task without relying on sharing user-relevant data or using other types of overlap information.
Transfer learning techniques can be used to mine and exploit explicit or implicit relationships between domains and then, transfer knowledge learned from one domain to another. Such techniques can prove to be efficient to address the aforementioned intractable CDR task. In this work, we first propose a user-irrelevant cross-domain (CD) association analysis method to mine the CD co-occurrence relationship among the items of different domains. Then, using the mined CD co-occurrence items, we generate transferable item embeddings across domains and realize CDR. We evaluate our proposal and confirm its effectiveness on public datasets.
MTSS: Movie Trailers Surveillance System using Social Media Analytics and Public Mood
- Ioannis Prokopiou
- Pantelis Vikatos
- Christos Chatzis
- Christos Christodoulou
The movie industry is characterized by high levels of uncertainty due to the difficulties businesses have predicting sales and income since they depend on so many complex elements. Because of the significant movie industry’s upfront costs, investors must make decisions based on accurate methodologies to estimate the success or returns of their investments. Due to social networks’ widespread use in our daily lives, there is an unprecedented amount of publicly available posts containing emotional triggers that can be analyzed. In this paper, we showcase our movie trailers campaign surveillance system using social media analytics and public mood. The system provides an overall interactive dashboard to users and the capability for further process, analysis, and visualization of data in distinct customized reports. The proposed system also offers a configurable environment to gather metrics from heterogeneous social media platforms and extract advanced analytics through dedicated AI tools for sentiment and emotion extraction. This solution benefits the continuous surveillance of a movie trailer’s popularity and supports the success of marketing campaigns.
SESSION: Session 5: AI Applied to Cheapfakes Detection
Leveraging Cross-Modals for Cheapfakes Detection
- Thanh-Son Nguyen
- Vinh Dang
- Minh-Triet Tran
- Duc-Tien Dang-Nguyen
Detecting cheapfakes, particularly out-of-context images, is crucial for maintaining the integrity of information and preserving trust in multimedia content. This study proposes a new Cross-modals approach for cheapfakes detection that blends Natural Language Processing and Computer Vision techniques, including Image Captioning, Name Entity Recognition, and Natural Language Inference. Our approach enhances the robustness of the Cross-modals methods by not only considering both the meaning and context of the image titles but also using additional information such as Name Entity. According to the experiments, our method achieved an accuracy of , which can improve previous approaches. This paper highlights the potential of combining Natural Language Processing and Computer Vision techniques to tackle real-world problems, making it a significant advancement in cheapfakes detection.
Detecting Cheapfakes using Self-Query Adaptive-Context Learning
- Kha-Luan Pham
- Manh-Thien Nguyen
- Anh-Duy Tran
- Minh-Son Dao
- Duc-Tien Dang-Nguyen
Detecting Cheapfakes often requires identifying contextual changes in media resulting from misleading captions. Cheapfake media can be created either by manipulating the content using image or video editing software, or by altering the context of an image or video through misleading claims, without relying on any software. While previous research has shown promising results, these approaches are limited to the data used during training. To overcome this limitation, we propose a Self-Query Adaptive-Context Learning method that is flexible and capable of adapting to new contexts during inference by using image search engine queries to enrich its knowledge. By verifying the context of captions based on the collected information, our approach extends knowledge in a flexible manner. Despite achieving an experimental accuracy of 59.70% on the IEEE ICME 2023 Cheapfakes Challenge dataset, our work has opened up new avenues for detecting out-of-context misuses.