# MM '20: Proceedings of the 28th ACM International Conference on Multimedia

## SESSION: Poster Session A2: Deep Learning for Multimedia

### Drum Synthesis and Rhythmic Transformation with Adversarial Autoencoders

• Maciej Tomczak
• Masataka Goto
• Jason Hockman

Creative rhythmic transformations of musical audio refer to automated methods for manipulation of temporally-relevant sounds in time. This paper presents a method for joint synthesis and rhythm transformation of drum sounds through the use of adversarial autoencoders (AAE). Users may navigate both the timbre and rhythm of drum patterns in audio recordings through expressive control over a low-dimensional latent space. The model is based on an AAE with Gaussian mixture latent distributions that introduce rhythmic pattern conditioning to represent a wide variety of drum performances. The AAE is trained on a dataset of bar-length segments of percussion recordings, along with their clustered rhythmic pattern labels. The decoder is conditioned during adversarial training for mixing of data-driven rhythmic and timbral properties. The system is trained with over 500000 bars from 5418 tracks in popular datasets covering various musical genres. In an evaluation using real percussion recordings, the reconstruction accuracy and latent space interpolation between drum performances are investigated for audio generation conditioned by target rhythmic patterns.

### MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection

• Guibiao Liao
• Wei Gao
• Qiuping Jiang
• Ronggang Wang
• Ge Li

Most existing RGB-D salient object detection (SOD) methods directly extract and fuse raw features from RGB and depth backbones. Such methods can be easily restricted by low-quality depth maps and redundant cross-modal features. To effectively capture multi-scale cross-modal fusion features, this paper proposes a novel Multi-stage and Multi-Scale Fusion Network (MMNet), which consists of a cross-modal multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). Similar to the mechanism of visual color stage doctrine in human visual system, the proposed CMFM aims to explore the useful and important feature representations in feature response stage, and effectively integrate them into available cross-modal fusion features in adversarial combination stage. Moreover, the proposed BMD learns the combination of cross-modal fusion features from multiple levels to capture both local and global information of salient objects and further reasonably boost the performance of the proposed method. Comprehensive experiments demonstrate that the proposed method can achieve consistently superior performance over the other 14 state-of-the-art methods on six popular RGB-D datasets when evaluated by 8 different metrics.

### Stable Video Style Transfer Based on Partial Convolution with Depth-Aware Supervision

• Songhua Liu
• Hao Wu
• Shoutong Luo
• Zhengxing Sun

As a very important research issue in digital media art, neural learning based video style transfer has attracted more and more attention. A lot of recent works import optical flow method to original image style transfer framework to preserve frame-coherency and prevent flicker. However, these methods highly rely on paired video datasets of content video and stylized video, which are often difficult to obtain. Another limitation of existing methods is that while maintaining inter-frame coherency, they will introduce strong ghosting artifacts. In order to address these problems, this paper has following contributions: (1).presents a novel training framework for video style transfer without dependency on video dataset of target style; (2).firstly focuses on the ghosting problem existing in most previous works and uses partial convolution-based strategy to utilize inter-frame context and correlation, together with additional depth loss as a constrain to the generated frames to suppress ghosting artifacts and preserve stability at the same time. Extensive experiments demonstrate that our method can produce natural and stable video frames with target style. Qualitative and quantitative comparisons also show that the proposed approach outperforms previous works in terms of overall image quality and inter-frame stability. To facilitate future research, we publish our experiment code at \urlhttps://github.com/Huage001/Artistic-Video-Partial-Conv-Depth-Loss.

### Video Synthesis via Transform-Based Tensor Neural Network

• Yimeng Zhang
• Xiao-Yang Liu
• Bo Wu
• Anwar Walid

Video frame synthesis is an important task in computer vision and has drawn great interests in wide applications. However, existing neural network methods do not explicitly impose tensor low-rankness of videos to capture the spatiotemporal correlations in a high-dimensional space, while existing iterative algorithms require hand-crafted parameters and take relatively long running time. In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. Our design is based on two observations: (i) both linear and nonlinear transforms can be implemented by a neural network layer, and (ii) the soft-thresholding operator corresponds to an activation function. Further, such an unfolding design is able to achieve nearly real-time at the cost of training time and enjoys an interpretable nature as a byproduct. Experimental results on the KTH and UCF-101 datasets show that compared with the state-of-the-art methods, i.e., DVF and Super SloMo, the proposed scheme improves Peak Signal-to-Noise Ratio (PSNR) of video interpolation and prediction by 4.13 dB and 4.26 dB, respectively.

### Cluster Attention Contrast for Video Anomaly Detection

• Ziming Wang
• Yuexian Zou
• Zeming Zhang

Anomaly detection in videos is commonly referred to as the discrimination of events that do not conform to expected behaviors. Most existing methods formulate video anomaly detection as an outlier detection task and establish normal concept by minimizing reconstruction loss or prediction loss on training data. However, these methods performances suffer drops when they cannot guarantee either higher reconstruction errors for abnormal events or lower prediction errors for normal events. To avoid these problems, we introduce a novel contrastive representation learning task, Cluster Attention Contrast, to establish subcategories of normality as clusters. Specifically, we employ multi-parallel projection layers to project snippet-level video features into multiple discriminate feature spaces. Each of these feature spaces is corresponding to a cluster which captures distinct subcategory of normality, respectively. To acquire the reliable subcategories, we propose the Cluster Attention Module to draw thecluster attention representation of each snippet, then maximize the agreement of the representations from the same snippet under random data augmentations via momentum contrast. In this manner, we establish a robust normal concept without any prior assumptions on reconstruction errors or prediction errors. Experiments show our approach achieves state-of-the-art performance on benchmark datasets.

### Automatic Interest Recognition from Posture and Behaviour

• Wolmer Bigi
• Claudio Baecchi
• Alberto Del Bimbo

In the last years, the clothing industry has attracted a lot of interest from researchers. Increasing research efforts have been devoted into giving the buyer a way to improve the shopping experience by suggesting meaningful items to purchase. These efforts result in works aiming at suggesting good matches for clothes, but seem to lack one important aspect: understanding the user's interest. In fact, to suggest something it is first necessary to collect the user's personal interests, or something about his or her previous purchases. Without this information, no personalized suggestion can be made. User interest understanding allows to recognize if a user is showing interest in a product he or she is looking at, acquiring precious information that can be later leveraged. Usually user interest is associated to facial expressions, but these are known to be easily falsifiable. Moreover, when privacy is a concern, faces are often impossible to exploit. To address all these aspects, we propose an automatic system that aims to recognize the user's interest towards a garment by just looking at body posture and behaviour. To train and evaluate our system we create a body pose interest dataset, named BodyInterest, which consists of 30 users looking at garments for a total of approximately 6 hours of videos. Extensive evaluations show the effectiveness of our proposed method.

### Referenceless Rate-Distortion Modeling with Learning from Bitstream and Pixel Features

• Yangfan Sun
• Li Li
• Zhu Li
• Shan Liu
• none none

Generally, adaptive bitrates for variable Internet bandwidths can be obtained through multi-pass coding. Referenceless prediction-based methods show practical benefits compared with multi-pass coding to avoid excessive computational resource consumption, especially in low-latency circumstances. However, most of them fail to predict precisely due to the complex inner structure of modern codecs. Therefore, to improve the fidelity of prediction, we propose a referenceless prediction-based R-QP modeling (PmR-QP) method to estimate bitrate by leveraging a deep learning algorithm with only one-pass coding. It refines the global rate-control paradigm in modern codecs on flexibility and applicability with few adjustments as possible. By exploring the potentials of bitstream and pixel features from the prerequisite of one-pass coding, it can reach the expectation of bitrate estimation in terms of precision. To be more specific, we first describe the R-QP relationship curve as a robust quadratic R-QP modeling function derived from the Cauchy-based distribution. Second, we simplify the modeling function by fastening one operational point of the relationship curve received from the coding process. Third, we learn the model parameters from bitstream and pixel features, named them hybrid referenceless features, comprising texture information, hierarchical coding structure, and selected modes in intra-prediction. Extensive experiments demonstrate the proposed method significantly decreases the proportion of samples' bitrate estimation error within 10% by 24.60% on average over the state-of-the-art.

### MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

• Lilang Lin
• Sijie Song
• Wenhan Yang
• Jiaying Liu

In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting problem, and the features are not generalizable for action recognition. Instead, we propose to integrate multiple tasks to learn more general representations in a self-supervised manner. To realize this goal, we integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Skeleton dynamics can be modeled through motion prediction by predicting the future sequence. And temporal patterns, which are critical for action recognition, are learned through solving jigsaw puzzles. We further regularize the feature space by contrastive learning. Besides, we explore different training strategies to utilize the knowledge from self-supervised tasks for action recognition. We evaluate our multi-task self-supervised learning approach with action classifiers trained under different configurations, including unsupervised, semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition, demonstrating the superiority of our method in learning more discriminative and general features. Our project website is available at https://langlandslin.github.io/projects/MSL/.

### Domain-Adaptive Object Detection via Uncertainty-Aware Distribution Alignment

• Dang-Khoa Nguyen
• Wei-Lun Tseng
• Hong-Han Shuai

Domain adaptation aims to transfer knowledge from the source data with annotations to scarcely-labeled data in the target domain, which has attracted a lot of attention in recent years and facilitated many multimedia applications. Recent approaches have shown the effectiveness of using adversarial learning to reduce the distribution discrepancy between the source and target images by aligning distribution between source and target images at both image and instance levels. However, this remains challenging since two domains may have distinct background scenes and different objects. Moreover, complex combinations of objects and a variety of image styles deteriorate the unsupervised cross-domain distribution alignment. To address these challenges, in this paper, we design an end-to-end approach for unsupervised domain adaptation of object detector. Specifically, we propose a Multi-level Entropy Attention Alignment (MEAA) method that consists of two main components: (1) Local Uncertainty Attentional Alignment (LUAA) module to accelerate the model better perceiving structure-invariant objects of interest by utilizing information theory to measure the uncertainty of each local region via the entropy of the pixel-wise domain classifier and (2) Multi-level Uncertainty-Aware Context Alignment (MUCA) module to enrich domain-invariant information of relevant objects based on the entropy of multi-level domain classifiers. The proposed MEAA is evaluated in four domain-shift object detection scenarios. Experiment results demonstrate state-of-the-art performance on three challenging scenarios and competitive performance on one benchmark dataset.

### MM-Hand: 3D-Aware Multi-Modal Guided Hand Generation for 3D Hand Pose Synthesis

• Zhenyu Wu
• Duc Hoang
• Shih-Yao Lin
• Yusheng Xie
• Liangjian Chen
• Yen-Yu Lin
• Zhangyang Wang
• Wei Fan

Estimating the 3D hand pose from a monocular RGB image is important but challenging. A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations. However, it is too expensive in practice. Instead, we develop a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images under the guidance of 3D pose information. We propose a 3D-aware multi-modal guided hand generative network (MM-Hand), together with a novel geometry-based curriculum learning strategy. Our extensive experimental results demonstrate that the 3D-annotated images generated by MM-Hand qualitatively and quantitatively outperform existing options. Moreover, the augmented data can consistently improve the quantitative performance of the state-of-the-art 3D hand pose estimators on two benchmark datasets. The code will be available at https://github.com/ScottHoang/mm-hand.

### Joint Self-Attention and Scale-Aggregation for Self-Calibrated Deraining Network

• Cong Wang
• Yutong Wu
• Zhixun Su
• Junyang Chen

In the field of multimedia, single image deraining is a basic pre-processing work, which can greatly improve the visual effect of subsequent high-level tasks in rainy conditions. In this paper, we propose an effective algorithm, called JDNet, to solve the single image deraining problem and conduct the segmentation and detection task for applications. Specifically, considering the important information on multi-scale features, we propose a Scale-Aggregation module to learn the features with different scales. Simultaneously, Self-Attention module is introduced to match or outperform their convolutional counterparts, which allows the feature aggregation to adapt to each channel. Furthermore, to improve the basic convolutional feature transformation process of Convolutional Neural Networks (CNNs), Self-Calibrated convolution is applied to build long-range spatial and inter-channel dependencies around each spatial location that explicitly expand fields-of-view of each convolutional layer through internal communications and hence enriches the output features. By designing the Scale-Aggregation and Self-Attention modules with Self-Calibrated convolution skillfully, the proposed model has better deraining results both on real-world and synthetic datasets. Extensive experiments are conducted to demonstrate the superiority of our method compared with state-of-the-art methods. The source code will be available at https://supercong94.wixsite.com/supercong94.

### Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos

• Ling-An Zeng
• Fa-Ting Hong
• Wei-Shi Zheng
• Qi-Zhi Yu
• Wei Zeng
• Yao-Wei Wang
• Jian-Huang Lai

The objective of action quality assessment is to score sports videos. However, most existing works focus only on video dynamic information (i.e., motion information) but ignore the specific postures that an athlete is performing in a video, which is important for action assessment in long videos. In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. To learn more discriminative representations for videos, we not only learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames, which represent the action quality at certain moments, along with the help of the proposed hybrid dynamic-static architecture. Moreover, we leverage a context-aware attention module consisting of a temporal instance-wise graph convolutional network unit and an attention unit for both streams to extract more robust stream features, where the former is for exploring the relations between instances and the latter for assigning a proper weight to each instance. Finally, we combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts. Additionally, we have collected and annotated the new Rhythmic Gymnastics dataset, which contains videos of four different types of gymnastics routines, for evaluation of action quality assessment in long videos. Extensive experimental results validate the efficacy of our proposed method, which outperforms related approaches.

### F2GAN: Fusing-and-Filling GAN for Few-shot Image Generation

• Yan Hong
• Li Niu
• Jianfu Zhang
• Weijie Zhao
• Chen Fu
• Liqing Zhang

In order to generate images for a given category, existing deep generative models generally rely on abundant training images. However, extensive data acquisition is expensive and fast learning ability from limited data is necessarily required in real-world applications. Also, these existing methods are not well-suited for fast adaptation to a new category. Few-shot image generation, aiming to generate images from only a few images for a new category, has attracted some research interest. In this paper, we propose a Fusing-and-Filling Generative Adversarial Network (F2GAN) to generate realistic and diverse images for a new category with only a few images. In our F2GAN, a fusion generator is designed to fuse the high-level features of conditional images with random interpolation coefficients, and then fills in attended low-level details with non-local attention module to produce a new image. Moreover, our discriminator can ensure the diversity of generated images by a mode seeking loss and an interpolation regression loss. Extensive experiments on five datasets demonstrate the effectiveness of our proposed method for few-shot image generation.

### JAFPro: Joint Appearance Fusion and Propagation for Human Video Motion Transfer from Multiple Reference Images

• Xianggang Yu
• Haolin Liu
• Xiaoguang Han
• Zhen Li
• Zixiang Xiong
• Shuguang Cui

We present a novel framework for human video motion transfer. Deviating from recent studies that use only single source image, we propose to allow users to supply multiple source images by simply imitating some poses in the desired target video. To aggregate the appearance from multiple input images, we propose a JAFPro framework that incorporates two modules: an appearance fusion module that adaptively fuses the information in the supplied images and an appearance propagation module that propagates textures through flow-based warping to further improve the result. An attractive feature of JAFPro is that the quality of its results progressively improves as more imitating images are supplied. Furthermore, we build a new dataset containing a large variety of dancing videos in the wild. Extensive experiments conducted on this dataset demonstrate JAFPro outperforms state-of-the-art methods both qualitatively and quantitatively. We will release our code and dataset upon publication of this work.

## SESSION: Poster Session B2: Deep Learning for Multimedia & Emerging Multimedia Applications

### A W2VV++ Case Study with Automated and Interactive Text-to-Video Retrieval

• Jakub Lokoć
• Tomáš Soućek
• Patrik Veselý
• František Mejzlík
• Jiaqi Ji
• Chaoxi Xu
• Xirong Li

As reported by respected evaluation campaigns focusing both on automated and interactive video search approaches, deep learning started to dominate the video retrieval area. However, the results are still not satisfactory for many types of search tasks focusing on high recall. To report on this challenging problem, we present two orthogonal task-based performance studies centered around the state-of-the-art W2VV++ query representation learning model for video retrieval. First, an ablation study is presented to investigate which components of the model are effective in two types of benchmark tasks focusing on high recall. Second, interactive search scenarios from the Video Browser Showdown are analyzed for two winning prototype systems implementing a selected variant of the model and providing additional querying and visualization components. The analysis of collected logs demonstrates that even with the state-of-the-art text search video retrieval model, it is still auspicious to integrate users into the search process for task types, where high recall is essential.

### Attention Cube Network for Image Restoration

• Yucheng Hang
• Qingmin Liao
• Wenming Yang
• Yupeng Chen
• Jie Zhou

Recently, deep convolutional neural network (CNN) have been widely used in image restoration and obtained great success. However, most of existing methods are limited to local receptive field and equal treatment of different types of information. Besides, existing methods always use a multi-supervised method to aggregate different feature maps, which can not effectively aggregate hierarchical feature information. To address these issues, we propose an attention cube network (A-CubeNet) for image restoration for more powerful feature expression and feature correlation learning. Specifically, we design a novel attention mechanism from three dimensions, namely spatial dimension, channel-wise dimension and hierarchical dimension. The adaptive spatial attention branch (ASAB) and the adaptive channel attention branch (ACAB) constitute the adaptive dual attention module (ADAM), which can capture the long-range spatial and channel-wise contextual information to expand the receptive field and distinguish different types of information for more effective feature representations. Furthermore, the adaptive hierarchical attention module (AHAM) can capture the long-range hierarchical contextual information to flexibly aggregate different feature maps by weights depending on the global context. The ADAM and AHAM cooperate to form an 'attention in attention' structure, which means AHAM's inputs are enhanced by ASAB and ACAB. Experiments demonstrate the superiority of our method over state-of-the-art image restoration methods in both quantitative comparison and visual analysis.

### CRNet: A Center-aware Representation for Detecting Text of Arbitrary Shapes

• Yu Zhou
• Hongtao Xie
• Shancheng Fang
• Yan Li
• Yongdong Zhang

Existing scene text detection methods achieve state-of-the-art performance by designing elaborate anchors or complex post-processing. Nonetheless, most methods still face the dilemma of detecting adjacent texts as one instance and long text with large character spacing as multiple fragments. To tackle these problems, we propose an anchor-free scene text detector leveraging Center-aware Representation to achieve accurate arbitrary-shaped scene text detection namely CRNet. Firstly, we propose a center-aware location algorithm to explicitly learn center regions and center points of text instances, which is able to separate adjacent text instances effectively. Then, a multi-scale context extraction module capable of extracting local context, long-range dependencies and global context adaptively is designed to effectively perceive long text with large character spacing. Finally, a low-level features enhancement block is introduced to enhance the geometric information of text. Extensive experiments conducted on several benchmarks including SCUT-CTW1500, Total-Text, ICDAR2015, ICDAR2017 MLT, and MSRA-TD500 demonstrate the effectiveness of our method. Specifically, without any anchor and complicated post-processing, our CRNet achieves 84.2% and 85.1% on CTW1500 and MSRA-TD500 in F-measure, outperforming all state-of-the-art anchor-based and anchor-free methods.

### Expressional Region Retrieval

• Xiaoqian Guo
• Xiangyang Li
• Shuqiang Jiang

Image retrieval is a long-standing topic in the multimedia community due to its various applications, e.g., product search and artworks retrieval in museum. The regions in images contain a wealth of information. Users may be interested in the objects presented in the image regions or the relationships between them. But previous retrieval methods are either limited to the single object of images, or tend to the entire visual scene. In this paper, we introduce a new task called expressional region retrieval, in which the query is formulated as a region of image with the associated description. The goal is to find images containing the similar content with the query and localize the regions within them. As far as we know, this task has not been explored yet. We propose a framework to address this issue. The region proposals are first generated based on region detectors and language features are extracted. Then the Gated Residual Network (GRN) takes language information as a gate to control the transformation of visual features. In this way, the combined visual and language representation is more specific and discriminative for expressional region retrieval. We evaluate our method on a new established benchmark which is constructed based on the Visual Genome dataset. Experimental results demonstrate that our model effectively utilizes both visual and language information, outperforming the baseline methods.

### ATRW: A Benchmark for Amur Tiger Re-identification in the Wild

• Shuyuan Li
• Jianguo Li
• Hanlin Tang
• Rui Qian
• Weiyao Lin

Monitoring the population and movements of endangered species is an important task to wildlife conversation. Traditional tagging methods do not scale to large populations, while applying computer vision methods to camera sensor data requires re-identification (re-ID) algorithms to obtain accurate counts and moving trajectory of wildlife. However, existing re-ID methods are largely targeted at persons and cars, which have limited pose variations and constrained capture environments. This paper tries to fill the gap by introducing a novel large-scale dataset, the Amur Tiger Re-identification in the Wild (ATRW) dataset. ATRW contains over 8,000 video clips from 92 Amur tigers, with bounding box, pose keypoint, and tiger identity annotations. In contrast to typical re-ID datasets, the tigers are captured in a diverse set of unconstrained poses and lighting conditions. We demonstrate with a set of baseline algorithms that ATRW is a challenging dataset for re-ID. Lastly, we propose a novel method for tiger re-identification, which introduces precise pose parts modeling in deep neural networks to handle large pose variation of tigers, and reaches notable performance improvement over existing re-ID methods. The ATRW dataset is public available at https://cvwc2019.github.io/challenge.html

• Weiying Wang
• Jieting Chen
• Qin Jin

Live video interactive commenting, a.k.a. danmaku, is an emerging social feature on online video sites, which involves rich multimodal information interaction among viewers. In order to support various related research, we build a large scale video interactive comments dataset called VideoIC, which consists of 4951 videos spanning 557 hours and 5 million comments. Videos are collected from popular categories on the 'Bilibili' video streaming website. Comparing to other existing danmaku datasets, our VideoIC contains richer and denser comments information, with 1077 comments per video on average. High comment density and diverse video types make VideoIC a challenging corpus for various research such as automatic video comments generation. We also propose a novel model based on multimodal multitask learning for comment generation (MML-CG), which integrates multiple modalities to achieve effective comment generation and temporal relation prediction. A multitask loss function is designed to train both tasks jointly in the end-to-end manner. We conduct extensive experiments on both VideoIC and Livebot datasets. The results prove the effectiveness of our model and reveal some features of danmaku.

### Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras

• Jiewen Zhao
• Ruize Han
• Yiyang Gan
• Liang Wan
• Wei Feng
• Song Wang

Compared to a single fixed camera, multiple moving cameras, e.g., those worn by people, can better capture the human interactive and group activities in a scene, by providing multiple, flexible and possibly complementary views of the involved people. In this setting the actual promotion of activity detection is highly dependent on the effective correlation and collaborative analysis of multiple videos taken by different wearable cameras, which is highly challenging given the time-varying view differences across different cameras and mutual occlusion of people in each video. By focusing on two wearable cameras and the interactive activities that involve only two people, in this paper we develop a new approach that can simultaneously: (i) identify the same persons across the two videos, (ii) detect the interactive activities of interest, including their occurrence intervals and involved people, and (iii) recognize the category of each interactive activity. Specifically, we represent each video by a graph, with detected persons as nodes, and propose a unified Graph Neural Network (GNN) based framework to jointly solve the above three problems. A graph matching network is developed for identifying the same persons across the two videos and a graph inference network is then used for detecting the human interactions. We also build a new video dataset, which provides a benchmark for this study, and conduct extensive experiments to validate the effectiveness and superiority of the proposed method.

### Surface Reconstruction with Unconnected Normal Maps: An Efficient Mesh-based Approach

• Miaohui Wang
• Wuyuan Xie
• Maolin Cui

Normal integration is a key step in dense 3D reconstruction methods such as shape-from-shading and photometric stereo. However, normal integration cannot be guaranteed between spatially unconnected normal maps, which can ultimately cause a shape deformation in surface-from-normals (SfN). For the first time, this paper presents an efficient approach to address the fundamental problem of surface reconstruction from unconnected normal maps (denoted as "SfN+") using discrete geometry. We first design a normal piece pairing metric to measure the virtually pairing quality between two unconnected normal fragments, which is used as a new constraint for the boundary vertexes during mesh deformation. We then adopt a normal connecting significance indicator to adjust the influence of virtually connected vertexes, which further improves the overall shape deformation. Finally, we model the shape reconstruction of unconnected normal maps as a light-weight energy optimization framework by jointly considering the relaxation of connecting constraints and overall reconstruction error. Experiments show that the proposed SfN+ achieves a robust and efficient performance on dense 3D surface reconstruction.

### MOR-UAV: A Benchmark Dataset and Baselines for Moving Object Recognition in UAV Videos

• Murari Mandal
• Lav Kush Kumar
• Santosh Kumar Vipparthi

Visual data collected from Unmanned Aerial Vehicles (UAVs) has opened a new frontier of computer vision that requires automated analysis of aerial images/videos. However, the existing UAV datasets primarily focus on object detection. An object detector does not differentiate between the moving and non-moving objects. Given a real-time UAV video stream, how can we both localize and classify the moving objects, i.e. perform moving object recognition (MOR) The MOR is one of the essential tasks to support various UAV vision-based applications including aerial surveillance, search and rescue, event recognition, urban and rural scene understanding.To the best of our knowledge, no labeled dataset is available for MOR evaluation in UAV videos. Therefore, in this paper, we introduce MOR-UAV, a large-scale video dataset for MOR in aerial videos. We achieve this by labeling axis-aligned bounding boxes for moving objects which requires less computational resources than producing pixel-level estimates. We annotate 89,783 moving object instances collected from 30 UAV videos, consisting of 10,948 frames in various scenarios such as weather conditions, occlusion, changing flying altitude and multiple camera views. We assigned the labels for two categories of vehicles (car and heavy vehicle). Furthermore, we propose a deep unified framework MOR-UAVNet for MOR in UAV videos. Since, this is a first attempt for MOR in UAV videos, we present 16 baseline results based on the proposed framework over the MOR-UAV dataset through quantitative and qualitative experiments. We also analyze the motion-salient regions in the network through multiple layer visualizations. The MOR-UAVNet works online at inference as it requires only few past frames. Moreover, it doesn't require predefined target initialization from user. Experiments also demonstrate that the MOR-UAV dataset is quite challenging.

### Learning Tuple Compatibility for Conditional Outfit Recommendation

• Xuewen Yang
• Dongliang Xie
• Xin Wang
• Jiangbo Yuan
• Wanying Ding
• Pengyun Yan

Outfit recommendation requires the answers of some challenging outfit compatibility questions such as 'Which pair of boots and school bag go well with my jeans and sweater?'. It is more complicated than conventional similarity search, and needs to consider not only visual aesthetics but also the intrinsic fine-grained and multi-category nature of fashion items. Some existing approaches solve the problem through sequential models or learning pair-wise distances between items. However, most of them only consider coarse category information in defining fashion compatibility while neglecting the fine-grained category information often desired in practical applications. To better define the fashion compatibility and more flexibly meet different needs, we propose a novel problem of learning compatibility among multiple tuples (each consisting of an item and category pair), and recommending fashion items following the category choices from customers. Our contributions include: 1) Designing a Mixed Category Attention Net (MCAN) which integrates both fine-grained and coarse category information into recommendation and learns the compatibility among fashion tuples. MCAN can explicitly and effectively generate diverse and controllable recommendations based on need. 2) Contributing a new dataset IQON, which follows eastern culture and can be used to test the generalization of recommendation systems. Our extensive experiments on a reference dataset Polyvore and our dataset IQON demonstrate that our method significantly outperforms state-of-the-art recommendation methods.

### Efficient Crowd Counting via Structured Knowledge Transfer

• Lingbo Liu
• Jiaqi Chen
• Hefeng Wu
• Tianshui Chen
• Guanbin Li
• Liang Lin

Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. However, most previous works relied on heavy backbone networks and required prohibitive run-time consumption, which would seriously restrict their deployment scopes and cause poor scalability. To liberate these crowd counting models, we propose a novel Structured Knowledge Transfer (SKT) framework, which fully exploits the structured knowledge of a well-trained teacher network to generate a lightweight but still highly effective student network. Specifically, it is integrated with two complementary transfer modules, including an Intra-Layer Pattern Transfer which sequentially distills the knowledge embedded in layer-wise features of the teacher network to guide feature learning of the student network and an Inter-Layer Relation Transfer which densely distills the cross-layer correlation knowledge of the teacher to regularize the student's feature evolution. Consequently, our student network can derive the layer-wise and cross-layer knowledge from the teacher network to learn compact yet effective features. Extensive evaluations on three benchmarks well demonstrate the effectiveness of our SKT for extensive crowd counting models. In particular, only using around $6%$ of the parameters and computation cost of original models, our distilled VGG-based models obtain at least 6.5× speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance. Our code and models are available at https://github.com/HCPLab-SYSU/SKT.

### DeSmoothGAN: Recovering Details of Smoothed Images via Spatial Feature-wise Transformation and Full Attention

• Yifei Huang
• Chenhui Li
• Xiaohu Guo
• Jing Liao
• Chenxu Zhang
• Changbo Wang

Recently, generative adversarial networks (GAN) have been widely used to solve image-to-image translation problems such as edges to photos, labels to scenes, and colorizing grayscale images. However, how to recover details of smoothed images is still unexplored. Naively training a GAN like pix2pix causes insufficiently perfect results due to the fact that we ignore two main characteristics including spatial variability and spatial correlation as for this problem. In this work, we propose DeSmoothGAN to utilize both characteristics specifically. The spatial variability indicates that the details of different areas of smoothed images are distinct and they are supposed to be recovered differently. Therefore, we propose to perform spatial feature-wise transformation to recover individual areas differently. The spatial correlation represents that the details of different areas are related to each other. Thus, we propose to apply full attention to consider the relations between them. The proposed method generates satisfying results on several real-world datasets. We have conducted quantitative experiments including smooth consistency and image similarity to demonstrate the effectiveness of DeSmoothGAN. Furthermore, ablation studies are performed to illustrate the usefulness of our proposed feature-wise transformation and full attention.

### PatchMatch based Multiview Stereo with Local Quadric Window

• Hyewon Song
• Jaeseong Park
• Suwoong Heo
• Jiwoo Kang
• Sanghoon Lee

Although various stereo matching methods are studied in many years, the accurate 3D reconstruction from multiview stereos in high-fidelity is still challenging due to the surface inconsistency caused by various factors such as specular illumination. In this paper, we propose an accurate PatchMatch based multiview stereo matching method with a quadric support window that efficiently captures the surface of a complex structured object. Our method takes three novel contributions. Firstly, delicate surface configurations are used for representing the complex structure of an object. By using a general 3D quadric function, the structured object surfaces can be estimated more accurately. In addition, an illumination robust framework is proposed, where the patch dissimilarities are precisely measured with disentangled representation. The matching cost is defined based on disentangled measurements of the object photometric and geometric properties, balancing the pixel intensities between images robust to illumination. Lastly, a multiview propagation method is proposed to confirm shape consistency among views. Through the disparity refinement to unify plane parameters of the views, the object surface is estimated from a global perspective. Consequently, the dense and smooth 3D shape of the object is reconstructed accurately. We evaluate our proposed method on the Middlebury stereo set and conduct comprehensive experiments on facial images. Both quantitative and qualitative results demonstrate that the proposed method shows significant improvements over state-of-the-art methods.

### Expert Performance in the Examination of Interior Surfaces in an Automobile: Virtual Reality vs. Reality

• Alexander Tesch
• Ralf Dörner

For evaluating the appearance and design language of car interiors, the surface quality and shapes are inspected by highly trained professionals. At the same time, virtual reality (VR) is making major progress, pushing the boundaries of this technology. In this paper, we evaluate the applicability of VR using head mounted displays (HMDs) in an experiment where we had experts examine the design quality of an interior in VR and compared the results with the examination on a powerwall as well as in reality. Our goal is to find out in how far current VR hardware can be used in the automotive industry and which advantages and disadvantages occur. Our results show that the experts are able to detect an amount of flaws comparable to reality with the powerwall being the superior medium in terms of flaws identified. Additionally, symptoms of cybersickness and a reduced lack of confidence was measured in the subjects using a HMD.

## SESSION: Poster Session C2: Emerging Multimedia Applications

### Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning

• Wentao Bao
• Qi Yu
• Yu Kong

Traffic accident anticipation aims to predict accidents from dashcam videos as early as possible, which is critical to safety-guaranteed self-driving systems. With cluttered traffic scenes and limited visual cues, it is of great challenge to predict how long there will be an accident from early observed frames. Most existing approaches are developed to learn features of accident-relevant agents for accident anticipation, while ignoring the features of their spatial and temporal relations. Besides, current deterministic deep neural networks could be overconfident in false predictions, leading to high risk of traffic accidents caused by self-driving systems. In this paper, we propose an uncertainty-based accident anticipation model with spatio-temporal relational learning. It sequentially predicts the probability of traffic accident occurrence with dashcam videos. Specifically, we propose to take advantage of graph convolution and recurrent networks for relational feature learning, and leverage Bayesian neural networks to address the intrinsic variability of latent relational representations. The derived uncertainty-based ranking loss is found to significantly boost model performance by improving the quality of relational features. In addition, we collect a new Car Crash Dataset (CCD) for traffic accident anticipation which contains environmental attributes and accident reasons annotations. Experimental results on both public and the newly-compiled datasets show state-of-the-art performance of our model. Our code and CCD dataset are available at https://github.com/Cogito2012/UString.

### A Tightly-coupled Semantic SLAM System with Visual, Inertial and Surround-view Sensors for Autonomous Indoor Parking

• Xuan Shao
• Lin Zhang
• Tianjun Zhang
• Ying Shen
• Hongyu Li
• Yicong Zhou

The semantic SLAM (simultaneous localization and mapping) system is an indispensable module for autonomous indoor parking. Monocular and binocular visual cameras constitute the basic configuration to build such a system. Features used in existing SLAM systems are often dynamically movable, blurred and repetitively textured. By contrast, semantic features on the ground are more stable and consistent in the indoor parking environment. Due to their inabilities to perceive salient features on the ground, existing SLAM systems are prone to tracking loss during navigation. Therefore, a surround-view camera system capturing images from a top-down viewpoint is necessarily called for. To this end, this paper proposes a novel tightly-coupled semantic SLAM system by integrating Visual, Inertial, and Surround-view sensors, VIS SLAM for short, for autonomous indoor parking. In VIS SLAM, apart from low-level visual features and IMU (inertial measurement unit) motion data, parking-slots in surround-view images are also detected and geometrically associated, forming semantic constraints. Specifically, each parking-slot can impose a surround-view constraint that can be split into an adjacency term and a registration term. The former pre-defines the position of each individual parking-slot subject to whether it has an adjacent neighbor. The latter further constrains by registering between each observed parking-slot and its position in the world coordinate system. To validate the effectiveness and efficiency of VIS SLAM, a large-scale dataset composed of synchronous multi-sensor data collected from typical indoor parking sites is established, which is the first of its kind. The collected dataset has been made publicly available at https://cslinzhang.github.io/VISSLAM/.

### Searching Privately by Imperceptible Lying: A Novel Private Hashing Method with Differential Privacy

• Yimu Wang
• Shiyin Lu
• Lijun Zhang

In the big data era, with the increasing amount of multi-media data, approximate nearest neighbor~(ANN) search has been an important but challenging problem. As a widely applied large-scale ANN search method, hashing has made great progress, and achieved sub-linear search time with low memory space. However, the advances in hashing are based on the availability of large and representative datasets, which often contain sensitive information. Typically, the privacy of this individually sensitive information is compromised. In this paper, we tackle this valuable yet challenging problem and formulate a task termed as private hashing, which takes into account both searching performance and privacy protection. Specifically, we propose a novel noise mechanism, i.e., Random Flipping, and two private hashing algorithms, i.e., PHashing and PITQ, with the refined analysis within the framework of differential privacy, since differential privacy is a well-established technique to measure the privacy leakage of an algorithm. Random Flipping targets binary scenarios and leverages the "Imperceptible Lying" idea to guarantee ε-differential privacy by flipping each datum of the binary matrix (noise addition). To preserve ε-differential privacy, PHashing perturbs and adds noise to the hash codes learned by non-private hashing algorithms using Random Flipping. However, the noise addition for privacy in PHashing will cause severe performance drops. To alleviate this problem, PITQ leverages the power of alternative learning to distribute the noise generated by Random Flipping into each iteration while preserving ε-differential privacy. Furthermore, to empirically evaluate our algorithms, we conduct comprehensive experiments on the image search task and demonstrate that proposed algorithms achieve equal performance compared with non-private hashing methods.

### Leverage Social Media for Personalized Stress Detection

• Xin Wang
• Huijun Zhang
• Lei Cao
• Ling Feng

Timely detection of stress is desirable to address the increasingly serious stress problem. Thanks to the rich linguistic expressions and complete historical records on social media, achieving personalized stress detection through social media is feasible and prominent. We construct a three-leveled framework, aiming at personalized stress detection based on social media. The three-leveled framework learns the personalized stress representations following an increasingly detailed processing, i.e., from the generic mass level, group level, to the final individual level. The first mass-level focuses on mining the generic stress representations from people's linguistic and visual posts with a two-layer attention mechanism. The second group-level adopts the graph neural network to learn the group-wise characteristics of the group where an individual belongs to. The third individual-level analyzes and incorporates individual's personality traits into stress detection. The performance study on the 2,059 microblog users shows that our proposed method can achieve over 90% in detection accuracy. Furthermore, the extended experiment on a harder personalized sub-dataset demonstrates that our method works better in distinguishing personalized expressions with different latent meanings.

### Arbitrary Style Transfer via Multi-Adaptation Network

• Yingying Deng
• Fan Tang
• Weiming Dong
• Wen Sun
• Feiyue Huang
• Changsheng Xu

Arbitrary style transfer is a significant topic with research value and application prospect. A desired style transfer, given a content image and referenced style painting, would render the content image with the color tone and vivid stroke patterns of the style painting while synchronously maintaining the detailed content structure information. Style transfer approaches would initially learn content and style representations of the content and style references and then generate the stylized images guided by these representations. In this paper, we propose the multi-adaptation network which involves two self-adaptation (SA) modules and one co-adaptation (CA) module:the SA modules adaptively disentangle the content and style representations, i.e., content SA module uses position-wise self-attention to enhance content representation and style SA module uses channel-wise self-attention to enhance style representation; the CA module rearranges the distribution of style representation based on content representation distribution by calculating the local similarity between the disentangled content and style features in a non-local fashion. Moreover, a new disentanglement loss function enables our network to extract main style patterns and exact content structures to adapt to various input images, respectively. Various qualitative and quantitative experiments demonstrate that the proposed multi-adaptation network leads to better results than the state-of-the-art style transfer methods.

### Dual-view Attention Networks for Single Image Super-Resolution

• Jingcai Guo
• Shiheng Ma
• Jie Zhang
• Qihua Zhou
• Song Guo

One non-negligible flaw of the convolutional neural networks (CNNs) based single image super-resolution (SISR) models is that most of them are not able to restore high-resolution (HR) images containing sufficient high-frequency information. Worse still, as the depth of CNNs increases, the training easily suffers from the vanishing gradients. These problems hinder the effectiveness of CNNs in SISR. In this paper, we propose the Dual-view Attention Networks to alleviate these problems for SISR. Specifically, we propose the local aware (LA) and global aware (GA) attentions to deal with LR features in unequal manners, which can highlight the high-frequency components and discriminate each feature from LR images in the local and global views, respectively. Furthermore, the local attentive residual-dense (LARD) block that combines the LA attention with multiple residual and dense connections is proposed to fit a deeper yet easy to train architecture. The experimental results verified the effectiveness of our model compared with other state-of-the-art methods.

### MRI Measurement Matrix Learning via Correlation Reweighting

• Zhongnian Li
• Tao Zhang
• Ruoyu Chen
• Daoqiang Zhang

In Compressive Sensing MRI (CS-MRI), measurement matrix learning has been developed as a promising method for measurement matrix designing. Research on MRI measurement task suggests that Relative 2-Norm Error (RLNE) of measurement images is imbalanced. However, current learning-based investigations suffer from the lack of probing imbalanced characteristic on measurement matrix learning. In this paper, we propose a novel Measurement Matrix Learning via Correlation Reweighting (MML-CR) approach for exploring and solving this problem by optimizing reweighted model.Specifically,we introduce a reweighting expected minimization model to obtain an essential measurement matrix in k-space. Besides, we propose an example correlation regularizer to prevent trivial solution for learning weights. Furthermore, we present an alternating solution and perform convergence analysis for the optimization. We also demonstrate quantitative and qualitative experimental results which show that our algorithm outperforms several state-of-art measurements methods. Compared with conventional methods, MML-CR achieves better performance on universal task.

### Complementary-View Co-Interest Person Detection

• Ruize Han
• Jiewen Zhao
• Wei Feng
• Yiyang Gan
• Liang Wan
• Song Wang

Fast and accurate identification of the co-interest persons, who draw joint interest of the surrounding people, plays an important role in social scene understanding and surveillance. Previous study mainly focuses on detecting co-interest persons from a single-view video. In this paper, we study a much more realistic and challenging problem, namely co-interest person~(CIP) detection from multiple temporally-synchronized videos taken by the complementary and time-varying views. Specifically, we use a top-view camera, mounted on a flying drone at a high altitude to obtain a global view of the whole scene and all subjects on the ground, and multiple horizontal-view cameras, worn by selected subjects, to obtain a local view of their nearby persons and environment details. We present an efficient top- and horizontal-view data fusion strategy to map multiple horizontal views into the global top view. We then propose a spatial-temporal CIP potential energy function that jointly considers both intra-frame confidence and inter-frame consistency, thus leading to an effective Conditional Random Field~(CRF) formulation. We also construct a complementary-view video dataset, which provides a benchmark for the study of multi-view co-interest person detection. Extensive experiments validate the effectiveness and superiority of the proposed method.

### Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements

• Weidong He
• Zhi Li
• Dongcai Lu
• Enhong Chen
• Tong Xu
• Baoxing Huai
• Jing Yuan

Recently, multimodal dialogue systems have engaged increasing attention in several domains such as retail, travel, etc. In spite of the promising performance of pioneer works, existing studies usually focus on utterance-level semantic representations with hierarchical structures, which ignore the context-aware dependencies of multimodal semantic elements, i.e., words and images. Moreover, when integrating the visual content, they only consider images of the current turn, leaving out ones of previous turns as well as their ordinal information. To address these issues, we propose a Multimodal diAlogue systems with semanTic Elements, MATE for short. Specifically, we unfold the multimodal inputs and devise a Multimodal Element-level Encoder to obtain the semantic representation at element-level. Besides, we take into consideration all images that might be relevant to the current turn and inject the sequential characteristics of images through position encoding. Finally, we make comprehensive experiments on a public multimodal dialogue dataset in the retail domain, and improve the BLUE-4 score by 9.49, and NIST score by 1.8469 compared with state-of-the-art methods.

### EyeShopper: Estimating Shoppers' Gaze using CCTV Cameras

• Carlos Bermejo
• Dimitris Chatzopoulos
• Pan Hui

Recent advances in machine and deep learning allow for enhanced retail analytics by applying object detection techniques. However, existing approaches either require laborious installation processes to function or lack precision when the customers turn their back in the installed cameras. In this paper, we present EyeShopper, an innovative system that tracks the gaze of shoppers when facing away from the camera and provides insights about their behavior in physical stores. EyeShopper is readily deployable in existing surveillance systems and robust against low-resolution video inputs. At the same time, its accuracy is comparable to state-of-the-art gaze estimation frameworks that require high-resolution and continuous video inputs to function. Furthermore, EyeShopper is more robust than state-of-the-art gaze tracking techniques for back head images. Extensive evaluation with different real video datasets and a synthetic dataset we produced shows that EyeShopper estimates with high accuracy the gaze of customers.

### Exploiting Active Learning in Novel Refractive Error Detection with Smartphones

• Eugene Yujun Fu
• Zhongqi Yang
• Hong Va Leong
• Grace Ngai
• Chi-wai Do
• Lily Chan

Refractive errors, such as myopia and astigmatism, can lead to severe visual impairment if not detected and corrected in time. Traditional methods of refractive error diagnosis rely on well-trained optometrists operating expensive and importable devices, constraining the vision screening process. Advance in smartphone camera has enabled novel low-cost ubiquitous vision screening to detect refractive error or ametropia through eye image processing, based on the principle of photorefraction. However, contemporary smartphone-based methods rely heavily on hand-crafted features and sufficiency of well-labeled data. To address these challenges, this paper exploits active learning methods with a set of Convolutional Neural Network features encoding information of human eyes from pre-trained gaze estimation model. This enables more effective training on refractive error detection models with less labeled data. Our experimental results demonstrate the encouraging effectiveness of our active learning approach. The new set of features is able to attain screening accuracy of more than 80% with mean absolute error less than 0.66, meeting the expectation of optometrists for 0.5 to 1. The proposed active learning also requires significantly fewer training samples of 18% in achieving satisfactory performance.

### Price Suggestion for Online Second-hand Items with Texts and Images

• Liang Han
• Zhaozheng Yin
• Zhurong Xia
• Minqian Tang
• Rong Jin

This paper presents an intelligent price suggestion system for online second-hand listings based on their uploaded images and text descriptions. The goal of price prediction is to help sellers set effective and reasonable prices for their second-hand items with the images and text descriptions uploaded to the online platforms. Specifically, we design a multi-modal price suggestion system which takes as input the extracted visual and textual features along with some statistical item features collected from the second-hand item shopping platform to determine whether the image and text of an uploaded second-hand item are qualified for reasonable price suggestion with a binary classification model, and provide price suggestions for second-hand items with qualified images and text descriptions with a regression model. To satisfy different demands, two different constraints are added into the joint training of the classification model and the regression model. Moreover, a customized loss function is designed for optimizing the regression model to provide price suggestions for second-hand items, which can not only maximize the gain of the sellers but also facilitate the online transaction. We also derive a set of metrics to better evaluate the proposed price suggestion system. Extensive experiments on a large real-world dataset demonstrate the effectiveness of the proposed multi-modal price suggestion system.

### An Advanced LiDAR Point Cloud Sequence Coding Scheme for Autonomous Driving

• Xuebin Sun
• Sukai Wang
• Miaohui Wang
• Shing Shin Cheng
• Ming Liu

Due to the huge volume of point cloud data, storing or transmitting it is currently difficult and expensive in autonomous driving. Learning from the high efficiency video coding (HEVC) coding framework, we propose an advanced coding scheme for large-scale LiDAR point cloud sequences, in which several techniques have been developed to remove the spatial and temporal redundancy. The proposed strategy consists mainly of intra-coding and inter-coding. For intra-coding, we utilize a cluster-based prediction method to remove the spatial redundancy. For inter-coding, a predictive recurrent network is designed, which is capable of generating future frames according to the previously encoded frames. By calculating the residual error between the predicted and real point cloud data, the temporal redundancy can be removed. Finally, the residual data is quantized and encoded by lossless coding schemes. Experiments are conducted on the KITTI data set with four different scenes to verify the effectiveness and efficiency of the proposed method. Our approach can deal with multiple types of point cloud data from the simple to more complex, and yields better performance in terms of compression ratio compared with octree, Google Draco, MPEG TMC13 and other recently proposed methods.

### Learning Optimization-based Adversarial Perturbations for Attacking Sequential Recognition Models

• Xing Xu
• Jiefu Chen
• Jinhui Xiao
• Zheng Wang
• Yang Yang
• Heng Tao Shen

A large number of recent studies on adversarial attack have verified that a Deep Neural Network (DNN) model designed for non-sequential recognition (NSR) tasks (e.g., classification, detection and segmentation) can be easily fooled by adversarial examples. However, only a few researches pay attention to the adversarial attack on sequential recognition (SR). They either apply the attack methods proposed for NSR to SR by neglecting the sequential dependencies, or focus on attacking specific SR models without considering the generality. In this paper, we study the adversarial attack on the general and popular DNN structure of CNN+RNN, i.e., the combination of convolutional neural network (CNN) and recurrent neural network (RNN), which has been widely used in various SR tasks. We take the scene text recognition (STR) and image captioning (IC) as case study, and derive the objective function for attacking the CNN+RNN based models with targeted and untargeted attack modes, and then developed an optimization-based algorithm to learn adversarial perturbations from the derived gradients of each character (or word) in sequence by incorporating the sequential dependencies. Extensive experiments show that our proposed method can effective fool several state-of-the-arts including four STR models and two IC models with higher successful rate and less time consumption, comparing to three latest attack methods.

## SESSION: Poster Session D2: Emerging Multimedia Applications & Emotional and Social Signals in Multimedia

### Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues

• Trisha Mittal
• Uttaran Bhattacharya
• Rohan Chandra
• Aniket Bera
• Dinesh Manocha

We present a learning-based method for detecting real and fake deepfake multimedia content. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to perceived emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets, respectively. To the best of our knowledge, ours is the first approach that simultaneously exploits audio and video modalities and also perceived emotions from the two modalities for deepfake detection.

### Deep Disturbance-Disentangled Learning for Facial Expression Recognition

• Delian Ruan
• Yan Yan
• Si Chen
• Jing-Hao Xue
• Hanzi Wang

To achieve effective facial expression recognition (FER), it is of great importance to address various disturbing factors, including pose, illumination, identity, and so on. However, a number of FER databases merely provide the labels of facial expression, identity, and pose, but lack the label information for other disturbing factors. As a result, many methods are only able to cope with one or two disturbing factors, ignoring the heavy entanglement between facial expression and multiple disturbing factors. In this paper, we propose a novel Deep Disturbance-disentangled Learning (DDL) method for FER. DDL is capable of simultaneously and explicitly disentangling multiple disturbing factors by taking advantage of multi-task learning and adversarial transfer learning. The training of DDL involves two stages. First, a Disturbance Feature Extraction Model (DFEM) is pre-trained to perform multi-task learning for classifying multiple disturbing factors on the large-scale face database (which has the label information for various disturbing factors). Second, a Disturbance-Disentangled Model (DDM), which contains a global shared sub-network and two task-specific (i.e., expression and disturbance) sub-networks, is learned to encode the disturbance-disentangled information for expression recognition. The expression sub-network adopts a multi-level attention mechanism to extract expression-specific features, while the disturbance sub-network leverages adversarial transfer learning to extract disturbance-specific features based on the pre-trained DFEM. Experimental results on both the in-the-lab FER databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild FER databases (including RAF-DB and SFEW) demonstrate the superiority of our proposed method compared with several state-of-the-art methods.

### Unsupervised Learning Facial Parameter Regressor for Action Unit Intensity Estimation via Differentiable Renderer

• Xinhui Song
• Tianyang Shi
• Zunlei Feng
• Mingli Song
• Jackie Lin
• Chuanjie Lin
• Changjie Fan
• Yi Yuan

Facial action unit (AU) intensity is an index to describe all visually discernible facial movements. Most existing methods learn intensity estimator with limited AU data, while they lack of generalization ability out of the dataset. In this paper, we present a framework to predict the facial parameters (including identity parameters and AU parameters) based on a bone-driven face model (BDFM) under different views. The proposed framework consists of a feature extractor, a generator, and a facial parameter regressor. The regressor can fit the physical meaning parameters of the BDFM from a single face image with the help of the generator, which maps the facial parameters to the game-face images as a differentiable renderer. Besides, identity loss, loopback loss, and adversarial loss can improve the regressive results. Quantitative evaluations are performed on two public databases BP4D and DISFA, which demonstrates that the proposed method can achieve comparable or better performance than the state-of-the-art methods. What's more, the qualitative results also demonstrate the validity of our method in the wild.

### Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching

• Jingjun Liang
• Ruichen Li
• Qin Jin

Automatic emotion recognition is an active research topic with wide range of applications. Due to the high manual annotation cost and inevitable label ambiguity, the development of emotion recognition dataset is limited in both scale and quality. Therefore, one of the key challenges is how to build effective models with limited data resource. Previous works have explored different approaches to tackle this challenge including data enhancement, transfer learning, and semi-supervised learning etc. However, the weakness of these existing approaches includes such as training instability, large performance loss during transfer, or marginal improvement. In this work, we propose a novel semi-supervised multi-modal emotion recognition model based on cross-modality distribution matching, which leverages abundant unlabeled data to enhance the model training under the assumption that the inner emotional status is consistent at the utterance level across modalities. We conduct extensive experiments to evaluate the proposed model on two benchmark datasets, IEMOCAP and MELD. The experiment results prove that the proposed semi-supervised learning model can effectively utilize unlabeled data and combine multi-modalities to boost the emotion recognition performance, which outperforms other state-of-the-art approaches under the same condition. The proposed model also achieves competitive capacity compared with existing approaches which take advantage of additional auxiliary information such as speaker and interaction context.

### PersonalitySensing: A Multi-View Multi-Task Learning Approach for Personality Detection based on Smartphone Usage

• Songcheng Gao
• Wenzhong Li
• Lynda J. Song
• Xiao Zhang
• Mingkai Lin
• Sanglu Lu

Assessing individual's personality traits has important implications in psychology, sociology, and economics. Conventional personality measurement methods were questionnaire-based, which are time-consuming and manpower-expensive. With the pervasive deployment of mobile communication applications, smartphone usage data was found to relate to people's social behavioral and psychological aspects. In this paper, we propose a deep learning approach to infer people's Big Five personality traits based on smartphone data. Specifically, we collect smartphone usage snapshots with an Android App, and extract features from the collected data. We propose a multi-view multi-task learning approach with a deep neural network model to fuse the extracted features and learn the Big Five personality traits jointly. Extensive experiments based on the real-world smartphone data collected from university volunteers show that the proposed approach significantly outperforms the state-of-the-art algorithms in personality prediction.

### AU-assisted Graph Attention Convolutional Network for Micro-Expression Recognition

• Hong-Xia Xie
• Ling Lo
• Hong-Han Shuai
• Wen-Huang Cheng

Micro-expressions (MEs) are important clues for reflecting the real feelings of humans, and micro-expression recognition (MER) can thus be applied in various real-world applications. However, it is difficult to perceive and interpret MEs correctly. With the advance of deep learning technologies, the accuracy of micro-expression recognition is improved but still limited by the lack of large-scale datasets. In this paper, we propose a novel micro-expression recognition approach by combining Action Units (AUs) and emotion category labels. Specifically, based on facial muscle movements, we model different AUs based on relational information and integrate the AUs recognition task with MER. Besides, to overcome the shortcomings of limited and imbalanced training samples, we propose a data augmentation method that can generate nearly indistinguishable image sequences with AU intensity of real-world micro-expression images, which effectively improve the performance and are compatible with other micro-expression recognition methods. Experimental results on three mainstream micro-expression datasets, i.e., CASME II, SAMM, and SMIC, manifest that our approach outperforms other state-of-the-art methods on both single database and cross-database micro-expression recognition.

### DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild

• Xingxun Jiang
• Yuan Zong
• Wenming Zheng
• Chuangao Tang
• Wanchuang Xia
• Cheng Lu
• Jiateng Liu

Recently, facial expression recognition (FER) in the wild has gained a lot of researchers' attention because it is a valuable topic to enable the FER techniques to move from the laboratory to the real applications. In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. First, we present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW (Dynamic Facial Expression in the Wild), consisting of over 16,000 video clips from thousands of movies. These video clips contain various challenging interferences in practical scenarios such as extreme illumination, occlusions, and capricious pose changes. Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning (EC-STFL) framework to deal with dynamic FER in the wild. Third, we conduct extensive benchmark experiments on DFEW using a lot of spatiotemporal deep feature learning methods as well as our proposed EC-STFL. Experimental results show that DFEW is a well-designed and challenging database, and the proposed EC-STFL can promisingly improve the performance of existing spatiotemporal deep neural networks in coping with the problem of dynamic FER in the wild. Our DFEW database is publicly available and can be freely downloaded from https://dfew-dataset.github.io/.

### Region of Interest Based Graph Convolution: A Heatmap Regression Approach for Action Unit Detection

• Zheng Zhang
• Taoyue Wang
• Lijun Yin

Machine vision of human facial expressions has been studied for decades, from prototypical expressions to Action Units (AUs), from hand-crafted to deep features, from multi-class to multi-label classifications. Since the widely adopted deep networks lack interpretation on learnt representations, human prior knowledge cannot be effectively imposed and examined. On the other hand, AU is a human defined concept. In order to align with this idea, a finer level of network design is desired. In this paper, we first extend the heatmaps to ROI maps, encoding the location of both positive and negative occurred AUs, then employ a well-designed backbone network to regress it. In this way, AU detection is performed in two stages, key regions localization and occurrence classification. To prompt the spatial dependency among ROIs, we utilize graph convolution for feature refinement. The decomposition of similarity matrix is supervised by AU labels. This novel framework is evaluated on two benchmark databases (BP4D and DISFA) for AU detection. The experimental results are superior to the state-of-the-art algorithms and baseline models, demonstrating the effectiveness of our proposed method.

### IExpressNet: Facial Expression Recognition with Incremental Classes

• Junjie Zhu
• Bingjun Luo
• Sicheng Zhao
• Shihui Ying
• Xibin Zhao
• Yue Gao

Existing methods on facial expression recognition (FER) are mainly trained in the setting when all expression classes are fixed in advance. However, in real applications, expression classes are becoming increasingly fine-grained and incremental. To deal with sequential expression classes, we can fine-tune or re-train these models, but this often results in poor performance or large computing resources consumption. To address these problems, we develop an Incremental Facial Expression Recognition Network (IExpressNet), which can learn a competitive multi-class classifier at any time with a lower requirement of computing resources. Specifically, IExpressNet consists of two novel components. First, we construct an exemplar set by dynamically selecting representative samples from old expression classes. Then, the exemplar set and new expression classes samples constitute the training set. Second, we design a novel center-expression-distilled loss. As for facial expression in the wild, center-expression-distilled loss enhances the discriminative power of the deeply learned features and prevents catastrophic forgetting. Extensive experiments are conducted on two large-scale FER datasets in the wild, RAF-DB and AffectNet. The results demonstrate the superiority of the proposed method as compared to state-of-the-art incremental learning approaches.

### SST-EmotionNet: Spatial-Spectral-Temporal based Attention 3D Dense Network for EEG Emotion Recognition

• Ziyu Jia
• Youfang Lin
• Xiyang Cai
• Haobin Chen
• Haijun Gou
• Jing Wang

Multimedia stimulation of brain activities has not only become an emerging field for intensive research, but also achieves important progress in the electroencephalogram (EEG) emotion classification based on brain activities. However, how to make full use of different EEG features and the discriminative local patterns among the features for different emotions is challenging. Existing models ignore the complementarity among the spatial-spectral-temporal features and discriminative local patterns in all features, which limits the classification ability of the models to a certain extent. In this paper, we propose a novel spatial-spectral-temporal based attention 3D dense network, named SST-EmotionNet, for EEG emotion recognition. The main advantage of the SST-EmotionNet is the simultaneous integration of spatial-spectral-temporal features in a unified network framework. Meanwhile, a 3D attention mechanism is designed to adaptively explore discriminative local patterns. Extensive experiments on two real-world datasets demonstrate that the SST-EmotionNet outperforms the state-of-the-art baselines.

### Language Models as Emotional Classifiers for Textual Conversation

• Connor T. Heaton
• David M. Schwartz

Emotions play a critical role in our everyday lives by altering how we perceive, process and respond to our environment. Affective computing aims to instill in computers the ability to detect and act on the emotions of users. A core aspect of any affective computing system is the classification of a user's emotion. In this study we present a novel methodology for classifying emotion in a conversation. At the backbone of our proposed methodology is a pre-trained Language Model (LM), which is supplemented by a Graph Convolutional Network (GCN) that propagates information over the predicate-argument structure identified in an utterance. We apply our proposed methodology on the IEMOCAP and Friends data sets, achieving state-of-the-art performance on the former and a higher accuracy on certain emotional labels on the latter. Furthermore, we examine the role context plays in our methodology by altering how much of the preceding conversation the model has access to when making a classification.

### Occluded Facial Expression Recognition with Step-Wise Assistance from Unpaired Non-Occluded Images

• Bin Xia
• Shangfei Wang

Although facial expression recognition has improved in recent years, it is still very challenging to recognize expressions from occluded facial images in the wild. Due to the lack of large-scale facial expression datasets with diversity of the type and position of occlusions, it is very difficult to learn robust occluded expression classifier directly from limited occluded images. Considering facial images without occlusions usually provide more information for facial expression recognition compared to occluded facial images, we propose a step-wise learning strategy for occluded facial expression recognition that utilizes unpaired non-occluded images as guidance in the feature and label space. Specifically, we first measure the complexity of non-occluded data using distribution density in a feature space and split data into three subsets. In this way, the occluded expression classifier can be guided by basic samples first, and subsequently leverage more meaningful and discriminative samples. Complementary adversarial learning techniques are applied in the global-level and local-level feature space throughout, forcing the distribution of the occluded features to be close to the distribution of the non-occluded features. We also take the variability of the different images' transferability into account via adaptive classification loss. Loss inequality regularization is imposed in the label space to calibrate the output values of the occluded network. Experimental results show that our method improves performance on both synthesized occluded databases and realistic occluded databases.

### Learning from Macro-expression: a Micro-expression Recognition Framework

• Bin Xia
• Weikang Wang
• Shangfei Wang
• Enhong Chen

As one of the most important forms of psychological behaviors, micro-expression can reveal the real emotion. However, the existing labeled micro-expression samples are limited to train a high performance micro-expression classifier. Since micro-expression and macro-expression share some similarities in facial muscle movements and texture changes, in this paper we propose a micro-expression recognition framework that leverages macro-expression samples as guidance. Specifically, we first introduce two Expression-Identity Disentangle Network, named MicroNet and MacroNet, as the feature extractor to disentangle expression-related features for micro and macro expression samples. Then MacroNet is fixed and used to guide the fine-tuning of MicroNet from both label and feature space. Adversarial learning strategy and triplet loss are added upon feature level between the MicroNet and MacroNet, so the MicroNet can efficiently capture the shared features of micro-expression and macro-expression samples. Loss inequality regularization is imposed to the label space to make the output of MicroNet converge to that of MicroNet. Comprehensive experiments on three public spontaneous micro-expression databases, i.e., SMIC, CASME2 and SAMM demonstrate the superiority of the proposed method.

### Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

• Sicheng Zhao
• Yaxian Li
• Xingxu Yao
• Weizhi Nie
• Pengfei Xu
• Jufeng Yang
• Kurt Keutzer

Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.

## SESSION: Poster Session E2: Emotional and Social Signals in Multimedia & Media Interpretation

### Exploiting Multi-Emotion Relations at Feature and Label Levels for Emotion Tagging

• Zhiwei Xu
• Shangfei Wang
• Can Wang

The dependence among emotions is crucial to boost emotion tagging. In this paper, we propose a novel emotion tagging method, that thoroughly explores emotion relations from both the feature and label levels. Specifically, a graph convolutional network is introduced to inject local dependence among emotions into the model at the feature level, while an adversarial learning strategy is applied to constrain the joint distribution of multiple emotions at the label level. In addition, a new balanced loss function that mitigates the adverse effects of intra-class and inter-class imbalance is introduced to deal with the imbalance of emotion labels. Experimental results on several benchmark databases demonstrate the superiority of the proposed method compared to state-of-the-art works.

### Uncertainty-aware Cross-dataset Facial Expression Recognition via Regularized Conditional Alignment

• Linyi Zhou
• Xijian Fan
• Yingjie Ma
• Qiaolin Ye

Cross-dataset facial expression recognition (FER) has remained a challenging problem due to the obvious biases caused by diverse subjects and various collection conditions. To this end, domain adaption can be adopted as an effective solution by learning invariant representations across domains (datasets). However, FER requires special consideration of its specific problems e.g., uncertainties caused by ambiguous facial images, and diverse inter- and intra-class relationship. Such uncertainties already exist in single dataset FER, and could be significantly aggravated by enlarged class-wise discrepancies under cross-dataset scenarios. To mitigate this problem, this paper proposes an unsupervised domain adaptation method via regularized conditional alignment for FER, which adversarially reduces domain- and class-wise discrepancies while explicitly dealing with uncertainties within and across domain. Specifically, the proposed method effectively suppresses uncertainties in FER transfer tasks via: 1) semantics-preserving adaptation framework which enforces both domain-invariant learning and class-level semantic consistency between source and target expression data, where discriminative cluster structures are simultaneously retained; 2) auxiliary uncertainty regularization which further constrains the ambiguity of cluster boundaries to guarantee the transferring reliability, thus discouraging the negative transfer brought by divergent facial images. Evaluation experiments on publicly available datasets demonstrate that the proposed method significantly outperforms the current state-of-the-art methods.

### Fonts Like This but Happier: A New Way to Discover Fonts

• Tugba Kulahcioglu
• Gerard de Melo

Fonts carry strong emotional and social signals, and can affect user engagement in significant ways. Hence, selecting the right font is a very important step in the design of a multimodal artifact with text. Currently, font exploration is frequently carried out via associated social tags. Users are expected to browse through thousands of fonts tagged with certain concepts to find the one that works best for their use case. In this study, we propose a new multimodal font discovery method in which users provide a reference font together with the changes they wish to obtain in order to get closer to their ideal font. This allows for efficient and goal-driven navigation of the font space, and discovery of fonts that would otherwise likely be missed. We achieve this by learning cross-modal vector representations that connect fonts and query words.

### Adaptive Multimodal Fusion for Facial Action Units Recognition

• Huiyuan Yang
• Taoyue Wang
• Lijun Yin

Multimodal facial action units (AU) recognition aims to build models that are capable of processing, correlating, and integrating information from multiple modalities (i.e., 2D images from a visual sensor, 3D geometry from 3D imaging, and thermal images from an infrared sensor). Although the multimodel data can provide rich information, there are two challenges that have to be addressed when learning from multimodal data: 1) the model must capture the complex cross-modal interactions in order to utilize the additional and mutual information effectively; 2) the model must be robust enough in the circumstance of unexpected data corruptions during testing, in case of a certain modality missing or being noisy. In this paper, we propose a novel A daptive M ultimodal F usion method (AMF ) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. The feature scoring module is designed to allow for evaluating the quality of features learned from multiple modalities. As a result, AMF is able to adaptively select more discriminative features, thus increasing the robustness to missing or corrupted modalities. In addition, to alleviate the over-fitting problem and make the model generalize better on the testing data, a cut-switch multimodal data augmentation method is designed, by which a random block is cut and switched across multiple modalities. We have conducted a thorough investigation on two public multimodal AU datasets, BP4D and BP4D+, and the results demonstrate the effectiveness of the proposed method. Ablation studies on various circumstances also show that our method remains robust to missing or noisy modalities during tests.

### Exploiting Self-Supervised and Semi-Supervised Learning for Facial Landmark Tracking with Unlabeled Data

• Shi Yin
• Shangfei Wang
• Xiaoping Chen
• Enhong Chen

Current work of facial landmark tracking usually requires large amounts of fully annotated facial videos to train a landmark tracker. To relieve the burden of manual annotations, we propose a novel facial landmark tracking method that makes full use of unlabeled facial videos by exploiting both self-supervised and semi-supervised learning mechanisms. First, self-supervised learning is adopted for representation learning from unlabeled facial videos. Specifically, a facial video and its shuffled version are fed into a feature encoder and a classifier. The feature encoder is used to learn visual representations, and the classifier distinguishes the input videos as the original or the shuffled ones. The feature encoder and the classifier are trained jointly. Through self-supervised learning, the spatial and temporal patterns of a facial video are captured at representation level. After that, the facial landmark tracker, consisting of the pre-trained feature encoder and a regressor, is trained semi-supervisedly. The consistencies among the tracking results of the original, the inverse and the disturbed facial sequences are exploited as the constraints on the unlabeled facial videos, and the supervised loss is adopted for the labeled videos. Through semi-supervised end-to-end training, the tracker captures sequential patterns inherent in facial videos despite small amount of manual annotations. Experiments on two benchmark datasets show that the proposed framework outperforms state-of-the-art semi-supervised facial landmark tracking methods, and also achieves advanced performance compared to fully supervised facial landmark tracking methods.

### Cross Corpus Physiological-based Emotion Recognition Using a Learnable Visual Semantic Graph Convolutional Network

• Woan-Shiuan Chien
• Hao-Chun Yang
• Chi-Chun Lee

Affective media videos have been used as stimulus to investigate an individual's affective-physio responses. In this study, we aim to develop a network learning strategy for robust cross-corpus emotion recognition using physiological features jointly with affective video content. Specifically, we present a novel framework of Visual Semantic Graph Learning Convolutional Network (VGLCN) for individual emotional state recognition using physiology on transfer learning tasks. The stimulus of videos content is integrated into learnable graph structure to weight the importance of physiology on the two emotion dimensions, valence and arousal. Furthermore, we evaluate our proposed framework on two public emotion databases with a rigorous cross validation method, and our model achieves the best unweighted average recall (UAR), which is 67.9%, 56.9% for arousal and 79.8%, 70.4% for valence on the cross datasets recognition experiments respectively. Further analyses reveal that 1) VGLCN is especially effective on transfer valence binary-task, 2) the physiological features (ECG, EDA) are very informative features for emotion recognition and 3) the affective media videos are important constraint to be included in the framework to stabilize the performance power.

### Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks

• Mengshi Qi
• Jie Qin
• Xiantong Zhen
• Di Huang
• Yi Yang
• Jiebo Luo

In the era of big data, few-shot learning has recently received much attention in multimedia analysis and computer vision due to its appealing ability of learning from scarce labeled data. However, it has been largely underdeveloped in the video domain, which is even more challenging due to the huge spatial-temporal variability of video data. In this paper, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units. Specifically, we introduce a family of few-shot learners based on SlowFast networks which are used to extract informative features at multiple rates, and we incorporate a memory unit into each network to enable encoding and retrieving crucial information instantly. Furthermore, we propose a choice controller network to leverage the diversity of few-shot learners by learning to adaptively assign a confidence score to each SlowFast memory network, leading to a strong classifier for enhanced prediction. Experimental results on two widely-adopted video datasets demonstrate the effectiveness of the proposed method, as well as its superior performance over the state-of-the-art approaches.

• Chenyu Li
• Shiming Ge
• Daichi Zhang
• Jia Li

Many real-world applications today like video surveillance and urban governance need to address the recognition of masked faces, where content replacement by diverse masks often brings in incomplete appearance and ambiguous representation, leading to a sharp drop in accuracy. Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. The de-occlusion module applies a generative adversarial network to perform face completion, which recovers the content under the mask and eliminates appearance ambiguity. The distillation module takes a pre-trained general face recognition model as the teacher and transfers its knowledge to train a student for completed faces using massive online synthesized face pairs. Especially, the teacher knowledge is represented with structural relations among instances in multiple orders, which serves as a posterior regularization to enable the adaptation. In this way, the knowledge can be fully distilled and transferred to identify masked faces. Experiments on synthetic and realistic datasets show the efficacy of the proposed approach.

### Privacy-sensitive Objects Pixelation for Live Video Streaming

• Jizhe Zhou
• Chi-Man Pun
• Yu Tong

With the prevailing of live video streaming, establishing an online pixelation method for privacy-sensitive objects is an urgency. Caused by the inaccurate detection of privacy-sensitive objects, simply migrating the tracking-by-detection structure applied in offline pixelation into the online form will incur problems in target initialization, drifting, and over-pixelation. To cope with the inevitable but impacting detection issue, we propose a novel Privacy-sensitive Objects Pixelation (PsOP) framework for automatic personal privacy filtering during live video streaming. Leveraging pre-trained detection networks, our PsOP is extendable to any potential privacy-sensitive objects pixelation. Employing the embedding networks and the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm as the backbone, our PsOP unifies the pixelation of discriminating and indiscriminating pixelation objects through trajectories generation. In addition to the pixelation accuracy boosting, experiment results on the streaming video data we built show that the proposed PsOP can significantly reduce the over-pixelation ratio in privacy-sensitive object pixelation.

### Deep Local Binary Coding for Person Re-Identification by Delving into the Details

• Jiaxin Chen
• Jie Qin
• Yichao Yan
• Lei Huang
• Li Liu
• Fan Zhu
• Ling Shao

Person re-identification (ReID) has recently received extensive research interests due to its diverse applications in multimedia analysis and computer vision. However, the majority of existing works focus on improving matching accuracy, while ignoring matching efficiency. In this work, we present a novel binary representation learning framework for efficient person ReID, namely Deep Local Binary Coding (DLBC). Different from existing deep binary ReID approaches, DLBC attempts to learn discriminative binary codes by explicitly interacting with local visual details. Specifically, DLBC first extracts a set of local features from spatially salient regions of pedestrian images. Subsequently, DLBC formulates a new binary-local semantic mutual information (BSMI) maximization term, based on which a self-lifting (SL) block is built to further exploit the semantic importance of local features. The BSMI term together with the SL block simultaneously enhances the dependency of binary codes on selected local features as well as their robustness to cross-view visual inconsistency. In addition, an efficient optimizing method is developed to train the proposed deep models with orthogonal and binary constraints. Extensive experiments reveal that DLBC significantly minimizes the accuracy gap between binary ReID methods and the state-of-the-art real-valued ones, whilst remarkably reducing query time and memory cost.

### March on Data Imperfections: Domain Division and Domain Generalization for Semantic Segmentation

• Hai Xu
• Hongtao Xie
• Zheng-Jun Zha
• Sun-ao Liu
• Yongdong Zhang

Significant progress has been made in semantic segmentation by deep neural networks, most of which concentrate on discriminative representation learning. However, model performances suffer from deterioration when the training process is optimized without awareness of data imperfections (e.g., data imbalance and label noise). In contrast to previous works, we present a novel model-agnostic training optimization algorithm which has two prominent components: Domain Division and Domain Generalization. Rather than sampling all pixels uniformly, an uncertainty-based Domain Division method is proposed to deal with data imbalance, which dynamically decomposes the pixels into meta-train and meta-test domains according to whether they lie near the classification boundary. The meta-train domain corresponds to highly-uncertain but more informative pixels and determines the current main update direction. Furthermore, to alleviate the degradation caused by label noise, we propose a Domain Generalization technique with a meta-optimization objective which ensures that update on the meta-train domain should generalize to the meta-test domain. Comprehensive experimental results on three public benchmarks across multi-modalities show that the proposed optimization algorithm is superior to other segmentation optimization methods and significantly outperforms conventional methods without introducing additional model parameters.

### Gait Recognition with Multiple-Temporal-Scale 3D Convolutional Neural Network

• Beibei Lin
• Shunli Zhang
• Feng Bao

Gait recognition which is one of the most important and effective biometric technologies has a significant advantage in long-distance recognition systems. For existing gait recognition methods, the template-based approaches may lose temporal information, while the sequence-based methods cannot fully exploit the temporal relations among the sequence. To address the above issues, we propose a novel multiple-temporal-scale gait recognition framework which integrates the temporal information in multiple temporal scales, making use of both the frame and interval fusion information. Moreover, the interval-level representation is realized by a local transformation module. Concretely, 3D convolution neural network (3D CNN) is applied in both the small and the large temporal scales to extract the spatial-temporal information. Moreover, a frame pooling method is developed to address the mismatch of the input of 3D network and video frames, and a novel 3D basic network block is designed to improve efficiency. Experiments demonstrate that the multiple-temporal-scale 3D CNN based gait recognition method can achieve better performance than most recent state-of-the-art methods in CASIA-B dataset. The proposed method obtains the rank-1 accuracy with 96.7% under normal condition, and outperforms other methods on average accuracy by at least 5.8% and 11.1%, respectively, in complex scenarios.

### SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space

• Yi Li
• Wenjie Pei
• Zhenyu He

The crux of homography estimation is that the homography is characterized by the geometric correspondences between two related images rather than appearance features, which differs from typical image recognition tasks. Existing methods either decompose the task of homography estimation into several individual sub-problems and optimize them sequentially, or attempt to tackle it in an end-to-end manner by delegating the whole task to deep convolutional networks (CNNs). However, it is quite arduous for CNNs to learn the mapping function from appearance features of related images to the homography directly. In this paper, we propose to parse the geometric correspondences between related images explicitly to bridge the gap between deep appearance features and the homography. Furthermore, we propose a coarse-to-fine estimation framework to capture different scale of homography transformations and thus predict the homography in a stepwise-refining manner. Additionally, we propose a pyramidal supervision scheme to leverage an important prior concerning the homography estimation. Extensive experiments on two large-scale datasets demonstrate that our model advances the state-of-the-art performance significantly.

### Tactile Sketch Saliency

• Jianbo Jiao
• Ying Cao
• Manfred Lau
• Rynson Lau

In this paper, we aim to understand the functionality of 2D sketches by predicting how humans would interact with the objects depicted by sketches in real life. Given a 2D sketch, we learn to predict a tactile saliency map for it, which represents where humans would grasp, press, or touch the object depicted by the sketch. We hypothesize that understanding 3D structure and category of the sketched object would help such tactile saliency reasoning. We thus propose to jointly predict the tactile saliency, depth map and semantic category of a sketch in an end-to-end learning-based framework. To train our model, we propose to synthesize training data by leveraging a collection of 3D shapes with 3D tactile saliency information. Experiments show that our model can predict accurate and plausible tactile saliency maps for both synthetic and real sketches. In addition, we also demonstrate that our predicted tactile saliency is beneficial to sketch recognition and sketch-based 3D shape retrieval, and enables us to establish part-based functional correspondences among sketches.

## SESSION: Poster Session F2: Media Interpretation & Mobile Multimedia

### Towards Clustering-friendly Representations: Subspace Clustering via Graph Filtering

• Zhengrui Ma
• Zhao Kang
• Guangchun Luo
• Ling Tian
• Wenyu Chen

Finding a suitable data representation for a specific task has been shown to be crucial in many applications. The success of subspace clustering depends on the assumption that the data can be separated into different subspaces. However, this simple assumption does not always hold since the raw data might not be separable into subspaces. To recover the "clustering-friendly" representation and facilitate the subsequent clustering, we propose a graph filtering approach by which a smooth representation is achieved. Specifically, it injects graph similarity into data features by applying a low-pass filter to extract useful data representations for clustering. Extensive experiments on image and document clustering datasets demonstrate that our method improves upon state-of-the-art subspace clustering techniques. Especially, its comparable performance with deep learning methods emphasizes the effectiveness of the simple graph filtering scheme for many real-world applications. An ablation study shows that graph filtering can remove noise, preserve structure in the image, and increase the separability of classes.

### One-shot Scene Graph Generation

• Yuyu Guo
• Jingkuan Song
• Lianli Gao
• Heng Tao Shen

As a structured representation of the image content, the visual scene graph (visual relationship) acts as a bridge between computer vision and natural language processing. Existing models on the scene graph generation task notoriously require tens or hundreds of labeled samples. By contrast, human beings can learn visual relationships from a few or even one example. Inspired by this, we design a task named One-Shot Scene Graph Generation, where each relationship triplet (e.g., "dog-has-head'') comes from only one labeled example. The key insight is that rather than learning from scratch, one can utilize rich prior knowledge. In this paper, we propose Multiple Structured Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot scene graph generation task. Specifically, the Relational Knowledge represents the prior knowledge of relationships between entities extracted from the visual content, e.g., the visual relationships "standing in'', "sitting in'', and "lying in'' may exist between "dog'' and "yard'', while the Commonsense Knowledge encodes "sense-making'' knowledge like "dog can guard yard''. By organizing these two kinds of knowledge in a graph structure, Graph Convolution Networks (GCNs) are used to extract knowledge-embedded semantic features of the entities. Besides, instead of extracting isolated visual features from each entity generated by Faster R-CNN, we utilize an Instance Relation Transformer encoder to fully explore their context information. Based on a constructed one-shot dataset, the experimental results show that our method significantly outperforms existing state-of-the-art methods by a large margin. Ablation studies also verify the effectiveness of the Instance Relation Transformer encoder and the Multiple Structured Knowledge.

### Cross-Granularity Learning for Multi-Domain Image-to-Image Translation

• Huiyuan Fu
• Ting Yu
• Xin Wang

Image translation across diverse domains has attracted more and more attention. Existing multi-domain image-to-image translation algorithms only learn the features of the complete image without considering specific features of local instances. To ensure the important instance to be more realistically translated, we propose a cross-granularity learning model for multi-domain image-to-image translation. We provide detailed procedures to capture the features of instances during the learning process, and specifically learn the relationship between style of the global image and the style of an instance on the image through the enforcing of the cross-granularity consistency. In our design, we only need one generator to perform the instance-aware multi-domain image translation. Our extensive experiments on several multi-domain image-to-image translation datasets show that our proposed method can achieve superior performance compared with the state-of-the-art approaches.

### Enhancing Self-supervised Monocular Depth Estimation via Incorporating Robust Constraints

• Rui Li
• Xiantuo He
• Yu Zhu
• Xianjun Li
• Jinqiu Sun
• Yanning Zhang

Self-supervised depth estimation has shown great prospects in inferring 3D structures using purely unannotated images. However, its performance usually drops when trained on the images with changing brightness and moving objects. In this paper, we address this issue by enhancing the robustness of the self-supervised paradigm using a set of image-based and geometry-based constraints. Our contributions are threefold, 1) we propose a gradient-based robust photometric loss which restrains the false supervisory signals caused by brightness changes, 2) we propose to filter out the unreliable areas that violate the rigid assumption by a novel combined selective mask, which is computed on the forward pass of the network by leveraging the inter-loss consistency and the loss-gradient consistency, and 3) we constrain the motion estimation network to generate across-frame consistent motions via proposing a triplet-based cycle consistency constraint. Extensive experiments conducted on KITTI, Cityscape and Make3D datasets demonstrate the superiority of our method, that the proposed method can effectively handle complex scenes with changing brightness and object motions. Both qualitative and quantitative results show that the proposed method outperforms the state-of-the-art methods.

### A Novel Object Re-Track Framework for 3D Point Clouds

• Tuo Feng
• Licheng Jiao
• Hao Zhu
• Long Sun

3D point cloud data is an important data source for autonomous vehicles to perceive the surroundings. Achieving accurate object tracking of 3D point clouds has become a challenging task. In this paper, we propose a 3D object two-stage re-track framework directly utilizing point clouds as the input, without using the ground truth as the reference box. The framework consists of a coarse stage and a fine stage. By tracking back the previous T frames and expanding the search space for each frame, we add the fine stage to re-track the lost objects of the coarse stage. Moreover, we design a dense AutoEncoder to enhance the discrimination in the latent space and improve shape completion performance, thus improving tracking performance. A Sample Update Strategy is also proposed to aggregate similar model shape samples in different frames, which improves the quality of the model shape. In terms of motion models for the proposed re-track framework, we further compare Kalman Filter with PointLSTM and do an extensive analysis. Finally, we test the re-track framework on the KITTI tracking dataset and outperform the public benchmark by 17.1%/15.5% in Success and Precision, respectively. Our code and model are available at https://github.com/FengZicai/Re-Track.

### Video Relation Detection via Multiple Hypothesis Association

• Zixuan Su
• Xindi Shang
• Jingjing Chen
• Yu-Gang Jiang
• Zhiyong Qiu
• Tat-Seng Chua

Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and language. Compared with visual relation detection on image, VidVRD requires one more step at last called visual relation association which associates relation segments across time dimension into video relations. This step plays an important role in the task but is less studied. Nevertheless, visual relation association is a difficult task as the association process is easily affected by inaccurate tracklet detection and relation prediction in the former steps. In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). It maintains multiple possible relation hypothesis during the association process in order to tolerate and handle the inaccurate or missing problem in the former steps and generate more accurate video relations. Our experiments on the benchmark datasets (Imagenet-VidVRD and VidOR) show that our method outperforms the state-of-the-art methods.

### HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation

• Lin Huang
• Jianchao Tan
• Jingjing Meng
• Ji Liu
• Junsong Yuan

As we use our hands frequently in daily activities, the analysis of hand-object interactions plays a critical role to many multimedia understanding and interaction applications. Different from conventional 3D hand-only and object-only pose estimation, estimating 3D hand-object pose is more challenging due to the mutual occlusions between hand and object, as well as the physical constraints between them. To overcome these issues, we propose to fully utilize the structural correlations among hand joints and object corners in order to obtain more reliable poses. Our work is inspired by structured output learning models in sequence transduction field like Transformer encoder-decoder framework. Besides modeling inherent dependencies from extracted 2D hand-object pose, our proposed Hand-Object Transformer Network (HOT-Net) also captures the structural correlations among 3D hand joints and object corners. Similar to Transformer's autoregressive decoder, by considering structured output patterns, this helps better constrain the output space and leads to more robust pose estimation. However, different from Transformer's sequential modeling mechanism, HOT-Net adopts a novel non-autoregressive decoding strategy for 3D hand-object pose estimation. Specifically, our model removes the Transformer's dependence on previously generated results and explicitly feeds a reference 3D hand-object pose into the decoding process to provide equivalent target pose patterns for parallely localizing each 3D keypoint. To further improve physical validity of estimated hand pose, besides anatomical constraints, we propose a cooperative pose constraint, aiming to enable the hand pose to cooperate with hand shape, to generate hand mesh. We demonstrate real-time speed and state-of-the-art performance on benchmark hand-object datasets for both 3D hand and object poses.

### Multi-Features Fusion and Decomposition for Age-Invariant Face Recognition

• Lixuan Meng
• Chenggang Yan
• Jun Li
• Jian Yin
• Wu Liu
• Hongtao Xie
• Liang Li

Although the General Face Recognition (GFR) research achieves great success, Age-Invariant Face Recognition (AIFR) is still a challenging problem since facial appearance changing over time brings significant intra-class variations. The existing discriminative methods for the AIFR task mostly focus on decomposing the facial feature from a sigle image into age-related feature and age-independent feature for recognition, which suffer from the loss of facial identity information. To address this issue, in this work we propose a novel Multi-Features Fusion and Decomposition (MFFD) framework to learn more discriminative feature representations and alleviate the intra-class variations for AIFR. Specifically, we first sample multiple face images of different ages with the same identity as a face time series. Next, we combine feature decomposition with fusion based on the face time series to ensure that the final age-independent features effectively represent the identity information of the face and have stronger robustness against aging. Moreover, we also present two feature fusion methods and several different training strategies to explore the impact on the model. Extensive experiments on several cross-age datasets (CACD, CACD-VS) demonstrate the effectiveness of our proposed method. Besides, our method also shows comparable generalization performance on the well-known LFW dataset.

### Part-Aware Interactive Learning for Scene Graph Generation

• Hongshuo Tian
• Ning Xu
• An-An Liu
• Yongdong Zhang

Generating scene graph to describe the whereabouts and interactions of objects in an image has attracted increasing attention of researchers. Most existing methods explore object-level visual context or bodypart-object cooperation with the message passing structure, which can not meet the part-aware interaction nature of scene graph. Normally, a subject interacts with an object through crucial parts in each other. Besides, the correlation among parts within an identical object can also help predicting objects and their relationships. Hence, both of subject and object parts and their intra- and inter-object correlations should be fully considered for scene graph generation. In this paper, we propose a part-aware interactive learning method, which are divided into the intra-object and inter-object scenarios. First, we detect objects from an image and further decompose each one into a set of parts. Second, the part-aware graph attention module is proposed to refine part features via the intra-object message passing, and the refined features are incorporated for object inference. Third, the visual mutual attention module is designed to discover part-aware correlated visual cues precisely for predicate inference. It can highlight the subject-related object parts and the object-related subject parts during inter-object interactive learning. We demonstrate the superiority of our method against the state of the arts on Visual Genome. Ablation studies and visualization further validate its effectiveness.

### Retrieval Guided Unsupervised Multi-domain Image to Image Translation

• Raul Gomez
• Yahui Liu
• Dimosthenis Karatzas
• Bruno Lepri
• Nicu Sebe

Image to image translation aims to learn a mapping that transforms an image from one visual domain to another. Recent works assume that images descriptors can be disentangled into a domain-invariant content representation and a domain-specific style representation. Thus, translation models seek to preserve the content of source images while changing the style to a target visual domain. However, synthesizing new images is extremely challenging especially in multi-domain translations, as the network has to compose content and style to generate reliable and diverse images in multiple domains. In this paper we propose the use of an image retrieval system to assist the image-to-image translation task. First, we train an image-to-image translation model to map images to multiple domains. Then, we train an image retrieval model using real and generated images to find images similar to a query one in content but in a different domain. Finally, we exploit the image retrieval system to fine-tune the image-to-image translation model and generate higher quality images. Our experiments show the effectiveness of the proposed solution and highlight the contribution of the retrieval network, which can benefit from additional unlabeled data and help image-to-image translation models in the presence of scarce data.

### GangSweep: Sweep out Neural Backdoors by GAN

• Liuwan Zhu
• Rui Ning
• Cong Wang
• Chunsheng Xin
• Hongyi Wu

This work proposes GangSweep, a new backdoor detection framework that leverages the super reconstructive power of Generative Adversarial Networks (GAN) to detect and ''sweep out'' neural backdoors. It is motivated by a series of intriguing empirical investigations, revealing that the perturbation masks generated by GAN are persistent and exhibit interesting statistical properties with low shifting variance and large shifting distance in feature space. Compared with the previous solutions, the proposed approach eliminates the reliance on the access to training data, and shows a high degree of robustness and efficiency for detecting and mitigating a wide range of backdoored models with various settings. Moreover, this is the first work that successfully leverages generative networks to defend against advanced neural backdoors with multiple triggers and their polymorphic forms.

### Iterative Back Modification for Faster Image Captioning

• Zhengcong Fei

Current state-of-the-art image captioning systems generally produce a sentence from left to right, and every step is conditioned on the given image and previously generated words. Nevertheless, such autoregressive nature makes the inference process difficult to parallelize and leads to high captioning latency. In this paper, we propose a non-autoregressive approach for faster image caption generation. Technically, low-dimension continuous latent variables are shaped to capture semantic information and word dependencies from extracted image features before sentence decoding. Moreover, we develop an iterative back modification inference algorithm, which continuously refines the latent variables with a look back mechanism and parallelly generates the whole sentence based on the updated latent variables in a constant number of steps. Extensive experiments demonstrate that our method achieves competitive performance compared to prevalent autoregressive captioning models while significantly reducing the decoding time on average.

### VIMES: A Wearable Memory Assistance System for Automatic Information Retrieval

• Carlos Bermejo
• Tristan Braud
• Ji Yang
• Shayan Mirjafari
• Bowen Shi
• Yu Xiao
• Pan Hui

The advancement of artificial intelligence and wearable computing triggers the radical innovation of cognitive applications. In this work, we propose VIMES, an augmented reality-based memory assistance system that helps recall declarative memory, such as whom the user meets and what they chat. Through a collaborative method with 20 participants, we design VIMES, a system that runs on smartglasses, takes the first-person audio and video as input, and extracts personal profiles and event information to display on the embedded display or a smartphone. We perform an extensive evaluation with 50 participants to show the effectiveness of VIMES for memory recall. VIMES outperforms (90% memory accuracy) other traditional methods such as self-recall (34%) while offering the best memory experience (Vividness, Coherence, and Visual Perspective all score over 4/5). The user study results show that most participants find VIMES useful (3.75/5) and easy to use (3.46/5).

## SESSION: Poster Session G2: Multimedia -- Art and Entertainment, Cloud and Edge Computing, Data Systems, & HCI

### Neutral Face Game Character Auto-Creation via PokerFace-GAN

• Tianyang Shi
• Zhengxia Zou
• Xinhui Song
• Zheng Song
• Changjian Gu
• Changjie Fan
• Yi Yuan

Game character customization is one of the core features of many recent Role-Playing Games (RPGs), where players can edit the appearance of their in-game characters with their preferences. This paper studies the problem of automatically creating in-game characters with a single photo. In recent literature on this topic, neural networks are introduced to make game engine differentiable and the self-supervised learning is used to predict facial customization parameters. However, in previous methods, the expression parameters and facial identity parameters are highly coupled with each other, making it difficult to model the intrinsic facial features of the character. Besides, the neural network based renderer used in previous methods is also difficult to be extended to multi-view rendering cases. In this paper, considering the above problems, we propose a novel method named "PokerFace-GAN" for neutral face game character auto-creation. We first build a differentiable character renderer which is more flexible than the previous methods in multi-view rendering cases. We then take advantage of the adversarial training to effectively disentangle the expression parameters from the identity parameters and thus generate player-preferred neutral face (expression-less) characters. Since all components of our method are differentiable, our method can be easily trained under a multi-task self-supervised learning paradigm. Experiment results show that our method can generate vivid neutral face game characters that are highly similar to the input photos. The effectiveness of our method is verified by comparison results and ablation studies.

### Gray2ColorNet: Transfer More Colors from Reference Image

• Peng Lu
• Jinbei Yu
• Xujun Peng
• Zhaoran Zhao
• Xiaojie Wang

Image colorization is an effective approach to provide plausible colors for grayscale images, which can achieve better and pleasing visual qualities. Although exemplar based colorization approaches provide promising results, they are relied on semantic colors or global colors only from the reference images. For the former situation, when the correspondence between the input grayscale image and reference image is not established, the colors of the reference image cannot be transferred to the input grayscale image successfully. With the later circumstance, because only global colors are considered, it is hard to produce a color image whose objects have the same color as the reference image when they are semantically related. Thus, an end-to-end colorization network Gray2ColorNet is proposed in this work, where an attention gating mechanism based color fusion network is designed to accomplish the colorization tasks. Relied on the proposed method, the semantic colors and global color distribution from the reference image are fused effectively, which are transferred to the final color images along with the prior knowledge of colors contained in the training data. The experimental results demonstrate the superior colorization performances of the proposed method compared to other state-of-the-art approaches.

### Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts

• Cheng-Che Lee
• Wan-Yi Lin
• Yen-Ting Shih
• Pei-Yi (Patricia) Kuo
• Li Su

Music-to-visual style transfer is a challenging yet important cross-modal learning problem in the practice of creativity. Its major difference from the traditional image style transfer problem is that the style information is provided by music rather than images. Assuming that musical features can be properly mapped to visual contents through semantic links between the two domains, we solve the music-to-visual style transfer problem in two steps: music visualization and style transfer. The music visualization network utilizes an encoder-generator architecture with a conditional generative adversarial network to generate image-based music representations from music data. This network is integrated with an image style transfer method to accomplish the style transfer process. Experiments are conducted on WikiArt-IMSLP, a newly compiled dataset including Western music recordings and paintings listed by decades. By utilizing such a label to learn the semantic connection between paintings and music, we demonstrate that the proposed framework can generate diverse image style representations from a music piece, and these representations can unveil certain art forms of the same era. Subjective testing results also emphasize the role of the era label in improving the perceptual quality on the compatibility between music and visual content.

### Modeling Caricature Expressions by 3D Blendshape and Dynamic Texture

• Keyu Chen
• Jianmin Zheng
• Jianfei Cai
• Juyong Zhang

The problem of deforming an artist-drawn caricature according to a given normal face expression is of interest in applications such as social media, animation and entertainment. This paper presents a solution to the problem, with an emphasis on enhancing the ability to create desired expressions and meanwhile preserve the identity exaggeration style of the caricature, which imposes challenges due to the complicated nature of caricatures. The key of our solution is a novel method to model caricature expression, which extends traditional 3DMM representation to caricature domain. The method consists of shape modelling and texture generation for caricatures. Geometric optimization is developed to create identity-preserving blendshapes for reconstructing accurate and stable geometric shape, and a conditional generative adversarial network (cGAN) is designed for generating dynamic textures under target expressions. The combination of both shape and texture components makes the non-trivial expressions of a caricature be effectively defined by the extension of the popular 3DMM representation and a caricature can thus be flexibly deformed into arbitrary expressions with good results visually in both shape and color spaces. The experiments demonstrate the effectiveness of the proposed method.

### SketchMan: Learning to Create Professional Sketches

• Jia Li
• Nan Gao
• Tong Shen
• Wei Zhang
• Tao Mei
• Hui Ren

Human free-hand sketches have been studied in various fields including sketch recognition, synthesis and sketch-based image retrieval. We propose a new challenging task sketch enhancement (SE) defined in an ill-posed space, i.e. enhancing a non-professional sketch (NPS) to a professional sketch (PS), which is a creative generation task different from sketch abstraction, sketch completion and sketch variation. For the first time we release a database of NPS with PS for anime characters. We cast sketch enhancement as an image-to-image translation problem by exploiting the relationship to corresponding intensive or sparse pixel domains for sketch domain. Specifically, we explore three different routines based on conditional generative adversarial network (cGAN), i.e. Sketch-Sketch (SS), Sketch-Colorization-Sketch (SCS) and Sketch-Abstraction-Sketch (SAS). SS is a one-stage model that directly maps NPS to PS, while SCS and SAS are two-stage models where auxiliary inputs, grayscale parsing and shape parsing, are involved. Multiple metrics are used to evaluate the performance of the models in both the sketch domain and other low-level feature domains. With quantitative and qualitative analysis of the experiments, we have established solid baselines, which, we hope, could encourage more research conducted on this task. Our dataset is publicly available via https://github.com/LCXCUC/SketchMan2020.

### Anisotropic Stroke Control for Multiple Artists Style Transfer

• Xuanhong Chen
• Xirui Yan
• Naiyuan Liu
• Ting Qiu
• Bingbing Ni

Though significant progress has been made in artistic style transfer, semantic information is usually difficult to be preserved in a fine-grained locally consistent manner by most existing methods, especially when multiple artists styles are required to transfer within one single model. To circumvent this issue, we propose a Stroke Control Multi-Artist Style Transfer framework. On the one hand, we design an Anisotropic Stroke Module (ASM) which realizes the dynamic adjustment of style-stroke between the non-trivial and the trivial regions. ASM endows the network with the ability of adaptive semantic-consistency among various styles. On the other hand, we present an novel Multi-Scale Projection Discriminator to realize the texture-level conditional generation. In contrast to the single-scale conditional discriminator, our discriminator is able to capture multi-scale texture clue to effectively distinguish a wide range of artistic styles. Extensive experimental results well demonstrate the feasibility and effectiveness of our approach. Our framework can transform a photograph into different artistic style oil painting via only ONE single model. Furthermore, the results are with distinctive artistic style and retain the anisotropic semantic information.

### A Multi-update Deep Reinforcement Learning Algorithm for Edge Computing Service Offloading

• Hao Hao
• Changqiao Xu
• Lujie Zhong
• Gabriel-Miro Muntean

By pushing computing functionalities to network edges, backhaul network bandwidth is saved and various latency requirements are met, providing support for diverse computation-intensive and delay-sensitive multimedia services. Due to the limited capabilities of edge nodes, it is very important to decide which services should be provided locally. This paper investigates the cloud-edge service offloading problem. Different from prior works which only give the proportion of computation offloading with constraint of computing capacity, we also take the storage space into account and determine the computing status of each service. We formulate the problem as a Markov decision process whose goal is to maximize the long-term average reduction of delay. The problem is hard to be solved with traditional methods because of the extremely large action space and lack of information about transition probability. Instead, this paper proposes an innovative deep reinforcement learning method to solve it. The proposed multi-update reinforcement learning algorithm introduces a novel exploration strategy and update method, which reduce dramatically the size of the action space. Extensive simulation-based testing shows that the proposed algorithm has fast convergence and improves the system performance more than other three alternative solutions do.

### Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds

• Zichuan Xu
• Jiangkai Wu
• Qiufen Xia
• Pan Zhou
• Jiankang Ren
• Huizhi Liang

With the development of deep learning technologies, attribute recognition and person re-identification (re-ID) have attracted extensive attention and achieved continuous improvement via executing computing-intensive deep neural networks in cloud datacenters. However, the datacenter deployment cannot meet the real-time requirement of attribute recognition and person re-ID, due to the prohibitive delay of backhaul networks and large data transmissions from cameras to datacenters. A feasible solution thus is to employ mobile edge clouds (MEC) within the proximity of cameras and enable distributed inference.

In this paper, we design novel models for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. We also investigate the problem of distributed inference in the MEC-enabled camera network. To this end, we first propose a novel inference framework with a set of distributed modules, by jointly considering the attribute recognition and person re-ID. We then devise a learning-based algorithm for the distributions of the modules of the proposed distributed inference framework, considering the dynamic MEC-enabled camera network with uncertainties. We finally evaluate the performance of the proposed algorithm by both simulations with real datasets and system implementation in a real testbed. Evaluation results show that the performance of the proposed algorithm with distributed inference framework is promising, by reaching the accuracies of attribute recognition and person identification up to 92.9% and 96.6% respectively, and significantly reducing the inference delay by at least 40.6% compared with existing methods.

### Deep Unsupervised Hybrid-similarity Hadamard Hashing

• Wanqian Zhang
• Dayan Wu
• Yu Zhou
• Bo Li
• Weiping Wang
• Dan Meng

Hashing has become increasingly important for large-scale image retrieval. Recently, deep supervised hashing has shown promising performance, yet little work has been done under the more realistic unsupervised setting. The most challenging problem in unsupervised hashing methods is the lack of supervised information. Besides, existing methods fail to distinguish image pairs with different similarity degrees, which leads to a suboptimal construction of similarity matrix. In this paper, we propose a simple yet effective unsupervised hashing method, dubbed Deep Unsupervised Hybrid-similarity Hadamard Hashing (DU3H), which tackles these issues in an end-to-end deep hashing framework. DU3H employs orthogonal Hadamard codes to provide auxiliary supervised information in unsupervised setting, which can maximally satisfy the independence and balance properties of hash codes. Moreover, DU3H utilizes both highly and normally confident image pairs to jointly construct a hybrid-similarity matrix, which can magnify the impacts of different pairs to better preserve the semantic relations between images. Extensive experiments conducted on three widely used benchmarks validate the superiority of DU3H.

### Incomplete Cross-modal Retrieval with Dual-Aligned Variational Autoencoders

• Mengmeng Jing
• Jingjing Li
• Lei Zhu
• Ke Lu
• Yang Yang
• Zi Huang

Learning the relationship between the multi-modal data, e.g., texts, images and videos, is a classic task in the multimedia community. Cross-modal retrieval (CMR) is a typical example where the query and the corresponding results are in different modalities. Yet, a majority of existing works investigate CMR with an ideal assumption that the training samples in every modality are sufficient and complete. In real-world applications, however, this assumption does not always hold. Mismatch is common in multi-modal datasets. There is a high chance that samples in some modalities are either missing or corrupted. As a result, incomplete CMR has become a challenging issue. In this paper, we propose a Dual-Aligned Variational Autoencoders (DAVAE) to address the incomplete CMR problem. Specifically, we propose to learn modality-invariant representations for different modalities and use the learned representations for retrieval. We train multiple autoencoders, one for each modality, to learn the latent factors among different modalities. These latent representations are further dual-aligned at the distribution level and the semantic level to alleviate the modality gaps and enhance the discriminability of representations. For missing instances, we leverage generative models to synthesize latent representations for them. Notably, we test our method with different ratios of random incompleteness.Extensive experiments on three datasets verify that our method can consistently outperform the state-of-the-arts.

### MRS-Net: Multi-Scale Recurrent Scalable Network for Face Quality Enhancement of Compressed Videos

• Tie Liu
• Mai Xu
• Shengxi Li
• Rui Ding
• Huaida Liu

The past decade has witnessed the explosive growth of faces in video multimedia systems, e.g., videoconferencing and live shows. However, these videos are normally compressed at low bit-rates due to the bandwidth-hungry issue, leading to heavy quality degradation on face regions. This paper addresses the problem of face quality enhancement in compressed videos. Specifically, we establish a compressed face video (CFV) database, which includes 87,607 faces in 113 raw video sequences and their corresponding 904 compressed sequences. We find that the faces of compressed videos exhibit tremendous scale variation and quality fluctuation. Motivated by scalable video coding, we propose a multi-scale recurrent scalable network (MRS-Net) to enhance the quality of multi-scale faces in compressed videos. The MRS-Net is comprised by one base and two refined enhancement levels, corresponding to the quality enhancement of small-, medium- and large-scale faces, respectively. In the multi-level architecture of our MRS-Net, small-/medium-scale face quality enhancement serves as the basis for facilitating the quality enhancement of medium-/large-scale faces. Finally, experimental results show that our MRS-Net method is effective in enhancing the quality of multi-scale faces for compressed videos, significantly outperforming other state-of-the-art methods.

### Panoptic Image Annotation with a Collaborative Assistant

• Jasper R.R. Uijlings
• Mykhaylo Andriluka
• Vittorio Ferrari

This paper aims to reduce the time to annotate images for panoptic segmentation, which requires annotating segmentation masks and class labels for all object instances and stuff regions. We formulate our approach as a collaborative process between an annotator and an automated assistant who take turns to jointly annotate an image using a predefined pool of segments. Actions performed by the annotator serve as a strong contextual signal. The assistant intelligently reacts to this signal by annotating other parts of the image on its own, which reduces the amount of work required by the annotator. We perform thorough experiments on the COCO panoptic dataset, both in simulation and with human annotators. These demonstrate that our approach is significantly faster than the recent machine-assisted interface of [Andriluka 18 ACMMM], and $2.4\times$ to $5\times$ faster than manual polygon drawing. Finally, we show on ADE20k that our method can be used to efficiently annotate new datasets, bootstrapping from a very small amount of annotated data.

### Blind Natural Video Quality Prediction via Statistical Temporal Features and Deep Spatial Features

• Jari Korhonen
• Yicheng Su
• Junyong You

Due to the wide range of different natural temporal and spatial distortions appearing in user generated video content, blind assessment of natural video quality is a challenging research problem. In this study, we combine the hand-crafted statistical temporal features used in a state-of-the-art video quality model and spatial features obtained from convolutional neural network trained for image quality assessment via transfer learning. Experimental results on two recently published natural video quality databases show that the proposed model can predict subjective video quality more accurately than the publicly available video quality models representing the state-of-the-art. The proposed model is also competitive in terms of computational complexity.

## SESSION: Session H2: Multimedia HCI, Multimeda Scalability and Management, & Multimedia Search and Recommendation

### Aesthetic-Aware Image Style Transfer

• Zhiyuan Hu
• Jia Jia
• Bei Liu
• Yaohua Bu
• Jianlong Fu

Style transfer aims to synthesize an image which inherits the content of one image while preserving a similar style of the other one. The "style'' of an image usually refers to its unique feeling conveyed from visual features, which is highly related to the aesthetic effect of the image. Aesthetic effect can be mainly decomposed as two factors: colour and texture. Previous methods like Neural Style Transfer and Colour Transfer have shown strong abilities in transferring colour and texture features. However, such approaches neglect to further disentangle colour and texture, which makes some of unique aesthetic effects designed by human artists hard to express. In this paper, we propose a novel problem called Aesthetic-Aware Image Style Transfer task, which aims to transfer colour and texture separately and independently to manipulate the aesthetic effect of an image. We propose a novel Aesthetic-Aware Model-Optimisation-Based Style Transfer (AAMOBST) model to solve this problem. Specifically, AAMOBST is a multi-reference, two-path model. It uses different reference images to decide desired colour and texture features. It can segregate colour and texture into two distinct paths and transfer them independently. Qualitative and quantitative experiments show that our model can decide colour and texture features separately and is able to keep one of them fixed while changing the other one, which is not applicable for previous methods. Furthermore, on tasks that are applicable for previous methods (such as style transfer, colour-preserved transfer and colour-only transfer), our model shows comparable abilities with other baseline methods.

### Building Movie Map - A Tool for Exploring Areas in a City - and its Evaluations

• Naoki Sugimoto
• Yoshihito Ebine
• Kiyoharu Aizawa

We propose a new Movie Map, which will enable users to explore a given city area using omnidirectional videos. Only one Movie Map prototype was developed in the 1980s; it was developed with analog video technology. Later, Google Street View (GSV) provided interactive panoramas from positions along streets around the world in Google Maps. Despite the wide use of GSV, it provides sparse images of streets, which often confuses users and lowers user satisfaction. Movie Map's use of videos instead of sparse images dramatically improves the user experience. Thus, we improve the Movie Map using state-of-the-art technology. We propose a new Movie Map system, with an interface for exploring cities. The system consists of four stages; acquisition, analysis, management, and interaction. In the acquisition stage, omnidirectional videos are taken along streets in target areas. Frames of the video are localized on the map, intersections are detected, and videos are segmented. Turning views at intersections are subsequently generated. By connecting the video segments following the specified movement in an area, we can view the streets better. The interface allows for easy exploration of a target area, and it can show virtual billboards of stores in the view. We conducted user studies to compare our system to the GSV in a scenario where users could freely move and explore to find a landmark. The experiment showed that our system had a better user experience than GSV.

### A Probabilistic Graphical Model for Analyzing the Subjective Visual Quality Assessment Data from Crowdsourcing

• Jing Li
• Suiyi Ling
• Junle Wang
• Patrick Le Callet

The swift development of the multimedia technology has raised dramatically the users' expectation on the quality of experience. To obtain the ground-truth perceptual quality for model training, subjective assessment is necessary. Crowdsourcing platform provides us a convenient and feasible way to run large-scale experiments. However, the obtained perceptual quality labels are generally noisy. In this paper, we propose a probabilistic graphical annotation model to infer the underlying ground truth and discovering the annotator's behavior. In the proposed model, the ground truth quality label is considered following a categorical distribution rather than a unique number, i.e., different reliable opinions on the perceptual quality are allowed. In addition, different annotator's behaviors in crowdsourcing are modeled, which allows us to identify the possibility that the annotator makes noisy labels during the test. The proposed model has been tested on both simulated data and real-world data, where it always shows superior performance than the other state-of-the-art models in terms of accuracy and robustness.

### DroidCloud: Scalable High Density AndroidTM Cloud Rendering

• Linsheng Li
• Bin Yang
• Cathy Bao
• Shuo Liu
• Randy Xu
• Yong Yao
• Jerry W. Hu
• Shoumeng Yan
• Zhengwei Qi

Cloud rendering is an emerging technology in which rendering-heavy applications run on the cloud server and then stream the rendered contents to the end-user device. High density and high scalability of the cloud rendering services are crucial to support millions of users concurrently and cost-effectively. However, it is still challenging to run Android OS in cloud smoothly with high density and high scalability without compromising user experience. This paper presents DroidCloud, the first open-source Android\footnoteAndroid is a trademark of Google LLC. cloud rendering solution focusing on the scalable design and density aspect optimization to the best of our knowledge. To cloudify Android OS, DroidCloud utilizes thevHAL technology in order to support remote devices and keep transparent to Android applications. And aFlexible rendering scheduling policy is introduced to break the boundary of GPU physical locations. Thus, both remote GPUs and local GPUs can accommodate render tasks by forwarding rendering tasks and making it possible to support multiple Android OSes with GPU acceleration. Besides, to further improve the density, DroidCloud optimizes the resource cost both in a single instance and across instances. We show that DroidCloud can run hundreds of Android OSes on a single Intel Xeon server with GPU acceleration simultaneously, increasing the density at the scale of one order of magnitude compared to current cloud gaming systems. Further experimental results demonstrate that DroidCloud can transparently run Android applications at native speed with lower CPU, memory, and storage utilization.

### Interpretable Embedding for Ad-Hoc Video Search

• Jiaxin Wu
• Chong-Wah Ngo

Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as well as search results are not interpretable, hindering subsequent steps in video browsing and query reformulation. This paper integrates feature embedding and concept interpretation into a neural network for unified dual-task learning. In this way, an embedding is associated with a list of semantic concepts as an interpretation of video content. This paper empirically demonstrates that, by using either the embedding features or concepts, considerable search improvement is attainable on TRECVid benchmarked datasets. Concepts are not only effective in pruning false positive videos, but also highly complementary to concept-free search, leading to large margin of improvement compared to state-of-the-art approaches.

### Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval

• Feifei Zhang
• Mingliang Xu
• Qirong Mao
• Changsheng Xu

Cross-model retrieval has attracted much attention in recent years due to its wide applications. Conventional approaches usually take one modality as query to retrieve relevant data of another modality. In this paper, we devote to an emerging task in cross-modal retrieval, Composing Text and Image to Image Retrieval (CTI-IR), which aims at retrieving images relevant to a query image with text describing desired modifications to the query image. Compared with conventional cross-modal retrieval, the new task is particularly useful for the retrieval that the query image does not perfectly match the user's expectations. Generally, the CTI-IR involves two underlying problems: how to manipulate visual features of the query image specified by the text, and how to model the modality gap between the query and target. Most previous methods focus on solving the second problem. In this paper, we aim to deal with both problems simultaneously in a unified model. Specifically, the proposed method is based on the graph attention network and adversarial learning network, which enjoys several merits. First, the query image and the modification text are constructed in a relation graph for learning text-adaptive representations. Second, semantic contents from the text are injected into the visual features through graph attention. Third, an adversarial loss is incorporated into the conventional cross-modal retrieval loss to learn more discriminative modality invariant representations for CTI-IR. Extensive experiments on three benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.

### Semi-supervised Online Multi-Task Metric Learning for Visual Recognition and Retrieval

• Yangxi Li
• Han Hu
• Jin Li
• Yong Luo
• Yonggang Wen

Distance metric learning (DML) is critial in many multimedia application tasks. However, it is hard to learn a satisfactory distance metric given only a few labeled samples for each task. In this paper, we proposed a novel semi-supervised online multi-task DML method termed SOMTML, which enables the models describing different tasks to help each other during the metric learning procedure and thus improving their respective performance. Besides, unlabeled data are leveraged to further help alleviate the data deficiency issue in different tasks by designing a novel regularization term, which also allows some prior information to be incorporated. More importantly, a quite efficient algorithm is developed to update the metrics of all tasks adaptively. The proposed SOMTML is experimentally validated in two popular visual analytic-based applications: handwriting digits recognition and face retrieval. We compared the proposed method with competitive single-task and multi-task metric learning approaches. Extensive experimental results demonstrate the effectiveness and efficiency of the proposed SOMTML.

### Supervised Hierarchical Deep Hashing for Cross-Modal Retrieval

• Yu-Wei Zhan
• Xin Luo
• Yongxin Wang
• Xin-Shun Xu

Cross-modal hashing has attracted much attention in the large-scale multimedia search area. In many real applications, labels of samples have hierarchical structure which also contains much useful information for learning. However, most existing methods are originally designed for non-hierarchical labeled data and thus fail to exploit the rich information of the label hierarchy. In this paper, we propose an effective cross-modal hashing method, named Supervised Hierarchical Deep Cross-modal Hashing, SHDCH for short, to learn hash codes by explicitly delving into the hierarchical labels. Specifically, both the similarity at each layer of the label hierarchy and the relatedness across different layers are implanted into the hash-code learning. Besides, an iterative optimization algorithm is proposed to directly learn the discrete hash codes instead of relaxing the binary constraints. We conducted extensive experiments on two real-world datasets and the experimental results show the superior performance of SHDCH over several state-of-the-art methods.

### Multi-graph Convolutional Network for Unsupervised 3D Shape Retrieval

• Weizhi Nie
• Yue Zhao
• An-An Liu
• Zan Gao
• Yuting Su

3D shape retrieval has attracted much research attention due to its wide applications in the fields of computer vision and multimedia. Various approaches have been proposed in recent years for learning 3D shape descriptor from different modalities. The existing works contain the following disadvantages: 1) the vast majority methods rely on the large scale of training data with clear category information; 2) many approaches focus on the fusion of multi-modal information but ignore the guidance of correlations among different modalities for shape representation learning; 3) many methods pay attention to the structural feature learning of 3D shape but ignore the guidance of structural similarity between every two shapes. To solve these problems, we propose a novel multi-graph network (MGN) for unsupervised 3D shape retrieval, which utilizes the correlations among modalities and structural similarity between two models to guide the shape representation learning process without category information. More specifically, we propose two novel loss functions: auto-correlation loss and cross-correlation loss. The auto-correlation loss utilizes information from different modalities to increase the discrimination of shape descriptor. The cross-correlation loss utilizes the structural similarity between two models to strengthen the intra-class similarity and increase the inter-class distinction. Finally, an effective similarity measurement is designed for the shape retrieval task. To validate the effectiveness of our proposed method, we conduct experiments on the ModelNet dataset. Experimental results demonstrate the effectiveness of our proposed method, and significant improvements have been achieved compared with state-of-the-art methods.

### Bottom-Up Foreground-Aware Feature Fusion for Person Search

• Wenjie Yang
• Dangwei Li
• Xiaotang Chen
• Kaiqi Huang

The key to efficient person search is jointly localizing pedestrians and learning discriminative representation for person re-identification (re-ID). Some recently developed task-joint models are built with separate detection and re-ID branches on top of shared region feature extraction networks, where the large receptive field of neurons leads to background information redundancy for the following re-ID task. Our diagnostic analysis indicates the task-joint model suffers from considerable performance drop when the background is replaced or removed. In this work, we propose a subnet to fuse the bounding box features that pooled from multiple ConvNet stages in a bottom-up manner, termed bottom-up fusion (BUF) network. With a few parameters introduced, BUF leverages the multi-level features with different sizes of receptive fields to mitigate the background-bias problem. Moreover, the newly introduced segmentation head generates a foreground probability map as guidance for the network to focus on the foreground regions. The resulting foreground attention module (FAM) enhances the foreground features. Extensive experiments on PRW and CUHK-SYSU validate the effectiveness of the proposals. Our Bottom-Up Foreground-Aware Feature Fusion (BUFF) network achieves considerable gains over the state-of-the- arts on PRW and competitive performance on CUHK-SYSU.

### Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches

• Zhi Chen
• Sen Wang
• Jingjing Li
• Zi Huang

Zero-shot learning (ZSL) is commonly used to address the very pervasive problem of predicting unseen classes in fine-grained image classification and other tasks. One family of solutions is to learn synthesised unseen visual samples produced by generative models from auxiliary semantic information, such as natural language descriptions. However, for most of these models, performance suffers from noise in the form of irrelevant image backgrounds. Further, most methods do not allocate a calculated weight to each semantic patch. Yet, in the real world, the discriminative power of features can be quantified and directly leveraged to improve accuracy and reduce computational complexity. To address these issues, we propose a novel framework called multi-patch generative adversarial nets (MPGAN) that synthesises local patch features and labels unseen classes with a novel weighted voting strategy. The process begins by generating discriminative visual features from noisy text descriptions for a set of predefined local patches using multiple specialist generative models. The features synthesised from each patch for unseen classes are then used to construct an ensemble of diverse supervised classifiers, each corresponding to one local patch. A voting strategy averages the probability distributions output from the classifiers and, given that some patches are more discriminative than others, a discrimination-based attention mechanism helps to weight each patch accordingly. Extensive experiments show that MPGAN has significantly greater accuracy than state-of-the-art methods.

### Surpassing Real-World Source Training Data: Random 3D Characters for Generalizable Person Re-Identification

• Yanan Wang
• Shengcai Liao
• Ling Shao

Person re-identification has seen significant advancement in recent years. However, the ability of learned models to generalize to unknown target domains still remains limited. One possible reason for this is the lack of large-scale and diverse source training data, since manually labeling such a dataset is very expensive and privacy sensitive. To address this, we propose to automatically synthesize a large-scale person re-identification dataset following a set-up similar to real surveillance but with virtual environments, and then use the synthesized person images to train a generalizable person re-identification model. Specifically, we design a method to generate a large number of random UV texture maps and use them to create different 3D clothing models. Then, an automatic code is developed to randomly generate various different 3D characters with diverse clothes, races and attributes. Next, we simulate a number of different virtual environments using Unity3D, with customized camera networks similar to real surveillance systems, and import multiple 3D characters at the same time, with various movements and interactions along different paths through the camera networks. As a result, we obtain a virtual dataset, called RandPerson, with 1,801,816 person images of 8,000 identities. By training person re-identification models on these synthesized person images, we demonstrate, for the first time, that models trained on virtual data can generalize well to unseen target images, surpassing the models trained on various real-world datasets, including CUHK03, Market-1501, DukeMTMC-reID, and almost MSMT17. The RandPerson dataset is available at https://github.com/VideoObjectSearch/RandPerson.

### Zero-Shot Multi-View Indoor Localization via Graph Location Networks

• Meng-Jiun Chiou
• Zhenguang Liu
• Yifang Yin
• An-An Liu
• Roger Zimmermann

Indoor localization is a fundamental problem in location-based applications. Current approaches to this problem typically rely on Radio Frequency technology, which requires not only supporting infrastructures but human efforts to measure and calibrate the signal. Moreover, data collection for all locations is indispensable in existing methods, which in turn hinders their large-scale deployment. In this paper, we propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. GLN makes location predictions based on robust location representations extracted from images through message-passing networks. Furthermore, we introduce a novel zero-shot indoor localization setting and tackle it by extending the proposed GLN to a dedicated zero-shot version, which exploits a novel mechanism Map2Vec to train location-aware embeddings and make predictions on novel unseen locations. Our extensive experiments show that the proposed approach outperforms state-of-the-art methods in the standard setting, and achieves promising accuracy even in the zero-shot setting where data for half of the locations are not available. The source code and datasets are publicly available.

### Hierarchical Gumbel Attention Network for Text-based Person Search

• Kecheng Zheng
• Wu Liu
• Jiawei Liu
• Zheng-Jun Zha
• Tao Mei

Text-based person search aims to retrieve the pedestrian images that best match a given textual description from gallery images. Previous methods utilize the soft-attention mechanism to infer the semantic alignments between the regions of image and the corresponding words in sentence. However, these methods may fuse the irrelevant multi-modality features together which cause matching redundancy problem. In this work, we propose a novel hierarchical Gumbel attention network for text-based person search via Gumbel top-k re-parameterization algorithm. Specifically, it adaptively selects the strong semantically relevant image regions and words/phrases from images and texts for precise alignment and similarity calculation. This hard selection strategy is able to fuse the strong-relevant multi-modality features for alleviating the problem of matching redundancy. Meanwhile, a Gumbel top-k re-parameterization algorithm is designed as a low-variance, unbiased gradient estimator to handle the discreteness problem of hard attention mechanism by an end-to-end manner. Moreover, a hierarchical adaptive matching strategy is employed by the model from three different granularities, i.e., word-level, phrase-level, and sentence-level, towards fine-grained matching. Extensive experimental results demonstrate the state-of-the-art performance. Compared the existed best method, we achieve the 8.24% Rank-1 and 7.6% mAP relative improvements in the text-to-image retrieval task, and 5.58% Rank-1 and 6.3% mAP relative improvements in the image-to-text retrieval task on CUHK-PEDES dataset, respectively.

## SESSION: Poster Session A3: Multimedia Search and Recommendation & Multimedia System and Middleware

### Dual Context-Aware Refinement Network for Person Search

• Jiawei Liu
• Zheng-Jun Zha
• Richang Hong
• Meng Wang
• Yongdong Zhang

Person search has recently gained increasing attention as the novel task of localizing and identifying a target pedestrian from a gallery of non-cropped scene images. Its performance depends on accurate person detection and re-identification simultaneously by learning effective representations. In this work, we propose a novel dual context-aware refinement network (DCRNet) for person search, which jointly explores two kinds of contexts including intra-instance context and inter-instance context to learn discriminative representation. Specifically, an intra-instance context module is designed to refine the representation for the bounding box of a pedestrian by leveraging its surrounding regions covering the same pedestrian and its accessories, which contain abundant complementary visual appearance of pedestrians. Moreover, an inter-instance context module is proposed to expand the instance-level feature for the bounding box of a pedestrian, by utilizing the rich scene contexts of neighboring co-travelers across images. These two modules are built on top of a joint detection and feature learning framework, i.e., Faster R-CNN. Extensive experimental results on two challenging datasets have demonstrated the effectiveness of DCRNet with significant performance improvements over state-of-the-art methods.

### Heterogeneous Fusion of Semantic and Collaborative Information for Visually-Aware Food Recommendation

• Lei Meng
• Fuli Feng
• Xiangnan He
• Xiaoyan Gao
• Tat-Seng Chua

Visually-aware food recommendation recommends food items based on their visual features. Existing methods typically use the pre-extracted visual features from food classification models, which mainly encode the visual content with limited semantic information, such as the classes and ingredients. Therefore, such features may not cover the personalized visual preferences of users, termed collaborative information, e.g. users may attend to different colors and textures of food based on their preferred ingredients and cooking methods. To address this problem, this paper presents a heterogeneous multi-task learning framework, termed privileged-channel infused network (PiNet). It learns the visual features that contain both the semantic and collaborative information by training the image encoder to simultaneously fulfill the ingredient prediction and food recommendation tasks. However, the heterogeneity between the two tasks may lead to different visual information in need and different directions in model parameter optimization. To handle these challenges, PiNet first employs a dual-gating module (DGM) to enable the encoding and passing of different visual information from the image encoder to individual tasks. Secondly, PiNet adopts a two-phase training strategy and two prior knowledge incorporation methods to ensure an effective model training. Experimental results from two real-world datasets show that the visual features generated by PiNet better attend to the informative image regions, yielding superior performance.

### How to Learn Item Representation for Cold-Start Multimedia Recommendation?

• Xiaoyu Du
• Xiang Wang
• Xiangnan He
• Zechao Li
• Jinhui Tang
• Tat-Seng Chua

The ability of recommending cold items (that have no behavior history) is a core strength of multimedia recommendation compared with behavior-only collaborative filtering. To learn effective item representation, a key challenge lies in the discrepancy between training and testing, since the cold items only exist in the testing data. This means that the signal used to represent an item varies during training and testing --- in the training stage, we can represent an item with both collaborative embedding and content embedding; whereas in the testing stage, we represent a cold item with content embedding only. Nevertheless, existing learning frameworks omit this critical discrepancy, resulting in suboptimal item representation for multimedia recommendation.

In this work, we pay special attention to cold items in multimedia recommender training. To address the discrepancy, we first represent an item with dual representation, i.e., two vectors where one follows the traditional way that combines collaborative embedding and content embedding, and the other assumes that the item is cold by replacing the collaborative embedding with zero vector. We then propose a Multi-Task Pairwise Ranking (MTPR) framework for model training, which enforces the observed interactions ranking higher than the unobserved ones even if the item is assumed to be cold. As a general learning framework, Our MTPR is agnostic to the choice of the collaborative (and/or content) encoder. We demonstrate it on VBPR, a representative multimedia recommendation model based on matrix factorization. Extensive experiments on three datasets of diverse domains validate MTPR, which leads to better representation for both cold and non-cold items in the testing stage, thus improving the overall performance of multimedia recommendation.

### Personalized Item Recommendation for Second-hand Trading Platform

• Xuzheng Yu
• Tian Gan
• Yinwei Wei
• Zhiyong Cheng
• Liqiang Nie

With rising awareness of environment protection and recycling, second-hand trading platforms have attracted increasing attention in recent years. The interaction data on second-hand trading platforms, consisting of sufficient interactions per user but rare interactions per item, is different from what they are on traditional platforms. Therefore, building successful recommendation systems in the second-hand trading platforms requires balancing modeling items? and users? preference, and mitigating the adverse effects of the sparsity, which makes recommendation especially challenging. Accordingly, we proposed a method to simultaneously learn representations of items and users from coarse-grained and fine-grained features, and a multi-task learning strategy is designed to address the issue of data sparsity. Experiments conducted on a real-world second-hand trading platform dataset demonstrate the effectiveness of our proposed model.

### What Aspect Do You Like: Multi-scale Time-aware User Interest Modeling for Micro-video Recommendation

• Hao Jiang
• Wenjie Wang
• Yinwei Wei
• Zan Gao
• Yinglong Wang
• Liqiang Nie

Online micro-video recommender systems aim to address the information explosion of micro-videos and make the personalized recommendation for users. However, the existing methods still have some limitations in learning representative user interests, since the multi-scale time effects, user interest group modeling, and false positive interactions are not taken into consideration. In view of this, we propose an end-to-end Multi-scale Time-aware user Interest modeling Network (MTIN). In particular, we first present an interest group routing algorithm to generate fine-grained user interest groups based on user's interaction sequence. Afterwards, to explore multi-scale time effects on user interests, we design a time-aware mask network and distill multiple temporal information by several parallel temporal masks. And then an interest mask network is introduced to aggregate fine-grained interest groups and generate the final user interest representation. At last, in the prediction unit, the user representation and micro-video candidates are fed into a deep neural network (DNN) for predictions. To demonstrate the effectiveness of our method, we conduct experiments on two publicly available datasets, and the experimental results demonstrate that our proposed model achieves substantial gains over the state-of-the-art methods.

### Domain-Specific Alignment Network for Multi-Domain Image-Based 3D Object Retrieval

• Yuting Su
• Yuqian Li
• Dan Song
• Zhendong Mao
• Xuanya Li
• An-An Liu

2D image-based 3D object retrieval is a very important task in computer vision and big data management. Conventional image-based 3D object retrieval usually assumes that the images are from one single domain. However, for real applications, 2D images may be from multiple domains (e.g., real image, sketch, and quick draw). It raises significant challenges for this task since these 2D images have a great domain gap with each other as well as a great modality gap with 3D objects. To address these issues, we propose an unsupervised Domain-Specific Alignment Network (DSAN) for multi-domain image-based 3D object retrieval. The proposed method aims to reduce domain discrepancy by domain-specific alignment network with multi-level moment matching, including first-order moment and second-order moment. Based on the observation that for any given sample, different domain classifiers should output the same label, we design a domain-specific classifier alignment module. To our knowledge, the proposed method is the first unsupervised work to align multiple-domain 2D images with 3D objects in an end-to-end manner. The multi-domain dataset MDI3D is utilized to advocate the research on this task, and the extensive experimental results demonstrate the superiority of the proposed method.

### Multi-modal Attentive Graph Pooling Model for Community Question Answer Matching

• Jun Hu
• Quan Fang
• Shengsheng Qian
• Changsheng Xu

Nowadays, millions of users use community question answering (CQA) systems to share valuable knowledge. An essential function of CQA systems is the accurate matching of answers w.r.t a given question. Recent research exhibits the superior advantages of graph neural networks (GNNs) on modeling content semantics for CQA matching. However, existing GNN-based approaches are insufficient to deal with the multi-modal and redundant properties of CQA systems. In this paper, we propose a multi-modal attentive graph pooling approach (MMAGP) to model the multi-modal content of questions and answers with GNNs in a unified framework, which explores the multi-modal and redundant properties of CQA systems. Our model converts each question/answer into a multi-modal content graph, which can preserve the relational information within multi-modal content. Specifically, to exploit the visual information, we propose an unsupervised meta-path link prediction approach to extract labels from visual content and model them into the multi-modal graph. An attentive graph pooling network is proposed to select vertices in the multi-modal content graph that are significant for the matching adaptively, and generate a pooled graph via aggregating context information for selected vertices. An interaction pooling network is designed to infer the final matching score based on the interactions between the pooled graphs of the input question and answer. Experimental results on two real-world datasets demonstrate the superior performance of MMAGP compared with other state-of-the-art CQA matching models.

### Task-distribution-aware Meta-learning for Cold-start CTR Prediction

• Tianwei Cao
• Qianqian Xu
• Zhiyong Yang
• Qingming Huang

### CFVMNet: A Multi-branch Network for Vehicle Re-identification Based on Common Field of View

• Ziruo Sun
• Xiushan Nie
• Xiaoming Xi
• Yilong Yin

Vehicle re-identification (re-ID) aims to retrieve the image of the same vehicles across multiple cameras. It has attracted wide attention in the field of computer vision owing to the deployment of surveillance system. However, some unfavorable factors restrict the retrieval accuracy of re-ID; minor inter-class difference and orientation variation are two main issues. In this study, we proposed a multi-branch network based on common field of view (CFVMNet) to address these issues. In the proposed method, we extracted and fused the global and local detail features using four branches and the Batch DropBlock (BDB) strategy to accentuate inter-class difference. We also considered some other attributes (i.e., color, type, and model) in the feature extraction process to make the final features more recognizable. For the issue of orientation variation that could lead to large intra-class difference, we learned two different metrics according to whether there is common field of view of two vehicle images, respectively, which can enable the proposed CFVMNet to focus on different regions. Extensive experiments on two public datasets, VeRi-776 and VehicleID, show that the proposed method outperformed the state-of-the-art approaches to vehicle re-ID.

### Exploiting Heterogeneous Artist and Listener Preference Graph for Music Genre Classification

• Chunyuan Yuan
• Qianwen Ma
• Junyang Chen
• Wei Zhou
• Xiaodan Zhang
• Xuehai Tang
• Jizhong Han
• Songlin Hu

Music genres are useful for indexing, organizing, searching, and recommending songs and albums. Therefore, the automatic classification of music genres is an essential part of almost all kinds of music applications. Recent works focus on exploiting text, audio, or multi-modal information for genre classification, without considering the influence of the artists' and listeners' preference. However, intuitively, artists have their composing preferences, and listeners also have their music tastes. Both of them provide helpful hints to the music genre from different views, which are crucial to improve classification performance.

In this paper, we make use of both artist-music and listener-music preference relations to construct a heterogeneous preference graph. Then, we propose a novel graph-based neural network to automatically encode the global preference relations of the heterogeneous graph into artist and listener representations. We construct a graph to capture the correlations among genres and apply a graph convolutional network to learn genre representation from the correlation graph. Finally, we combine artist, listener, and genre representations for multi-label genre classification. Experimental results show that our model significantly outperforms the state-of-the-art methods on two public music genre classification datasets.

### Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback

• Yinwei Wei
• Xiang Wang
• Liqiang Nie
• Xiangnan He
• Tat-Seng Chua

Reorganizing implicit feedback of users as a user-item interaction graph facilitates the applications of graph convolutional networks (GCNs) in recommendation tasks. In the interaction graph, edges between user and item nodes function as the main element of GCNs to perform information propagation and generate informative representations. Nevertheless, an underlying challenge lies in the quality of interaction graph, since observed interactions with less-interested items occur in implicit feedback (say, a user views micro-videos accidentally). This means that the neighborhoods involved with such false-positive edges will be influenced negatively and the signal on user preference can be severely contaminated. However, existing GCN-based recommender models leave such challenge under-explored, resulting in suboptimal representations and performance.

In this work, we focus on adaptively refining the structure of interaction graph to discover and prune potential false-positive edges. Towards this end, we devise a new GCN-based recommender model, Graph-Refined Convolutional Network (GRCN), which adjusts the structure of interaction graph adaptively based on status of model training, instead of remaining the fixed structure. In particular, a graph refining layer is designed to identify the noisy edges with the high confidence of being false-positive interactions, and consequently prune them in a soft manner. We then apply a graph convolutional layer on the refined graph to distill informative signals on user preference. Through extensive experiments on three datasets for micro-video recommendation, we validate the rationality and effectiveness of our GRCN. Further in-depth analysis presents how the refined graph benefits the GCN-based recommender model.

### Visually Precise Query

• Riddhiman Dasgupta
• Francis Tom
• Sudhir Kumar
• Mithun Das Gupta
• Yokesh Kumar
• Vinay P. Namboodiri

We present the problem of Visually Precise Query (VPQ) generation which enables a more intuitive match between a user's information need and an e-commerce site's product description. Given an image of a fashion item, what is the most optimum search query that will retrieve the exact same or closely related product(s) with high probability. In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides aword level extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. We collect a large dataset of fashion images and their titles and merge it with an existing research dataset which was created for a different task. Given the image and title pair, VPQ problem is posed as identifying a non-contiguous collection of spans within the title. We provide a dataset of around 400K image, title and corresponding VPQ entries and release it to the research community. We provide a detailed description of the data collection process as well as discuss the future direction of research for the problem introduced in this work. We provide the standard text as well as visual domain baseline comparisons and also provide multi-modal baseline models to analyze the task introduced in this work. Finally, we propose a hybrid fusion model which promises to be the direction of research in the multi-modal community.

### All-in-depth via Cross-baseline Light Field Camera

• Dingjian Jin
• Anke Zhang
• Jiamin Wu
• Gaochang Wu
• Haoqian Wang
• Lu Fang

Light-field (LF) camera holds great promise for passive/general depth estimation benefited from high angular resolution, yet suffering small baseline for distanced region. While stereo solution with large baseline is superior to handle distant scenarios, the problem of limited angular resolution becomes bothering for near objects. Aiming for all-in-depth solution, we propose a cross-baseline LF camera using a commercial LF camera and a monocular camera, which naturally form a 'stereo camera' enabling compensated baseline for LF camera. The idea is simple yet non-trivial, due to the significant angular resolution gap and baseline gap between LF and stereo cameras.

Fusing two depth maps from LF and stereo modules in spatial domain is fluky, which relies on the imprecisely predicted depth to distinguish close or distance range, and determine the weights for fusion. Alternatively, taking the unified representation for both LF and monocular sub-aperture view in epipolar plane image (EPI) domain, we show that for each pixel, the minimum variance along different shearing degrees in EPI domain estimates its depth with the highest fidelity. By minimizing the minimum variance, the depth error is minimized accordingly. The insight is that the calculated minimum variance in EPI domain owns higher fidelity than the predicted depth in spatial domain. Extensive experiments demonstrate the superiority of our cross-baseline LF camera in providing high-quality all-in-depth map from 0.2m to 100m.

### Revealing True Identity: Detecting Makeup Attacks in Face-based Biometric Systems

• Mohamed Hussein
• Wael Abd-Almageed
• Mohamed Hefeeda

Face-based authentication systems are among the most commonly used biometric systems, because of the ease of capturing face images at a distance and in non-intrusive way. These systems are, however, susceptible to various presentation attacks, including printed faces, artificial masks, and makeup attacks. In this paper, we propose a novel solution to address makeup attacks, which are the hardest to detect in such systems because makeup can substantially alter the facial features of a person, including making them appear older/younger by adding/hiding wrinkles, modifying the shape of eyebrows, beard, and moustache, and changing the color of lips and cheeks. In our solution, we design a generative adversarial network for removing the makeup from face images while retaining their essential facial features and then compare the face images before and after removing makeup. We collect a large dataset of various types of makeup, especially malicious makeup that can be used to break into remote unattended security systems. This dataset is quite different from existing makeup datasets that mostly focus on cosmetic aspects. We conduct an extensive experimental study to evaluate our method and compare it against the state-of-the art using standard objective metrics commonly used in biometric systems as well as subjective metrics collected through a user study. Our results show that the proposed solution produces high accuracy and substantially outperforms the closest works in the literature.

## SESSION: Poster Session B3: Multimedia System and Middleware & Multimedia Telepresence and Virtual/Augmented Reality

### Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Neural Networks

• Negin Ghamsarian
• Christian Timmerer
• Mario Taschwer
• Klaus Schöffmann

Recorded cataract surgery videos play a prominent role in training and investigating the surgery, and enhancing the surgical outcomes. Due to storage limitations in hospitals, however, the recorded cataract surgeries are deleted after a short time and this precious source of information cannot be fully utilized. Lowering the quality to reduce the required storage space is not advisable since the degraded visual quality results in the loss of relevant information that limits the usage of these videos. To address this problem, we propose a relevance-based compression technique consisting of two modules: (i) relevance detection, which uses neural networks for semantic segmentation and classification of the videos to detect relevant spatio-temporal information, and (ii) content-adaptive compression, which restricts the amount of distortion applied to the relevant content while allocating less bitrate to irrelevant content. The proposed relevance-based compression framework is implemented considering five scenarios based on the definition of relevant information from the target audience's perspective. Experimental results demonstrate the capability of the proposed approach in relevance detection. We further show that the proposed approach can achieve high compression efficiency by abstracting substantial redundant information while retaining the high quality of the relevant content.

### A Modular Approach for Synchronized Wireless Multimodal Multisensor Data Acquisition in Highly Dynamic Social Settings

• Chirag Raman
• Stephanie Tan
• Hayley Hung

Existing data acquisition literature for human behavior research provides wired solutions, mainly for controlled laboratory setups. In uncontrolled free-standing conversation settings, where participants are free to walk around, these solutions are unsuitable. While wireless solutions are employed in the broadcasting industry, they can be prohibitively expensive. In this work, we propose a modular and cost-effective wireless approach for synchronized multisensor data acquisition of social human behavior. Our core idea involves a cost-accuracy trade-off by using Network Time Protocol (NTP) as a source reference for all sensors. While commonly used as a reference in ubiquitous computing, NTP is widely considered to be insufficiently accurate as a reference for video applications, where Precision Time Protocol (PTP) or Global Positioning System (GPS) based references are preferred. We argue and show, however, that the latency introduced by using NTP as a source reference is adequate for human behavior research, and the subsequent cost and modularity benefits are a desirable trade-off for applications in this domain. We also describe one instantiation of the approach deployed in a real-world experiment to demonstrate the practicality of our setup in-the-wild.

### SphericRTC: A System for Content-Adaptive Real-Time 360-Degree Video Communication

• Shuoqian Wang
• Xiaoyang Zhang
• Mengbai Xiao
• Kenneth Chiu
• Yao Liu

We present the SphericRTC system for real-time 360-degree video communication. 360-degree video allows the viewer to observe the environment in any direction from the camera location. This more-immersive streaming experience allows users to more-efficiently exchange information and can be beneficial in the real-time setting. Our system applies a novel approach to select representations of 360-degree frames to allow efficient, content-adaptive delivery. The system performs joint content and bitrate adaptation in real-time by offloading expensive transformation operations to the GPU via CUDA. The system demonstrates that the multiple sub-components -- viewport feedback, representation selection, and joint content and bitrate adaptation -- can be effectively integrated within a single framework. Compared to a baseline implementation, views in SphericRTC have consistently higher visual quality. The median Viewport-PSNR of such views is 2.25 dB higher than views in the baseline system.

### Single Image Shape-from-Silhouettes

• Yawen Lu
• Yuxing Wang
• Guoyu Lu

Recovering a 3D shape representation from one single image input has been attempted in recent years. Most of the works obtain 3D models from multiple images at different perspectives or ground truth CAD models. However, multiple images from different perspectives or 3D CAD models are not always available in real applications. In this work, we present a novel shape-from-silhouette method based on just a single image, which is an end-to-end learning framework relying on view synthesis and shape-from-silhouette methodology to reconstruct a 3D shape. The reconstructed 3D mesh can approach the real shape of target objects by constraining the silhouettes from both horizontal and vertical directions, especially for those objects with occlusions. Our proposed method achieves state-of-the-art performance on the ShapeNet dataset compared with other recent approaches targeting 3D reconstruction from a single image. Without requiring labor-intensive and time-consuming human annotations, the work has a broad potential to be applied in real-world applications.

### VVSec: Securing Volumetric Video Streaming via Benign Use of Adversarial Perturbation

• Zhongze Tang
• Xianglong Feng
• Yi Xie
• Huy Phan
• Tian Guo
• Bo Yuan
• Sheng Wei

Volumetric video (VV) streaming has drawn an increasing amount of interests recently with the rapid advancements in consumer VR/AR devices and the relevant multimedia and graphics research. While the resource and performance challenges in volumetric video streaming have been actively investigated by the multimedia community, the potential security and privacy concerns with this new type of multimedia have not been studied. We for the first time identify an effective threat model that extracts 3D face models from volumetric videos and compromises face ID-based authentications To defend against such attack, we develop a novel volumetric video security mechanism, namely VVSec, which makes benign use of adversarial perturbations to obfuscate the security and privacy-sensitive 3D face models. Such obfuscation ensures that the 3D models cannot be exploited to bypass deep learning-based face authentications. Meanwhile, the injected perturbations are not perceivable by the end-users, maintaining the original quality of experience in volumetric video streaming. We evaluate VVSec using two datasets, including a set of frames extracted from an empirical volumetric video and a public RGB-D face image dataset. Our evaluation results demonstrate the effectiveness of both the proposed attack and defense mechanisms in volumetric video streaming.

### Bitrate Requirements of Non-Panoramic VR Remote Rendering

• Viktor Kelkkanen
• Markus Fiedler
• David Lindero

This paper shows the impact of bitrate settings on objective quality measures when streaming non-panoramic remote-rendered Virtual Reality (VR) images. Non-panoramic here refers to the images that are rendered and sent across the network, they only cover the viewport of each eye, respectively.

To determine the required bitrate of remote rendering for VR, we use a server that renders a 3D-scene, encodes the resulting images using the NVENC H.264 codec and transmits them to the client across a network. The client decodes the images and displays them in the VR headset. Objective full-reference quality measures are taken by comparing the image before encoding on the server to the same image after it has been decoded on the client. By altering the average bitrate setting of the encoder, we obtain objective quality scores as functions of bitrates. Furthermore, we study the impact of headset rotation speeds, since this will also have a large effect on image quality.

We determine an upper and lower bitrate limit based on headset rotation speeds. The lower limit is based on a speed close to the average human peak head-movement speed, 360°s. The upper limit is based on maximal peaks of 1080°s. Depending on the expected rotation speeds of the specific application, we determine that a total of 20--38Mbps should be used at resolution 2160×1200@90,fps, and 22--42Mbps at 2560×1440@60,fps. The recommendations are given with the assumption that the image is split in two and streamed in parallel, since this is how the tested prototype operates.

### Kalman Filter-based Head Motion Prediction for Cloud-based Mixed Reality

• Serhan Gül
• Sebastian Bosse
• Dimitri Podborski
• Thomas Schierl
• Cornelius Hellge

Volumetric video allows viewers to experience highly-realistic 3D content with six degrees of freedom in mixed reality (MR) environments. Rendering complex volumetric videos can require a prohibitively high amount of computational power for mobile devices. A promising technique to reduce the computational burden on mobile devices is to perform the rendering at a cloud server. However, cloud-based rendering systems suffer from an increased interaction (motion-to-photon) latency that may cause registration errors in MR environments. One way of reducing the effective latency is to predict the viewer's head pose and render the corresponding view from the volumetric video in advance.

In this paper, we design a Kalman filter for head motion prediction in our cloud-based volumetric video streaming system. We analyze the performance of our approach using recorded head motion traces and compare its performance to an autoregression model for different prediction intervals (look-ahead times). Our results show that the Kalman filter can predict head orientations 0.5 degrees more accurately than the autoregression model for a look-ahead time of 60 ms.

### Perception-Lossless Codec of Haptic Data with Low Delay

• Chaoyang Zeng
• Tiesong Zhao
• Qian Liu
• Yiwen Xu
• Kai Wang

In multimedia services, the introduction of haptic signals provides a more immersive user experience besides of conventional audio-visual perceptions. To support synchronous streaming and display of these information, it is imperative to efficiently compress and store the haptic signals, which promotes the development and optimization of haptic codecs. In this paper, we propose an end-to-end haptic codec for high-efficiency, low-delay and perception-lossless compression of kinesthetic signal, one of two major components of haptic signals. The proposed encoder consists of amplifier, DCT, quantizer, run-length encoder and entropy encoder, while the decoder includes all counterpart modules of the encoder. In particular, all parameters of these modules are deliberately calibrated aimed at a high compression efficiency of kinesthetic information. We allow a maximal DCT length of 8 samples, in order to guarantee a maximal encoding delay of 7ms for a popular haptic simulator of 1000Hz. Incorporating the model of perception deadband, the proposed codec is capable of realizing perception-lossless kinesthetic bitsteam. Finally, we examine the proposed codec on the standard database of IEEE P1918.1.1 Haptic Codecs Task Group. Comprehensive experiments reveal that our codec outperforms its rivals with 50% bit rate reduction, improved perception quality and a negligible encoder delay.

### Neural3D: Light-weight Neural Portrait Scanning via Context-aware Correspondence Learning

• Xin Suo
• Minye Wu
• Yanshun Zhang
• Yingliang Zhang
• Lan Xu
• Qiang Hu
• Jingyi Yu

Reconstructing a human portrait in a realistic and convenient manner is critical for human modeling and understanding. Aiming at light-weight and realistic human portrait reconstruction, in this paper we propose Neural3D: a novel neural human portrait scanning system using only a single RGB camera. In our system, to enable accurate pose estimation,we propose a context-aware correspondence learning approach which jointly models the appearance, spatial and motion information between feature pairs. To enable realistic reconstruction and suppress the geometry error, we further adopt a point-based neural rendering scheme to generate realistic and immersive portrait visualization in arbitrary virtual view-points. By introducing these learning-based technical components into the pure RGB-based human modeling framework, we can achieve both accurate camera pose estimation and realistic free-viewpoint rendering of the reconstructed human portrait. Extensive experiments on a variety of challenging capture scenarios demonstrate the robustness and effectiveness of our approach.

### Presence, Embodied Interaction and Motivation: Distinct Learning Phenomena in an Immersive Virtual Environment

• Jack Ratcliffe
• Laurissa Tokarchuk

The use of immersive virtual environments (IVEs) for educational purposes has increased in recent years, but the mechanisms through which they contribute to learning is still unclear. Popular explanations for the learning benefits brought by IVEs come from motivation, presence and embodied perspectives; either as individual benefits or through mediation effects on each other. This paper describes an experiment designed to interrogate these approaches, and provides evidence that embodied controls and presence encourage learning in immersive virtual environments, but for distinct,non-interacting reasons, which are also not explained by motivational benefits.

### User Centered Adaptive Streaming of Dynamic Point Clouds with Low Complexity Tiling

• Shishir Subramanyam
• Irene Viola
• Alan Hanjalic
• Pablo Cesar

### Leveraging QoE Heterogenity for Large-Scale Livecaset Scheduling

• Rui-Xiao Zhang
• Ming Ma
• Tianchi Huang
• Hanyu Li
• Jiangchuan Liu
• Lifeng Sun

Livecast streaming has received great success in recent years. Although many prior efforts have suggested that dynamic viewer scheduling according to the quality of service (QoS) can improve user engagement, they may suffer inefficiency due to their ignorance of viewer heterogeneity in how the QoS impact quality of experience (QoE).

In this paper, we conduct measurement studies over large-scale data provided by a top livecast platform in China. We observe that QoE is influenced by a lot of QoS and non-QoS factors, and most importantly, the QoE sensitivity to QoS metrics can vary significantly among viewers. Inspired by the above insights, we propose HeteroCast, a novel livecast scheduling framework for intelligent viewer scheduling based on viewer heterogeneity. In detail, HeteroCast addresses this concern by solving two sub-problems. For the first sub-problem (i.e., the QoE modeling problem), we use the deep factorization machine (DeepFM) based method to precisely map complicated factors (QoS and non-QoS factors) to QoE and build the QoE model. For the second sub-problem (i.e., the QoE-aware scheduling problem), we use a graph-matching method to generate the best viewer allocation policy for each CDN provider. Specifically, by using some pruning techniques, HeteroCast only introduces slight overhead and can well adapt to the large-scale livecast scenario. Through extensive evaluation on real-world traces, HeteroCast is demonstrated to increase the average QoE by 8.87%-10.09%.

### Towards Viewport-dependent 6DoF 360 Video Tiled Streaming for Virtual Reality Systems

• Jong-Beom Jeong
• Soonbin Lee
• Il-Woong Ryu
• Tuan Thanh Le
• Eun-Seok Ryu

Previous studies of 360-degree video streaming with regard to virtual reality allowed users to move their head freely, while their position is fixed according to the camera's location in virtual reality. One of the approaches to overcome the problem is transmitting multiview video to provide six degrees of freedom (6DoF). However, 6DoF streaming system implementation is challenging because multiple high-quality video streaming requires several decoders and a high bandwidth. Therefore, this paper proposes a viewport-dependent high-efficiency video coding (HEVC)-compliant tiled streaming system on test model for immersive video (TMIV), MPEG-Immersive multiview compression reference software. This paper proposes a 6DoF viewport tile selector (VTS) for multiple 360-degree video tiled streaming. Furthermore, this paper introduces a viewport-dependent multiple-tile extractor. The proposed system detects the user's head movement, selects the tile sets that correspond to the user's viewport, extracts tile bitstreams, and generates single bitstream. The extracted bitstream is transmitted and decoded to render the user's viewport The proposed viewport-dependent streaming method can reduce the decoding time as well as the bandwidth. Experimental results demonstrated 12.04% bjontegaard delta rate (BD-rate) saving for the luma peak signal-to-noise ratio (PSNR) compared to those obtained via the TMIV anchor without tiled encoding and a 55.51% decoding time saving compared to those obtained via the TMIV anchor with the existing tiled streaming method.

## SESSION: Poster Session C3: Multimedia Transport and Delivery & Multimedia Analysis and Description

### Low-latency FoV-adaptive Coding and Streaming for Interactive 360° Video Streaming

• Yixiang Mao
• Liyang Sun
• Yong Liu
• Yao Wang

Virtual Reality (VR) and Augmented Reality (AR) technologies have become popular in recent years. Encoding and transmitting the omni-directional or $360^\circ$ video is critical and challenging for those applications. The $360^\circ$ video requires much higher bandwidth than the traditional planar video. A premium quality $360^\circ$ video with 120 frames per second (fps) and 24K resolution can easily consume bandwidth in the range of Gigabits-per-second~\cite1. On the other hand, at any given time, a user only watches a small portion of the $360^\circ$ scope within her Field-of-View (FoV). An effective way to reduce the bandwidth requirement of $360^\circ$ video is through FoV-adaptive streaming, which codes and delivers the predicted FoV region at higher quality, and discards or codes at lower quality the remaining regions. Such strategy has been quite extensively studied for video-on-demand \citefov_adapt_2,fov_adapt_3,1,tile_based_3,qian2016optimizing and live video streaming applications\citelive_1,live_2,live_3, sun2020flocking. Interactive applications, such as conferencing, gaming, and remote collaboration, can also benefit from $360^\circ$ video by creating an immersive environment for participants to interact with each other citeinteractive_gamming \citevr_conferencing \citelee2015outatime. However, realtime coding and streaming of $360^\circ$ video with extremely low latency, required for interactive applications, has not been sufficiently addressed. This work focuses on developing low-latency and FoV-adaptive coding and streaming strategies for interactive $360^\circ$ video streaming. We assume the sender and the receiver are connected by a network path with dynamically varying throughput without short-latency guarantee. The sender is either the video source, or a proxy server relaying the source video. The receiver is either the end user device that directly renders the video, or a local edge server that renders the video and transmit to the end user \citeHou2017.

### Towards Modality Transferable Visual Information Representation with Optimal Model Compression

• Rongqun Lin
• Linwei Zhu
• Shiqi Wang
• Sam Kwong

Compactly representing the visual signals is of fundamental importance in various image/video-centered applications. Although numerous approaches were developed for improving the image and video coding performance by removing the redundancies within visual signals, much less work has been dedicated to the transformation of the visual signals to another well-established modality for better representation capability. In this paper, we propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. In particular, the deep learning model, which characterizes and absorbs the statistics of the input scene with online training, could be efficiently represented in the sense of rate-utility optimization to serve as the enhancement layer in the bitstream. As such, the overall performance can be further guaranteed by optimizing the new modality incorporated. The proposed framework is implemented on the state-of-the-art video coding standard (i.e., versatile video coding), and significantly better representation capability has been observed based on extensive evaluations.

• Chao Zhou
• Shuoqian Wang
• Mengbai Xiao
• Sheng Wei
• Yao Liu

360-degree video is an emerging medium that presents an immersive view of the environment to the user. Despite its potential to provide an immersive watching experience, 360-degree video has not achieved widespread popularity. A significant cause of this slow adoption is the high-bandwidth requirements of the format. The primary source of bandwidth inefficiency in 360-degree video streaming, un-addressed in popular transmission methods, is the discrepancy between the pixels sent over the network (typically the full omnidirectional view) and the pixels displayed in the head-mounted display's field of view. At worst, roughly 88% of transmitted pixels remain unviewed.

In this work, we motivate a user-adaptive approach to address inefficiencies in 360-degree streaming through an analysis of user-viewing traces. We design a greedy algorithm to generate projections of the spherical surface that allow the user-view trajectories to be efficiently transmitted. We further demonstrate that our approach can be applied to many popular 360-degree projection layouts. In BD-rate experiments, we show that the adaptive versions of the rotated spherical projection (RSP) and equi-angular cubemap (EAC) can save 26.2% and 24.0% bitrates on average, respectively, while achieving the same visual quality of rendered views compared to their non-adaptive counterparts in a realistic scenario. These adaptive projections can also achieve 53.1% bandwidth savings over the equirectangular projection.

### Tile Rate Allocation for 360-Degree Tiled Adaptive Video Streaming

• Wei Tsang Ooi

360-degree video streaming commonly encodes and transmits the video as independently-decodable tiles to conserve bandwidth of regions out of the viewer's field of view (FoV). The bitrate of the tiles, however, can vary significantly across the tiles, complicating the choice of the representation to download for each tile in each segment to adapt to the bandwidth dynamics. In this paper, we model the tile rate allocation problem as a multiclass knapsack problem with a dynamic profit function that is a function of the FoV and the buffer occupancy. Experiments show that our approach can reduce bandwidth wastage by up to 41%, the number of stalls by up to 31%, stall durations by up to 26.5%, switches in quality by up to 20%, without sacrificing the quality of the tiles within the FoV, even when there are significant head movement and changes in FoV during streaming.

• Lianli Gao
• Junchen Zhu
• Jingkuan Song
• Feng Zheng
• Heng Tao Shen

Lab2Pix refers to the task of generating photo-realistic images from labels, e.g., semantic labels or sketch labels. Despite inheriting from image-to-image translation, Lab2Pix develops its own characteristics due to the differences between labels and general images. This prevents Lab2Pix task from simply applying general image-to-image translation models. Therefore, we propose an unsupervised framework named Lab2Pix to adaptively synthesize images from labels by elegantly considering the particular properties of label to image synthesis task. Specifically, since the labels contain much less information than the images, we design our generator in a cumulative style which gradually renders synthesized images by fusing features in different levels. Accordingly, the verification process feeds the generated images to a segmentation component and compares the results to the original input label. Furthermore, we propose a sharp enhancement loss, an image consistency loss and a foreground enhancement mask to encourage the network to synthesize photo-realistic images. Experiments conducted on Cityscapes, Facades, Edge2shoes and Edge2handbags datasets demonstrate that our Lab2Pix significantly outperforms existing state-of-the-art unsupervised methods and is even comparable to supervised methods. The source code is available at https://github.com/RoseRollZhu/Lab2Pix.

• Zhou Yu
• Yuhao Cui
• Jun Yu
• Meng Wang
• Dacheng Tao
• Qi Tian

### DIMC-net: Deep Incomplete Multi-view Clustering Network

• Jie Wen
• Zheng Zhang
• Zhao Zhang
• Zhihao Wu
• Lunke Fei
• Yong Xu
• Bob Zhang

In this paper, a new deep incomplete multi-view clustering network, called DIMC-net, is proposed to address the challenge of multi-view clustering on missing views. In particular, DIMC-net designs several view-specific encoders to extract the high-level information of multiple views and introduces a fusion graph based constraint to explore the local geometric information of data. To reduce the negative influence of missing views, a weighted fusion layer is introduced to obtain the consensus representation shared by all views. Moreover, a clustering layer is introduced to guarantee that the obtained consensus representation is the best one for the clustering task. Compared with the existing deep learning based approaches, DIMC-net is more flexible and efficient since it can handle all kinds of incomplete cases and directly produce the clustering results. Experimental results show that DIMC-net achieves significant improvement over state-of-the-art incomplete multi-view clustering methods.

### Cross-domain Cross-modal Food Transfer

• Bin Zhu
• Chong-Wah Ngo
• Jing-jing Chen

The recent works in cross-modal image-to-recipe retrieval pave a new way to scale up food recognition. By learning the joint space between food images and recipes, food recognition is boiled down as a retrieval problem by evaluating the similarity of embedded features. The major drawback, nevertheless, is the difficulty in applying an already-trained model to recognize different cuisines of dishes unknown to the model. In general, model updating with new training examples, in the form of image-recipe pairs, is required to adapt a model to new cooking styles in a cuisine. Nevertheless, in practice, acquiring sufficient number of image-recipe pairs for model transfer can be time-consuming. This paper addresses the challenge of resource scarcity in the scenario that only partial data instead of a complete view of data is accessible for model transfer. Partial data refers to missing information such as absence of image modality or cooking instructions from an image-recipe pair. To cope with partial data, a novel generic model, equipped with various loss functions including cross-modal metric learning, recipe residual loss, semantic regularization and adversarial learning, is proposed for cross-domain transfer learning. Experiments are conducted on three different cuisines (Chuan, Yue and Washoku) to provide insights on scaling up food recognition across domains with limited training resources.

### Texture Semantically Aligned with Visibility-aware for Partial Person Re-identification

• Lishuai Gao
• Hua Zhang
• Zan Gao
• Weili Guan
• Zhiyong Cheng
• Meng Wang

In real person re-identification (ReID) tasks, pedestrians are often obscured by other pedestrians or objects; moreover, changes in poses or observation perspectives also commonly exist in partial-person ReID. To the best of our knowledge, few works simultaneously focus on these two issues. In this work, we propose a novel texture semantic alignment (TSA) approach with the visibility-aware for partial person ReID task where the occlusion issue and changes in poses are simultaneously explored in an end-to-end unified framework. Specifically, we first employ a texture alignment scheme with the semantic visibility of a person's image to solve the issue of changes in poses that can enhance the alignment and generalization capability of the models. Second, we design a human pose-based partial region alignment scheme to solve the occlusion problem that makes TSA method emphasize the shared body parts. Finally, these two networks jointly learn these aspects. Extensive experimental results demonstrate that our proposed TSA method is very effective and robust for simultaneously handling occlusion and changes in pose, and it can outperform state-of-the-art approaches by a large margin and achieves an improvement of 5% and 6.4% on the rank-1 accuracy over the visibility-aware part model (VPM) method (published in CVPR 2019) on the Partial ReID and Partial-iLIDS datasets, respectively.

### KTN: Knowledge Transfer Network for Multi-person DensePose Estimation

• Xuanhan Wang
• Lianli Gao
• Jingkuan Song
• Heng Tao Shen

In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. It still poses several challenges due to real-world scenes with scale variations, occlusion and insufficient annotations. In particular, we address two main problems: 1) how to design a simple yet effective pipeline for densepose estimation; and 2) how to equip this pipeline with the ability of handling the issues of limited annotations and class-imbalanced labels. To tackle these problems, we develop a novel densepose estimation framework based on a two-stage pipeline, called Knowledge Transfer Network (KTN). Unlike existing works which directly propagate the pyramidal base features of regions, we enhance their representation power by a multi-instance decoder (MID). MID can well distinguish the target instance from other interference instances and background. Then, we introduce a knowledge transfer machine (KTM), which improves densepose estimation by utilizing the external commonsense knowledge. Notably, with the help of our knowledge transfer machine (KTM), current densepose estimation systems (either based on RCNN or fully-convolutional frameworks) can be improved in terms of the accuracy of human densepose estimation. Solid experiments on densepose estimation benchmarks demonstrate the superiority and generalizability of our approach. Our code and models will be publicly available.

### Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

• Junwen Chen
• Wentao Bao
• Yu Kong

In this paper, we study the problem of weakly-supervised spatio-temporal grounding from raw untrimmed video streams. Given a video and its descriptive sentence, spatio-temporal grounding aims at predicting the temporal occurrence and spatial locations of each query object across frames. Our goal is to learn a grounding model in a weakly-supervised fashion, without the supervision of both spatial bounding boxes and temporal occurrences during training. Existing methods have been addressed in trimmed videos, but their reliance on object tracking will easily fail due to frequent camera shot cut in untrimmed videos. To this end, we propose a novel spatio-temporal multiple instance learning framework for untrimmed video grounding. Spatial MIL and temporal MIL are mutually guided to ground each query to specific spatial regions and the occurring frames of a video. Furthermore, an activity described in the sentence is captured to use the informative contextual cues for region proposals refinement and text representation. We conduct extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method outperforms state-of-the-art methods.

### Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

• Zhaobo Qi
• Shuhui Wang
• Chi Su
• Li Su
• Weigang Zhang
• Qingming Huang

Event analysis in untrimmed videos has attracted increasing attention due to the application of cutting-edge techniques such as CNN. As a well studied property for CNN-based models, the receptive field is a measurement for measuring the spatial range covered by a single feature response, which is crucial in improving the image categorization accuracy. In video domain, video event semantics are actually described by complex interaction among different concepts, while their behaviors vary drastically from one video to another, leading to the difficulty in concept-based analytics for accurate event categorization. To model the concept behavior, we study temporal concept receptive field of concept-based event representation, which encodes the temporal occurrence pattern of different mid-level concepts. Accordingly, we introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics. TDC can adjust the temporal concept receptive field size dynamically according to different inputs. Notably, a set of coefficients are learned to fuse the results of multiple convolutions with different kernel widths that provide various temporal concept receptive field sizes. Different coefficients can generate appropriate and accurate temporal concept receptive field size according to input videos and highlight crucial concepts. Based on TDC, we propose the temporal dynamic concept modeling network~(TDCMN) to learn an accurate and complete concept representation for efficient untrimmed video analysis. Experiment results on FCVID and ActivityNet show that TDCMN demonstrates adaptive event recognition ability conditioned on different inputs, and improve the event recognition performance of Concept-based methods by a large margin. Code is available at https://github.com/qzhb/TDCMN.

### Relational Graph Learning for Grounded Video Description Generation

• Wenqiao Zhang
• Xin Eric Wang
• Siliang Tang
• Haizhou Shi
• Haochen Shi
• Jun Xiao
• Yueting Zhuang
• William Yang Wang

Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., 'jump left or right') are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge to assist captioning models in selecting the relevant information it needs to generate correct words. We validate the effectiveness of our model through automatic metrics and human evaluation, and the results indicate that our approach can generate more fine-grained and accurate description, and it solves the problem of object hallucination to some extent.

## SESSION: Poster Session D3: Multimedia Analysis and Description & Multimedia Fusion and Embedding

### Finding Achilles' Heel: Adversarial Attack on Multi-modal Action Recognition

• Deepak Kumar
• Chetan Kumar
• Chun Wei Seah
• Siyu Xia
• Ming Shao

Neural network-based models are notoriously known for their adversarial vulnerability. Recent adversarial machine learning mainly focused on images, where a small perturbation can be simply added to fool the learning model. Very recently, this practice has been explored in human action video attacks by adding perturbation to key frames. Unfortunately, frame selection is usually computationally expensive in run-time, and adding noises to all frames is unrealistic, either. In this paper, we present a novel yet efficient approach to address this issue. Multi-modal video data such as RGB, depth and skeleton data have been widely used for human action modeling, and they have been demonstrated with superior performance than a single modality. Interestingly, we observed that the skeleton data is more "vulnerable" under adversarial attack, and we propose to leverage this "Achilles' Heel" to attack multi-modal video data. In particular, first, an adversarial learning paradigm is designed to perturb skeleton data for a specific action under a black box setting, which highlights how body joints and key segments in videos are subject to attack. Second, we propose a graph attention model to explore the semantics between segments from different modalities and within a modality. Third, the attack will be launched in run-time on all modalities through the learned semantics. The proposed method has been extensively evaluated on multi-modal visual action datasets, including PKU-MMD and NTU-RGB+D to validate its effectiveness.

### Online Multi-view Subspace Learning with Mixed Noise

• Jinxing Li
• Hongwei Yong
• Feng Wu
• Mu Li

Multi-view learning reveals the latent correlation between different input modalities and has achieved outstanding performances in many fields. Recent approaches aim to find a low-dimensional subspace to reconstruct each view, in which the gross residual or noise follows either Gaussian or Laplacian distribution. However, the noise distribution is often more complex in practical applications, and a deterministic distribution assumption is incapable of modeling it. Additionally, referring to time-changed data, e.g., videos, the noise is temporal smooth, preventing us from processing the data with the whole input, as have generally been done in many existing multi-view learning methods. To tackle these problems, a novel online multi-view subspace learning is proposed in this paper. Particularly, our proposed method not only estimates a transformation for each view to extract the correlation among various views, but also introduces a Mixture of Gausssians (MoG) model into the multi-view data, successfully exploiting numbers of Gaussian Distributions to adaptively fit a wider range of the complex noise. Furthermore, we further design a novel online Expectation Maximization (EM) algorithm, being capable of efficiently processing the dynamic data. Experimental results substantiate the effectiveness and superiority of our approach.

### LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark

• Qiao Liu
• Xin Li
• Zhenyu He
• Chenglong Li
• Jun Li
• Zikun Zhou
• Di Yuan
• Jing Li
• Kai Yang
• Nana Fan
• Feng Zheng

In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we re-train several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.

### Towards More Explainability: Concept Knowledge Mining Network for Event Recognition

• Zhaobo Qi
• Shuhui Wang
• Chi Su
• Li Su
• Qingming Huang
• Qi Tian

Event recognition of untrimmed video is a challenging task due to the big gap between low level visual features and event semantics. Beyond feature learning via deep neural networks, some recent works focus on analyzing event videos using concept-based representation. However, these methods simply aggregate the concept representation vectors of frames or segments, which inevitably introduces information loss on video-level concept knowledge. Moreover, the diversified relation between different concept domains (e.g., scene, object and action) has not been fully explored. To address the above issues, we propose a concept knowledge mining network (CKMN) for event recognition. CKMN is composed of an intra-domain concept knowledge mining subnetwork (IaCKM) and an inter-domain concept knowledge mining subnetwork~(IrCKM). IaCKM aims to obtain a complete concept representation by mining the existing pattern of each concept at different time granularities with dilated temporal pyramid convolution and temporal self-attention, while IrCKM explores the interaction between different types of concepts with co-attention style learning. We evaluate our method on FCVID and ActivityNet datasets. Experimental results show the effectiveness and better interpretability of our model on event analytics. Code is available at https://github.com/qzhb/CKMN.

### Simultaneous Semantic Alignment Network for Heterogeneous Domain Adaptation

• Shuang Li
• Binhui Xie
• Jiashu Wu
• Ying Zhao
• Chi Harold Liu
• Zhengming Ding

Heterogeneous domain adaptation (HDA) transfers knowledge across source and target domains that present heterogeneities e.g., distinct domain distributions and difference in feature type or dimension. Most previous HDA methods tackle this problem through learning a domain-invariant feature subspace to reduce the discrepancy between domains. However, the intrinsic semantic properties contained in data are under-explored in such alignment strategy, which is also indispensable to achieve promising adaptability. In this paper, we propose a Simultaneous Semantic Alignment Network (SSAN) to simultaneously exploit correlations among categories and align the centroids for each category across domains. In particular, we propose an implicit semantic correlation loss to transfer the correlation knowledge of source categorical prediction distributions to target domain. Meanwhile, by leveraging target pseudo-labels, a robust triplet-centroid alignment mechanism is explicitly applied to align feature representations for each category. Notably, a pseudo-label refinement procedure with geometric similarity involved is introduced to enhance the target pseudo-label assignment accuracy. Comprehensive experiments on various HDA tasks across text-to-image, image-to-image and text-to-text successfully validate the superiority of our SSAN against state-of-the-art HDA methods. The code is publicly available at https://github.com/BIT-DA/SSAN.

### Diverter-Guider Recurrent Network for Diverse Poems Generation from Image

• Liang Li
• Shijie Yang
• Li Su
• Shuhui Wang
• Chenggang Yan
• Zheng-jun Zha
• Qingming Huang

Poem generation from image aims to automatically generate the poetic sentences for presenting the image content or overtone. Previous works focused on 1-to-1 image-poem generation with the demands of poeticness and content relevance. This paper proposes the paradigm of multiple poems generation from one image, which is closer to human poetizing but more challenging. Its key problem is to simultaneously guarantee the diversity of multiple poems with poeticness and relevance. To this end, we propose an end-to-end probabilistic Diverter-Guider Recurrent Network (DG-Net), which is a context-based encoder-decoder generative model with the hierarchical stochastic variables. Specifically, the diverter-variable represents the decoding-context inferred from the input image to diversify the poem themes; the guider-variable is introduced as an attribute decoder to restricts the word-choice with supervised information. Extensive experiments on automatic evaluations and human judgments demonstrate the superior performance of DG-Net than existing poem generation methods. Qualitative study show that our model can generate diverse poems with the poeticness and relevance.

### Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

• Ying Cheng
• Ruize Wang
• Zhihao Pan
• Rui Feng
• Yuejie Zhang

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.

### Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization

• Haoming Xu
• Runhao Zeng
• Qingyao Wu
• Mingkui Tan
• Chuang Gan

We address the challenging task of event localization, which requires the machine to localize an event and recognize its category in unconstrained videos. Most existing methods leverage only the visual information of a video while neglecting its audio information, which, however, can be very helpful and important for event localization. For example, humans often recognize an event by reasoning with the visual and audio content simultaneously. Moreover, the audio information can guide the model to pay more attention on the informative regions of visual scenes, which can help to reduce the interference brought by the background. Motivated by these, in this paper, we propose a relation-aware network to leverage both audio and visual information for accurate event localization. Specifically, to reduce the interference brought by the background, we propose an audio-guided spatial-channel attention module to guide the model to focus on event-relevant visual regions. Besides, we propose to build connections between visual and audio modalities with a relation-aware module. In particular, we learn the representations of video and/or audio segments by aggregating information from the other modality according to the cross-modal relations. Last, relying on the relation-aware representations, we conduct event localization by predicting the event relevant score and classification score. Extensive experimental results demonstrate that our method significantly outperforms the state-of-the-arts in both supervised and weakly-supervised AVE settings.

### Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion

• Yikai Wang
• Fuchun Sun
• Ming Lu
• Anbang Yao

We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. The framework consists of two innovative fusion schemes. Firstly, unlike existing multimodal methods that necessitate individual encoders for different modalities, we verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder, which also enables implicit fusion via joint feature representation learning. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively. To take advantage of such scheme, we introduce two asymmetric fusion operations including channel shuffle and pixel shift, which learn different fused features with respect to different fusion directions. These two operations are parameter-free and strengthen the multimodal feature interactions across channels as well as enhance the spatial feature discrimination within channels. We conduct extensive experiments on semantic segmentation and image translation tasks, based on three publicly available datasets covering diverse modalities. Results indicate that our proposed framework is general, compact and is superior to state-of-the-art fusion frameworks.

### Look, Listen and Infer

• Ruijian Jia
• Xinsheng Wang
• Shanmin Pang
• Jihua Zhu
• Jianru Xue

Inspired by the ability of human beings on recognizing the relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos and associated sounds. In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before. LLINet is mainly desired to qualify for two tasks, i.e., image-audio cross-modal retrieval and sound localization in each image. Towards this end, it is designed as a two-branch encoding network that builds a common space for images and audios. Besides, a cross-modal attention mechanism is proposed in LLINet to localize sound objects. To evaluate LLINet, a new data set, named INSTRUMENT-32CLASS, is collected in this work. Besides zero-shot cross-modal retrieval and sound localization, a zero-shot image recognition task based on sounds is also conducted on this database. All experimental results on these tasks demonstrate the effectiveness of LLINet, indicating that zero-shot learning for visual scenes and sounds is feasible. The project page for LLINet is available at https://llinet.github.io/.

### DCNet: Dense Correspondence Neural Network for 6DoF Object Pose Estimation in Occluded Scenes

• Zhi Chen
• Wei Yang
• Zhenbo Xu
• Xike Xie
• Liusheng Huang
• null null

6DoF object pose estimation is essential for many real-world applications. Although great progress has been made, challenges still remain in estimating 6D pose for occluded objects. Current RGB-D approaches predict 6DoF pose directly, which is sensitive to occlusion in cluttered scenes. In this work, we propose DCNet, an end-to-end framework for estimating 6DoF object poses. DCNet first converts pixels in the image plane to point clouds in the camera coordinate system and then establishes dense correspondences between the camera coordinate system and the object coordinate system. Based on these two systems, we fuse 2D appearance and 3D geometric features by pixel-wise concatenation to construct dense correspondences, from which the pose is calculated through the least-squares fitting algorithm. Dense correspondences guarantee enough point pairs for a robust 6DoF pose estimation, even if the occlusion is heavy. Experimental results demonstrate that DCNet outperforms the state-of-the-art methods on LINEMOD, Occlusion LINEMOD and YCB-Video datasets, especially in terms of the robustness to occlusion scenes.

## SESSION: Poster Session E3: Multimedia Fusion and Embedding & Music, Speech and Audio & Summarization, Analytics and Storytelling

### Transferrable Referring Expression Grounding with Concept Transfer and Context Inheritance

• Xuejing Liu
• Liang Li
• Shuhui Wang
• Zheng-Jun Zha
• Dechao Meng
• Qingming Huang

Referring Expression Grounding (REG) aims at localizing a particular object in an image according to a language expression. Recent REG methods have achieved promising performance, but most of them are constrained to limited object categories due to the scale of current REG datasets. In this paper, we explore REG in a new scenario, where the REG model can ground novel objects out of REG training data. With this motivation, we propose a Concept-Context Disentangled network (CCD) which transfers concepts from auxiliary classification data with new categories meanwhile inherits context from REG data to ground new objects. Specially, we design a subject encoder to learn a cross-modal common semantic space, which can bridge the semantic and domain gap between auxiliary classification data and REG data. This common space guarantees CCD can transfer and recognize novel categories. Further, we learn the correspondence between image proposal and referring expression upon location and relationship. Benefiting from the disentangled structure, the context is relatively independent of the subject, so it can be better inherited from the REG training data. Finally, a language attention is learned to adaptively assign different importance to subject and context for grounding target objects. Experiments on four REG datasets show our method outperforms the compared approach on the new-category test datasets.

### Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos

• Yanhui Guo
• Xi Zhang
• Xiaolin Wu

We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. Such video contents are very common in social media, teleconferencing, distance education, tele-medicine, etc., and often need to be transmitted with limited bandwidth. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts caused by spatial down sampling and quantization. The deep learning approach turns out to be ideally suited for the video restoration task, as the complex non-linear cross-modality correlations are very difficult to model analytically and explicitly. The new method is a video post processor that can significantly boost the perceptual quality of aggressively compressed talking head videos, while being fully compatible with all existing video compression standards.

### Multi-modal Multi-relational Feature Aggregation Network for Medical Knowledge Representation Learning

• Yingying Zhang
• Quan Fang
• Shengsheng Qian
• Changsheng Xu

Representation learning of medical Knowledge Graph (KG) is an important task and forms the fundamental process for intelligent medical applications such as disease diagnosis and healthcare question answering. Therefore, many embedding models have been proposed to learn vector presentations for entities and relations but they ignore three important properties of medical KG: multi-modal, unbalanced and heterogeneous. Entities in the medical KG can carry unstructured multi-modal content, such as image and text. At the same time, the knowledge graph consists of multiple types of entities and relations, and each entity has various number of neighbors. In this paper, we propose a Multi-modal Multi-Relational Feature Aggregation Network (MMRFAN) for medical knowledge representation learning. To deal with the multi-modal content of the entity, we propose an adversarial feature learning model to map the textual and image information of the entity into the same vector space and learn the multi-modal common representation. To better capture the complex structure and rich semantics, we design a sampling mechanism and aggregate the neighbors with intra and inter-relation attention. We evaluate our model on three knowledge graphs, including FB15k-237, IMDb and Symptoms-in-Chinese with link prediction and node classification tasks. Experimental results show that our approach outperforms state-of-the-art method.

• Wenqiao Zhang
• Siliang Tang
• Yanpeng Cao
• Jun Xiao
• Shiliang Pu
• Fei Wu
• Yueting Zhuang

Understanding and reasoning over partially observed visual clues are often regarded as a challenging real-world problem even for human beings. In this paper, we present a new visual question answering (VQA) task -- Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream. Photo Stream QA is more challenging and interesting than the existing VQA tasks, since the temporal and visual variance among photos in the stream is huge and hard to observe. Therefore, instead of learning simple vision-text mappings, the AI algorithms must fill these variance gaps with more recollection, reasoning, even the knowledge from our daily experiences. To tackle the problems in Photo Stream QA, we propose an end-to-end baseline (E-TAA) with a novel Experienced Unit (E-unit) and Three-stage Alternating Attention (TAA). E-unit yields a better visual representation which captures the temporal semantic relation among visual clues in the photo stream, while TAA creates three levels of attention that gradually refines visual features by using the textual representation from the question as the guidance. Experimental results on our developed dataset demonstrate that, as the first attempt at the Photo Stream QA task, E-TAA provides promising results outperforming all the other baseline methods.

### Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition

• Xinhang Song
• Haitao Zeng
• Sixian Zhang
• Luis Herranz
• Shuqiang Jiang

Recognizing visual categories from semantic descriptions is a promising way to extend the capability of a visual classifier beyond the concepts represented in the training data (i.e. seen categories). This problem is addressed by (generalized) zero-shot learning methods (GZSL), which leverage semantic descriptions that connect them to seen categories (e.g. label embedding, attributes). Conventional GZSL are designed mostly for object recognition. In this paper we focus on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions. Conventional GZSL representations are not rich enough to capture these local discriminative differences. Addressing these limitations, we propose a feature generation framework with two novel components: 1) multiple sources of semantic information (i.e. attributes, word embeddings and descriptions), 2) region descriptions that can enhance scene discrimination. To generate synthetic visual features we propose a two-step generative approach, where local descriptions are sampled and used as conditions to generate visual features. The generated features are then aggregated and used together with real features to train a joint classifier. In order to evaluate the proposed method, we introduce a new dataset for zero-shot scene recognition with multi-semantic annotations. Experimental results on the proposed dataset and SUN Attribute dataset illustrate the effectiveness of the proposed method.

• Xia Du
• Chi-Man Pun
• Zheng Zhang

### Emerging Topic Detection on the Meta-data of Images from Fashion Social Media

• Kunihiro Miyazaki
• Takayuki Uchiba
• Scarlett Young
• Yuichi Sasaki
• Kenji Tanaka

In the fashion industry where social media has a growing presence, it is increasingly important to find the emergence of people's new tastes in the early stage based on the photos posted there. However, the amount of photos posted on fashion social media is so large that it is almost impossible for people to examine them manually. Also, previous studies on image analysis in social media focus only on individual items for trend detection. Therefore, in this research, we propose a novel framework for capturing changes in people's tastes in terms of coordination rather than individual items. In the framework, we apply Emerging Topic Detection (ETD) to multiple meta-data of images automatically extracted by deep learning. In ETD, new topics which did not exist previously are detected by comparing multiple time windows. To better capture the nature of fashion topics, we employ a clustering method MULIC as a topic detection method, which is density-based, centroid-based, and designed for categorical data. Our experiments with real-world data, in terms of method stability, qualitative evaluation of the output, and experts review, confirmed that the Emerging Topics were properly captured.

### Deep Concept-wise Temporal Convolutional Networks for Action Localization

• Xin Li
• Tianwei Lin
• Xiao Liu
• Wangmeng Zuo
• Chao Li
• Xiang Long
• Dongliang He
• Fu Li
• Shilei Wen
• Chuang Gan

Existing action localization approaches adopt shallow temporal convolutional networks (i.e., TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we introduce a novel concept-wise temporal convolutional network (C-TCN) as an alternative to TCN for training deeper action localization networks. To address this issue, we introduce a novel concept-wise temporal convolution (CTC) layer as an alternative to conventional temporal convolution layer for training deeper action localization networks. Instead of recombining latent concepts, CTC layer deploys a number of temporal filters to each concept separately with shared filter parameters across concepts. Thus can capture common temporal patterns of different concepts and significantly enrich representation ability. Via stacking CTC layers, we proposed a deep concept-wise temporal convolutional network (C-TCN), which boosts the state-of-the-art action localization performance on THUMOS'14 from 42.8 to 52.1 in terms of mAP(%), achieving a relative improvement of 21.7%. Favorable result is also obtained on ActivityNet.

### Who You Are Decides How You Tell

• Shuang Wu
• Shaojing Fan
• Zhiqi Shen
• Mohan Kankanhalli
• Anthony K.H. Tung

Image captioning is gaining significance in multiple applications such as content-based visual search and chat-bots. Much of the recent progress in this field embraces a data-driven approach without deep consideration of human behavioural characteristics. In this paper, we focus on human-centered automatic image captioning. Our study is based on the intuition that different people will generate a variety of image captions for the same scene, as their knowledge and opinion about the scene may differ. In particular, we first perform a series of human studies to investigate what influences human description of a visual scene. We identify three main factors: a person's knowledge level of the scene, opinion on the scene, and gender. Based on our human study findings, we propose a novel human-centered algorithm that is able to generate human-like image captions. We evaluate the proposed model through traditional evaluation metrics, diversity metrics, and human-based evaluation. Experimental results demonstrate the superiority of our proposed model on generating diverse human-like image captions.

### Query Twice: Dual Mixture Attention Meta Learning for Video Summarization

• Junyan Wang
• Yang Bai
• Yang Long
• Bingzhang Hu
• Zhenhua Chai
• Yu Guan
• Xiaolin Wei

Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank representations for complex visual or sequential information, which is known as the Softmax Bottleneck problem. In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. Furthermore, the DMASum significantly exploits both visual and sequential attention that connects local key-frame and global attention in an accumulative way. We adopt the new evaluation protocol on two public datasets, SumMe, and TVSum. Both qualitative and quantitative experiments manifest significant improvements over the state-of-the-art methods.

### Textual Dependency Embedding for Person Search by Language

• Kai Niu
• Yan Huang
• Liang Wang

Person search by language aims to associate the pedestrian images with free-form natural language descriptions. Although great efforts have been made to align images with sentences, most researchers neglect the difficulty of long-distance dependency modeling in textual encoding, which is very important for solving this problem because the description sentences are always long and have complex structures for distinguishing different pedestrians. In this work, we focus on the long-distance dependencies in a sentence for better textual encoding, and accordingly propose the Textual Dependency Embedding (TDE) method. We first employ the sentence analysis tools to figure out the long-distance syntactic dependencies from a dependent to its governor in a sentence. Then we embed the dependent representations to their governor adaptively in our Governor-guided Dependent Attention Module (GDAM) to model these long-distance relations. After that, we further consider the dependency types, which also tell the importance of different dependents semantically, and embed them together with the dependents' features to clarify their inequivalent contributions to their governor. Extensive experiments and analysis on person search by language and image-text matching have validated the effectiveness of our method, and we have obtained the state-of-the-art performance on the CUHK-PEDES and Flickr30K datasets.

### Visual-Semantic Graph Matching for Visual Grounding

• Chenchen Jing
• Yuwei Wu
• Mingtao Pei
• Yao Hu
• Yunde Jia
• Qi Wu

Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

### LAL: Linguistically Aware Learning for Scene Text Recognition

• Yi Zheng
• Wenda Qin
• Derry Wijaya
• Margrit Betke

Scene text recognition is the task of recognizing character sequences in images of natural scenes. The considerable diversity in the appearance of text in a scene image and potentially highly complex backgrounds make text recognition challenging. Previous approaches employ character sequence generators to analyze text regions and, subsequently, compare the candidate character sequences against a language model. In this work, we propose a bimodal framework that simultaneously utilizes visual and linguistic information to enhance recognition performance. Our linguistically aware learning (LAL) method effectively learns visual embeddings using a rectifier, encoder, and attention decoder approach, and linguistic embeddings, using a deep next-character prediction model. We present an innovative way of combining these two embeddings effectively. Our experiments on eight standard benchmarks show that our method outperforms previous methods by large margins, particularly on rotated, foreshortened, and curved text. We show that the bimodal approach has a statistically significant impact. We also contribute a new dataset, and show robust performance when LAL is combined with a text detector in a pipelined text spotting framework.

## SESSION: Poster Session F3: Vision and Language

• Fen Liu
• Guanghui Xu
• Qi Wu
• Qing Du
• Wei Jia
• Mingkui Tan

We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.

### Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

• Daizong Liu
• Xiaoye Qu
• Xiao-Yang Liu
• Jianfeng Dong
• Pan Zhou
• Zichuan Xu

Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts.

### Text-Guided Image Inpainting

• Zijian Zhang
• Zhou Zhao
• Zhu Zhang
• Baoxing Huai
• Jing Yuan

Given a partially masked image, image inpainting aims to complete the missing region and output a plausible image. Most existing image inpainting methods complete the missing region by expanding or borrowing information from the surrounding source region, which work well when the original content in the missing region is similar to the surrounding source region. Unsatisfactory results will be generated if there is no sufficient contextual information can be referenced from source region. Besides, the inpainting results should be diverse and this kind of diversity should be controllable. Based on these observations, we propose a new inpainting problem that introduces text as a kind of guidance to direct and control the inpainting process. The main difference between this problem and previous works is that we need ensure the result to be consistent with not only the source region but also the textual guidance during inpainting. By this way, we want to avoid the unreasonable completion and meanwhile make it controllable. We propose a progressively coarse-to-fine cross-modal generative network and adopt the text-image-text training schema to generate visually consistent and semantically coherent images. Extensive quantitative and qualitative experiments on two public datasets with captions demonstrate the effectiveness of our method.

### RT-VENet: A Convolutional Network for Real-time Video Enhancement

• Mohan Zhang
• Qiqi Gao
• Jinglu Wang
• Henrik Turbell
• David Zhao
• Jinhui Yu
• Yan Lu

Real-time video enhancement is in great demand due to the extensive usage of live video applications, but existing approaches are far from satisfying the strict requirements of speed and stability. We present a novel convolutional network that can perform high-quality enhancement on 1080p videos at 45 FPS with a single CPU, which has high potential for real-world deployment. The proposed network is designed based on a light-weight image network and further consolidated for temporal consistency with a temporal feature aggregation (TFA) module. Unlike most image translation networks that use decoders to generate target images, our network discards decoders and employs only an encoder and a small head. The network predicts color mapping functions instead of pixel values in a grid-like container which fits the CNN structure well and also advances the enhancement to be scalable to any video resolution. Furthermore, the temporal consistency of the output will be enforced by the TFA module which utilizes the learned temporal coherence of semantics across frames. We also demonstrate that the mapping representation is general to various enhancement tasks, such as relighting, retouching and dehazing, on benchmark datasets. Our approach achieves the state-of-the-art performance and performs about 10 times faster than the current real-time method on high-resolution videos.

### Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

• Zhu Zhang
• Zhijie Lin
• Zhou Zhao
• Jieming Zhu
• Xiuqiang He

Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training. Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment, but ignore the intra-sample confrontment between moments with semantically similar contents. Thus, these methods fail to distinguish the target moment from plausible negative moments. In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream. We then design the sharable two-branch proposal module to generate positive proposals from the enhanced stream and plausible negative proposals from the suppressed one for sufficient confrontment. Further, we apply the proposal regularization to stabilize the training process and improve model performance. The extensive experiments show the effectiveness of our method. Our code is released at here.

### Feature Reintegration over Differential Treatment: A Top-down and Adaptive Fusion Network for RGB-D Salient Object Detection

• Miao Zhang
• Yu Zhang
• Yongri Piao
• Beiqi Hu
• Huchuan Lu

Most methods for RGB-D salient object detection (SOD) utilize the same fusion strategy to explore the cross-modal complementary information at each level. However, this may ignore different feature contributions from two modalities on different levels towards prediction. In this paper, we propose a novel top-down multi-level fusion structure where different fusion strategies are utilized to effectively explore the low-level and high-level features. This is achieved by designing the interweave fusion module (IFM) to effectively integrate the global information and designing the gated select fusion module (GSFM) to discriminatively select useful local information by filtering out the unnecessary one from RGB and depth data. Moreover, we propose an adaptive fusion module (AFM) to reintegrate the fused cross-modal features of each level to predict a more accurate result. Comprehensive experiments on 7 challenging benchmark datasets demonstrate that our method achieves the competitive performance over 14 state-of-the-art RGB-D alternative methods.

### Dual Path Interaction Network for Video Moment Localization

• Hao Wang
• Zheng-Jun Zha
• Xuejin Chen
• Zhiwei Xiong
• Jiebo Luo

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., top-down approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidate-level representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches.

### Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

• Guiyu Tian
• Shuai Wang
• Jie Feng
• Li Zhou

Zero-shot image segmentation refers to the task of segmenting pixels from specific unseen semantic class. Previous methods mainly rely on historic segmentation tasks, such as using semantic embedding or word embedding of class names to infer a new segmentation model. In this work we describe Cap2Seg, a novel solution of zero-shot image segmentation that harnesses accompanying image captions for intelligently inferring spatial and semantic context for the zero-shot image segmentation task. As our main insight, image captions often implicitly entail the occurrence of a new class in an image and its most-confident spatial distribution. We define a contextual entailment question (CEQ) that tailors BERT-like text models. In specific, the proposed networks for inferring unseen classes consists of three branches (global / local / semi-global), which infer labels of unseen class from image level, image-stripe level or pixel level respectively. Comprehensive experiments and ablation studies are conducted on two image benchmarks, COCO-stuff and Pascal VOC. All clearly demonstrate the effectiveness of the proposed Cap2Seg, including a set of hardest unseen classes (i.e., image captions do not literally contain the class names and direct matching for inference fails).

### Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking

• Congcong Zhu
• Xiaoqiang Li
• Jide Li
• Guangtai Ding
• Weiqin Tong

Diversity of training data significantly affects tracking robustness of model under unconstrained environments. However, existing labeled datasets for facial landmark tracking tend to be large but not diverse, and manually annotating the massive clips of new diverse videos is extremely expensive. To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. Unlike most existing methods which rely heavily on labeled data, STKI exploits supervisions from unlabeled data. Specifically, STKI integrates spatial-temporal knowledge from massive unlabeled videos, which has several orders of magnitude more than existing labeled video data on the diversity, for robust tracking. Our framework includes a self-supervised tracker and an image-based detector for tracking initialization. To avoid the distortion of facial shape, the tracker leverages adversarial learning to introduce facial structure prior and temporal knowledge into cycle-consistency tracking. Meanwhile, we design a graph-based knowledge distillation method, which distills the knowledge from tracking and detection results, to improve the generalization of the detector. The fine-tuned detector can provide tracker on unconstrained videos with high-quality tracking initialization. Extensive experimental results show that the proposed method achieves state-of-the-art performance on comprehensive evaluation datasets.

### Weakly Supervised 3D Object Detection from Point Clouds

• Zengyi Qin
• Jinglu Wang
• Yan Lu

A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes. Existing 3D object detectors heavily rely on annotated 3D bounding boxes during training, while these annotations could be expensive to obtain and only accessible in limited scenarios. Weakly supervised learning is a promising approach to reducing the annotation requirement, but existing weakly supervised object detectors are mostly for 2D detection rather than 3D. In this work, we propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training. First, we introduce an unsupervised 3D proposal module that generates object proposals by leveraging normalized point cloud densities. Second, we present a cross-modal knowledge distillation strategy, where a convolutional neural network learns to predict the final results from the 3D object proposals by querying a teacher network pretrained on image datasets. Comprehensive experiments on the challenging KITTI dataset demonstrate the superior performance of our VS3D in diverse evaluation settings. The source code and pretrained models are publicly available at https://github.com/Zengyi-Qin/Weakly-Supervised-3D-Object-Detection.

### Bridging the Gap between Vision and Language Domains for Improved Image Captioning

• Fenglin Liu
• Xian Wu
• Shen Ge
• Xiaoyu Zhang
• Wei Fan
• Yuexian Zou

Image captioning has attracted extensive research interests in recent years. Due to the great disparities between vision and language, an important goal of image captioning is to link the information in visual domain to textual domain. However, many approaches conduct this process only in the decoder, making it hard to understand the images and generate captions effectively. In this paper, we propose to bridge the gap between the vision and language domains in the encoder, by enriching visual information with textual concepts, to achieve deep image understandings. To this end, we propose to explore the textual-enriched image features. Specifically, we introduce two modules, namely Textual Distilling Module and Textual Association Module. The former distills relevant textual concepts from image features, while the latter further associates extracted concepts according to their semantics. In this manner, we acquire textual-enriched image features, which provide clear textual representations of image under no explicit supervision. The proposed approach can be used as a plugin and easily embedded into a wide range of existing image captioning systems. We conduct the extensive experiments on two benchmark image captioning datasets, i.e., MSCOCO and Flickr30k. The experimental results and analysis show that, by incorporating the proposed approach, all baseline models receive consistent improvements over all metrics, with the most significant improvement up to 10% and 9%, in terms of the task-specific metrics CIDEr and SPICE, respectively. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image captioning.

### STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

• Da Cao
• Yawen Zeng
• Meng Liu
• Xiangnan He
• Meng Wang
• Zheng Qin

In this article, we tackle the cross-modal video moment localization issue, namely, localizing the most relevant video moment in an untrimmed video given a sentence as the query. The majority of existing methods focus on generating video moment candidates with the help of multi-scale sliding window segmentation. They hence inevitably suffer from numerous candidates, which result in the less effective retrieval process. In addition, the spatial scene tracking is crucial for realizing the video moment localization process, but it is rarely considered in traditional techniques. To this end, we innovatively contribute a spatial-temporal reinforcement learning framework. Specifically, we first exploit a temporal-level reinforcement learning to dynamically adjust the boundary of localized video moment instead of the traditional window segmentation strategy, which is able to accelerate the localization process. Thereafter, a spatial-level reinforcement learning is proposed to track the scene on consecutive image frames, therefore filtering out less relevant information. Lastly, an alternative optimization strategy is proposed to jointly optimize the temporal- and spatial-level reinforcement learning. Thereinto, the two tasks of temporal boundary localization and spatial scene tracking are mutually reinforced. By experimenting on two real-world datasets, we demonstrate the effectiveness and rationality of our proposed solution.

### Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

• Heqian Qiu
• Hongliang Li
• Qingbo Wu
• Fanman Meng
• Hengcan Shi
• Taijin Zhao
• King Ngi Ngan

Referring expression comprehension expects to accurately locate an object described by a language expression, which requires precise language-aware visual object representations. However, existing methods usually use rectangular object representations, such as object proposal regions and grid regions. They ignore some fine-grained object information like shapes and poses, which are often described in language expressions and important to localize objects. Additionally, rectangular object regions usually contain background contents and irrelevant foreground features, which also decrease the localization performance. To address these problems, we propose a language-aware deformable convolution model (LDC) to learn language-aware fine-grained object representations. Rather than extracting rectangular object representations, LDC adaptively samples a set of key points based on the image and language to represent objects. This type of object representations can capture more fine-grained object information (e.g., shapes and poses) and suppress noises in accordance with language and thus, boosts the object localization performance. Based on the language-aware fine-grained object representation, we next design a bidirectional interaction model (BIM) that leverages a modified co-attention mechanism to build cross-modal bidirectional interactions to further improve the language and object representations. Furthermore, we propose a hierarchical fine-grained representation network (HFRN) to learn language-aware fine-grained object representations and cross-modal bidirectional interactions at local word level and global sentence level, respectively. Our proposed method outperforms the state-of-the-art methods on the RefCOCO, RefCOCO+ and RefCOCOg datasets.

### Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

• Xu Yang
• Chongyang Gao
• Hanwang Zhang
• Jianfei Cai

When we humans tell a long paragraph about an image, we usually first implicitly compose a mental "script'' and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the "script" to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.

## SESSION: Poster Session G3: Vision and Language

### Improving Intra- and Inter-Modality Visual Relation for Image Captioning

• Yong Wang
• WenKai Zhang
• Qing Liu
• Zhengyuan Zhang
• Xin Gao
• Xian Sun

It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed I2RT. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.

### Exploring Language Prior for Mode-Sensitive Visual Attention Modeling

• Xiaoshuai Sun
• Xuying Zhang
• Liujuan Cao
• Yongjian Wu
• Feiyue Huang
• Rongrong Ji

Modeling human visual attention mechanism is a fundamental problem for the understanding of human vision, which has also been demonstrated as an important module for various multimedia applications such as image captioning and visual question answering. In this paper, we propose a new probabilistic framework for attention, and introduce the concept ofmode to model the flexibility and adaptability of attention modulation in complex environments. We characterize the correlations between the visual input, the activated mode, the saliency and the spatial allocation of attention via a graphical model representation, based on which we explore the lingual guidance from captioning data for the implementation of a mode-sensitive attention (MSA) model. The proposed framework explicitly justifies the usage of center bias for fixation prediction and can convert an arbitrary learning-based backbone attention model to a more robust multi-mode version. Experimental results on the York120, MIT1003 and PASCAL datasets demonstrate the effectiveness of the proposed method.

### Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

• Jiacheng Li
• Siliang Tang
• Juncheng Li
• Jun Xiao
• Fei Wu
• Shiliang Pu
• Yueting Zhuang

Visual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the annotation of VIST is extremely costly and many topics cannot be covered in the training dataset due to the long-tail topic distribution. In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. Inspired by the way humans tell a story, we propose a topic adaptive storyteller to model the ability of inter-topic generalization. In practice, we apply the gradient-based meta-learning algorithm on multi-modal seq2seq models to endow the model the ability to adapt quickly from topic to topic. Besides, We further propose a prototype encoding structure to model the ability of intra-topic derivation. Specifically, we encode and restore the few training story text to serve as a reference to guide the generation at inference time. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model on BLEU and METEOR metric. The further case study shows that the stories generated after few-shot adaptation are more relative and expressive.

### ICECAP: Information Concentrated Entity-aware Image Captioning

• Anwen Hu
• Shizhe Chen
• Qin Jin

Most current image captioning systems focus on describing general image content, and lack background knowledge to deeply understand the image, such as exact named entities or concrete events. In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image. However, due to the length of news articles, previous works only employ news articles at the coarse article or sentence level, which are not fine-grained enough to refine relevant events and choose named entities accurately. To overcome these limitations, we propose an Information Concentrated Entity-aware news image CAPtioning (ICECAP) model, which progressively concentrates on relevant textual information within the corresponding news article from the sentence level to the word level. Our model first creates coarse concentration on relevant sentences using a cross-modality retrieval model and then generates captions by further concentrating on relevant words within the sentences. Extensive experiments on both BreakingNews and GoodNews datasets demonstrate the effectiveness of our proposed method, which outperforms other state-of-the-arts.

### Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal

• Jiayi Ji
• Xiaoshuai Sun
• Yiyi Zhou
• Rongrong Ji
• Fuhai Chen
• Jianzhuang Liu
• Qi Tian

In this paper, we investigate the fragility of deep image captioning models against adversarial attacks. Different from existing works that generate common words and concepts, we focus on the adversarial attacks towards controllable image captioning, i.e., removing target words from captions by imposing adversarial noises to images while maintaining the captioning accuracy for the remaining visual content. We name this new task as Masked Image Captioning (MIC), which is expected to be training and labeling free for end-to-end captioning models. Meanwhile, we propose a novel adversarial learning approach for this new task, termed Show, Mask, and Tell (SMT), which crafts adversarial examples to mask the target concepts via minimizing an objective loss while training the noise generator. Concretely, three novel designs are introduced in this loss, i.e., word removal regularization, captioning accuracy regularization, and noise filtering regularization. For quantitative validation, we propose a benchmark dataset for MIC based on the MS COCO dataset, together with a new evaluation metric called Attack Quality. Experimental results show that the proposed approach achieves successful attacks by removing 93.8% and 91.9% target words while maintaining 97.3% and 97.4% accuracies on two cutting-edge captioning models, respectively.

### ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

• Ye Liu
• Junsong Yuan
• Chang Wen Chen

We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of <human, action, object> in images. Most existing works treat HOIs as individual interaction categories, thus can not handle the problem of long-tail distribution and polysemy of action labels. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities. We extensively evaluate our model on the challenging V-COCO and HICO-DET datasets, and results validate that our approach outperforms state-of-the-arts under both fully-supervised and zero-shot settings.

### ChefGAN: Food Image Generation from Recipes

• Siyuan Pan
• Ling Dai
• Xuhong Hou
• Huating Li
• Bin Sheng

Although significant progress has been made in generating images from the text by using generative adversarial networks (GANs), it is still challenging to deal with long text, which contains complex semantic information like recipes. This paper focuses on generating images with high visual realism and semantic consistency from the complex text of recipes. To achieve this, we propose a GANs based method termed ChefGAN. The critical concept of ChefGAN is that a joint image-recipe embedding model is used before the generation task to provide high-quality representations of recipes, and it acts as an extra regularization during the generation to improve semantic consistency. Two modules are designed for this image text embedding module (ITEM) and a cascaded image generation module (CIGM). The generation process is carried out in 3 steps: (1) Two encoders in ITEM are trained simultaneously to generate similar representations for each image-recipe pair. (2) CIGM generates images according to the representations from ITEM's text encoder. (3) The generated image is fed into ITEM's image encoder to calculate the similarity with the given recipe. This process can provide additional regularization effect other than the impact of a discriminator. To facilitate convergence, we applied a two-stage training strategy, which generates an image with low resolution and then one with high resolution in the CIGM module. Compared with other representative state-of-the-art methods, ChefGAN demonstrates better performance both in visual realism and semantic consistency.

### Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

• Fei Liu
• Jing Liu
• Xinxin Zhu
• Richang Hong
• Hanqing Lu

Video story question answering (video story QA) is a challenging problem, as it requires a joint understanding of diverse data sources (i.e., video, subtitle, question, and answer choices). Existing approaches for video story QA have several common defects: (1) single temporal scale; (2) static and rough multimodal interaction; and (3) insufficient (or shallow) exploitation of both question and answer choices. In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) to address the aforementioned defects together. The proposed DHTCN explores multiple temporal scales by building hierarchical temporal convolutional network. In each temporal convolutional layer, two key components, namely AttLSTM and QA-Aware Dynamic Normalization, are introduced to capture the temporal dependency and the multimodal interaction in a dynamic and fine-grained manner. To enable sufficient exploitation of both question and answer choices, we increase the depth of QA pairs with a stack of non-linear layers, and exploit QA pairs in each layer of the network. Extensive experiments are conducted on two widely used datasets: TVQA and MovieQA, demonstrating the effectiveness of DHTCN. Our model obtains state-of-the-art results on the both datasets.

### Generalized Zero-Shot Learning using Generated Proxy Unseen Samples and Entropy Separation

• Omkar Gune
• Biplab Banerjee
• Subhasis Chaudhuri
• Fabio Cuzzolin

The recent generative model-driven Generalized Zero-shot Learning (GZSL) techniques overcome the prevailing issue of the model bias towards the seen classes by synthesizing the visual samples of the unseen classes through leveraging the corresponding semantic prototypes. Although such approaches significantly improve the GZSL performance due to data augmentation, they violate the principal assumption of GZSL regarding the unavailability of semantic information of unseen classes during training. In this work, we propose to use a generative model (GAN) for synthesizing the visual proxy samples while strictly adhering to the standard assumptions of the GZSL. The aforementioned proxy samples are generated by exploring the early training regime of the GAN. We hypothesize that such proxy samples can effectively be used to characterize the average entropy of the label distribution of the samples from the unseen classes. Further, we train a classifier on the visual samples from the seen classes and proxy samples using entropy separation criterion such that an average entropy of the label distribution is low and high, respectively, for the visual samples from the seen classes and the proxy samples. Such entropy separation criterion generalizes well during testing where the samples from the unseen classes exhibit higher entropy than the entropy of the samples from the seen classes. Subsequently, low and high entropy samples are classified using supervised learning and ZSL rather than GZSL. We show the superiority of the proposed method by experimenting on AWA1, CUB, HMDB51, and UCF101 datasets.

### Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

• Zipeng Xu
• Fangxiang Feng
• Xiaojie Wang
• Yushu Yang
• Huixing Jiang
• Zhongyuan Wang

### Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

• Xiaoye Qu
• Pengwei Tang
• Zhikang Zou
• Yu Cheng
• Jianfeng Dong
• Pan Zhou
• Zichuan Xu

Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: ActivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.

### Hierarchical Bi-Directional Feature Perception Network for Person Re-Identification

• Zhipu Liu
• Lei Zhang
• Yang Yang

Previous Person Re-Identification (Re-ID) models aim to focus on the most discriminative region of an image, while its performance may be compromised when that region is missing caused by camera viewpoint changes or occlusion. To solve this issue, we propose a novel model named Hierarchical Bi-directional Feature Perception Network (HBFP-Net) to correlate multi-level information and reinforce each other. First, the correlation maps of cross-level feature-pairs are modeled via low-rank bilinear pooling. Then, based on the correlation maps, Bi-directional Feature Perception (BFP) module is employed to enrich the attention regions of high-level feature, and to learn abstract and specific information in low-level feature. And then, we propose a novel end-to-end hierarchical network which integrates multi-level augmented features and inputs the augmented low- and middle-level features to following layers to retrain a new powerful network. What's more, we propose a novel trainable generalized pooling, which can dynamically select any value of all locations in feature maps to be activated. Extensive experiments implemented on the mainstream evaluation datasets including Market-1501, CUHK03 and DukeMTMC-ReID show that our method outperforms the recent SOTA Re-ID models.

### Hard Negative Samples Emphasis Tracker without Anchors

• Zhongzhou Zhang
• Lei Zhang

Trackers based on Siamese network have shown tremendous success, because of their balance between accuracy and speed. Nevertheless, with tracking scenarios becoming more and more sophisticated, most existing Siamese-based approaches ignore the addressing of the problem that distinguishes the tracking target from hard negative samples in the tracking phase. The features learned by these networks lack of discrimination, which significantly weakens the robustness of Siamese-based trackers and leads to suboptimal performance. To address this issue, we propose a simple yet efficient hard negative samples emphasis method, which constrains Siamese network to learn features that are aware of hard negative samples and enhance the discrimination of embedding features. Through a distance constraint, we force to shorten the distance between exemplar vector and positive vectors, meanwhile, enlarge the distance between exemplar vector and hard negative vectors. Furthermore, we explore a novel anchor-free tracking framework in a per-pixel prediction fashion, which can significantly reduce the number of hyper-parameters and simplify the tracking process by taking full advantage of the representation of convolutional neural network. Extensive experiments on six standard benchmark datasets demonstrate that the proposed method can perform favorable results against state-of-the-art approaches.

### JointFontGAN: Joint Geometry-Content GAN for Font Generation via Few-Shot Learning

• Yankun Xi
• Guoli Yan
• Jing Hua
• Zichun Zhong

Automatic generation of font and text design in the wild is a challenging task since font and text in real world exhibit various visual effects. In this paper, we propose a novel model, JointFontGAN, to derive fonts, including both geometric structures and shape contents in correctness and consistency with very few font samples available. Specifically, we design an end-to-end deep learning based approach for font generation through the new multi-stream extended conditional generative adversarial network (XcGAN) models, which jointly learn and generate both font skeleton and glyph representations simultaneously. It can adapt to the geometric variability and content scalability at the neural network level. Then, we apply it, along with the developed efficient and effective one-stage model, to text generations in letters and sentences / paragraphs with both standard and artistic / handwriting styles. The extensive experiments and comparisons demonstrate that our approach outperforms the state-of-the-art methods on the collected datasets including 20K fonts (letters and punctuations) with different styles.

## SESSION: Poster Session H3: Vision and Language

### DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms

• Hua Qi
• Qing Guo
• Felix Juefei-Xu
• Xiaofei Xie
• Lei Ma
• Wei Feng
• Yang Liu
• Jianjun Zhao

As the GAN-based face image and video generation techniques, widely known as DeepFakes, have become more and more matured and realistic, there comes a pressing and urgent demand for effective DeepFakes detectors. Motivated by the fact that remote visual photoplethysmography (PPG) is made possible by monitoring the minuscule periodic changes of skin color due to blood pumping through the face, we conjecture that normal heartbeat rhythms found in the real face videos will be disrupted or even entirely broken in a DeepFake video, making it a potentially powerful indicator for DeepFake detection. In this work, we propose DeepRhythm, a DeepFake detection technique that exposes DeepFakes by monitoring the heartbeat rhythms. DeepRhythm utilizes dual-spatial-temporal attention to adapt to dynamically changing face and fake types. Extensive experiments on FaceForensics++ and DFDC-preview datasets have confirmed our conjecture and demonstrated not only the effectiveness, but also the generalization capability of DeepRhythm over different datasets by various DeepFakes generation techniques and multifarious challenging degradations.

• Jinglin Liu
• Yi Ren
• Zhou Zhao
• Chen Zhang
• Baoxing Huai
• Jing Yuan

### Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

• Jing Wang
• Jinhui Tang
• Jiebo Luo

OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

• Yi Zhang
• Jitao Sang

### Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

• Botian Shi
• Lei Ji
• Zhendong Niu
• Nan Duan
• Ming Zhou
• Xilin Chen

Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.

### LGNN: A Context-aware Line Segment Detector

• Quan Meng
• Jiakai Zhang
• Qiang Hu
• Xuming He
• Jingyi Yu

We present a novel real-time line segment detection scheme called Line Graph Neural Network (LGNN). Existing approaches require a computationally expensive verification or postprocessing step. Our LGNN employs a deep convolutional neural network (DCNN) for proposing line segment directly, with a graph neural network (GNN) module for reasoning their connectivities. Specifically, LGNN exploits a new quadruplet representation for each segment where the GNN module takes the predicted candidates as vertexes and constructs a sparse graph to enforce structural context. Compared with the state-of-the-art, LGNN achieves near real-time performance without compromising accuracy. LGNN further enables time-sensitive 3D applications. When a 3D point cloud is accessible, we present a multi-modal line segment classification technique for extracting a 3D wireframe of the environment robustly and efficiently.

### DeVLBert: Learning Deconfounded Visio-Linguistic Representations

• Shengyu Zhang
• Tan Jiang
• Tan Wang
• Kun Kuang
• Zhou Zhao
• Jianke Zhu
• Jin Yu
• Hongxia Yang
• Fei Wu

In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. Existing methods for this problem are purely likelihood-based, leading to the spurious correlations and hurt the generalization ability when transferred to out-of-domain downstream tasks. By spurious correlation, we mean that the conditional probability of one token (object or word) given another one can be high (due to the dataset biases) without robust (causal) relationships between them. To mitigate such dataset biases, we propose a Deconfounded Visio-Linguistic Bert framework, abbreviated as DeVLBert, to perform intervention-based learning. We borrow the idea of the backdoor adjustment from the research field of causality and propose several neural-network based architectures for Bert-style out-of-domain pretraining. The quantitative results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and Visual Question Answering, show the effectiveness of DeVLBert by boosting generalization ability.

### Sequential Attention GAN for Interactive Image Editing

• Yu Cheng
• Zhe Gan
• Yitong Li
• Jingjing Liu
• Jianfeng Gao

Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language description from the user as the input, and modifies the image generated in previous turn to a new design, following the user description. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Network (SeqAttnGAN), which applies a neural state tracker to encode the previous image and the textual description in each turn of the sequence, and uses a GAN framework to generate a modified version of the image that is consistent with the preceding images and coherent with the description. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. To benchmark on the new task, we introduce two new datasets, Zap-Seq and DeepFashion-Seq, which contain multi-turn sessions with image-description sequences in the fashion domain. Experiments on both datasets show that the proposed SeqAttnGAN model outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics including visual quality, image sequence coherence and text-image consistency.

## SESSION: Interactive Art Session

### Portraits of No One: An Internet Artwork

• Tiago Martins
• João Correia
• Sérgio Rebelo
• João Bicker

Portraits of No One is an internet artwork that generates and displays artificial photo-realistic portraits of human faces. This artwork assumes the form of a web page that synthesises new portraits by automatically recombining the facial features of the users who interacted with it. The generated portraits invoke the capabilities of Artificial Intelligence to generate visual content that makes people question themselves about the veracity of what they are seeing.

### MaLiang: An Emotion-driven Chinese Calligraphy Artwork Composition System

• Ruixue Liu
• Shaozu Yuan
• Meng Chen
• Baoyang Chen
• Zhijie Qiu
• Xiaodong He

We present a novel Chinese calligraphy artwork composition system (MaLiang) which can generate aesthetic, stylistic and diverse calligraphy images based on the emotion status from the input text. Different from previous research, it's the first work to endow the calligraphy synthesis with the ability to express fickle emotions and composite a whole piece of discourse-level calligraphy artwork instead of single character images. The system consists of three modules: emotion detection, character image generation, and layout prediction. As a creative form of interactive art, MaLiang has been exhibited in several famous international art festivals.

### First Impression: AI Understands Personality

• Xiaohui Wang
• Xia Liang
• Miao Lu
• Jingyan Qin

When you first encounter a person, a mental image of that person is formed. First impression, an interactive art, is proposed to let AI understand human personality at first glance. The mental image is demonstrated by Beijing opera facial makeups, which shows the character personality with a combination of realism and symbolism. We build Beijing opera facial makeup dataset and semantic dataset of facial features to establish relationships among real faces, personalities and facial makeups. First impression detects faces, recognizes personality from facial appearance and finds the matching Beijing opera facial makeup. Finally, the morphing process from real face to facial makeup is shown to let users enjoy the process of AI understanding personality.

### Draw Portraits by Music: A Music based Image Style Transformation

• Siyu Jin
• Jingyan Qin
• Wenfa Li

"Draw portraits by music", an interactive work of art. Compared with music visualization and image style conversion, it's AI's imitation of human synaesthetic. New portraits gradually appear on the screen and are synchronized with music in real-time. Users select music and images as the main interactive contents, the parameters of the music are used as the dynamic expression of human emotions, and the new pixel generation process of the image is regarded as the result of emotions affecting humans.

### Little World: Virtual Humans Accompany Children on Dramatic Performance

• Xiaohui Wang
• Xiaoxue Ding
• Jinke Li
• Jingyan Qin

Every child is the leading actor in her/his unique world. To help them achieve performance, an interactive art called 'little world' is proposed to let virtual humans accompany children on drama performance. Theatrical adaptation rewrites the novel to an interactive drama suitable for children. Little world builds drama scenes and virtual humans for characters and lets children interact with them by speech and actions.

### Keep Running - AI Paintings of Horse Figure and Portrait

• James She
• Carmen Ng

"Keep Running" is a collection of human and machine generated paintings using a generative adversarial network technology. The horse artworks are produced during the lockdown period in the Middle East due to the Covid-19. Many recent AI artworks are either generated in photo-realistic style, or abstract style with distorted faces, fragmented figures and a combination of unknown objects. Besides the cultural and historic symbols that horses represent in this region, what's unique with our work is showing the possibility of using AI to create horse paintings with distinguishable features and forms, while still rendering different aesthetic and even sentimental expressions in the horse paintings. Our first artwork is a series of storytelling-like paintings of an evolving horse figure in motion with changing backgrounds. Another one is a set of different horse portrait paintings that are presented in a grid with each of them evolved and generated stylishly from the same yet repeated machine processes. Our AI artworks are not just artistic and meaningful, but also paying a salute to the early works of machine-assisted art by Eadweard Muybridge and Andy Warhol, for their influences to the art world today.

### AI Mirror: Visualize AI's Self-knowledge

• Siyu Hu
• Bo Shui
• Siyu Jin
• Xiaohui Wang

"AI mirror", an interactive art, tends to visualize the self-knowledge mechanism from the AI's perspective, and arouses people's reflection on artificial intelligence. In the first stage of the unconscious imitation, the visual neurons perceive environmental information and mirror neurons imitate human behavior. Then, the language and consciousness are generated from the long term of imitation, denoted as poet and coordinates in an affective space. In the final stage of conscious behavior, an affinity analysis is generated, and the mirror neurons will behave more harmoniously with the user or have the autonomous movements on its own, which evokes the user's reflection on its undiscovered traits.

## SESSION: Brave New Ideas Session

### Image Sentiment Transfer

• Tianlang Chen
• Wei Xiong
• Haitian Zheng
• Jiebo Luo

In this work, we introduce an important but still unexplored research task -- image sentiment transfer. Compared with other related tasks that have been well-studied, such as image-to-image translation and image style transfer, transferring the sentiment of an image is more challenging. Given an input image, the rule to transfer the sentiment of each contained object can be completely different, making existing approaches that perform global image transfer by a single reference image inadequate to achieve satisfactory performance. In this paper, we propose an effective and flexible framework that performs image sentiment transfer at the object level. It first detects the objects and extracts their pixel-level masks, and then performs object-level sentiment transfer guided by multiple reference images for the corresponding objects. For the core object-level sentiment transfer, we propose a novel Sentiment-aware GAN (SentiGAN). Both global image-level and local object-level supervisions are imposed to train SentiGAN. More importantly, an effective content disentanglement loss cooperating with a content alignment step is applied to better disentangle the residual sentiment-related information of the input image. Extensive quantitative and qualitative experiments are performed on the object-oriented VSO dataset we create, demonstrating the effectiveness of the proposed framework.

### Personal Food Model

• Ali Rostami
• Vaibhav Pandey
• Nitish Nag
• Vesper Wang
• Ramesh Jain

Food is central to life. Food provides us with energy and foundational building blocks for our body and is also a major source of joy and new experiences. A significant part of the overall economy is related to food. Food science, distribution, processing, and consumption have been addressed by different communities using silos of computational approaches. In this paper, we adopt a person-centric multimedia and multimodal perspective on food computing and show how multimedia and food computing are synergistic and complementary. Enjoying food is a truly multimedia experience involving sight, taste, smell, and even sound, that can be captured using a multimedia food logger. The biological response to food can be captured using multimodal data streams using available wearable devices. Central to this approach is the Personal Food Model. Personal Food Model is the digitized representation of the food-related characteristics of an individual. It is designed to be used in food recommendation systems to provide eating-related recommendations that improve the user's quality of life. To model the food-related characteristics of each person, it is essential to capture their food-related enjoyment using a Preferential Personal Food Model and their biological response to food using their Biological Personal Food Model. Inspired by the power of 3-dimensional color models for visual processing, we introduce a 6-dimensional taste-space for capturing culinary characteristics as well as personal preferences. We use event mining approaches to relate food with other life and biological events to build a predictive model that could also be used effectively in emerging food recommendation systems.

### Helping Users Tackle Algorithmic Threats on Social Media: A Multimedia Research Agenda

• Christian von der Weth
• Ashraf Abdul
• Shaojing Fan
• Mohan Kankanhalli

Participation on social media platforms has many benefits but also poses substantial threats. Users often face an unintended loss of privacy, are bombarded with mis-/disinformation, or are trapped in filter bubbles due to over-personalized content. These threats are further exacerbated by the rise of hidden AI-driven algorithms working behind the scenes to shape users' thoughts, attitudes, and behaviour. We investigate how multimedia researchers can help tackle these problems to level the playing field for social media users. We perform a comprehensive survey of algorithmic threats on social media and use it as a lens to set a challenging but important research agenda for effective and real-time user nudging. We further implement a conceptual prototype and evaluate it with experts to supplement our research agenda. This paper calls for solutions that combat the algorithmic threats on social media by utilizing machine learning and multimedia content analysis techniques but in a transparent manner and for the benefit of the users.

## SESSION: Reproducibility Session

### Reproducibility Companion Paper: Instance of Interest Detection

• Fan Yu
• Dandan Wang
• Haonan Wang
• Tongwei Ren
• Jinhui Tang
• Gangshan Wu
• Jingjing Chen
• Michael Riegler

To support the replication of "Instance of Interest Detection", which was presented at MM'19, this companion paper provides the details of the artifacts. Instance of Interest Detection (IOID) aims to provide instance-level user interest modeling for image semantic description. In this paper, we explain the file structure of the source code and publish the details of our IOID dataset, which can be used to retrain the model with custom parameters. We also provide a program for component analysis to help other researchers to do experiments with alternative models that are not included in our experiments. Moreover, we provide a demo program for using our model easily.

### Reproducibility Companion Paper: Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network

• Xin Wang
• Bo Wu
• Yueqi Zhong
• Wei Hu
• Jan Zahálka

This companion paper supports the experimental replication of paper "Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network", which is presented at ACM Multimedia 2019. We provide the software package for replicating the implementation of Multi-Layered Comparison Network (MCN), as well as the Polyvore-T dataset and baseline methods compared in the original paper. This paper contains the guides to reproduce the experiment results including outfit compatibility prediction, outfit diagnosis and automatic outfit revision.

### Reproducibility Companion Paper: Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN

• Quoc-Tuan Truong
• Martin Aumüller
• Naoko Nitta

We revisit our contributions on visual sentiment analysis for online review images published at ACM Multimedia 2017, where we develop item-oriented and user-oriented convolutional neural networks that better capture the interaction of image features with specific expressions of users or items. In this work, we outline the experimental claims as well as describe the procedures to reproduce the results therein. In addition, we provide artifacts including data sets and code to replicate the experiments.

### Reproducibility Companion Paper: Selective Deep Convolutional Features for Image Retrieval

• Tuan Hoang
• Thanh-Toan Do
• Ngai-Man Cheung
• Michael Riegler
• Jan Zahálka

In this companion paper, firstly, we briefly summarize the contributions of our main manuscript: Selective Deep Convolutional Features for Image Retrieval, published in ACM MultiMedia 2017. In addition, we provide detail instructions together with pre-configured MATLAB scripts which allow experiments to be executed and to reproduce the results reported in our main manuscript effortlessly. The source code is available at https://github.com/hnanhtuan/selectiveConvFeatures_ACMMM_reproducibility.

## SESSION: Open Source Software

### MLModelCI: An Automatic Cloud Platform for Efficient MLaaS

• Huaizheng Zhang
• Yuanming Li
• Yizheng Huang
• Yonggang Wen
• Jianxiong Yin
• Kyle Guan

MLModelCI provides multimedia researchers and developers with a one-stop platform for efficient machine learning (ML) services. The system leverages DevOps techniques to optimize, test, and manage models. It also containerizes and deploys these optimized and validated models as cloud services (MLaaS). In its essence, MLModelCI serves as a housekeeper to help users publish models. The models are first automatically converted to optimized formats for production purpose and then profiled under different settings (e.g., batch size and hardware). The profiling information can be used as guidelines for balancing the trade-off between performance and cost of MLaaS. Finally, the system dockerizes the models for ease of deployment to cloud environments. A key feature of MLModelCI is the implementation of a controller, which allows elastic evaluation which only utilizes idle workers while maintaining online service quality. Our system bridges the gap between current ML training and serving systems and thus free developers from manual and tedious work often associated with service deployment. We release the platform as an open-source project on GitHub under Apache 2.0 license, with the aim that it will facilitate and streamline more large-scale ML applications and research projects.

### Hysia: Serving DNN-Based Video-to-Retail Applications in Cloud

• Huaizheng Zhang
• Yuanming Li
• Qiming Ai
• Yong Luo
• Yonggang Wen
• Yichao Jin
• Nguyen Binh Duong Ta

Combining video streaming and online retailing (V2R) has been a growing trend recently. In this paper, we provide practitioners and researchers in multimedia with a cloud-based platform named Hysia for easy development and deployment of V2R applications. The system consists of: 1) a back-end infrastructure providing optimized V2R related services including data engine, model repository, model serving and content matching; and 2) an application layer which enables rapid V2R application prototyping. Hysia addresses industry and academic needs in large-scale multimedia by: 1) seamlessly integrating state-of-the-art libraries including NVIDIA video SDK, Facebook faiss, and gRPC; 2) efficiently utilizing GPU computation; and 3) allowing developers to bind new models easily to meet the rapidly changing deep learning (DL) techniques. On top of that, we implement an orchestrator for further optimizing DL model serving performance. Hysia has been released as an open source project on GitHub, and attracted considerable attention. We have published Hysia to DockerHub as an official image for seamless integration and deployment in current cloud environments.

### PyRetri: A PyTorch-based Library for Unsupervised Image Retrieval by Deep Convolutional Neural Networks

• Benyi Hu
• Ren-Jie Song
• Xiu-Shen Wei
• Yazhou Yao
• Xian-Sheng Hua
• Yuehu Liu

Despite significant progress of applying deep learning methods to the field of content-based image retrieval, there has not been a software library that covers these methods in a unified manner. In order to fill this gap, we introduce PyRetri, an open source library for deep learning based unsupervised image retrieval. The library encapsulates the retrieval process in several stages and provides functionality that covers various prominent methods for each stage. The idea underlying its design is to provide a unified platform for deep learning based image retrieval research, with high usability and extensibility. The project source code, with usage examples, sample data and pre-trained models are available at https://github.com/PyRetri/.

### Cottontail DB: An Open Source Database System for Multimedia Retrieval and Analysis

• Ralph Gasser
• Luca Rossetto
• Silvan Heller
• Heiko Schuldt

Multimedia retrieval and analysis are two important areas in "Big data" research. They have in common that they work with feature vectors as proxies for the media objects themselves. Together with metadata such as textual descriptions or numbers, these vectors describe a media object in its entirety, and must therefore be considered jointly for both storage and retrieval.

In this paper we introduce Cottontail DB, an open source database management system that integrates support for scalar and vector attributes in a unified data and query model that allows for both Boolean retrieval and nearest neighbour search. We demonstrate that Cottontail DB scales well to large collection sizes and vector dimensions and provide insights into how it proved to be a valuable tool in various use cases ranging from the analysis of MRI data to realizing retrieval solutions in the cultural heritage domain.

### BMXNet 2: An Open Source Framework for Low-bit Networks - Reproducing, Understanding, Designing and Showcasing

• Joseph Bethge
• Christian Bartz
• Haojin Yang
• Christoph Meinel

Binary and quantized neural networks are a promising technique to run convolutional neural networks on mobile or embedded devices. BMXNet 2 is an open-source framework that provides a broad basis for academia and industry. It provides a modern implementation of binary and quantized layers with a wide array of implemented state-of-the-art models. Our implementation fosters reproducibility of other works and our own work through publishing model code, hyperparameters, detailed model graphs, and training logs. Furthermore, we implement several applications for BNNs, including demo applications, which can run on a smartphone or a Raspberry Pi. The code can be found online: https://github.com/hpi-xnor/BMXNet-v2

### PyAnomaly: A Pytorch-based Toolkit for Video Anomaly Detection

• Yuhao Cheng
• Wu Liu
• Pengrui Duan
• Jingen Liu
• Tao Mei

Video anomaly detection is an essential task in computer vision which attracts massive attention from academia and industry. The existing approaches are implemented in diverse deep learning frameworks and settings, making it difficult to reproduce the results published by the original authors. Undoubtedly, this phenomenon is detrimental to the development of Video Anomaly detection and community communication. In this paper, we present a PyTorch-based video anomaly detection toolbox, namely PyAnomaly that contains high modular and extensible components, comprehensive and impartial evaluation platforms, a friendly manageable system configuration, and the abundant engineering deployment functions. To make it easy-to-use and easy-to-extend, we implement the architecture by hooks and registers functionality. Remarkably, we have reproduced the comparable experimental results of six representative methods as those published by the original authors, and we will release these pre-trained models with more rich configurations. To our best knowledge, the PyAnomaly is the first open-source tool in video anomaly detection and is available at https://github.com/YuhaoCheng/PyAnomaly.

### TAPAS-360°: A Tool for the Design and Experimental Evaluation of 360° Video Streaming Systems

• Giuseppe Ribezzo
• Luca De Cicco
• Vittorio Palmisano
• Saverio Mascolo

Video streaming platforms are required to innovate their delivery pipeline to allow new and more immersive video content to be supported. In particular, Omnidirectional videos enable the user to explore a 360° scene by moving their heads using Head Mounted Display devices. Viewport adaptive streaming allows changing dynamically the quality of the video falling in the user's field of view. In this paper, we present TAPAS-360°, an open-source tool that enables designing and experimenting all the components required to build omnidirectional video streaming systems. The tool can be used by researchers focusing on the design of viewport-adaptive algorithms and also to produce video streams to be employed for subjective and objective Quality of Experience evaluations.

### SOMHunter: Lightweight Video Search System with SOM-Guided Relevance Feedback

• Miroslav Kratochvil
• František Mejzlík
• Patrik Veselý
• Tomáš Soućek
• Jakub Lokoć

In the last decade, the Video Browser Showdown (VBS) became a comparative platform for various interactive video search tools competing in selected video retrieval tasks. However, the participation of new teams with an own, novel tool is prohibitively time-demanding because of the large number and complexity of components required for constructing a video search system from scratch. To partially alleviate this difficulty, we provide an open-source version of the lightweight known-item search system SOMHunter that competed successfully at VBS 2020. The system combines several features for text-based search initialization and browsing of large result sets; in particular a variant of W2VV++ model for text search, temporal queries for targeting sequences of frames, several types of displays including the eponymous self-organizing map view, and a feedback-based approach for maintaining the relevance scores inspired by PICHunter. The minimalistic, easily extensible implementation of SOMHunter should serve as a solid basis for constructing new search systems, thus facilitating easier exploration of new video retrieval ideas.

## SESSION: Demo Session I

### Text-to-Image Synthesis via Aesthetic Layout

• Samah Saeed Baraheem
• Trung-Nghia Le
• Tam V. Nguyen

In this work, we introduce a practical system which synthesizes an appealing image from natural language descriptions such that the generated image should maintain the aesthetic level of photographs. Our proposed method takes the text from the end-users via a user-friendly interface and produces a set of different label maps via the primary generator PG. Then, choosing a subset from the label maps set is performed through the primary aesthetic appreciation PAA. Next, our subset of label maps is fed into the accessory generator AG, which is the state-of-the-art image-to-image translation. Last but not least, our subset of generated images is ranked via the accessory aesthetic appreciation AAA, and the most appealing image is produced.

### Progressive Domain Adaptation for Robot Vision Person Re-identification

• Zijun Sha
• Zelong Zeng
• Zheng Wang
• Yoichi Natori
• Yasuhiro Taniguchi
• Shin'ichi Satoh

Person re-identification has received much attention in the last few years, as it enhances the retrieval effectiveness in the video surveillance networks and video archive management. In this paper, we demonstrate a guiding robot with person followers system, which recognizes the follower using a person re-identification technology. It first adopts existing face recognition and person tracking methods to generate person tracklets with different IDs. Then, a classic person re-identification model, pre-trained on the surveillance dataset, is adapted to the new robot vision condition incrementally. The demonstration showcases the quality of robot follower focusing.

### Semantic Storytelling Automation: A Context-Aware and Metadata-Driven Approach

• Paula Viana
• Pedro Carvalho
• Pieter P. Jonker
• Vasileios Papanikolaou
• Inês N. Teixeira
• Luis Vilaça
• José P. Pinto
• Tiago Costa

Multimedia content production is nowadays widespread due to technological advances, namely supported by smartphones and social media. Although the massive amount of media content brings new opportunities to the industry, it also obfuscates the relevance of marketing content, meant to maintain and lure new audiences. This leads to an emergent necessity of producing these kinds of contents as quickly and engagingly as possible. Creating these automatically would decrease both the production costs and time, particularly by using static media for the creation of short storytelling animated clips. We propose an innovative approach that uses context and content information to transform a still photo into an appealing context-aware video clip. Thus, our solution presents a contribution to the state-of-the-art in computer vision and multimedia technologies and assists content creators with a value-added service to automatically build rich contextualized multimedia stories from single photographs.

### ADHD Intelligent Auxiliary Diagnosis System Based on Multimodal Information Fusion

• Yanyi Zhang
• Ming Kong
• Tianqi Zhao
• Wenchen Hong
• Qiang Zhu
• Fei Wu

The traditional medical diagnosis methods of ADHD mainly rely on scale evaluation and interview observation. The diagnosis conclusion is subjective and extremely dependent on the doctor's experience level. There is an urgent need to improve diagnosis efficiency and improve the diagnosis standard through other technical means in the clinical process. We have designed and developed the ADHD intelligent auxiliary diagnosis system with software and hardware cooperation. The system performs a set of functional test tasks, uses a camera module to capture multimodal information such as facial expressions, eye movements, limb movements, language expressions and reaction abilities of children during task completion, and uses computer vision technology to automatically extract measurable characteristics. Finally, deep learning technology is used to detect children's specific behaviors in the video, which is complementary to the existing doctor's diagnosis basis. This system was deployed in the Department of Psychology of Children's Hospital of Zhejiang University in July 2019 and has been used in actual clinical diagnosis to date. It has completed the testing and evaluation of hundreds of ADHD children.

### Video 360 Content Navigation for Mobile HMD Devices

• Jounsup Park
• Mingyuan Wu
• Eric Lee
• Klara Nahrstedt
• Yash Shah
• Arielle Rosenthal
• John Murray
• Kevin Spiteri
• Michael Zink
• Ramesh Sitaraman

We demonstrate a video 360 navigation and streaming system for Mobile HMD devices. The Navigation Graph (NG) concept is used to predict future views that use a graph model that captures both temporal and spatial viewing behavior of prior viewers. Visualization of video 360 content navigation and view prediction algorithms is used for assessment of Quality of Experience (QoE) and evaluation of the accuracy of the NG-based view prediction algorithm.

• Yuanfeng Song
• Di Jiang
• Xiaoling Huang
• Yawen Li
• Qian Xu
• Raymond Chi-Wing Wong
• Qiang Yang

Existing Automatic Speech Recognition (ASR) systems usually generate the N-best hypotheses list first, and then rescore them with the language model score and the acoustic model score to find the best one. This procedure is essentially analogous to the working mechanism of modern Information Retrieval (IR) systems, which retrieve a relatively large amount of relevant candidates first, re-rank them, and output the top-N list. Exploiting their commonality, this demonstration proposes a novel system named GoldenRetriever that marries IR with ASR. GoldenRetriever transforms the problem of N-best hypotheses rescoring as a Learning-to-Rescore (L2RS) problem and utilizes a wide range of features beyond the language model score and the acoustic model score. In this demonstration, the audience can experience the great potential of marrying IR with ASR for the first time. GoldenRetriever should inspire more research on transferring the state-of-the-art IR techniques to ASR.

### Integrating Event Camera Sensor Emulator

• Andrew C. Freeman
• Ketan Mayer-Patel

Event cameras are biologically-inspired sensors that upend the framed, synchronous nature of traditional cameras. Singh et al. proposed a novel sensor design wherein incident light values may be measured directly through continuous integration, with individual pixels' light sensitivity being adjustable in real time, allowing for extremely high frame rate and high dynamic range video capture. Arguing the potential usefulness of this sensor, this paper introduces a system for simulating the sensor's event outputs and pixel firing rate control from 3D-rendered input images.

### Scene-segmented Video Information Annotation System V2.0

• Alex Lee
• Chang-Uk Kwak
• Jeong-Woo Son
• Gyeong-June Hahm
• Min-Ho Han
• Sun-Joong Kim

We have built the scene-segmented video information annotation system and upgraded it to version 2.0. The system imports the video by user selection and splits into the scene units. Each scene clips are annotated by the integration of visual features derived by state-of-the-art deep learning techniques. The proposed system uses the multiview deep convolutional neural network for video segmentation and a supervised movie caption model for video annotation. Each functionality has been installed in two different sub-systems and connected through the web interface. The web interface allows connecting to external content providers in order to expand the capability of the system.

### SmartShots: Enabling Automatic Generation of Videos with Data Visualizations Embedded

• Tan Tang
• Junxiu Tang
• Jiewen Lai
• Lu Ying
• Peiran Ren
• Lingyun Yu
• Yingcai Wu

Videos become prevalent for storytellers to inspire viewers' interests. To further enhance narrations, visualizations are integrated into videos to present data-driven insights. However, manually crafting such data-driven videos is difficult and time-consuming. Thus, we present SmartShots, a system that facilitates the automatic integration of in-video visualizations. Specifically, we propose a computational framework that integrates non-verbal video clips, images, a melody, and a data table to create a video with data visualizations embedded. The system automatically translates the multi-media material into shots and then combines the shots into a compelling video. In addition, we develop a set of post-editing interactions to incorporate users' design knowledge and help them re-edit the automatically-generated videos.

## SESSION: Demo Session II

### A Smart-Site-Survey System using Image-based 3D Metric Reconstruction and Interactive Panorama Visualization

• Sha Yu
• Kevin Mcguinness
• Patricia Moore
• David Azcona
• Noel O'Connor

This work presents a so-called Smart Site Survey (SSS) system that provides an efficient, web-based platform for virtual inspection of remote sites with absolute 3D metrics. Traditional manual surveying requires sending surveyors and specialised measuring tools to the targeted scene, which takes time and requires significant human resource, and often includes human error. The proposed system provides an automated site survey tool. Sample indoor scenes including offices, storage rooms, and laboratory are used for testing purposes, and highly precise virtual scenes are restored, with the measurement accuracy of 1%, i.e. an error ±1.5cm to a 150cm length. This is comparable or superior to existing works or commercial products.

### AI-SAS: Automated In-match Soccer Analysis System

• Ning Zhang
• Tong Shen
• Yue Chen
• Wei Zhang
• Dan Zeng
• Jingen Liu
• Tao Mei

Real-time in-match soccer statistics provide continuous tracking of soccer ball and player positions and speeds, enabling advanced analytics. Currently, only elite soccer leagues have the luxury of tracking in-match soccer statistics operated with a large number of trained personnel. In this work, we present an Automated In-match Soccer Analysis System (AI-SAS), using a domain-knowledge-based multi-view global tracking. This system tracks player team, position, and speed automatically, providing real-time in-match team- and individual-level statistics and analyses. In comparison with the latest soccer analysis systems, AI-SAS is more scalable in streaming multiple video sources for real-time process and more flexible in hosting plug-and-play deep-learning-based tracking-by-detection algorithms. The global multi-view tracking also overcomes the single-view limitation and improves the tracking accuracy.

### Detecting Urban Issues With the Object Detection Kit

• Maarten Sukel
• Stevan Rudinac
• Marcel Worring

This technical demo will present the Object Detection Kit, a system capable of collecting, analyzing and distributing street level imagery in real-time. It provides civil servants with the actionable intelligence about issues on city streets and, at the same time, equips the multimedia research community with a framework and data facilitating easy deployment and testing of algorithms in a challenging urban setting. The system is available as open source. In the Object Detection Kit demo we will demonstrate how the framework can be used to detect urban issues and showcase the capabilities of the system.

### Visual-speech Synthesis of Exaggerated Corrective Feedback

• Yaohua Bu
• Weijun Li
• Tianyi Ma
• Shengqi Chen
• Jia Jia
• Kun Li
• Xiaobo Lu

To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.

### TindART: A Personal Visual Arts Recommender

• Gjorgji Strezoski
• Lucas Fijen
• Jonathan Mitnik
• Dániel László
• Pieter de Marez Oyens
• Yoni Schirris
• Marcel Worring

We present TindART - a comprehensive visual arts recommender system. TindART leverages real time user input to build a user-centric preference model based on content and demographic features. Our system is coupled with visual analytics controls that allow users to gain a deeper understanding of their art taste and further refine their personal recommendation model. The content based features in TindART are extracted using a multi-task learning deep neural network which accounts for a link between multiple descriptive attributes and the content they represent. Our demographic engine is powered by social media integrations such as Google, Facebook and Twitter profiles the users can login with. Both the content and demographics power a recommender system which decision making processed is visualized through our web t-SNE implementation. TindART is live and available at: https://tindart.net/.

### Fashionist: Personalising Outfit Recommendation for Cold-Start Scenarios

• Dhruv Verma
• Kshitij Gulati
• Vasu Goel
• Rajiv Ratn Shah

With the proliferation of the online fashion industry, there have been increased efforts towards building cutting-edge solutions for personalising fashion recommendation. Despite this, the technology is still limited by its poor performance on new entities, i.e. the cold-start problem. We attempt to address the cold-start problem for new users, by leveraging a novel visual preference modelling approach on a small set of input images. Additionally, we describe our proposed strategy to incorporate the modelled preference in occasion-oriented outfit recommendation. Finally, we propose Fashionist: a real-time web application to demonstrate our approach enabling personalised and diverse outfit recommendation for cold-start scenarios. Check out https://youtu.be/kuKgPCkoPy0 for demonstration.

### EmotionTracker: A Mobile Real-time Facial Expression Tracking System with the Assistant of Public AI-as-a-Service

• Xuncheng Liu
• Jingyi Wang
• Weizhan Zhang
• Qinghu Zheng
• Xuanya Li

### AvatarMeeting: An Augmented Reality Remote Interaction System With Personalized Avatars

• Xuanyu Wang
• Yang Wang
• Yan Shi
• Weizhan Zhang
• Qinghua Zheng

To further enhance the immersion perception of remote interaction, avatars can be involved harnessing Head Mounted Display (HMD) based Augmented Reality (AR). In our demonstration, we present an avatar based remote interaction system AvatarMeeting, enabling users to meet with remote peers through interactive personalized avatars just like face to face. Specifically, we propose a novel framework including a consumer-grade set-up, a complete transmission scheme and a processing pipeline, which consists of prescan modeling, pose detection and action reconstruction. And an angle based reconstruction approach is introduced to empower the AR avatars to perform the same actions as each remote real person do in real time smoothly while keeping a good avatar shape.

### An Interactive Design for Visualizable Person Re-Identification

• Haolin Ren
• Zheng Wang
• Zhixiang Wang
• Lixiong Chen
• Shin'ichi Satoh
• Daning Hu

Although person Re-Identification (ReID) is widely applied in a variety of multimedia systems, most of its essentially multifaceted output is evaluated and visualized inflexibly only using a list of images ranked by the similarity of image content, while the correlations between samples of different IDs and the spatial-temporal features of the images are underinvestigated. As system operators need a comfortable access to these important elements, we introduce an interactive design of a person ReID system to visualize these quantities. We demonstrate that a system offering these visual representations can effectively expedite and improve a person re-identification analysis and make it a much user-friendly experience.

## SESSION: Demo Session III

### Image and Video Restoration and Compression Artefact Removal Using a NoGAN Approach

• Filippo Mameli
• Marco Bertini
• Leonardo Galteri
• Alberto Del Bimbo

Lossy image and video compression algorithms introduce several types of visual artefacts that reduce the visual quality of the compressed media. In this work, we report results obtained using the NoGAN training approach and adapting the popular DeOldify architecture used for colorization, for image and video compression artefact removal and restoration.

### Beautify As You Like

• Wentao Jiang
• Si Liu
• Chen Gao
• Ran He
• Bo Li
• Shuicheng Yan

Customizable makeup transfer, which aims to transfer the makeup from an arbitrary reference face to a source face, is widely demanded in many applications such as short video platforms and online meeting applications. However, existing methods are neither user-friendly nor sufficiently fast. In this demo, we present the first fast makeup transfer system named as Fast Pose and expression robust Spatial-Aware GAN (FPSGAN). With a novel Attentive Makeup Morphing (AMM) module, FPSGAN is robust to face pose and expression. Moreover, it can achieve shade-controllable and partial makeup, improving the system's user-friendliness. In addition, FPSGAN is light-weighted and fast. To sum up, FPSGAN is the first fast customizable makeup transfer system to enable users to beautify themselves as they like.

• Jiawei Zuo
• Yue Chen
• Linfang Wang
• Yingwei Pan
• Ting Yao
• Ke Wang
• Tao Mei

### Multimedia Food Logger

• Ali Rostami
• Bihao Xu
• Ramesh Jain

Logging what we eat is important for individuals and the aggregated information in these logs are important for businesses as well as public health. Food logging has received very little attention and has been mostly limited only to the recognition of food items ignoring context, situation, and health variable completely. In this demo we let the audience interact with our multimedia food logger system which is described in the following. We also describe how this system captures the major food-related information that could be used by all stakeholders in the food ecosystem. We will demonstrate the complete functionality of such a system in this demo.

### A Cross-modality and Progressive Person Search System

• Xiaodong Chen
• Wu Liu
• Xinchen Liu
• Yongdong Zhang
• Tao Mei

This demonstration presents an instant and progressive cross-modality person search system, called 'CMPS'. Through the system, users can instantly find the lost children or elderly persons by simply describing their appearance through speech. Unlike most existing person search applications which have to cost much time to find the probe images, CMPS will save more valuable time in the early stage of losing. The proposed CMPS is one of the first attempts towards instant and progressive person search leveraging the audio, text, and visual modalities together. In detail, the system first takes the speech that describes the appearance of a person as the input to obtain a textual description by speech-to-text conversion. Then the cross-modal search is performed by matching the textual embedding with the visual representations of images in the learned latent space. The searched images can be used as candidates for query expansion. If the candidates are not right, the user can quickly adjust their description through speech. Once a right image is found, the user can directly click it as a new query. Finally the system will give the complete track of the lost person by once-click. On the built CUHK-PEDES-AUDIOS dataset, the system can achieve 82.46% rank-1 accuracy in real-time speed. Our code of CMPS is available at https://github.com/SheldongChen/Search-People-With-Audio.

### Binocular Multi-CNN System for Real-Time 3D Pose Estimation

• Teo T. Niemirepo
• Marko Viitanen
• Jarno Vanne

The current practical approaches for depth-aware pose estimation convert a human pose from a monocular 2D image into 3D space with a single computationally intensive convolutional neural network (CNN). This paper introduces the first open-source algorithm for binocular 3D pose estimation. It uses two separate lightweight CNNs to estimate disparity/depth information from a stereoscopic camera input. This multi-CNN fusion scheme makes it possible to perform full-depth sensing in real time on a consumer-grade laptop even if parts of the human body are invisible or occluded. Our real-time system is validated with a proof-of-concept demonstrator that is composed of two Logitech C930e webcams and a laptop equipped with Nvidia GTX1650 MaxQ GPU and Intel i7-9750H CPU. The demonstrator is able to process the input camera feeds at 30 fps and the output can be visually analyzed with a dedicated 3D pose visualizer.

### An Interaction-based Video Viewing Support System using Geographical Relationships

• Itsuki Hashimoto
• Yuanyuan Wang
• Yukiko Kawai
• Kazutoshi Sumiya

With the spread of internet television, many studies have been conducted to recommend relevant information for TV programs. Such as NHK Hybridcast is a new TV service that provides relevant information on the same screen during a TV program broadcast. However, current services cannot recommend supplementary information for TV programs based on user viewing behavior. Therefore, in this paper, we propose a video viewing support system to recommend supplementary information using geographical relationships based on user interaction.

### Infinity Battle: A Glance at How Blockchain Techniques Serve in a Serverless Gaming System

• Feijie Wu
• Ho Yin Yuen
• Henry C.B. Chan
• Victor C.M. Leung
• Wei Cai

The blockchain technology provides a data authentication and permanent storage solution to the data volatility issue in peer-to-peer games. In this work, we present the Infinity Battle, a serverless turn-based strategy game supported by a novel Proof-of-Play consensus model. Comprising three major phases: matchmaking, gaming session and global synchronization, the proposed demo game generates a blockchain through distributed storage and processing.

### ConfFlow: A Tool to Encourage New Diverse Collaborations

• Ekin Gedik
• Hayley Hung

ConfFlow is an interactive web application that allows conference participants to inspect other attendees through a visualized similarity space. The construction of the similarity space is done in a similar manner to the well-known Toronto Paper Matching System (TPMS) and based on the publicly available former publications of the attendees, obtained by crawling through the Web. ConfFlow aims to help attendees initiate new connections and collaborations with participants that have similar and/or complementary research interests. It has multiple functionalities that allow users to customize their experience and identify the perfect connection for their next collaboration.

## SESSION: Grand Challenge: SMP Challenge

### HyFea: Winning Solution to Social Media Popularity Prediction for Multimedia Grand Challenge 2020

• Xin Lai
• Yihong Zhang
• Wei Zhang

Social Media Popularity (SMP) prediction focuses on predicting the social impact of a given post from a specific user in social media, which is crucial for online advertising, social recommendation, and demand prediction. In this paper, we present HyFea, our winning solution to the Social Media Prediction (SMP) Challenge for multimedia grand challenge of ACM Multimedia 2020. To address the multi-modality and personality issues of this challenge, HyFea carefully considers multiple feature types and adopts a tree-based ensembling method, i.e., CatBoost, which is shown to perform well in prediction. Specifically, HyFea involves the features related to Image, Category, Space-Time, User Profile, Tag, and Others. We conduct several experiments on the Social Media Prediction Dataset (SMPD), verifying the positive contributions of each type of features.

### A Feature Generalization Framework for Social Media Popularity Prediction

• Kai Wang
• Penghui Wang
• Xin Chen
• Qiushi Huang
• Zhendong Mao
• Yongdong Zhang

Social media is an indispensable part in modern life and social media popularity prediction can be applied to many aspects of sociality. In this paper, we propose a novel combined framework for social media popularity prediction, which accomplishes feature generalization and temporal modeling based on multi-modal feature extraction. On the one hand, in order to address the generalization problem caused by massive missing data, we train two CatBoost models with different datasets and integrate their outputs with a linear combination. On the other hand, sliding window average is employed to mine potential short-term dependency for each user's post sequence. Extensive experiments show that our proposed framework has superiorities in both feature generalization and temporal modeling. Besides, our approach achieves the 1st place on the leader board of the SMP Challenge in 2020, which proves the effectiveness of our proposed framework.

### Curriculum Learning for Wide Multimedia-Based Transformer with Graph Target Detection

• Weilong Chen
• Feng Hong
• Chenghao Huang
• Shaoliang Zhang
• Rui Wang
• Ruobing Xie
• Feng Xia
• Leyu Lin
• Yanru Zhang
• Yan Wang

The social media prediction task is aiming at predicting content popularity which includes social multimedia data such as photos, videos, and news. The task can not only help make better decisions for recommendation, but also reveals the public attention from evolutionary social systems. In this paper, we propose a novel approach named curriculum learning for wide multimedia-based transformer with graph target detection(CL-WMTG). The curriculum learning is designed for the transformer to improve the efficiency of model convergence. The mechanism of wide multimedia-based transformer is to make the model capable of learning cross information from text, pictures and other features(e.g. categories, location). Moreover, the graph target detection part can extract different features in the picture by pretrained model and reconstruct the features with a homogeneous graph network. We achieved third place in the SMP Challenge 2020.

### Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism

• Kele Xu
• Zhimin Lin
• Jianqiao Zhao
• Peicang Shi
• Wei Deng
• Huaimin Wang

Social media popularity estimation refers to predict the post's popularity using multimodal contents. The prediction performance heavily relies on the feature extraction part and fully leveraging multimodal heterogeneous data is of a great challenge in the practical settings. Despite remarkable progress have been made, most of the previous attempts are restrained from the essentially limited property of the employed single modality. Inspired by the recent success of multimodal learning, we propose a novel multimodal deep learning framework for the popularity prediction task, which aims to leverage the complementary knowledge from different modalities. Moreover, an attention mechanism is introduced in our framework, with the goal to assign large weights to specified modalities during the training and inference phases. To empirically investigate the effectiveness and robustness of the proposed approach, we conduct extensive experiments on the 2020 SMP challenge. The obtained results show that the proposed framework outperforms related approaches.

### Rethinking Relation between Model Stacking and Recurrent Neural Networks for Social Media Prediction

• Chih-Chung Hsu
• Wen-Hai Tseng
• Hao-Ting Yang
• Chia-Hsiang Lin
• Chi-Hung Kao

Popularity prediction of social posts is one of the most critical issues for social media analysis and understanding. In this paper, we discover a more dominant feature representation of text information, as well as propose a singe ensemble learning model to obtain the popularity scores, for social media prediction challenge. However, most social media prediction techniques focus on predicting the popularity score of social posts based on a single model, such as deep learning-based or ensemble learning-based approaches. However, it is well-known that the model stacking strategy is a more effective way to boost the performance on various regression tasks. In this paper, we also show that the model stacking can be modeled as a simple recurrent neural network problem with comparable performance on predicting popularity scores. Firstly, a single strong baseline is proposed based on the deep neural network with a prediction branch. Then, the partial feature maps of the last layer of our strong baseline are used to establish a new branch with an isolated predictor. It is easy to obtain multi-prediction by repeating the above two steps. These preliminary predicted scores are then formed as the input of the recurrent unit to learn the final predicted scores, called Recurrent Stacking Model (RSM). Our experiments show that the proposed ensemble learning approach outperforms other state-of-the-art methods. Furthermore, the proposed RSM also shows the superiority over our ensemble learning approach, having verified that the model stacking problem can be transformed into the training problem of a recurrent neural network.

## SESSION: Grand Challenge: Video Relation understanding & Pre-training for Video Captions Challenge

### Video Relation Detection with Trajectory-aware Multi-modal Features

• Wentao Xie
• Guanghui Ren
• Si Liu

Video relation detection problem refers to the detection of the relationship between different objects in videos, such as spatial relationship and action relationship. In this paper, we present video relation detection with trajectory-aware multi-modal features to solve this task. Considering the complexity of doing visual relation detection in videos, we decompose this task into three sub-tasks: object detection, trajectory proposal and relation prediction. We use the state-of-the-art object detection method to ensure the accuracy of object trajectory detection and multi-modal feature representation to help the prediction of relation between objects. Our method won the first place on the video relation detection task of Video Relation Understanding Grand Challenge in ACM Multimedia 2020 with 11.74% mAP, which surpasses other methods by a large margin.

### A Strong Baseline for Multiple Object Tracking on VidOR Dataset

• Zhipeng Luo
• Zhiguang Zhang
• Yuehan Yao

This paper explores a simple and efficient baseline for multi-class and multiple objects tracking on VidOR dataset. The task is to build a robust object tracker that not only localize objects with bounding boxes in every video frame but also link the bounding boxes that indicate the same object entity into a trajectory. The task's challenges are the low resolution and imbalance of data and the disappearance of the object for a long time. According to the above characteristics, we design a robust detection model, proposed a new deep metric learning method, and explored some useful tracking algorithms to help complete the video object detection task.

### XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

• Yiqing Huang
• Qiuyu Cai
• Siyu Xu
• Jiansheng Chen

The dynamic feature extracted by the 3D convolutional network and the static feature extracted by CNN are proved to be beneficial for video captioning. We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose XlanV model for video captioning. However, we notice that the dynamic feature is not compatible with vision-language pre-training techniques when the frame length distribution and average pixel difference of training video and test video biases. Consequently, we directly train the XlanV model on the MSR-VTT dataset without pre-training on the GIF dataset in this challenge. The proposed XlanV model reaches the 1st place in the pre-training for video captioning challenge, which shows that substantially exploiting the dynamic feature is more effective than vision-language pre-training in this challenge.

### VideoTRM: Pre-training for Video Captioning Challenge 2020

• Jingwen Chen
• Hongyang Chao

The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT benchmark. As a part of the submission to this challenge, we propose a Transformer based framework named VideoTRM, which consists of four modules: a textual encoder for encoding the linguistic relationship among words in the input sentence, a visual encoder for capturing the temporal dynamics in the input video, a cross-modal encoder for modeling the interactions between the two modalities (i.e., textual and visual) and a decoder for sentence generation conditioned on the input video and words generated previously. Additionally, we extend the decoder in our VideoTRM with mesh-like connections and gate fusion mechanism in multi-head attention during fine-tuning to take advantage of multi-level visual features and bypass less informative attention results, respectively. In the evaluation on test server, our VideoTRM achieves superior performances and ranks the second place on the leadboard finally.

### Multi-stage Tag Guidance Network in Video Caption

• Lanxiao Wang
• Chao Shang
• Heqian Qiu
• Taijin Zhao
• Benliu Qiu
• Hongliang Li

Recently, video caption plays an important role in computer vision tasks. We participate in Pre-training for Video Captioning Challenge which aims to produce at least one sentence for each challenge video based on the pretraining models. In this work, we propose a tag guidance module to learn a representation which can better build the interaction in cross-modal between visual content and textual sentences. First, we utilize three types of features extraction networks to fully capture the information of 2D, 3D and object information. Second, to prevent overfitting and time issues, the entire process of training is divided into two stages. The first stage trains all data, and the second stage introduces a random dropout. Furthermore, we train a CNN-based network to pick out the best candidate results. In summary, we were ranked third place in Pre-training for Video Captioning Challenge which proved the effectiveness of our model.

## SESSION: Grand Challenge: Human Centric Analysis I

### Dense Scene Multiple Object Tracking with Box-Plane Matching

• Jinlong Peng
• Yueyang Gu
• Yabiao Wang
• Chengjie Wang
• Jilin Li
• Feiyue Huang

Multiple Object Tracking (MOT) is an important task in computer vision. MOT is still challenging due to the occlusion problem, especially in dense scenes. Following the tracking-by-detection framework, we propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. First, we design the Layer-wise Aggregation Discriminative Model (LADM) to filter the noisy detections. Then, to associate remaining detections correctly, we introduce the Global Attention Feature Model (GAFM) to extract appearance feature and use it to calculate the appearance similarity between history tracklets and current detections. Finally, we propose the Box-Plane Matching strategy to achieve data association according to the motion similarity and appearance similarity between tracklets and detections. With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.

### Transductive Multi-Object Tracking in Complex Events by Interactive Self-Training

• Ancong Wu
• Chengzhi Lin
• Bogao Chen
• Weihao Huang
• Zeyu Huang
• Wei-Shi Zheng

Recently, multi-object tracking (MOT) for estimating trajectories of pedestrians has undergone fast development and played an important role in human-centric video analysis. However, video analysis in complex events (e.g. scenes in HiEve dataset) is still under-explored. In complex real-world scenarios, domain gap in unseen testing scenes and severe occlusion problem that disconnects tracks are challenging for existing online MOT methods without domain adaptation. To alleviate domain gap, we study the problem in a transductive learning setting, which assumes that unlabeled testing data is available for learning offline tracking. We propose a transductive interactive self-training method to adapt the tracking model to unseen crowded scenes with unlabeled testing data by means of teacher-student interative learning. To reduce prediction variance in an unseen domain, we train two different models and teach one model with pseudo labels of unlabeled data predicted by the other model interactively. To improve robustness against occlusions during self-training, we exploit disconnected track interpolation (DTI) to refine the predicted pseudo labels. Our method achieved MOTA of 60.23 on HiEve dataset and won the first place of Multi-person Motion Tracking in Complex Events (with Private Detection) in the ACM MM Grand Challenge on Large-scale Human-centric Video Analysis in Complex Events.

### Application of Multi-Object Tracking with Siamese Track-RCNN to the Human in Events Dataset

• Bing Shuai
• Andrew Berneshawi
• Manchen Wang
• Chunhui Liu
• Davide Modolo
• Xinyu Li
• Joseph Tighe

Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction. Differently, this work aims to unify all these in a single tracking system. Towards this, we propose Siamese Track-RCNN, a two stage detect-and-track framework which consists of three functional branches: (1) the detection branch localizes object instances; (2) the Siamese-based track branch estimates the object motion and (3) the object re-identification branch re-activates the previously terminated tracks when they re-emerge. We used this design and apply it to the Human in Events dataset.

### Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

• Shuning Chang
• Li Yuan
• Xuecheng Nie
• Ziyuan Huang
• Yichen Zhou
• Yupeng Chen
• Jiashi Feng
• Shuicheng Yan

Video-based human pose estimation in crowed scenes is a challenging problem due to occlusion, motion blur, scale variation and viewpoint change, etc. Prior approaches always fail to deal with this problem because of (1) lacking of usage of temporal information; (2) lacking of training data in crowded scenes. In this paper, we focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data. In particular, we first follow the top-down strategy to detect persons and perform single-person pose estimation for each frame. Then, we refine the frame-based pose estimation with temporal contexts deriving from the optical-flow. Specifically, for one frame, we forward the historical poses from the previous frames and backward the future poses from the subsequent frames to current frame, leading to stable and accurate human pose estimation in videos. In addition, we mine new data of similar scenes to HIE dataset from the Internet for improving the diversity of training set. In this way, our model achieves best performance on 7 out of 13 videos and 56.33 average wAP on test dataset of HIE challenge.

### Combined Distillation Pose

• Lei Yuan
• Shu Zhang
• Feng Fubiao
• Naike Wei

Human keypoint detection is a challenging task, especially under blurry and crowded conditions. However, the existing network for human keypoint detection has become increasingly deeper. When backpropagating, the final supervision information of the network often cannot effectively guide the training of the entire network. Therefore, how to guide the deep network to train effectively is a subject worth discussing. In this paper, the knowledge distillation method is used to make the network predictions results act as supervision information, then a multi-stage supervision training framework is designed from shallow to deep layers. Besides, to further improve the feature expression ability and enhance the receptive field of the network, we also design a new convolution module, which can model the channel and spatial features separately. Finally, our method increased from AP49 to AP55 on the HiEve human keypoint detection dataset[1], which demonstrates the superior performance and effectiveness of our method.

## SESSION: Grand Challenge: Deep Video Understanding & BioMedia

### Deep Relationship Analysis in Video with Multimodal Feature Fusion

• Fan Yu
• DanDan Wang
• Beibei Zhang
• Tongwei Ren

In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are tracked. Text, audio and visual features in a scene are extracted to predict relationships between different entities in the scene. The relationships between entities construct a knowledge graph of the video and can be used to answer some queries about the video. The experimental results show that our method performs well for deep video understanding on the HLVU dataset.

### Towards Using Semantic-Web Technologies for Multi-Modal Knowledge Graph Construction

• Matthias Baumgartner
• Luca Rossetto
• Abraham Bernstein

While a multitude of approaches for extracting semantic information from multimedia documents has emerged in recent years, isolating any form of holistic semantic representation from a larger type of document, such as a movie, is not yet feasible. In this paper we present our approaches used in the first instance of the Deep Video Understanding Challenge, using a combination of several multi-modal detectors and an integration scheme informed by methods from the semantic web context in order to determine the capabilities limitations of currently available methods for the extraction of semantic relations between the characters and locations relevant to the narrative of a movie.

### Story Semantic Relationships from Multimodal Cognitions

• Vishal Anand
• Raksha Ramesh
• Ziyin Wang
• Yijing Feng
• Jiana Feng
• Wenfeng Lyu
• Tianle Zhu
• Serena Yuan
• Ching-Yung Lin

We consider the problem of building semantic relationship of unseen entities from free-form multi-modal sources. This intelligent agent understands semantic properties by creating (1) logical segments from sources, (2) finds interacting objects, (3) infers their interaction actions using (4) extracted textual, auditory, visual, and tonal information. The conversational dialogue discourses are automatically mapped to interacting co-located objects, and fused with their Kinetic action embeddings at each scene of occurrence. This generates a combined probability distribution representation for interacting entities spanning over every semantic relation class. Using these probabilities, we create knowledge graphs capable of answering semantic queries and infer missing properties in a given context.

### ACM Multimedia BioMedia 2020 Grand Challenge Overview

• Steven A. Hicks
• Vajira Thambawita
• Hugo L. Hammer
• Trine B. Haugen
• Jorunn M. Andersen
• Oliwia Witczak
• Pål Halvorsen
• Michael A. Riegler

The BioMedia 2020 ACM Multimedia Grand Challenge is the second in a series of competitions focusing on the use of multimedia for different medical use-cases. In this year's challenge, participants are asked to develop algorithms that automatically predict the quality of a given human semen sample using a combination of visual, patient-related, and laboratory-analysis-related data. Compared to last year's challenge, participants are provided with a fully multimodal dataset (videos, analysis data, study participant data) from the field of assisted human reproduction. The tasks encourage the use of the different modalities contained within the dataset and finding smart ways of how they may be combined to further improve prediction accuracy. For example, using only video data or combining video data and patient-related data. The ground truth was developed through a preliminary analysis done by medical experts following the World Health Organization's standard for semen quality assessment. The task lays the basis for automatic, real-time support systems for artificial reproduction. We hope that this challenge motivates multimedia researchers to explore more medical-related applications and use their vast knowledge to make a real impact on people's lives.

### A Quantitative Comparison of Different Machine Learning Approaches for Human Spermatozoa Quality Prediction Using Multimodal Datasets

• Ming Feng
• Kele Xu
• Yin Wang

Despite remarkable advances in medical data analysis fields, they are severely restrained from the limited property of the employed single modality, usually medical imaging data. However, other modalities (such as patient-related information) should also be taken into account in the process of clinical decision. How to fully employ the multi-modal dataset is still under-explored. In this paper, we make a quantitative comparison of different machine learning approaches for the human spermatozoa quality prediction task, leveraging multiple modalities dataset. To empirically investigate the advantages and disadvantages of different machine learning approaches, we perform extensive experiments. Leveraging different features, we achieve state-of-the-art performance on most of the tasks. The obtained results show that simple models can provide better performance, which emphasizes the importance of avoiding overfitting. For the sake of reproducibility, we have released our code to facilitate the research community.

## SESSION: Grand Challenge: CitySCENE

### Enhancing Anomaly Detection in Surveillance Videos with Transfer Learning from Action Recognition

• Kun Liu
• Minzhi Zhu
• Huiyuan Fu
• Tat-Seng Chua

Anomaly detection in surveillance videos, as a special case of video-based action recognition, has been of increasing interest in multimedia community and public security. Action recognition in videos faces some challenges, such as cluttered background, illumination conditions. Besides these above difficulties, detecting anomaly in surveillance videos has several unique problems to be solved. For example, the lack of sufficient training samples is one of the main challenges for detecting anomalies in surveillance videos. In this paper, we propose to utilize transfer learning to leverage the good results from action recognition for anomaly detection in surveillance videos. More specially, we explore some techniques based on action recognition models from the following aspects: training samples, temporal modules for action recognition, network backbones. We draw some conclusions. First, more training samples from surveillance videos lead to higher classification accuracy. Second, stronger temporal modules designed for recognizing action and deeper networks do not achieve better results. This conclusion is reasonable since deeper networks tend to over-fitting, especially for the small-scale training set. Besides, to distinguish the hard examples from normal activities, we separately train a neural network to classify the hard category and normal events. Then we fuse the binary network and previous network to generate the final prediction for general anomaly detection. On the benchmarks of CitySCENE, our framework achieves promising performance and obtains the first prize for general anomaly detection and the second prize for specific anomaly detection.

### Modularized Framework with Category-Sensitive Abnormal Filter for City Anomaly Detection

• Jie Wu
• Yingying Li
• Wei Zhang
• Yi Wu
• Xiao Tan
• Hongwu Zhang
• Shilei Wen
• Errui Ding
• Guanbin Li

Anomaly detection in the city scenario is a fundamental computer vision task and plays a critical role in city management and public safety. Although it has attracted intense attention in recent years, it remains a very challenging problem due to the complexity of the city environment, the serious imbalance between normal and abnormal samples, and the ambiguity of the concept of abnormal behavior. In this paper, we propose a modularized framework to perform general and specific anomaly detection. A video segment extraction module is first employed to obtain the candidate video segments. Then an anomaly classification network is introduced to predict the abnormal score for each category. A category-sensitive abnormal filter is concatenated after the classification model to filter the abnormal event from the candidate video clips. It is helpful to alleviate the impact of the imbalance of abnormal categories in the test phase and obtain more accurate localization results. The experimental results reveal that our framework obtains a 66.41 MF1 in the test set of the CitySCENE Challenge 2020, which ranks first in the specific anomaly detection task.

### Large Scale Hierarchical Anomaly Detection and Temporal Localization

• Soumil Kanwal
• Vineet Mehta
• Abhinav Dhall

Abnormal event detection is a non-trivial task in machine learning. The primary reason behind this is that the abnormal class occurs sparsely, and its temporal location may not be available. In this paper, we propose a multiple feature-based approach for CitySCENE challenge-based anomaly detection. For motion and context information, Res3D and Res101 architectures are used. Object-level information is extracted by object detection feature-based pooling. Fusion of three channels above gives relatively high performance on the challenge Test set for the general anomaly task. We also show how our method can be used for temporal localisation of the abnormal activity event in a video.

### Global Information Guided Video Anomaly Detection

• Hui Lv
• Chunyan Xu
• Zhen Cui

Video anomaly detection (VAD) is currently a challenging task due to the complexity of "anomaly" as well as the lack of labor-intensive temporal annotations. In this paper, we propose an end-to-end Global Information Guided (GIG) anomaly detection framework for anomaly detection using the video-level annotations (i.e., weak labels). We propose to first mine the global pattern cues by leveraging the weak labels in a GIG module. Then we build a spatial reasoning module to measure the relevance between vectors in spatial domain with the global cue vectors, and select the most related feature vectors for temporal anomaly detection. The experimental results on the CityScene challenge demonstrate the effectiveness of our model.

## SESSION: Grand Challenge: Human Centric Analysis II

### A Simple Baseline for Pose Tracking in Videos of Crowed Scenes

• Li Yuan
• Shuning Chang
• Ziyuan Huang
• Yichen Zhou
• Yupeng Chen
• Xuecheng Nie
• Francis E.H. Tay
• Jiashi Feng
• Shuicheng Yan

This paper presents our solution to ACM MM challenge: Large-scale Human-centric Video Analysis in Complex Events[13]; specifically, here we focus on Track3: Crowd Pose Tracking in Complex Events. Remarkable progress has been made in multi-pose training in recent years. However, how to track the human pose in crowded and complex environments has not been well addressed. We formulate the problem as several subproblems to be solved. First, we use a multi-object tracking method to assign human ID to each bounding box generated by the detection model. After that, a pose is generated to each bounding box with ID. At last, optical flow is used to take advantage of the temporal information in the videos and generate the final pose tracking result.

### HiEve ACM MM Grand Challenge 2020: Pose Tracking in Crowded Scenes

• Lumin Xu
• Ruihan Xu
• Sheng Jin

This paper tackles the challenging problem of multi-person articulated tracking in crowded scenes. We propose a simple yet effective top-down crowd pose tracking algorithm. The proposed method applies Cascade-RCNN for human detection and HRNet for pose estimation. Then IOU tracking and pose distance tracking are applied successively for pose tracking. We conduct extensive ablation studies on the recently released HiEve crowd pose tracking benchmark. Our final model achieves 56.98 Multi-Object Tracking Accuracy (MOTA) without model ensembling on the HiEve test set. Our team SimpleTrack won the 3rd place in the ACM MM'2020 HiEve Challenge.

### Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes

• Li Yuan
• Yichen Zhou
• Shuning Chang
• Ziyuan Huang
• Yupeng Chen
• Xuecheng Nie
• Tao Wang
• Jiashi Feng
• Shuicheng Yan

Detecting and recognizing human action in videos with crowed scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the semantic segmentation model to assistant the process. As a result, our method achieved an average 26.05 wf\_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in Events).

### Person-level Action Recognition in Complex Events via TSD-TSM Networks

• Yanbin Hao
• Zi-Niu Liu
• Hao Zhang
• Bin Zhu
• Jingjing Chen
• Yu-Gang Jiang
• Chong-Wah Ngo

The task of person-level action recognition in complex events aims to densely detect pedestrians and individually predict their actions from surveillance videos. In this paper, we present a simple yet efficient pipeline for this task, referred to as TSD-TSM networks. Firstly, we adopt the TSD detector for the pedestrian localization on each single keyframe. Secondly, we generate the sequential ROIs for a person proposal by replicating the adjusted bounding box coordinates around the keyframe. Particularly, we propose to conduct straddling expansion and region squaring on the original bounding box of a person proposal to widen the potential space of motion and interaction and lead to a square box for ROI detection. Finally, we adapt the TSM classifier on the generated ROI sequences to perform action classification and further adopt late fusion to promote the prediction. Our proposed pipeline achieved the 3rd place in the ACM-MM 2020 grand challenge, i.e., Large-scale Human-centric Video Analysis in Complex Events (Track-4), obtaining final 15.31% wf-mAP@avg and 20.63% f-mAP@avg on the testing set.

### Group-Skeleton-Based Human Action Recognition in Complex Events

• Tingtian Li
• Zixun Sun
• Xiao Chen

Human action recognition as an important application of computer vision has been studied for decades. Among various approaches, skeleton-based methods recently attract increasing attention due to their robust and superior performance. However, existing skeleton-based methods ignore the potential action relationships between different persons, while the action of a person is highly likely to be impacted by another person especially in complex events. In this paper, we propose a novel group-skeleton-based human action recognition method in complex events. This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons. In addition to the traditional key point coordinates, we also input the key point speed values to the networks for better performance. Then we use multilayer perceptrons (MLPs) to embed the distance values between the reference person and other persons into the extracted features. Lastly, all the features are fed into another MS-G3D for feature fusion and classification. For avoiding class imbalance problems, the networks are trained with a focal loss. The proposed algorithm is also our solution for the Large-scale Human-centric Video Analysis in Complex Events Challenge. Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.

## SESSION: Grand Challenge: AI Meets Beauty

### Attention Based Beauty Product Retrieval Using Global and Local Descriptors

• Jun Yu
• Guochen Xie
• Mengyan Li
• Haonian Xie
• Xinlong Hao
• Fang Gao
• Feng Shuang

Beauty product retrieval has drawn more and more attention for its wide application outlook and enormous economic benefits. However, this task is always challenging due to the variation of products, especially the disturbance of clustered background. In this paper, we first introduce attention mechanism into a global image descriptor, i.e., Maximum Activation of Convolutions (MAC), and propose Attention-based MAC (AMAC). With this enhancement, we can suppress the negative effect of background and highlight the foreground in an unsupervised manner. Then, AMAC and local descriptors are ensembled to complementarily increase the performance. Furthermore, we try to finetune multiple retrieval methods on the different datasets and adopt a query expansion strategy to obtain more improvements. Extensive experiments conducted on a dataset containing more the half million beauty products (Perfect-500K) demonstrate the effectiveness of the proposed method. Finally, our team (USTC-NELSLIP) wins the first place on the leaderboard of the 'AI Meets Beauty'Grand Challenge of ACM Multimedia 2020. The code is available at: https://github.com/gniknoil/Perfect500K-Beauty-Product-Retrieval-Challenge.

### Multi-Feature Fusion Method Based on Salient Object Detection for Beauty Product Retrieval

• Runming Yan
• Yongchun Lin
• Zhichao Deng
• Liang Lei
• Chudong Xu

Beauty and Personal care product retrieval has attracted more and more attention due to its wide application value. However, due to the diversity of data and the complexity of image background, this task is very challenging. In this paper, we propose a multi-feature fusion method based on salient object detection to improve retrieval performance. The key of our method is to extract the foreground objects of the query set by using the salient object detection network, so as to eliminate the background interference. Then the foreground target images and dataset are put into the multi-classification networks to extract multiple fusion features for retrieval. We use the perfect-500k dataset for experiments, and the results show that our method is effective. Our method ranked 2st in the Grand Challenge of AI Meets Beauty in ACM Multimedia 2020 with a MAP score of 0.43729. We released our code on GitHub:github.com/R-M-Yan/ACMMM2020AIMeetBeauty.

### Attention-driven Unsupervised Image Retrieval for Beauty Products with Visual and Textual Clues

• Jingwen Hou
• Sijie Ji
• Annan Wang

Beauty and personal care product retrieval (BPCR) aims to match a query image of an item to examples of the same item in a large database. The task is extremely challenging because a small number of ground-truth examples have to be found in a large search space. Previous works mostly search only with visual representations and have not made full use of the product descriptions. Since many noisy examples only have subtle visual differences comparing to the ground-truth examples (e.g. similar packaging but different brands) and those differences (e.g. product brands) are especially hard to be captured only by visual features, methods merely based on visual feature similarities can easily regard those noisy examples as examples of the same item in the query image. We notice that the product descriptions are good sources for capturing those subtle visual differences. Therefore, we propose a search method utilizing both images and product descriptions in this work. Before searching, we not only prepare attention-based visual features for each database image but also a textual index (TI) that matches each database example to other examples with similar product descriptions. During searching, the visual feature of the query image is firstly searched in the whole database and then searched in a subset obtained by looking up the TI. Finally, the second result is used to refine the initial result. Since the subset examples usually have similar properties (e.g. brands and type), the noisy examples in the initial result can be effectively replaced. We have experimentally proved the effectiveness of the proposed method on the validation set of the Perfect-500K dataset. Our team (NTU-Beauty) achieved the 3rd place in the leader board of the Grand Challenge of AI Meets Beauty in ACM Multimedia 2020. Our code is available at: https://github.com/jingwenh/2020-ai-meets-beauty_ntubeauty.git.

### Learning Visual Features from Product Title for Image Retrieval

• Fangxiang Feng
• Tianrui Niu
• Ruifan Li
• Xiaojie Wang
• Huixing Jiang

There is a huge market demand for searching for products by images in e-commerce sites. Visual features play the most important role in solving this content-based image retrieval task. Most existing methods leverage pre-trained models on other large-scale datasets with well-annotated labels, e.g. the ImageNet dataset, to extract visual features. However, due to the large difference between the product images and the images in ImageNet, the feature extractor trained on ImageNet is not efficient in extracting the visual features of product images. And retraining the feature extractor on the product images is faced with the dilemma of lacking the annotated labels. In this paper, we utilize the easily accessible text information, that is, the product title, as a supervised signal to learn the features of the product image. Specifically, we use the n-grams extracted from the product title as the label of the product image to construct a dataset for image classification. This dataset is then used to fine-tuned a pre-trained model. Finally, the basic max-pooling activation of convolutions (MAC) feature is extracted from the fine-tuned model. As a result, we achieve the fourth position in the Grand Challenge of AI Meets Beauty in 2020 ACM Multimedia by using only a single ResNet-50 model without any human annotations and pre-processing or post-processing tricks. Our code is available at: \urlhttps://github.com/FangxiangFeng/AI-Meets-Beauty-2020.

### Learning to Remember Beauty Products

• Toan H. Vu
• An Dang
• Jia-Ching Wang

This paper develops a deep learning model for the beauty product image retrieval problem. The proposed model has two main components- an encoder and a memory. The encoder extracts and aggregates features from a deep convolutional neural network at multiple scales to get feature embeddings. With the use of an attention mechanism and a data augmentation method, it learns to focus on foreground objects and neglect background on images, so can it extract more relevant features. The memory consists of representative states of all database images as its stacks, and it can be updated during training process. Based on the memory, we introduce a distance loss to regularize embedding vectors from the encoder to be more discriminative. Our model is fully end-to-end, requires no manual feature aggregation and post-processing. Experimental results on the Perfect-500K dataset demonstrate the effectiveness of the proposed model with a significant retrieval accuracy.

### Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions for Beauty Product Retrieval

• Kele Xu
• Yuzhong Liu
• Ming Feng
• Jianqiao Zhao
• Huaimin Wang
• Hengxing Cai

The application of beauty and personal-care product retrieval seems to be evident in our daily life, and it has attracted increasing research interests during the last decade. However, the retrieval task is suffered from different image variations and complicated backgrounds. Recent works have demonstrated that Generalized-attention Regional Maximal Activation of Convolutions (GRMAC) descriptor can provide state-of-the-art performance for the retrieval task. However, GRMAC descriptor is restrained from the essentially limited property of the employed feature from a single layer. Features from a single layer are not robust enough for scale variations, shape deformation, and heavy occlusion. In this paper, we propose a novel descriptors, named Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions (MS-GRMAC). This method introduces multi-scale generalized attention mechanism to reduce the influence of scale variations, thus, can boost the performance of the retrieval task. To empirically investigate the effectiveness of the proposed approach, we conduct extensive experiments on the dataset containing more than half-million personal-care products (Perfect-500K) and obtain satisfactory results without ensemble.

## SESSION: Doctoral Symposium

### Low-level Optimizations for Faster Mobile Deep Learning Inference Frameworks

• Mathieu Febvay

Over the last ten years, we have seen a strong progression of technology around smartphones. Each new generation acquires capabilities that significantly increase performance. On the other hand, several deep learning tools are offered today by the giants of the net for mobile, embedded devices and IoT. The proposed libraries allow a machine learning inference on the device with low latency. They provide pre-trained models, but one can also use one's own models and run them on mobile, embedded or microcontroller devices. Lack of privacy, poor Internet connectivity and high cost of cloud platform let on-device inference became popular through app developers but there are more significant challenges especially for real-time tasks like augmented reality or autonomous driving. This PhD research aims at providing a path for developers to help them choose the best methods and tools to do real-time inference on mobile devices. In this paper, we present the performance benchmark of four popular open-source deep learning inference frameworks used on mobile devices on three different convolutional neural network models. We focus our work on image classification process and particularly on validation image bank of ImageNet 2012 dataset. We try to answer three questions : How does a framework influence model prediction and latency - Why some frameworks are better in terms of latency/accuracy than others with the same model - And what are the difficulties to implement these frameworks inside a mobile application - Our first findings demonstrate that low-level software implementations chosen in frameworks, model conversion steps and parameters set in the framework have a big impact on performance and accuracy.

### Deep Neural Networks for Predicting Affective Responses from Movies

• Ha Thi Phuong Thao

In this work, we develop deep neural networks for predicting affective responses from movies taking both audio and video streams into account. This study also tackles the issue of how to build a representation of video and audio in order to predict emotions that movies elicit in viewers. Besides, we analyse and identify helpful features extracted from video and audio streams that are important for the design of a good emotion prediction model. Fusion techniques are also taken into account with the aim to obtain the highest prediction accuracy.

### Learning Self-Supervised Multimodal Representations of Human Behaviour

• Abhinav Shukla

Self-supervised learning of representations has important potential applications in human behaviour understanding. The ability to learn useful representations from large unlabeled datasets by modeling intrinsic properties of the data has been successfully employed in various fields of machine learning, often outperforming transfer learning or fully supervised training. My research interests lie in applying these ideas to multimodal human-centric data. In this extended abstract, I present the direction of research that I have followed during the first half of my PhD, along with ideas and work in progress for the second half. My completed research so far demonstrates the potential of cross-modal self-supervision for audio representation learning, especially on small downstream datasets. I want to explore similar ideas for visual and multimodal representation learning, and apply them to speech and emotion recognition and multimodal question answering.

### Multi-person Pose Estimation in Complex Physical Interactions

• Wen Guo

Recent literature addressed the monocular multi-person 3D human pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many every-day situations, people are interacting, and different pose instances should be considered jointly since the pose of an individual depends on the pose of his/her interactees. This work aims to develop machine learning techniques for human pose estimation of persons involved in complex interactions, using the interaction information to improve the performance.

In this article, we will first describe the global problem, introduce the 3 main challenges and the related works. Then we will introduce the methods, experiments and results obtained with the person interaction network, PI-Net, which is accepted by IEEE WACV 2021. In this work we input the initial pose along with its interactees into a recurrent network to refine the pose of the person-of-interest, and we demonstrate the effectiveness of our method in the MuPoTS dataset, setting the new state-of-the-art on it. Finally, our ongoing works on constructing the person interaction dataset, PI dataset, and other future challenges will be discussed.

## SESSION: Workshop Summaries

### AI4TV 2020: 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery

• Raphaël Troncy
• Jorma Laaksonen
• Hamed R. Tavakoli
• Lyndon Nixon
• Vasileios Mezaris

Technological developments in comprehensive video understanding - detecting and identifying visual elements of a scene, combined with audio understanding (music, speech), as well as aligned with textual information such as captions, subtitles, etc. and background knowledge - have been undergoing a significant revolution during recent years. The workshop brings together experts from academia and industry in order to discuss the latest progress in artificial intelligence research in topics related to multimodal information analysis, and in particular, semantic analysis of video, audio, and textual information for smart digital TV content production, access and delivery.

### ATQAM/MAST'20: Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends

• Tanaya Guha
• Dietmar Saupe
• Bastian Goldlücke
• Naveen Kumar
• Weisi Lin
• Victor Martinez
• Krishna Somandepalli
• Shrikanth Narayanan
• Wen-Huang Cheng
• Kree McLaughlin
• John See
• Lai-Kuan Wong

The Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends (ATQAM/ MAST) aims to bring together researchers and professionals working in fields ranging from computer vision, multimedia computing, multimodal signal processing to psychology and social sciences. It is divided into two tracks: ATQAM and MAST. ATQAM track: Visual quality assessment techniques can be divided into image and video technical quality assessment (IQA and VQA, or broadly TQA) and aesthetics quality assessment (AQA). While TQA is a long-standing field, having its roots in media compression, AQA is relatively young. Both have received increased attention with developments in deep learning. The topics have mostly been studied separately, even though they deal with similar aspects of the underlying subjective experience of media. The aim is to bring together individuals in the two fields of TQA and AQA for the sharing of ideas and discussions on current trends, developments, issues, and future directions. MAST track: The research area of media content analytics has been traditionally used to refer to applications involving inference of higher-level semantics from multimedia content. However, multimedia is typically created for human consumption, and we believe it is necessary to adopt a human-centered approach to this analysis, which would not only enable a better understanding of how viewers engage with content but also how they impact each other in the process.

### FATE/MM 20: 2nd International Workshop on Fairness, Accountability, Transparency and Ethics in MultiMedia

• Xavier Alameda-Pineda
• Miriam Redi
• Jahna Otterbacher
• Nicu Sebe
• Shih-Fu Chang

The series of FAT/FAccT events aim at bringing together researchers and practitioners interested in fairness, accountability, transparency and ethics of computational methods. The FATE/MM workshop focuses on addressing these issues in the Multimedia field. Multimedia computing technologies operate today at an unprecedented scale, with a growing community of scientists interested in multimedia models, tools and applications. Such continued growth has great implications not only for the scientific community, but also for the society as a whole. Typical risks of large-scale computational models include model bias and algorithmic discrimination. These risks become particularly prominent in the multimedia field, which historically has been focusing on user-centered technologies. To ensure a healthy and constructive development of the best multimedia technologies, this workshop offers a space to discuss how to develop ethical, fair, unbiased, representative, and transparent multimedia models, bringing together researchers from different areas to present computational solutions to these issues.

### HUMA'20: 1st International Workshop on Human-Centric Multimedia Analysis

• Wu Liu
• Chuang Gan
• Jingkuan Song
• Dingwen Zhang
• Wenbing Huang
• John Smith

The First International Workshop on Human-Centric MultimediaAnalysis is concentrated on the tasks of human-centric analysis with multimedia and multimodal information. It is one of the fundamental and challenging problems of multimedia understanding. The human-centric multimedia analysis involves multiple tasks such as face detection and recognition, human body pattern analysis, person re-identification, human action detection, person tracking,human-object interaction, and so on. Today, multiple multimedia sensing technologies and large-scale computing infrastructures are producing at a rapid velocity a wide variety of big multi-modality data for human-centric analysis, which provides rich knowledge to help tackle these challenges. Researchers have strived to push the limits of human-centric multimedia analysis in a wide variety of applications, such as intelligent surveillance, retailing, fashion design, and services. Therefore, this workshop aims to provide a platform to bridge the gap between the communities of human analysis and multimedia.

### MMSports'20: 3rd International Workshop on Multimedia Content Analysis in Sports

• Rainer Lienhart
• Thomas B. Moeslund
• Hideo Saito

The third ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'20) is part of the ACM International Conference on Multimedia 2020 (ACM Multimedia 2020). Exceptionally, due to the corona pandemic, the workshop is held virtually. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding and visualizing the multimedia/multimodal data in sports. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation, understanding, statistical analysis and evaluation, and sensor fusion. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports.

### MuCAI'20: 1st International Workshop on Multimodal Conversational AI

• Alex Hauptmann
• Joao Magalhaes
• Ricardo G. Sousa
• Joao Paulo Costeira

Recently, conversational systems have seen a significant rise in demand due to modern commercial applications using systems such as Amazon's Alexa, Apple's Siri, Microsoft's Cortana and Google Assistant. The research on multimodal chatbots is a widely underexplored area, where users and the conversational agent communicate by natural language and visual data. Conversational agents are now becoming a commodity as a number of companies push for this technology. The wide use of these conversational agents exposes the many challenges in achieving more natural, human-like, and engaging conversational agents. The research community is actively addressing several of these challenges: how are visual and text data related in user utterances? How to interpret the user intent? How to encode multimodal dialog status? What are the ethical and legal aspects of conversational AI? The Multimodal Conversational AI workshop will be a forum where researchers and practitioners share their experiences and brainstorm about success and failures in the topic. It will also promote collaboration to strengthen the conversational AI community at ACM Multimedia.

### Summary of MuSe 2020: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media

• Lukas Stappen
• Björn Schuller
• Iulia Lefter
• Erik Cambria
• Ioannis Kompatsiaris

The first Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 was a Challenge-based Workshop held in conjunction with ACM Multimedia'20. It addresses three distinct 'in-the-wild Sub-challenges: sentiment/ emotion recognition (MuSe-Wild), emotion-target engagement (MuSe-Target) and trustworthiness detection (MuSe-Trust). A large multimedia dataset MuSe-CaR was used, which was specifically designed with the intention of improving machine understanding approaches of how sentiment (e.g. emotion) is linked to a topic in emotional, user-generated reviews. In this summary, we describe the motivation, first of its kind 'in-the-wild database, challenge conditions, participation, as well as giving an overview of utilised state-of-the-art techniques.

### QoEVMA'20: 1st Workshop on Quality of Experience (QoE) in Visual Multimedia Applications

• Xinbo Gao
• Patrick Le Callet
• Jing Li
• Zhi Li
• Wen Lu
• Jiachen Yang

Nowadays, people spend dramatically more time on watching videos through different devices. The advanced hardware technology and network allow for the increasing demands of users viewing experience. Thus, enhancing the Quality of Experience of end-users in advanced multimedia is the ultimate goal of service providers, as good services would attract more consumers. Quality assessment is thus important. The first workshop on "Quality of Experience (QoE) in visual multimedia applications" (QoEVMA'20) focuses on the QoE assessment of any visual multimedia applications both subjectively and objectively. The topics include 1)QoE assessment on different visual multimedia applications, including VoD for movies, dramas, variety shows, UGC on social networks, live streaming videos for gaming/shopping/social, etc. 2)QoE assessment for different video formats in multimedia services, including 2D, stereoscopic 3D, High Dynamic Range (HDR), Augmented Reality (AR), Virtual Reality (VR), 360, Free-Viewpoint Video(FVV), etc. 3)Key performance indicators (KPI) analysis for QoE. This summary gives a brief overview of the workshop, which took place at October 16, 2020 in Seattle (U.S.), as a half-day workshop.

### SUMAC 2020: The 2nd Workshop on Structuring and Understanding of Multimedia heritAge Contents

• Valérie Gouet-Brunet
• Margarita Khokhlova
• Ronak Kosti
• Liming Chen
• Xu-Cheng Yin

SUMAC 2020 is the second edition of the workshop on Structuring and Understanding of Multimedia heritAge Contents. It is held in Seattle, USA on October 12th, 2020 and is co-located with the 28th ACM International Conference on Multimedia; this year, due to the sanitary crisis, it is organized virtually. Its objective is to present and discuss the latest and most significant trends and challenges in the analysis, structuring and understanding of multimedia contents dedicated to the valorization of heritage, with the emphasis on the unlocking of and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field.

## SESSION: Tutorials

### Multimedia Intelligence: When Multimedia Meets Artificial Intelligence

• Xin Wang
• Wenwu Zhu
• Yonghong Tian
• Wen Gao

Owing to the rich emerging multimedia applications and services in the past decade, super large amount of multimedia data has been produced for the purpose of advanced research in multimedia. Furthermore, multimedia research has made great progress on image/video content analysis, multimedia search and recommendation, multimedia streaming, multimedia content delivery etc. At the same time, Artificial Intelligence (AI) has undergone a "new" wave of development since being officially regarded as an academic discipline in 1950s, which should give credits to the extreme success of deep learning. Thus, one question naturally arises: What happens when multimedia meets Artificial Intelligence?

To answer this question, this tutorial disseminates and promotes the concept of Multimedia Intelligence through discussing the mutual-influence between multimedia and Artificial Intelligence, which is an exciting and fast-growing research direction in both multimedia and machine learning. We will advocate novel, high-quality research findings, and innovative solutions to multimedia intelligence through exploring the mutual influences between multimedia and Artificial Intelligence from two aspects: i) multimedia drives Artificial Intelligence to experience a paradigm shift towards more explainability and ii) Artificial Intelligence in turn injects new ways of thinking for multimedia research. As such, these two aspects form a loop in which multimedia and Artificial Intelligence interactively enhance each other. In this tutorial, we discuss what and how efforts have been done in literature and share our insights on research directions that deserve further study to produce potentially profound impact on multimedia intelligence.

This topic is at the core of the scope of ACM Multimedia, and is attractive to MM audience from both academia and industry.

### Deep Learning for Privacy in Multimedia

• Andrea Cavallaro

We discuss the design and evaluation of machine learning algorithms that provide users with more control on the multimedia information they share. We introduce privacy threats for multimedia data and key features of privacy protection. We cover privacy threats and mitigating actions for images, videos, and motion-sensor data from mobile and wearable devices, and their protection from unwanted, automatic inferences. The tutorial offers theoretical explanations followed by examples with software developed by the presenters and distributed as open source.

### Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data

• Gerald Friedland

This tutorial provides an actionable perspective on the experimental design for machine learning experiments on multimedia data. The tutorial consists of lectures and hands-on exercises. The lectures provide an engineering introduction to machine learning design. By understanding the information flow and quantities in the scientific process, machine learners can be designed to be more efficient and their limits can be easier understood. The thought framework presented is derived from the traditional experimental sciences which require published results to be self-contained with regards to reproducibility. In the practical exercises, we will work on calculating and measuring quantities like Memory Equivalent Capacity or generalization ratio for different machine learners and data sets and discuss how these quantities relate to reproducible experimental design.

### Food Computing for Multimedia

• Shuqiang Jiang
• Weiqing Min

Food computing applies computational approaches for acquiring and analyzing heterogeneous food data from disparate sources for perception, recognition, retrieval, recommendation, prediction and monitoring of food to address food-related issues in multimedia and beyond. It has received more attention from both academia and industry as one emerging interdiscipline for its various applications, such as improving human health and understanding the culinary culture. Recently, there are more studies on food computing in the multimedia, such as food recognition and multimodal recipe analysis. This tutorial will provide a basic understanding of food computing, and discuss its use in various multimedia tasks, ranging from food recognition, retrieval, recommendation, recipe analysis to cooking behavior understanding. Specifically, we will first introduce food computing, including its method, task and applications. Then we will discuss several typical tasks of food computing in the multimedia including food image recognition, food retrieval and recommendation, multimodal recipe analysis and cooking action anticipation. Finally, we will point out future research directions on food computing in the multimedia.

### Active Learning for Multimedia Computing: Survey, Recent Trends and Applications

• Shayok Chakraborty

The widespread emergence and deployment of inexpensive sensors has resulted in the generation of enormous amounts of digital data in today's world. While this has expanded the possibilities of solving real world problems using computational learning frameworks, selecting the salient data samples from such huge collections of data has proved to be a significant and practical challenge. Further, to train a reliable classification model, it is important to have a large quantity of labeled training data. Manual annotation of large amounts of data is an expensive process in terms of time, labor and human expertise. This has set the stage for research in the field of active learning. Active learning algorithms automatically select the salient and exemplar instances from large quantities of unlabeled data and thereby tremendously reduce human annotation effort in training an effective classifier. It can be applied across all existing classification / regression methods and with any kind of data, thus making it a very generalizable approach. The success of active learning in several applications (such as image retrieval, image recognition) has resulted in the extension of the framework to problem settings beyond regular classification / regression. Active learning concepts have been extended to newer problem settings (such as feature selection, video summarization, matrix completion) and have also been combined with other learning paradigms such as deep learning and transfer learning. This tutorial will seek to present a comprehensive overview of active learning with a focus on multimedia computing applications, including historical perspectives, theoretical analysis and novel paradigms. The novelty of this tutorial lies in its focus on the emerging trends, algorithms and applications of active learning. It will aim at introducing concepts and open perspectives that motivate further work in this domain, ranging from fundamentals to applications and systems.

### Immersive Imaging Technologies: From Capture to Display

• Martin Alain
• Emin Zerman
• Cagri Ozcinar

New immersive imaging technologies enable creating multimedia systems that would increase the viewer presence and provide an immersive experience. This half-day tutorial aims to give an overview of these new immersive imaging systems and help the participants understand the content creation and delivery pipeline for the immersive imaging technologies. The tutorial will go over the full imaging pipeline, from camera setup for content capture, through content compression / streaming, to content display and related perceptual studies.

### Effective and Efficient: Toward Open-world Instance Re-identification

• Zheng Wang
• Wu Liu
• Yusuke Matsui
• Shin'ichi Satoh

Instance Re-identification (ReID) system facilitates various applications that require painful and boring video watching. Its efficiency and effectiveness accelerate the process of video analysis. In this tutorial, we summarize ReID technologies and provide an overview. We'll introduce fundamental technologies, existing challenges, trends, etc. This tutorial would be useful for multimedia content analysis and system-level multimedia retrieval, especially for an effective and efficient open-world ReID system for the practical, large-scale, and open-set domain.

### Deep Bayesian Multimedia Learning

• Jen-Tzung Chien

Deep learning has been successfully developed as a complicated learning process from source inputs to target outputs in presence of multimedia environments. The inference or optimization is performed over an assumed deterministic model with deep structure. A wide range of temporal and spatial data in language and vision are treated as the inputs or outputs to build such a domain mapping for multimedia applications. A systematic and elaborate transfer is required to meet the mapping between source and target domains. Also, the semantic structure in natural language and computer vision may not be well represented or trained in mathematical logic or computer programs. The distribution function in discrete or continuous latent variable model for words, sentences, images or videos may not be properly decomposed or estimated. The system robustness to heterogeneous environments may not be assured. This tutorial addresses the fundamentals and advances in statistical models and neural networks for domain mapping, and presents a series of deep Bayesian solutions including variational Bayes, sampling method, Bayesian neural network, variational auto-encoder (VAE), stochastic recurrent neural network, sequence-to-sequence model, attention mechanism, end-to-end network, stochastic temporal convolutional network, temporal difference VAE, normalizing flow and neural ordinary differential equation. Enhancing the prior/posterior representation is addressed in different latent variable models. We illustrate how these models are connected and why they work for a variety of applications on complex patterns in language and vision. The word, sentence and image embeddings are merged with semantic constraint or structural information. Bayesian learning is formulated in the optimization procedure where the posterior collapse is tackled. An informative latent space is trained to incorporate deep Bayesian learning in various information systems.

## SESSION: Panels

### Coping with Pandemics: Opportunities and Challenges for AI Multimedia in the "New Normal"

• Jiaying Liu
• Wen-Huang Cheng
• Klara Nahrstedt
• Ramesh Jain
• Elisa Ricci
• Hyeran Byun

Theworld iswelcoming the newnormal - the coronavirus pandemic has significantly changed the way people live, work, communicate and learn. Almost everyone now is wearing a face mask when they go in public. People are working from home, some taking care of children at the same time. Bars and restaurants are limited to carry-out and delivery only. Meetings and conferences go online. Schools are closed and educators are instead holding video conference classes regularly. All these become the new normal as our ways of life. The panel thus provides a valuable opportunity for people from a variety of backgrounds to exchange views on opportunities and challenges for AI multimedia in the current and post pandemics era.

### The World has Changed - The World Needs to Change. What Multimedia has to Offer for Our Common Digital Future

• Susanne Boll
• Hari Sundram
• Svetha Venkatesh
• Martha Larson
• Mohan Kankanhalli

Not only the current coronavirus is holding the world in breath. Beyond this current health crisis the world is facing several global challenges from climate change and environmental damage, access to clean water and food, socio-economic inequalities to name a few. The United have very well framed these global challenges in their 17 Sustainability Goals for a future in prosperity and equal opportunities for all, to be achieved by 2030. There is no one simple solution, no one easy cure in sight to address these pressing challenges of our days. Rather a collective approach of all of us is needed which in sum will be contributing to these. Obviously, the field of multimedia has contributed to many tools and applications that are so much in demand these days to stay connected while keeping the distance. But there is much more we can offer to our common digital future. Our future health system, global access to education, decent work, and reducing inequalities are just some of these goals where we our field can contribute. In this panel we will discuss which path we could follow.

## SESSION: Keynote Talks

### 360-Video Navigation for 360-Multimedia Delivery Systems: Research Challenges and Opportunities

• Klara Nahrstedt

With the emergence of new 360-degree cameras, ambisonic microphones, and VR/AR display devices, more diverse multi-modal content has become available, and with it the demand for the capability of streaming 360-degree videos to enhance users? 360-multimedia experience on mobile devices such as mobile phones and head-mounted displays. The big issue for the mobile 360-multimedia delivery systems is the huge resource demand on the underlying networks and devices to deliver 360-multimedia content with high quality of experience. In this talk, we will discuss the research challenges of 360-degree video delivery systems such as the large bandwidth, low latency, users? disorientation, and cyber-sickness, and opportunities to solve these challenges including rate adaptation algorithms of tiles videos, view prediction algorithms, content navigation, enhancement of DASH streaming for 360-videos, and control of Quality of Experience (QoE) [1]. We will briefly dive into more details of the concept of navigation graphs for 360-degree videos and present the opportunity of navigation graphs to organize 360-video content that can help in viewing navigation, caching and improvements of QoE [2]. We will show how navigation graphs are serving as models for viewing behaviors in the temporal and spatial domains, and can assist with view predictions, bandwidth, and latency control. Our experimental results are encouraging [3] and support the intuition that if we can encapsulate viewing patterns of 360-degree videos into navigation graphs at multiple levels of contextual details, we will be able to stream "need-to-see" 360-content to wireless HMD devices in timely manner within bandwidth-constrained environments, and enhance viewing quality experience of 360-degree videos in augmented reality applications.

### Cloud Drive Apps - Closing the Gap Between AI Research to Practice

• Itamar Friedman

In the past few years, Cloud Drive Apps have aroused increasing interest from end-users and enterprise customers. During this period, numerous artificial intelligence based features were introduced, such as functions enabling users to intelligently organize, search, share, edit and recreate content with their images and videos. In this talk, I will introduce our latest work related to highly-efficient image understanding, which aims to enable various novel methods (such as neural architecture search [1,2] and advanced training techniques [3,4]) to be practiced in Cloud Drive App use cases. I will discuss use-cases such as image search through free-text query, focusing on difficult real-world problems and suggested solutions. I will also demonstrate the usefulness of the proposed techniques when applied to public competitions.

### Building Digital Human

• Dong Yu

Digital humans find their applications in areas such as virtual companion, virtual reporter, and virtual narrator. As the global trend of digitalization continues, the value of digital humans continues to increase. For example, a virtual teacher may mimic human teachers to deliver personalized education to students spread all over the world at a lower cost. There are many technical difficulties yet to be solved to make digital humans truly valuable. In this talk, I report our recent progresses on addressing two of these difficulties: multi-modal text-to-speech synthesis and multi-modal voice separation and recognition. To address the multi-modal text-to-speech synthesis problem, we developed the duration informed attention network (DurIAN) [1]. DurIAN enhanced the attention-based alignment in the state-of-the-art (SOTA) end-to-end speech synthesis systems such as Tacotron2 [2] with duration information estimated from the rich text input. This technology, while generating high quality natural speech, avoids popular pitfalls such as word repetition and missing in the pure end-to-end systems. More importantly, the system can easily align the facial representation and synthesized speech through the duration model. To more robustly drive the facial expression and mouth movement, we developed a 3D-model guided framework for multi-modal synthesis. To solve the multi-modal voice separation and recognition problem, which is in need in many scenarios such as virtual receptionist, we developed an all deep learning beamformer [3] which integrates the conventional minimum variance distortionless response (MVDR) beamformer, the recurrent neural network-based statistics estimator, and the visual cue guided speaker tracing and diarization system [4]. Our novel approach significantly improved the quality of the separated speech.

### Neural Network Design for Multimedia: Bio-inspired and Hardware-friendly

• Shuicheng Yan

Neural network architecture design is playing the most important role in recent fast development of multimedia technology. In this talk, I mainly introduce the research and development efforts in designing neural networks from two orthogonal lines: 1) how these neural network models are bio-inspired, e.g. the 1x1 convolution simulates the function of cell-body of a neuron, and 2) how these models are more hardware-friendly or motivating the next-generation of AI chips, e.g. the selective convolution is expecting new design of hardware. These two lines of efforts are collaboratively enhancing the overall efficiency of multimedia systems.