MM '20: Proceedings of the 28th ACM International Conference on Multimedia - Part 1

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Full Citation in the ACM Digital Library

SESSION: Oral Session A1: Deep Learning for Multimedia

Image Inpainting Based on Multi-frequency Probabilistic Inference Model

  • Jin Wang
  • Chen Wang
  • Qingming Huang
  • Yunhui Shi
  • Jian-Feng Cai
  • Qing Zhu
  • Baocai Yin

Image inpainting methods usually fail to reconstruct reasonable structure and fine-grained texture simultaneously. This paper handles this problem from a novel perspective of predicting low-frequency semantic structural contents and high-frequency detailed textures respectively, and proposes a multi-frequency probabilistic inference model(MPI model) to predict the multi-frequency information of missing regions by estimating the parametric distribution of multi-frequency features over the corresponding latent spaces. Firstly, in order to extract the information of different frequencies without any interference, wavelet transform is utilized to decompose the input image into low-frequency subband and high-frequency subbands. Furthermore, an MPI model is designed to estimate the underlying multi-frequency distribution of input images. With this model, closer approximation to the true posterior distribution can be constrained and maximum-likelihood assignment can be approximated. Finally, based on the proposed MPI model, a two-path network consisting of inference network(InferenceNet) and generation network(GenerationNet) is trained parallelly to enforce the consistency of global structure and local texture between the generated image and ground truth. We qualitatively and quantitatively compare our method with other state-of-the-art methods on Paris StreetView, CelebA, CelebAMask-HQ and Places2 datasets. The results show the superior performance of our method, especially in the aspects of realistic texture details and semantic structural consistency.

Dual Adversarial Network for Unsupervised Ground/Satellite-to-Aerial Scene Adaptation

  • Jianzhe Lin
  • Lichao Mou
  • Tianze Yu
  • Xiaoxiang Zhu
  • Z. Jane Wang

Recent domain adaptation work tends to obtain a uniformed representation in an adversarial manner through joint learning of the domain discriminator and feature generator. However, this domain adversarial approach could render sub-optimal performances due to two potential reasons: First, it might fail to consider the task at hand when matching the distributions between the domains. Second, it generally treats the source and target domain data in the same way. In our opinion, the source domain data which serves the feature adaption purpose should be supplementary, whereas the target domain data mainly needs to consider the task-specific classifier. Motivated by this, we propose a dual adversarial network for domain adaptation, where two adversarial learning processes are conducted iteratively, in correspondence with the feature adaptation and the classification task respectively. The efficacy of the proposed method is first demonstrated on Visual Domain Adaptation Challenge (VisDA) 2017 challenge, and then on two newly proposed Ground/Satellite-to-Aerial Scene adaptation tasks. For the proposed tasks, the data for the same scene is collected not only by the traditional camera on the ground, but also by satellite from the out space and unmanned aerial vehicle (UAV) at the high-altitude. Since the semantic gap between the ground/satellite scene and the aerial scene is much larger than that between ground scenes, the newly proposed tasks are more challenging than traditional domain adaptation tasks. The datasets/codes can be found at

Adversarial Bipartite Graph Learning for Video Domain Adaptation

  • Yadan Luo
  • Zi Huang
  • Zijian Wang
  • Zheng Zhang
  • Mahsa Baktashmotlagh

Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area due to the significant spatial and temporal shifts across the source (i.e. training) and target (i.e. test) domains. As such, recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations and strengthen the feature transferability are not highly effective on the videos. To overcome this limitation, in this paper, we learn a domain-agnostic video classifier instead of learning domain-invariant representations, and propose an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions with a network topology of the bipartite graph. Specifically, the source and target frames are sampled as heterogeneous vertexes while the edges connecting two types of nodes measure the affinity among them. Through message-passing, each vertex aggregates the features from its heterogeneous neighbors, forcing the features coming from the same class to be mixed evenly. Explicitly exposing the video classifier to such cross-domain representations at the training and test stages makes our model less biased to the labeled source data, which in-turn results in achieving a better generalization on the target domain. The proposed framework is agnostic to the choices of frame aggregation, and therefore, four different aggregation functions are investigated for capturing appearance and temporal dynamics. To further enhance the model capacity and testify the robustness of the proposed architecture on difficult transfer tasks, we extend our model to work in a semi-supervised setting using an additional video-level bipartite graph. Extensive experiments conducted on four benchmark datasets evidence the effectiveness of the proposed approach over the state-of-the-art methods on the task of video recognition.

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

  • Peng Wang
  • Dongyang Liu
  • Hui Li
  • Qi Wu

Conventional referring expression comprehension (REF) assumes people to query something from an image by describing its visual appearance and spatial location, but in practice, we often ask for an object by describing its affordance or other non-visual attributes, especially when we do not have a precise target. For example, sometimes we say 'Give me something to eat'. In this case, we need to use commonsense knowledge to identify the objects in the image. Unfortunately, there is no existing referring expression dataset reflecting this requirement, not to mention a model to tackle this challenge. In this paper, we collect a new referring expression dataset, called KB-Ref, containing $43$k expressions on 16k images. In KB-Ref, to answer each expression (detect the target object referred by the expression), at least one piece of commonsense knowledge must be required. We then test state-of-the-art (SoTA) REF models on KB-Ref, finding that all of them present a large drop compared to their outstanding performance on general REF datasets. We also present an expression conditioned image and fact attention (ECIFA) network that extracts information from correlated image regions and commonsense knowledge facts. Our method leads to a significant improvement over SoTA REF models, although there is still a gap between this strong baseline and human performance. The dataset and baseline models are available at:

Single Image De-noising via Staged Memory Network

  • Weijiang Yu
  • Jian Liang
  • Lu Li
  • Nong Xiao

Single image de-noising is an important yet under-explored task to estimate the underlying clean image from its noisy observation. It poses great challenges over the balance between over-de-noising (e.g., mistakenly remove texture details in noise-free regions) and under-de-noising (e.g., leave noisy points). Existing works solely treat the removal of noise from images as a process of pixel-wise regression and lack of preserving image details. In this paper, we firstly propose a Staged Memory Network (SMNet) consisting of noise memory stage and image memory stage for explicitly exploring the staged memories of our network in single image de-noising with different noise levels. Specifically, the noise memory stage is to reveal noise characteristics by using local-global spatial dependencies via an encoder-decoder sub-network composed of dense blocks and noise-aware blocks. Taking the residual result between the input noise image and the prediction of the noise memory stage as input, the image memory stage continues to get a noise-free and well-reconstructed output image via a contextual fusion sub-network with contextual blocks and a fusion block. Solid and comprehensive experiments on three tasks (i.e. synthetic and real data, and blind de-noising) demonstrate that our SMNet can significantly achieve better performance compared with state-of-the-art methods by cleaning noisy images with various densities, scales and intensities while keeping the image details of noise-free regions well-preserved. Moreover, interpretability analysis is added to further prove the ability of our composed memory stages.

Self-supervised Dance Video Synthesis Conditioned on Music

  • Xuanchi Ren
  • Haoran Li
  • Zijian Huang
  • Qifeng Chen

We present a self-supervised approach with pose perceptual loss for automatic dance video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of given music. To achieve this, we firstly generate a human skeleton sequence from music and then apply the learned pose-to-appearance mapping to generate the final video. In the stage of generating skeleton sequences, we utilize two discriminators to capture different aspects of the sequence and propose a novel pose perceptual loss to produce natural dances. Besides, we also provide a new cross-modal evaluation metric to evaluate the dance quality, which is able to estimate the similarity between two modalities (music and dance). Finally, our experimental qualitative and quantitative results demonstrate that our dance video synthesis approach produces realistic and diverse results. Our source code and data are available at

SESSION: Oral Session B1: Deep Learning for Multimedia

Dynamic GCN: Context-enriched Topology Learning for Skeleton-based Action Recognition

  • Fanfan Ye
  • Shiliang Pu
  • Qiaoyong Zhong
  • Chao Li
  • Di Xie
  • Huiming Tang

raph Convolutional Networks (GCNs) have attracted increasing interests for the task of skeleton-based action recognition. The key lies in the design of the graph structure, which encodes skeleton topology information. In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Context-encoding Network (CeN) is introduced to learn skeleton topology automatically. In particular, when learning the dependency between two joints, contextual features from the rest joints are incorporated in a global manner. CeN is extremely lightweight yet effective, and can be embedded into a graph convolutional layer. By stacking multiple CeN-enabled graph convolutional layers, we build Dynamic GCN. Notably, as a merit of CeN, dynamic graph topologies are constructed for different input samples as well as graph convolutional layers of various depths. Besides, three alternative context modeling architectures are well explored, which may serve as a guideline for future research on graph topology learning. CeN brings only ~7% extra FLOPs for the baseline model, and Dynamic GCN achieves better performance with 2x ~4x fewer FLOPs than existing methods. By further combining static physical body connections and motion modalities, we achieve state-of-the-art performance on three large-scale benchmarks, namely NTU-RGB+D, NTU-RGB+D 120 and Skeleton-Kinetics.

Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning

  • Peike Li
  • Yunchao Wei
  • Yi Yang

Recent progress in few-shot segmentation usually aims at performing novel object segmentation using a few annotated examples as guidance. In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, i.e., Generalized Few-shot Scene Parsing (GFSP). In this task, we take a fully annotated image as guidance to segment all pixels in a query image. Our mission is to study a generalizable and robust segmentation network from the meta-learning perspective so that both seen and unseen categories can be correctly recognized. Different from previous practices, this task performs segmentation on a joint label space consisting of both previously seen and novel categories. Moreover, pixels from these multiple categories need to be simultaneously taken into account, which is actually not well explored before. Accordingly, we present Meta Parsing Networks (MPNet) to better exploit the guidance information in the support set. Our MPNet contains two basic modules, i.e., the Adaptive Deep Metric Learning (ADML) module and the Contrastive Inter-class Distraction (CID) module. Specially, the ADML takes the annotated pixels from the support image as the guidance and adaptively produces high-quality prototypes for learning a deep comparison metric. In addition, MPNet further introduces the CID module learning to enlarge the feature discrepancy of different categories in the embedding space, leading the MPNet to generate more discriminative feature embeddings. We conduct experiments on two newly constructed benchmarks, i.e., GFSP-Cityscapes and GFSP-Pascal-Context. Extensive ablation studies well demonstrate the effectiveness and generalization ability of our MPNet.

CODAN: Counting-driven Attention Network for Vehicle Detection in Congested Scenes

  • Wei Li
  • Zhenting Wang
  • Xiao Wu
  • Ji Zhang
  • Qiang Peng
  • Hongliang Li

Although recent object detectors have shown excellent performance for vehicle detection, they are incompetent for scenarios with a relatively large number of vehicles. In this paper, we explore the dense vehicle detection given the number of vehicles. Existing crowd counting methods cannot directly applied for dense vehicle detection due to insufficient description of density map, and the lack of effective constraint for mining the spatial awareness of dense vehicles. Inspired by these observations, a conceptually simple yet efficient framework, called CODAN, is proposed for dense vehicle detection. The proposed approach is composed of three major components: (i) an efficient strategy for generating multi-scale density maps (MDM) is designed to represent the vehicle counting, which can capture the global semantics and spatial information of dense vehicles, (ii) a multi-branch attention module (MAM) is proposed to bridging the gap between object counting and vehicle detection framework, (iii) with the well-designed density maps as explicit supervision, an effective counting-awareness loss (C-Loss) is employed to guide the attention learning by building the pixel-level constrain. Extensive experiments conducted on four benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods. The impressive results indicate that vehicle detection and counting can be mutually supportive, which is an important and meaningful finding.

Webly Supervised Image Classification with Metadata: Automatic Noisy Label Correction via Visual-Semantic Graph

  • Jingkang Yang
  • Weirong Chen
  • Litong Feng
  • Xiaopeng Yan
  • Huabin Zheng
  • Wayne Zhang

Webly supervised learning becomes attractive recently for its efficiency in data expansion without expensive human labeling. However, adopting search queries or hashtags as web labels of images for training brings massive noise that degrades the performance of DNNs. Especially, due to the semantic confusion of query words, the images retrieved by one query may contain tremendous images belonging to other concepts. For example, searching 'tiger cat' on Flickr will return a dominating number of tiger images rather than the cat images. These realistic noisy samples usually have clear visual semantic clusters in the visual space that mislead DNNs from learning accurate semantic labels. To correct real-world noisy labels, expensive human annotations seem indispensable. Fortunately, we find that metadata can provide extra knowledge to discover clean web labels in a labor-free fashion, making it feasible to automatically provide correct semantic guidance among the massive label-noisy web data. In this paper, we propose an automatic label corrector VSGraph-LC based on the visual-semantic graph. VSGraph-LC starts from anchor selection referring to the semantic similarity between metadata and correct label concepts, and then propagates correct labels from anchors on a visual graph using graph neural network (GNN). Experiments on realistic webly supervised learning datasets Webvision-1000 and NUS-81-Web show the effectiveness and robustness of VSGraph-LC. Moreover, VSGraph-LC reveals its advantage on the open-set validation set.

CRSSC: Salvage Reusable Samples from Noisy Data for Robust Learning

  • Zeren Sun
  • Xian-Sheng Hua
  • Yazhou Yao
  • Xiu-Shen Wei
  • Guosheng Hu
  • Jian Zhang

Due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-grained (FG) models directly through web images tends to have an inferior recognition ability. In the literature, to alleviate this issue, loss correction methods try to estimate the noise transition matrix, but the inevitable false correction would cause severe accumulated errors. Sample selection methods identify clean ("easy") samples based on the fact that small losses can alleviate the accumulated errors. However, "hard" and mislabeled examples that can both boost the robustness of FG models are also dropped. To this end, we propose a certainty-based reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images. Our key idea is to additionally identify and correct reusable samples, and then leverage them together with clean examples to update the networks. We demonstrate the superiority of the proposed approach from both theoretical and experimental perspectives.

Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism

  • Jen-Chun Lin
  • Wen-Li Wei
  • Yen-Yu Lin
  • Tyng-Luh Liu
  • Hong-Yuan Mark Liao

Learning from music to visual storytelling of shots is an interesting and emerging task. It produces a coherent visual story in the form of a shot type sequence, which not only expands the storytelling potential for a song but also facilitates automatic concert video mashup process and storyboard generation. In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. Different from the one-way transfer between a pre-trained teacher network (or ensemble network) and a student network in knowledge distillation (KD), the proposed method enables collaborative learning between an ensemble teacher network and a student network. Namely, the student network also teaches. Specifically, our method first learns a teacher network that is composed of several assistant networks to generate a shot type sequence and produce the soft target (shot types) distribution accordingly through KD. It then constructs the student network that learns from both the ground truth label (hard target) and the soft target distribution to alleviate the difficulty of optimization and improve generalization capability. As the student network gradually advances, it turns to feed back knowledge to the assistant networks, thereby improving the teacher network in each iteration. Owing to such interactive designs, the DIL mechanism bridges the gap between the teacher and student networks and produces more superior capability for both networks. Objective and subjective experimental results demonstrate that both the teacher and student networks can generate more attractive shot sequences from music, thereby enhancing the viewing and listening experience.

SESSION: Oral Session C1: Deep Learning for Multimedia

TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection

  • Fangfang Wang
  • Yifeng Chen
  • Fei Wu
  • Xi Li

Arbitrary-shaped text detection is a challenging task due to the complex geometric layouts of texts such as large aspect ratios, various scales, random rotations and curve shapes. Most state-of-the-art methods solve this problem from bottom-up perspectives, seeking to model a text instance of complex geometric layouts with simple local units (e.g., local boxes or pixels) and generate detections with heuristic post-processings. In this work, we propose an arbitrary-shaped text detection method, namely TextRay, which conducts top-down contour-based geometric modeling and geometric parameter learning within a single-shot anchor-free framework. The geometric modeling is carried out under polar system with a bidirectional mapping scheme between shape space and parameter space, encoding complex geometric layouts into unified representations. For effective learning of the representations, we design a central-weighted training strategy and a content loss which builds propagation paths between geometric encodings and visual content. TextRay outputs simple polygon detections at one pass with only one NMS post-processing. Experiments on several benchmark datasets demonstrate the effectiveness of the proposed approach. The code is available at

Weakly Supervised Real-time Image Cropping based on Aesthetic Distributions

  • Peng Lu
  • Jiahui Liu
  • Xujun Peng
  • Xiaojie Wang

Image cropping is an effective tool to edit and manipulate images to achieve better aesthetic quality. Most existing cropping approaches rely on the two-step paradigm where multiple candidate cropping areas are proposed initially and the optimal cropping window is determined based on some quality criteria for these candidates afterwards. The obvious disadvantage of this mechanism is its low efficiency due to the huge searching space of candidate crops. In order to tackle this problem, a weakly supervised cropping framework is proposed, where the distribution dissimilarity between high quality images and cropped images is used to guide the coordinate predictor's training and the ground truths of cropping windows are not required by the proposed method. Meanwhile, to improve the cropping performance, a saliency loss is also designed in the proposed framework to force the neural network to focus more on the interested objects in the image. Under this framework, the images can be cropped effectively by the trained coordinate predictor in a one-pass favor without multiple candidates proposals, which ensures the high efficiency of the proposed system . Also, based on the proposed framework, many existing distribution dissimilarity measurements can be applied to train the image cropping system with high flexibility, such as likelihood based and divergence based distribution dissimilarity measure proposed in this work. The experiments on the public databases show that the proposed cropping method achieves the state-of-the-art accuracy, and the high computation efficiency as fast as 285 FPS is also obtained.

Towards Unsupervised Crowd Counting via Regression-Detection Bi-knowledge Transfer

  • Yuting Liu
  • Zheng Wang
  • Miaojing Shi
  • Shin'ichi Satoh
  • Qijun Zhao
  • Hongyu Yang

Unsupervised crowd counting is a challenging yet not largely explored task. In this paper, we explore it in a transfer learning setting where we learn to detect and count persons in an unlabeled target set by transferring bi-knowledge learnt from regression- and detection-based models in a labeled source set. The dual source knowledge of the two models is heterogeneous and complementary as they capture different modalities of the crowd distribution. We formulate the mutual transformations between the outputs of regression- and detection-based models as two scene-agnostic transformers which enable knowledge distillation between the two models. Given the regression- and detection-based models and their mutual transformers learnt in the source, we introduce an iterative self-supervised learning scheme with regression-detection bi-knowledge transfer in the target. Extensive experiments on standard crowd counting benchmarks, ShanghaiTech, UCF_CC_50, and UCF_QNRF demonstrate a substantial improvement of our method over other state-of-the-arts in the transfer learning setting.

Occluded Prohibited Items Detection: An X-ray Security Inspection Benchmark and De-occlusion Attention Module

  • Yanlu Wei
  • Renshuai Tao
  • Zhangjie Wu
  • Yuqing Ma
  • Libo Zhang
  • Xianglong Liu

Security inspection often deals with a piece of baggage or suitcase where objects are heavily overlapped with each other, resulting in an unsatisfactory performance for prohibited items detection in X-ray images. In the literature, there have been rare studies and datasets touching this important topic. In this work, we contribute the first high-quality object detection dataset for security inspection, named Occluded Prohibited Items X-ray (OPIXray) image benchmark. OPIXray focused on the widely-occurred prohibited item "cutter", annotated manually by professional inspectors from the international airport. The test set is further divided into three occlusion levels to better understand the performance of detectors. Furthermore, to deal with the occlusion in X-ray images detection, we propose the De-occlusion Attention Module (DOAM), a plug-and-play module that can be easily inserted into and thus promote most popular detectors. Despite the heavy occlusion in X-ray imaging, shape appearance of objects can be preserved well, and meanwhile different materials visually appear with different colors and textures. Motivated by these observations, our DOAM simultaneously leverages the different appearance information of the prohibited item to generate the attention map, which helps refine feature maps for the general detectors. We comprehensively evaluate our module on the OPIXray dataset, and demonstrate that our module can consistently improve the performance of the state-of-the-art detection methods such as SSD, FCOS, etc, and significantly outperforms several widely-used attention mechanisms. In particular, the advantages of DOAM are more significant in the scenarios with higher levels of occlusion, which demonstrates its potential application in real-world inspections. The OPIXray benchmark and our model are released at

Temporally Guided Music-to-Body-Movement Generation

  • Hsuan-Kai Kao
  • Li Su

This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists? body movements considering key features in musical body movement.

Compositional Few-Shot Recognition with Primitive Discovery and Enhancing

  • Yixiong Zou
  • Shanghang Zhang
  • Ke Chen
  • Yonghong Tian
  • Yaowei Wang
  • José M. F. Moura

Few-shot learning (FSL) aims at recognizing novel classes given only few training samples, which still remains a great challenge for deep learning. However, humans can easily recognize novel classes with only few samples. A key component of such ability is the compositional recognition that human can perform, which has been well studied in cognitive science but is not well explored in FSL. Inspired by such capability of humans, to imitate humans' ability of learning visual primitives and composing primitives to recognize novel classes, we propose an approach to FSL to learn a feature representation composed of important primitives, which is jointly trained with two parts, i.e. primitive discovery and primitive enhancing. In primitive discovery, we focus on learning primitives related to object parts by self-supervision from the order of image splits, avoiding extra laborious annotations and alleviating the effect of semantic gaps. In primitive enhancing, inspired by current studies on the interpretability of deep networks, we provide our composition view for the FSL baseline model. To modify this model for effective composition, inspired by both mathematical deduction and biological studies (the Hebbian Learning rule and the Winner-Take-All mechanism), we propose a soft composition mechanism by enlarging the activation of important primitives while reducing that of others, so as to enhance the influence of important primitives and better utilize these primitives to compose novel classes. Extensive experiments on public benchmarks are conducted on both the few-shot image classification and video recognition tasks. Our method achieves the state-of-the-art performance on all these datasets and shows better interpretability.

SESSION: Oral Session D1: Deep Learning for Multimedia

InteractGAN: Learning to Generate Human-Object Interaction

  • Chen Gao
  • Si Liu
  • Defa Zhu
  • Quan Liu
  • Jie Cao
  • Haoqian He
  • Ran He
  • Shuicheng Yan

Compared with the widely studied Human-Object Interaction DE-Tection (HOI-DET), no effort has been devoted to its inverse problem, i.e. to generate an HOI scene image according to the given relationship triplet <human, predicate, object>, to our best knowledge. We term this new task "Human-Object Interaction Image Generation" (HOI-IG). HOI-IG is a research-worthy task with great application prospects, such as online shopping, film production and interactive entertainment. In this work, we introduce an Interact-GAN to solve this challenging task. Our method is composed of two stages: (1) manipulating the posture of a given human image conditioned on a predicate. (2) merging the transformed human image and object image to one realistic scene image while satisfying the ir expected relative position and ratio. Besides, to address the large spatial misalignment issue caused by fusing two images content with reasonable spatial layout, we propose a Relation-based Spatial Transformer Network (RSTN) to adaptively process the images conditioned on their interaction. Extensive experiments on two challenging datasets demonstrate the effectiveness and superiority of our approach. We advocate for the image generation community to draw more attention to the new Human-Object Interaction Image Generation problem. To facilitate future research, our project will be released at:

Category-specific Semantic Coherency Learning for Fine-grained Image Recognition

  • Shijie Wang
  • Zhihui Wang
  • Haojie Li
  • Wanli Ouyang

Existing deep learning based weakly supervised fine-grained image recognition (WFGIR) methods usually pick out the discriminative regions from the high-level feature (HLF) maps directly. However, as HLF maps are derived based on spatial aggregation of convolution which is basically a pattern matching process that applies fixed filters, it is ineffective to model visual contents of same semantic but varying posture or perspective. We argue that this will cause the selected discriminative regions of same sub-category are not semantically corresponding and thus degrade the WFGIR performance. To address this issue, we propose an end-to-end Category-specific Semantic Coherency Network (CSC-Net) to semantically align the discriminative regions of the same subcategory. Specifically, CSC-Net consists of: 1) Local-to-Attribute Projecting Module (LPM), which automatically learns a set of latent attributes via collecting the category-specific semantic details while eliminating the varying spatial distributions from the local regions. 2) Latent Attribute Aligning (LAA), which aligns the latent attributes to specific semantic via graph convolution based on their discriminability, to achieve category-specific semantic coherency; 3) Attribute-to-Local Resuming Module (ARM), which resumes the original Euclidean space of latent attributes and construct latent attribute aligned feature maps by a location-embedding graph unpooling operation. Finally, the new feature maps are used which applies the category-specific semantic coherency implicitly for more accurate discriminative regions localization. Extensive experiments verify that CSC-Net yields the best performance under the same settings with most competitive approaches, on CUB Bird, Stanford-Cars, and FGVC Aircraft datasets.

Scene-Aware Context Reasoning for Unsupervised Abnormal Event Detection in Videos

  • Che Sun
  • Yunde Jia
  • Yao Hu
  • Yuwei Wu

In this paper, we propose a scene-aware context reasoning method that exploits context information from visual features for unsupervised abnormal event detection in videos, which bridges the semantic gap between visual context and the meaning of abnormal events. In particular, we build na spatio-temporal context graph to model visual context information including appearances of objects, spatio-temporal relationships among objects and scene types. The context information is encoded into the nodes and edges of the graph, and their states are iteratively updated by using multiple RNNs with message passing for context reasoning. To infer the spatio-temporal context graph in various scenes, we develop a graph-based deep Gaussian mixture model for scene clustering in an unsupervised manner. We then compute frame-level anomaly scores based on the context information to discriminate abnormal events in various scenes. Evaluations on three challenging datasets, including the UCF-Crime, Avenue, and ShanghaiTech datasets, demonstrate the effectiveness of our method.

Light Field Super-resolution via Attention-Guided Fusion of Hybrid Lenses

  • Jing Jin
  • Junhui Hou
  • Jie Chen
  • Sam Kwong
  • Jingyi Yu

This paper explores the problem of reconstructing high-resolution light field (LF) images from hybrid lenses, including a high-resolution camera surrounded by multiple low-resolution cameras. To tackle this challenge, we propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input from two complementary and parallel perspectives. Specifically, one module regresses a spatially consistent intermediate estimation by learning a deep multidimensional and cross-domain feature representation; the other one constructs another intermediate estimation, which maintains the high-frequency textures, by propagating the information of the high-resolution view. We finally leverage the advantages of the two intermediate estimations via the learned attention maps, leading to the final high-resolution LF image. Extensive experiments demonstrate the significant superiority of our approach over state-of-the-art ones. That is, our method not only improves the PSNR by more than 2 dB, but also preserves the LF structure much better. To the best of our knowledge, this is the first end-to-end deep learning method for reconstructing a high-resolution LF image with a hybrid input. We believe our framework could potentially decrease the cost of high-resolution LF data acquisition and also be beneficial to LF data storage and transmission. The code is available at

Trajectory Prediction in Heterogeneous Environment via Attended Ecology Embedding

  • Wei-Cheng Lai
  • Zi-Xiang Xia
  • Hao-Siang Lin
  • Lien-Feng Hsu
  • Hong-Han Shuai
  • I-Hong Jhuo
  • Wen-Huang Cheng

Trajectory prediction is a highly desirable feature for safe navigation or autonomous vehicle in complex traffic. In this paper, we consider the practical environment of predicting trajectory in the heterogeneous traffic ecology. The proposed method has various applications in trajectory prediction problems and also in applied fields beyond tracking. One challenge stands out of the trajectory prediction-heterogeneous environment. Particularly, many factors should be considered in the environments, i.e., multiple types of road-agents, social interactions and terrains. The information is complicated and large that may result in inaccurate trajectory prediction. We propose two social and visual enforced attention modules to circumvent the problem and a variant of an Info-GAN structure to predict the trajectory with multi-modal behaviors. Experimental results show that the proposed method significantly outperforms state-of-the-art methods in both heterogeneous and homogeneous real environments.

Text-Embedded Bilinear Model for Fine-Grained Visual Recognition

  • Liang Sun
  • Xiang Guan
  • Yang Yang
  • Lei Zhang

Fine-grained visual recognition, which aims to identify subcategories of the same base-level category, is a challenging task because of its large intra-class variances and small inter-class variances. Human beings can perform object recognition task based on not only the visual appearance but also the knowledge from texts, as texts can point out the discriminative parts or characteristics which are always the key to distinguishing different subcategories. This is an involuntary transfer from human textual attention to visual attention, suggesting that texts are able to assist fine-grained recognition. In this paper, we propose a Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained recognition. Specially, we first conduct a text-embedded network to embed text feature into the discriminative image feature learning to get a embedded feature. In addition, since the cross-layer part feature interaction and fine-grained feature learning are mutually correlated and can reinforce each other, we also extract a candidate feature from the text encoder and embed it into the inter-layer feature of the image encoder to get an embedded candidate feature. At last we utilize a cross-layer bilinear network to fuse the two embedded features. Comparing with state-of-the-art methods on the widely used CUB-200-2011 dataset and Oxford Flowers-102 dataset for fine-grained image recognition, the experimental results demonstrate our TEB model achieves the best performance.

SESSION: Oral Session E1: Deep Learning for Multimedia

Learning Scales from Points: A Scale-aware Probabilistic Model for Crowd Counting

  • Zhiheng Ma
  • Xing Wei
  • Xiaopeng Hong
  • Yihong Gong

Counting people automatically through computer vision technology is a challenging task. Recently, convolution neural network (CNN) based methods have made significant progress. Nonetheless, large scale variations of instances caused by, for example, perspective effects remain unsolved. Moreover, it is problematic to estimate scales with only point annotations. In this paper, we propose a scale-aware probabilistic model to handle this problem. Unlike previous methods that generate a single density map where instances of various scales are processed indiscriminately, we propose a density pyramid network (DPN), where each pyramid level handles instances within a particular scale range. Furthermore, we propose a scale distribution estimator (SDE) to learn scales of people from input data, under the weak supervision of point annotations. Finally, we adopt an instance-level probabilistic scale-aware model (IPSM) to guide the multi-scale training of DPN explicitly. Qualitative and quantitative experimental results demonstrate the effectiveness of the proposed method, which achieves competitive results on four widely used benchmarks.

Learning Global Structure Consistency for Robust Object Tracking

  • Bi Li
  • Chengquan Zhang
  • Zhibin Hong
  • Xu Tang
  • Jingtuo Liu
  • Junyu Han
  • Errui Ding
  • Wenyu Liu

Fast appearance variations and the distractions of similar objects are two of the most challenging problems in visual object tracking. Unlike many existing trackers that focus on modeling only the target, in this work, we consider the transient variations of the whole scene. The key insight is that the object correspondence and spatial layout of the whole scene are consistent (i.e., global structure consistency) in consecutive frames which helps to disambiguate the target from distractors. Moreover, modeling transient variations enables to localize the target under fast variations. Specifically, we propose an effective and efficient short-term model that learns to exploit the global structure consistency in a short time and thus can handle fast variations and distractors. Since short-term modeling falls short of handling occlusion and out of the views, we adopt the long-short term paradigm and use a long-term model that corrects the short-term model when it drifts away from the target or the target is not present. These two components are carefully combined to achieve the balance of stability and plasticity during tracking. We empirically verify that the proposed tracker can tackle the two challenging scenarios and validate it on large scale benchmarks. Remarkably, our tracker improves state-of-the-art-performance on VOT2018 from 0.440 to 0.460, GOT-10k from 0.611 to 0.640, and NFS from 0.619 to 0.629.

Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical Understanding of Outdoor Scene

  • Xinke Li
  • Chongshou Li
  • Zekun Tong
  • Andrew Lim
  • Junsong Yuan
  • Yuwei Wu
  • Jing Tang
  • Raymond Huang

Learning on 3D scene-based point cloud has received extensive attention as its promising application in many fields, and well-annotated and multisource datasets can catalyze the development of those data-driven approaches. To facilitate the research of this area, we present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks and also an effective learning framework for its hierarchical segmentation task. The dataset was generated via the photogrammetric processing on unmanned aerial vehicle (UAV) images of the National University of Singapore (NUS) campus, and has been point-wisely annotated with both hierarchical and instance-based labels. Based on it, we formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies. To solve this problem, a two-stage method including multi-task (MT) learning and hierarchical ensemble (HE) with consistency consideration is proposed. Experimental results demonstrate the superiority of the proposed method and potential advantages of our hierarchical annotations. In addition, we benchmark results of semantic and instance segmentation, which is accessible online at with the dataset and all source codes.

Instability of Successive Deep Image Compression

  • Jun-Hyuk Kim
  • Soobeom Jang
  • Jun-Ho Choi
  • Jong-Seok Lee

Successive image compression refers to the process of repeated encoding and decoding of an image. It frequently occurs during sharing, manipulation, and re-distribution of images. While deep learning-based methods have made significant progress for single-step compression, thorough analysis of their performance under successive compression has not been conducted. In this paper, we conduct comprehensive analysis of successive deep image compression. First, we introduce a new observation, instability of successive deep image compression, which is not observed in JPEG, and discuss causes of the instability. Then, we conduct a successive image compression benchmark for the state-of-the-art deep learning-based methods, and analyze the factors that affect the instability in a comparative manner. Finally, we propose a new loss function for training deep compression models, called feature identity loss, to mitigate the instability of successive deep image compression.

ALANET: Adaptive Latent Attention Network for Joint Video Deblurring and Interpolation

  • Akash Gupta
  • Abhishek Aich
  • Amit K. Roy-Chowdhury

Existing works address the problem of generating high frame-rate sharp videos by separately learning the frame deblurring and frame interpolation modules. Most of these approaches have a strong prior assumption that all the input frames are blurry whereas in a real-world setting, the quality of frames varies. Moreover, such approaches are trained to perform either of the two tasks - deblurring or interpolation - in isolation, while many practical situations call for both. Different from these works, we address a more realistic problem of high frame-rate sharp video synthesis with no prior assumption that input is always blurry. We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos with no prior knowledge of input frames being blurry or not, thereby performing the task of both deblurring and interpolation. We hypothesize that information from the latent representation of the consecutive frames can be utilized to generate optimized representations for both frame deblurring and frame interpolation. Specifically, we employ combination of self-attention and cross-attention module between consecutive frames in the latent space to generate optimized representation for each frame. The optimized representation learnt using these attention modules help the model to generate and interpolate sharp frames. Extensive experiments on standard datasets demonstrate that our method performs favorably against various state-of-the-art approaches, even though we tackle a much more difficult problem. The project page is available at

PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation

  • Shaotian Yan
  • Chen Shen
  • Zhongming Jin
  • Jianqiang Huang
  • Rongxin Jiang
  • Yaowu Chen
  • Xian-Sheng Hua

Today's scene graph generation (SGG) task is largely limited in realistic scenarios, mainly due to the extremely long-tailed bias of predicate annotation distribution. Thus, tackling the class imbalance trouble of SGG is critical and challenging. In this paper, we first discover that when predicate labels have strong correlation with each other, prevalent re-balancing strategies (e.g., re-sampling and re-weighting) will give rise to either over-fitting the tail data (e.g., bench sitting on sidewalk rather than on), or still suffering the adverse effect from the original uneven distribution (e.g., aggregating varied parked on/standing on/sitting on into on). We argue the principal reason is that re-balancing strategies are sensitive to the frequencies of predicates yet blind to their relatedness, which may play a more important role to promote the learning of predicate features. Therefore, we propose a novel Predicate-Correlation Perception Learning (PCPL for short) scheme to adaptively seek out appropriate loss weights by directly perceiving and utilizing the correlation among predicate classes. Moreover, our PCPL framework is further equipped with a graph encoder module to better extract context features. Extensive experiments on the benchmark VG150 dataset show that the proposed PCPL performs markedly better on tail classes while well-preserving the performance on head ones, which significantly outperforms previous state-of-the-art methods.

SESSION: Oral Session F1: Deep Learning for Multimedia

Discriminative Spatial Feature Learning for Person Re-Identification

  • Peixi Peng
  • Yonghong Tian
  • Yangru Huang
  • Xiangqian Wang
  • Huilong An

Person re-identification (ReID) aims to match detected pedestrian images from multiple non-overlapping cameras. Most existing methods employ a backbone CNN to extract a vectorized feature representation by performing some global pooling operations (such as global average pooling and global max pooling) on the 3D feature map (i.e., the output of the backbone CNN). Although simple and effective in some situations, the global pooling operation only focuses on the statistical properties and ignores the spatial distribution of the feature map. Hence, it can not distinguish two feature maps when they have similar response values located in totally different positions. To handle this challenge, a novel method is proposed to learn the discriminative spatial features. Firstly, a self-constrained spatial transformer network (SC-STN) is introduced to handle the misalignments caused by detection errors. Then, based on the prior knowledge that the spatial structure of a pedestrian often keeps robust in vertical orientation of images, a novel vertical convolution network (VCN) is proposed to extract the spatial feature in vertical. Extensive experimental evaluations on several benchmarks demonstrate that the proposed method achieves state-of-the-art performances by introducing only a few parameters to the backbone.

AdaHGNN: Adaptive Hypergraph Neural Networks for Multi-Label Image Classification

  • Xiangping Wu
  • Qingcai Chen
  • Wei Li
  • Yulun Xiao
  • Baotian Hu

Multi-label image classification is an important and challenging task in computer vision and multimedia fields. Most of the recent works only capture the pair-wise dependencies among multiple labels through statistical co-occurrence information, which cannot model the high-order semantic relations automatically. In this paper, we propose a high-order semantic learning model based on adaptive hypergraph neural networks (AdaHGNN) to boost multi-label classification performance. Firstly, an adaptive hypergraph is constructed by using label embeddings automatically. Secondly, image features are decoupled into feature vectors corresponding to each label, and hypergraph neural networks (HGNN) are employed to correlate these vectors and explore the high-order semantic interactions. In addition, multi-scale learning is used to reduce sensitivity to object size inconsistencies. Experiments are conducted on four benchmarks: MS-COCO, NUS-WIDE, Visual Genome, and Pascal VOC 2007, which cover large, medium, and small-scale categories. State-of-the-art performances are achieved on three of them. Results and analysis demonstrate that the proposed method has the ability to capture high-order semantic dependencies.

Reinforced Similarity Learning: Siamese Relation Networks for Robust Object Tracking

  • Dawei Zhang
  • Zhonglong Zheng
  • Minglu Li
  • Xiaowei He
  • Tianxiang Wang
  • Liyuan Chen
  • Riheng Jia
  • Feilong Lin

Recently, Siamese networks based tracking algorithms have shown favorable performance. Latest work focuses on better feature embedding and target state estimation, which greatly improves the accuracy. Nevertheless, the simple cross-correlation operation of the features between a fixed template and the search region limits their robustness and discrimination capability. In this paper, we pay more attention to learn an outstanding similarity measure for robust tracking. We propose a novel relation network that can be integrated on top of previous trackers without any need for further training of the siamese networks, which achieves a superior discriminative ability. During online inference, we utilize the feedback from high-confidence tracking results to obtain an additional template and update it, which improves the robustness and generalization. We implement two versions of the proposed approach with the SiamFC-based tracker and SiamRPN-based tracker to validate the strong compatibility of our algorithm. Extensive experimental results on several tracking benchmarks indicate that the proposed method can effectively improve the performance and robustness of the underlying trackers without reducing speed too much, and performs superiorly against the state-of-the-art trackers.

Deep Structural Contour Detection

  • Ruoxi Deng
  • Shengjun Liu

Object contour detection is the fundamental and preprocessing step for multimedia applications such as icon generation, object segmentation, and tracking. The quality of contour prediction is of great importance in these applications since it affects the subsequent process. In this work, we aim to develop a high-performance contour detection system. We first propose a novel yet very effective loss function for contour detection. The proposed loss function is capable of penalizing the distance of contour-structure similarity between each pair of prediction and ground-truth. Moreover, to better distinguishing object contours and background textures, we introduce a novel convolutional encoder-decoder network. Within the network, we present a hyper module that captures dense connections among high-level features and produces effective semantic information. Then the information is progressively propagated and fused with low-level features. We conduct extensive experiments on the BSDS500 and Multi-Cue datasets, the results show significant improvement against the state-of-the-art competitors. We further demonstrate the benefit of our DSCD method for crowd counting.

Cross-modal Non-linear Guided Attention and Temporal Coherence in Multi-modal Deep Video Models

  • Saurabh Sahu
  • Palash Goyal
  • Shalini Ghosh
  • Chul Lee

Videos have data in multiple modalities, e.g., audio, video, text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated --- so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuity of video and audio, a property that we refer to as temporal coherence. We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multi-modal machine learning (ML) models for video analysis tasks like categorization. Our experiments on the large-scale YouTube-8M dataset show how our approach significantly outperforms state-of-the-art multi-modal ML models for video categorization. The model trained on the YouTube-8M dataset also showed good performance on an internal dataset of video segments from actual Samsung TV Plus channels without retraining or fine-tuning, showing the generalization capabilities of our model.

IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning

  • Zhenhuan Liu
  • Jincan Deng
  • Liang Li
  • Shaofei Cai
  • Qianqian Xu
  • Shuhui Wang
  • Qingming Huang

Conditional image generation is an active research topic including text2image and image translation. Recently image manipulation with linguistic instruction brings new challenges of multimodal conditional generation. However, traditional conditional image generation models mainly focus on generating high-quality and visually realistic images, and lack resolving the partial consistency between image and instruction. To address this issue, we propose an Increment Reasoning Generative Adversarial Network (IR-GAN), which aims to reason the consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary. Finally, we propose a reasoning discriminator to measure the consistency between visual increment and semantic increment, which purifies user's intention and guarantees the good logic of generated target image. Extensive experiments and visualization conducted on two datasets show the effectiveness of IR-GAN.

SESSION: Oral Session G1: Deep Learning for Multimedia

Fine-Grained Similarity Measurement between Educational Videos and Exercises

  • Xin Wang
  • Wei Huang
  • Qi Liu
  • Yu Yin
  • Zhenya Huang
  • Le Wu
  • Jianhui Ma
  • Xue Wang

In online learning systems, measuring the similarity between educational videos and exercises is a fundamental task with great application potentials. In this paper, we explore to measure the fine-grained similarity by leveraging multimodal information. The problem remains pretty much open due to several domain-specific characteristics. First, unlike general videos, educational videos contain not only graphics but also text and formulas, which have a fixed reading order. Both spatial and temporal information embedded in the frames should be modeled. Second, there are semantic associations between adjacent video segments. The semantic associations will affect the similarity and different exercises usually focus on the related context of different ranges. Third, the fine-grained labeled data for training the model is scarce and costly. To tackle the aforementioned challenges, we propose VENet to measure the similarity at both video-level and segment-level by just exploiting the video-level labeled data. Extensive experimental results on real-world data demonstrate the effectiveness of VENet.

One-shot Text Field labeling using Attention and Belief Propagation for Structure Information Extraction

  • Mengli Cheng
  • Minghui Qiu
  • Xing Shi
  • Jun Huang
  • Wei Lin

Structured information extraction from document images usually consists of three steps: text detection, text recognition, and text field labeling. While text detection and text recognition have been heavily studied and improved a lot in literature, text field labeling is less explored and still faces many challenges. Existing learning based methods for text labeling task usually require a large amount of labeled examples to train a specific model for each type of document. However, collecting large amounts of document images and labeling them is difficult and sometimes impossible due to privacy issues. Deploying separate models for each type of document also consumes a lot of resources. Facing these challenges, we explore one-shot learning for the text field labeling task. Existing one-shot learning methods for the task are mostly rule-based and have difficulty in labeling fields in crowded regions with few landmarks and fields consisting of multiple separate text regions. To alleviate these problems, we proposed a novel deep end-to-end trainable approach for one-shot text field labeling, which makes use of attention mechanism to transfer the layout information between document images. We further applied conditional random field on the transferred layout information for the refinement of field labeling. We collected and annotated a real-world one-shot field labeling dataset with a large variety of document types and conducted extensive experiments to examine the effectiveness of the proposed model. To stimulate research in this direction, the collected dataset and the one-shot model will be released (

Grad: Learning for Overhead-aware Adaptive Video Streaming with Scalable Video Coding

  • Yunzhuo Liu
  • Bo Jiang
  • Tian Guo
  • Ramesh K. Sitaraman
  • Don Towsley
  • Xinbing Wang

Video streaming commonly uses Dynamic Adaptive Streaming over HTTP (DASH) to deliver good Quality of Experience (QoE) to users. Videos used in DASH are predominantly encoded by single-layered video coding such as H.264/AVC. In comparison, multi-layered video coding such as H.264/SVC provides more flexibility for upgrading the quality of buffered video segments and has the potential to further improve QoE. However, there are two challenges for using SVC in DASH: (i) the complexity in designing ABR algorithms; and (ii) the negative impact of SVC's coding overhead. In this work, we propose a deep reinforcement learning method called Grad for designing ABR algorithms that take advantage of the quality upgrade mechanism of SVC. Additionally, we quantify the impact of coding overhead on the achievable QoE of SVC in DASH, and propose jump-enabled hybrid coding (HYBJ) to mitigate the impact. Through emulation, we demonstrate that Grad-HYBJ, an ABR algorithm for HYBJ learned by Grad, outperforms the best performing state-of-the-art ABR algorithm by 17% in QoE.

Efficient Adaptation of Neural Network Filter for Video Compression

  • Yat-Hong Lam
  • Alireza Zare
  • Francesco Cricri
  • Jani Lainema
  • Miska M. Hannuksela

We present an efficient finetuning methodology for neural-network filters which are applied as a postprocessing artifact-removal step in video coding pipelines. The fine-tuning is performed at encoder side to adapt the neural network to the specific content that is being encoded. In order to maximize the PSNR gain and minimize the bitrate overhead, we propose to finetune only the convolutional layers' biases. The proposed method achieves convergence much faster than conventional finetuning approaches, making it suitable for practical applications. The weight-update can be included into the video bitstream generatedby the existing video codecs. We show that our method achieves up to 9.7% average BD-rate gain when compared to the state-of-art Versatile Video Coding (VVC) standard codec on 7 test sequences.

SonoSpace: Visual Feedback of Timbre with Unsupervised Learning

  • Naoki Kimura
  • Keisuke Shiro
  • Yota Takakura
  • Hiromi Nakamura
  • Jun Rekimoto

One of the most difficult things in practicing musical instruments is improving timbre. Unlike pitch and rhythm, timbre is a high-dimensional and sensuous concept, and learners cannot evaluate their timbre by themselves. To efficiently improve their timbre control, learners generally need a teacher to provide feedback about timbre. However, hiring teachers is often expensive and sometimes difficult. Our goal is to develop a low-cost learning system that substitutes the teacher. We found that a variational autoencoder (VAE), which is an unsupervised neural network model, provides a 2-dimensional user-friendly mapping of timbre. Our system, SonoSpace, maps the learner's timbre into a 2D latent space extracted from an advanced player's performance. Seeing this 2D latent space, the learner can visually grasp the relative distance between their timbre and that of the advanced player. Although our system was evaluated mainly with an alto saxophone, SonoSpace could also be applied to other instruments, such as trumpets, flutes, and drums.

Single Image Deraining via Scale-space Invariant Attention Neural Network

  • Bo Pang
  • Deming Zhai
  • Junjun Jiang
  • Xianming Liu

Image enhancement from degradation of rainy artifacts plays a critical role in outdoor visual computing systems. In this paper, we tackle the notion of scale that deals with visual changes in appearance of rain steaks with respect to the camera. Specifically, we revisit multi-scale representation by scale-space theory, and propose to represent the multi-scale correlation in convolutional feature domain, which is more compact and robust than that in pixel domain. Moreover, to improve the modeling ability of the network, we do not treat the extracted multi-scale features equally, but design a novel scale-space invariant attention mechanism to help the network focus on parts of the features. In this way, we summarize the most activated presence of feature maps as the salient features. Extensive experiments results on synthetic and real rainy scenes demonstrate the superior performance of our scheme over the state-of-the-arts. The source code of our method can be found in:

SESSION: Oral Session H1: Emerging Multimedia Applications

Every Moment Matters: Detail-Aware Networks to Bring a Blurry Image Alive

  • Kaihao Zhang
  • Wenhan Luo
  • Björn Stenger
  • Wenqi Ren
  • Lin Ma
  • Hongdong Li

Motion-blurred images are the result of light accumulation over the period of camera exposure time, during which the camera and objects in the scene are in relative motion to each other. The inverse process of extracting an image sequence from a single motion-blurred image is an ill-posed vision problem. One key challenge is that the motions across frames are subtle, which makes the generating networks difficult to capture them and thus the recovery sequences lack motion details. In order to alleviate this problem, we propose a detail-aware network with three consecutive stages to improve the reconstruction quality by addressing specific aspects in the recovery process. The detail-aware network firstly models the dynamics using a cycle flow loss, resolving the temporal ambiguity of the reconstruction in the first stage. Then, a GramNet is proposed in the second stage to refine subtle motion between continuous frames using Gram matrices as motion representation. Finally, we introduce a HeptaGAN in the third stage to bridge the continuous and discrete nature of exposure time and recovered frames, respectively, in order to maintain rich detail. Experiments show that the proposed detail-aware networks produce sharp image sequences with rich details and subtle motion, outperforming the state-of-the-art methods.

ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network

  • Weiqing Min
  • Linhu Liu
  • Zhiling Wang
  • Zhengdong Luo
  • Xiaoming Wei
  • Xiaolin Wei
  • Shuqiang Jiang

Food recognition has received more and more attention in the multimedia community for its various real-world applications, such as diet management and self-service restaurants. A large-scale ontology of food images is urgently needed for developing advanced large-scale food recognition algorithms, as well as for providing the benchmark dataset for such algorithms. To encourage further progress in food recognition, we introduce the dataset ISIA Food-500 with 500 categories from the list in the Wikipedia and 399,726 images, a more comprehensive food dataset that surpasses existing popular benchmark datasets by category coverage and data volume. Furthermore, we propose a stacked global-local attention network, which consists of two sub-networks for food recognition. One sub-network first utilizes hybrid spatial-channel attention to extract more discriminative features, and then aggregates these multi-scale discriminative features from multiple layers into global-level representation (e.g., texture and shape information about food). The other one generates attentional regions (e.g., ingredient relevant regions) from different regions via cascaded spatial transformers, and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method, and thus can be considered as one strong baseline. The dataset, code and models can be found at

An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis

  • Tianyu Zhang
  • Weiqing Min
  • Ying Zhu
  • Yong Rui
  • Shuqiang Jiang

In this paper, we focus on egocentric action anticipation from videos, which enables various applications, such as helping intelligent wearable assistants understand users' needs and enhance their capabilities in the interaction process. It requires intelligent systems to observe from the perspective of the first person and predict an action before it occurs. Owing to the uncertainty of future, it is insufficient to perform action anticipation relying on visual information especially when there exists salient visual difference between past and future. In order to alleviate this problem, which we call visual gap in this paper, we propose one novel Intuition-Analysis Integrated (IAI) framework inspired by psychological research, which mainly consists of three parts: Intuition-based Prediction Network (IPN), Analysis-based Prediction Network (APN) and Adaptive Fusion Network (AFN). To imitate the implicit intuitive thinking process, we model IPN as an encoder-decoder structure and introduce one procedural instruction learning strategy implemented by textual pre-training. On the other hand, we allow APN to process information under designed rules to imitate the explicit analytical thinking, which is divided into three steps: recognition, transitions and combination. Both the procedural instruction learning strategy in IPN and the transition step of APN are crucial to improving the anticipation performance via mitigating the visual gap problem. Considering the complementarity of intuition and analysis, AFN adopts attention fusion to adaptively integrate predictions from IPN and APN to produce the final anticipation results. We conduct experiments on the largest egocentric video dataset. Qualitative and quantitative evaluation results validate the effectiveness of our IAI framework, and demonstrate the advantage of bridging visual gap by utilizing multi-modal information, including both visual features of observed segments and sequential instructions of actions.

Multi-Person Action Recognition in Microwave Sensors

  • Diangang Li
  • Jianquan Liu
  • Shoji Nishimura
  • Yuka Hayashi
  • Jun Suzuki
  • Yihong Gong

The usage of surveillance cameras for video understanding, raises concerns about privacy intrusion recently. This motivates the research community to seek potential alternatives of cameras for emerging multimedia applications. Stepping to this goal, a few researchers have explored the usage of Wi-Fi or Bluetooth sensors to handle action recognition. However, the practical ability of these sensors is limited by their frequency band and deployment inconvenience because of the separate transmitter/receiver architecture. Motivated by the same purpose of reducing privacy issues, we introduce a latest microwave sensor for multi-person action recognition in this paper. The microwave sensor works at 77GHz ~ 80GHz band, and is implemented with both transmitter and receiver inside itself, thus can be easily deployed for action recognition. Although with its advantages, two main challenging issues still remain. One is the difficulty of labelling the invisible signal data with embedding actions. The other is the difficulty of cancelling the environment noise for high-accurate action recognition. To address the challenges, we propose a novel learning framework by designed original loss functions with the considerations on weakly-supervised multi-label learning and attention mechanism to improve the accuracy for action recognition. We build a new microwave sensor data set, and conduct comprehensive experiments to evaluate the recognition accuracy of our proposed framework, and the effectiveness of parameters in each component. The experiment results show that our framework outperforms the state-of-the-art methods up to 14% in terms of mAP.

Coupling Deep Textural and Shape Features for Sketch Recognition

  • Qi Jia
  • Xin Fan
  • Meiyu Yu
  • Yuqing Liu
  • Dingrong Wang
  • Longin Jan Latecki

Recognizing freehand sketches with high arbitrariness is such a great challenge that the automatic recognition rate has reached a ceiling in recent years. In this paper, we explicitly explore the shape properties of sketches, which has almost been neglected before in the context of deep learning, and propose a sequential dual learning strategy that combines both shape and texture features. We devise a two-stage recurrent neural network to balance these two types of features. Our architecture also considers stroke orders of sketches to reduce the intra-class variations of input features. Extensive experiments on the TU-Berlin benchmark set show that our method achieves over 90% recognition rate for the first time on this task, outperforming both humans and state-of-the-art algorithms by over 19 and 7.5 percentage points, respectively. Especially, our approach can distinguish the sketches with similar textures but different shapes more effectively than recent deep networks. Based on the proposed method, we develop an on-line sketch retrieval and imitation application to teach children or adults to draw. The application is available as Sketch.Draw.

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

  • Huaizheng Zhang
  • Yong Luo
  • Qiming Ai
  • Yonggang Wen
  • Han Hu

Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed Deep$M^2$Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.

SESSION: Oral Session A2: Emerging Multimedia Applications

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

  • Komal Chugh
  • Parul Gupta
  • Abhinav Dhall
  • Ramanathan Subramanian

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network

  • Kai Cheng
  • Xin Liu
  • Yiu-ming Cheung
  • Rui Wang
  • Xing Xu
  • Bineng Zhong

Many cognitive researches have shown that human may 'see voices' or 'hear faces', and such ability can be potentially associated by machine vision and intelligence. However, this research is still under early stage. In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. Within the proposed framework, we exploit a simple and efficient adversarial learning architecture to learn the cross-modal embeddings between faces and voices, which consists of two subnetworks, respectively, for generator and discriminator. The former subnetwork is designed to adaptively discriminate the high-level semantical features between voices and faces, in which the triplet loss and multi-modal center loss are in tandem utilized to explicitly regularize the correspondences among them. The latter subnetwork is further leveraged to maximally bridge the semantic gap between the representations of voice and face data, featuring on maintaining the semantic consistency. Through the joint exploitation of the above, the proposed framework can well push representations of voice-face data from the same person closer while pulling those representations of different person away. Extensive experiments empirically show that the proposed approach involves fewer parameters and calculations, adapts various cross-modal matching tasks for voice-face data and brings substantial improvements over the state-of-the-art methods.

Multimodal Multi-Task Financial Risk Forecasting

  • Ramit Sawhney
  • Puneet Mathur
  • Ayush Mangal
  • Piyush Khanna
  • Rajiv Ratn Shah
  • Roger Zimmermann

Stock price movement and volatility prediction aim to predict stocks' future trends to help investors make sound investment decisions and model financial risk. Companies' earnings calls are a rich, underexplored source of multimodal information for financial forecasting. However, existing fintech solutions are not optimized towards harnessing the interplay between the multimodal verbal and vocal cues in earnings calls. In this work, we present a multi-task solution that utilizes domain specialized textual features and audio attentive alignment for predictive financial risk and price modeling. Our method advances existing solutions in two aspects: 1) tailoring a deep multimodal text-audio attention model, 2) optimizing volatility, and price movement prediction in a multi-task ensemble formulation. Through quantitative and qualitative analyses, we show the effectiveness of our deep multimodal approach.

Down to the Last Detail: Virtual Try-on with Fine-grained Details

  • Jiahang Wang
  • Tong Sha
  • Wei Zhang
  • Zhoujun Li
  • Tao Mei

Virtual try-on has attracted lots of research attention due to its potential applications in e-commerce, virtual reality and fashion design. However, existing methods can hardly preserve the fine-grained details (e.g., clothing texture, facial identity, hair style, skin tone) during generation, due to the non-rigid body deformation and multi-scale details. In this work, we propose a multi-stage framework to synthesize person images, where fine-grained details can be well preserved. To address the long-range translation and rich-details generation, we propose a Tree-Block (tree dilated fusion block) to replace standard ResNet-block where applicable. Notably, multi-scale feature maps can be smoothly fused for fine-grained detail generation, by incorporating larger spatial context at multiple scales. With a delicate end-to-end training scheme, our whole framework can be jointly optimized for results with significantly better visual fidelity and richer details. Moreover, we also explore the potential application in video-based virtual try-on. By harnessing the well-trained image generator and an extra video-level adaptor, a model photo can be well animated with a driving pose sequence. Extensive evaluations on standard datasets and user study demonstrate that our proposed framework achieves the state-of-the-art results, especially in preserving visual details in clothing texture and facial identity. Our implementation is publicly available via

Temporal Denoising Mask Synthesis Network for Learning Blind Video Temporal Consistency

  • Yifeng Zhou
  • Xing Xu
  • Fumin Shen
  • Lianli Gao
  • Huimin Lu
  • Heng Tao Shen

Recently, developing temporally consistent video-based processing techniques has drawn increasing attention due to the defective extend-ability of existing image-based processing algorithms (e.g., filtering, enhancement, colorization, etc). Generally, applying these image-based algorithms independently to each video frame typically leads to temporal flickering due to the global instability of these algorithms. In this paper, we consider enforcing temporal consistency in a video as a temporal denoising problem that removing the flickering effect in given unstable pre-processed frames. Specifically, we propose a novel model termed Temporal Denoising Mask Synthesis Network (TDMS-Net) that jointly predicts the motion mask, soft optical flow and the refining mask to synthesize the temporal consistent frames. The temporal consistency is learned from the original video and the learned temporal features are applied to reprocess the output frames that are agnostic (blind) to specific image-based processing algorithms. Experimental results on two datasets for 16 different applications demonstrate that the proposed TDMS-Net significantly outperforms two state-of-the-art blind temporal consistency approaches.

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

  • K R Prajwal
  • Rudrabha Mukhopadhyay
  • Vinay P. Namboodiri
  • C.V. Jawahar

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model, and also publicly release the code, models, and evaluation benchmarks on our website.

SESSION: Oral Session B2: Emotional and Social Signals in Multimedia

MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos

  • Guangyao Shen
  • Xin Wang
  • Xuguang Duan
  • Hongzhi Li
  • Wenwu Zhu

Humans can perceive subtle emotions from various cues and contexts, even without hearing or seeing others. However, existing video datasets mainly focus on recognizing the emotions of the speakers from complete modalities. In this work, we present the task of multimodal emotion reasoning in videos. Beyond directly recognizing emotions from multimodal signals of target persons, this task requires a machine capable of reasoning about human emotions from the contexts and surrounding world. To facilitate the study towards this task, we introduce a new dataset, MEmoR, that provides fine-grained emotion annotations for both speakers and non-speakers. The videos in MEmoR are collected from TV shows closely in real-life scenarios. In these videos, while speakers may be non-visually described, non-speakers always deliver no audio-textual signals and are often visually inconspicuous. This modality-missing characteristic makes MEmoR a more practical yet challenging testbed for multimodal emotion reasoning. In support of various reasoning behaviors, the proposed MEmoR dataset provides both short-term contexts and external knowledge. We further propose an attention-based reasoning approach to model the intra-personal emotion contexts, inter-personal emotion propagation, and the personalities of different individuals. Experimental results demonstrate that our proposed approach outperforms related baselines significantly. We isolate and analyze the validity of different reasoning modules across various emotions of speakers and non-speakers. Finally, we draw forth several future research directions for multimodal emotion reasoning with MEmoR, aiming to empower high Emotional Quotient (EQ) in modern artificial intelligence systems. The code and dataset released on

Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations

  • Dong Zhang
  • Weisheng Zhang
  • Shoushan Li
  • Qiaoming Zhu
  • Guodong Zhou

Through much exploration in the past decade, emotion analysis in conversations was mainly conducted in textual scenario. Nowadays, with the popularization of speech and video communication, academia and industry have become gradually aware of the need in multimodal scenario. Therefore, emotion detection in conversations becomes increasingly hot not only in natural language processing (NLP) community but also in multimodal analysis community. Although previous studies normally argue that the emotion of current utterance in a conversation is much influenced by the content of historical utterances, their speakers and emotions, they model the influence derived from the history to the current utterance at the same granularity (Intra-modal influence). Intuitively, the clues of emotion detection may not exist in the history of the same modality as current utterance, but in the history of other modalities (Inter-modal influence). Besides, previous studies normally model the information propagation as the conversation flow. Intuitively, bidirectional modeling of information propagation in conversations provides rich clues for emotion detection. Therefore, this paper proposes a bidirectional dynamic dual influence network for real-time emotion detection in conversations, which can simultaneously model both intra- and inter-modal influence with bidirectional information propagation for current utterance and its historical utterances. Detailed experiments demonstrate that our approach much advances the state-of-the-art.

Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection

  • Xincheng Ju
  • Dong Zhang
  • Junhui Li
  • Guodong Zhou

Multi-modal utterance-level emotion detection has been a hot research topic in both multi-modal analysis and natural language processing communities. Different from traditional single-label multi-modal sentiment analysis, typical multi-modal emotion detection is naturally a multi-label problem where an utterance often contains multiple emotions. Existing studies normally focus on multi-modal fusion only and transform multi-label emotion classification into multiple binary classification problem independently. As a result, existing studies largely ignore two kinds of important dependency information: (1) Modality-to-label dependency, where different emotions can be inferred from different modalities, that is, different modalities contribute differently to each potential emotion. (2) Label-to-label dependency, where some emotions are more likely to coexist than those conflicting emotions. To simultaneously model above two kinds of dependency, we propose a unified approach, namely multi-modal emotion set generation network (MESGN) to generate an emotion set for an utterance. Specifically, we first employ a cross-modal transformer encoder to capture cross-modal interactions among different modalities, and a standard transformer encoder to capture temporal information for each modality-specific sequence given previous interactions. Then, we design a transformer-based discriminative decoding module equipped with modality-to-label attention to handle the modality-to-label dependency. In the meanwhile, we employ a reinforced decoding algorithm with self-critic learning to handle the label-to-label dependency. Finally, we validate the proposed MESGN architecture on a word-level aligned and unaligned multi-modal dataset. Detailed experimentation shows that our proposed MESGN architecture can effectively improve the performance of multi-modal multi-label emotion detection.

CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis

  • Kaicheng Yang
  • Hua Xu
  • Kai Gao

Multimodal sentiment analysis is an emerging research field that aims to enable machines to recognize, interpret, and express emotion. Through the cross-modal interaction, we can get more comprehensive emotional characteristics of the speaker. Bidirectional Encoder Representations from Transformers (BERT) is an efficient pre-trained language representation model. Fine-tuning it has obtained new state-of-the-art results on eleven natural language processing tasks like question answering and natural language inference. However, most previous works fine-tune BERT only base on text data, how to learn a better representation by introducing the multimodal information is still worth exploring. In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, masked multimodal attention is designed to dynamically adjust the weight of words by combining the information of text and audio modality. We evaluate our method on the public multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results show that it has significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT. Besides, we visualize the masked multimodal attention and proves that it can reasonably adjust the weight of words by introducing audio modality information.

AffectI: A Game for Diverse, Reliable, and Efficient Affective Image Annotation

  • Xingkun Zuo
  • Jiyi Li
  • Qili Zhou
  • Jianjun Li
  • Xiaoyang Mao

An important application of affective image annotation is affective image content analysis, which aims to automatically understand the emotion being brought to viewers by image contents. The so-called subjective perception issue, i.e., different viewers may have different emotional responses to the same image, makes it difficult to link image features with the expected perceived emotion. Due to the ability to learn features, recent deep learning technologies have opened a new window on affective image content analysis, which has led to a growing demand for affective image annotation technologies to build large reliable training datasets. This paper proposes a novel affective image annotation technique, AffectI, for efficiently collecting diverse and reliable emotional labels with the estimate emotion distribution for images based on the concept of Game With a Purpose (GWAP). AffectI features three novel mechanisms: a selection mechanism for ensuring all emotion words being fairly evaluated for collecting diverse and reliable labels; an estimation mechanism for estimating the emotion distribution by aggregating partial pairwise comparisons of the emotion words for collecting the labels effectively and efficiently; an incentive mechanism shows the comparison between current player and her opponents as well as all past players to promote the interest of players and also contributes the reliability and diversity. Our experimental results demonstrate that AffectI is superior to existing methods in terms of being able to collect more diverse and reliable labels. The advantage of using GWAP for reducing the frustration of evaluators was also confirmed through subjective evaluation.

Attentive One-Dimensional Heatmap Regression for Facial Landmark Detection and Tracking

  • Shi Yin
  • Shangfei Wang
  • Xiaoping Chen
  • Enhong Chen
  • Cong Liang

Although heatmap regression is considered a state-of-the-art method to locate facial landmarks, it suffers from huge spatial complexity and is prone to quantization error. To address this, we propose a novel attentive one-dimensional heatmap regression method for facial landmark localization. First, we predict two groups of 1D heatmaps to represent the marginal distributions of the x and y coordinates. These 1D heatmaps reduce spatial complexity significantly compared to current heatmap regression methods, which use 2D heatmaps to represent the joint distributions of x and y coordinates. With much lower spatial complexity, the proposed method can output high-resolution 1D heatmaps despite limited GPU memory, significantly alleviating the quantization error. Second, a co-attention mechanism is adopted to model the inherent spatial patterns existing in x and y coordinates, and therefore the joint distributions on the x and y axes are also captured. Third, based on the 1D heatmap structures, we propose a facial landmark detector capturing spatial patterns for landmark detection on an image; and a tracker further capturing temporal patterns with a temporal refinement mechanism for landmark tracking. Experimental results on four benchmark databases demonstrate the superiority of our method.

SESSION: Oral Session C2: Media Interpretation

Domain Adaptive Person Re-Identification via Coupling Optimization

  • Xiaobin Liu
  • Shiliang Zhang

Domain adaptive person Re-Identification (ReID) is challenging owing to the domain gap and shortage of annotations on target scenarios. To handle those two challenges, this paper proposes a coupling optimization method including the Domain-Invariant Mapping (DIM) method and the Global-Local distance Optimization (GLO), respectively. Different from previous methods that transfer knowledge in two stages, the DIM achieves a more efficient one-stage knowledge transfer by mapping images in labeled and unlabeled datasets to a shared feature space. GLO is designed to train the ReID model with unsupervised setting on the target domain. Instead of relying on existing optimization strategies designed for supervised training, GLO involves more images in distance optimization, and achieves better robustness to noisy label prediction. GLO also integrates distance optimizations in both the global dataset and local training batch, thus exhibits better training efficiency. Extensive experiments on three large-scale datasets,i.e., Market-1501, DukeMTMC-reID, andMSMT17, show that our coupling optimization outperforms state-of-the-art methods by a large margin. Our method also works well in unsupervised training, and even outperforms several recent domain adaptive methods.

Dual-Structure Disentangling Variational Generation for Data-Limited Face Parsing

  • Peipei Li
  • Yinglu Liu
  • Hailin Shi
  • Xiang Wu
  • Yibo Hu
  • Ran He
  • Zhenan Sun

Deep learning based face parsing methods have attained state-of-the-art performance in recent years. Their superior performance heavily depends on the large-scale annotated training data. However, it is expensive and time-consuming to construct a large-scale pixel-level manually annotated dataset for face parsing. To alleviate this issue, we propose a novel Dual-Structure Disentangling Variational Generation (D2VG) network. Benefiting from the interpretable factorized latent disentanglement in VAE, D2VG can learn a joint structural distribution of facial image and its corresponding parsing map. Owing to these, it can synthesize large-scale paired face images and parsing maps from a standard Gaussian distribution. Then, we adopt both manually annotated and synthesized data to train a face parsing model in a supervised way. Since there are inaccurate pixel-level labels in synthesized parsing maps, we introduce a coarseness-tolerant learning algorithm, to effectively handle these noisy or uncertain labels. In this way, we can significantly boost the performance of face parsing. Extensive quantitative and qualitative results on HELEN, CelebAMask-HQ and LaPa demonstrate the superiority of our methods.

Accurate UAV Tracking with Distance-Injected Overlap Maximization

  • Chunhui Zhang
  • Shiming Ge
  • Kangkai Zhang
  • Dan Zeng

UAV tracking is usually challenged by the dual-dynamic disturbances that arise from not only diverse moving target but also motion camera, leading to a more serious model drift issue than traditional visual tracking. In this work, we propose to alleviate this issue with distance-injected overlap maximization. Our idea is improving the accuracy of target localization by deriving a conceptually simple target localization loss and a global feature recalibration scheme in a mutual reinforced way. In particular, the target localization loss is designed by simply incorporating the normalized distance of target offset and generic semantic IoU loss, resulting in the distance-injected semantic IoU loss, and its minimal solution can alleviate the drift problem caused by camera motion. Moreover, the deep feature extractor is reconstructed and alternated with a feature recalibration network, which can leverage the global information to recalibrate significant features and suppress negligible features. Following by multi-scale feature concat, the proposed tracker can improve the discriminative capability of feature representation for UAV targets on the fly. Extensive experimental results on four benchmarks, i.e. UAV123, UAVDT, DTB70, and VisDrone, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on UAV tracking.

PiRhDy: Learning Pitch-, Rhythm-, and Dynamics-aware Embeddings for Symbolic Music

  • Hongru Liang
  • Wenqiang Lei
  • Paul Yaozhu Chan
  • Zhenglu Yang
  • Maosong Sun
  • Tat-Seng Chua

Definitive embeddings remain a fundamental challenge of computational musicology for symbolic music in deep learning today. Analogous to natural language, music can be modeled as a sequence of tokens. This motivates the majority of existing solutions to explore the utilization of word embedding models to build music embeddings. However, music differs from natural languages in two key aspects: (1) musical token is multi-faceted -- it comprises of pitch, rhythm and dynamics information; and (2) musical context is two-dimensional -- each musical token is dependent on both melodic and harmonic contexts. In this work, we provide a comprehensive solution by proposing a novel framework named PiRhDy that integrates pitch, rhythm, and dynamics information seamlessly. PiRhDy adopts a hierarchical strategy which can be decomposed into two steps: (1) token (i.e., note event) modeling, which separately represents pitch, rhythm, and dynamics and integrates them into a single token embedding; and (2) context modeling, which utilizes melodic and harmonic knowledge to train the token embedding. A thorough study was made on each component and sub-strategy of PiRhDy.We further validate our embeddings in three downstream tasks -- melody completion, accompaniment suggestion, and genre classification. Results indicate a significant advancement of the neural approach towards symbolic music as well as PiRhDy's potential as a pretrained tool for a broad range of symbolic music applications.

Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events

  • Guang Yu
  • Siqi Wang
  • Zhiping Cai
  • En Zhu
  • Chuanfu Xu
  • Jianping Yin
  • Marius Kloft

As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at

Pose-native Network Architecture Search for Multi-person Human Pose Estimation

  • Qian Bao
  • Wu Liu
  • Jun Hong
  • Lingyu Duan
  • Tao Mei

Multi-person pose estimation has achieved great progress in recent years, even though, the precise prediction for occluded and invisible hard keypoints remains challenging. Most of the human pose estimation networks are equipped with an image classification-based pose encoder for feature extraction and a handcrafted pose decoder for high-resolution representations. However, the pose encoder might be sub-optimal because of the gap between image classification and pose estimation. The widely used multi-scale feature fusion in pose decoder is still coarse and cannot provide sufficient high-resolution details for hard keypoints. Neural Architecture Search (NAS) has shown great potential in many visual tasks to automatically search efficient networks. In this work, we present the Pose-native Network Architecture Search (PoseNAS) to simultaneously design a better pose encoder and pose decoder for pose estimation. Specifically, we directly search a data-oriented pose encoder with stacked searchable cells, which can provide an optimum feature extractor for the pose specific task. In the pose decoder, we exploit scale-adaptive fusion cells to promote rich information exchange across the multi-scale feature maps. Meanwhile, the pose decoder adopts a Fusion-and-Enhancement manner to progressively boost the high-resolution representations that are non-trivial for the precious prediction of hard keypoints. With the exquisitely designed search space and search strategy, PoseNAS can simultaneously search all modules in an end-to-end manner. PoseNAS achieves state-of-the-art performance on three public datasets, MPII, COCO, and PoseTrack, with small-scale parameters compared with the existing methods. Our best model obtains 76.7% mAP and 75.9% mAP on the COCO validation set and test set with only 33.6M parameters. Code and implementation are available at

SESSION: Oral Session D2: Media Interpretation

Beyond the Attention: Distinguish the Discriminative and Confusable Features For Fine-grained Image Classification

  • Xiruo Shi
  • Liutong Xu
  • Pengfei Wang
  • Yuanyuan Gao
  • Haifang Jian
  • Wu Liu

Learning subtle discriminative features plays a significant role in fine-grained image classification. Existing methods usually extract the distinguishable parts through the attention module for classification. Although these learned distinguishable parts contain valuable features that are beneficial for classification, part of irrelevant features are also preserved, which may confuse the model to make a correct classification, especially for the fine-grained tasks due to their similarities. How to keep the discriminative features while removing confusable features from the distinguishable parts is an interesting yet changeling task. In this paper, we introduce a novel classification approach, named Logical-based Feature Extraction Model (LAFE for short) to address this issue. The main advantage of LAFE lies in the fact that it can explicitly add the significance of discriminative features and subtract the confusable features. Specifically, LAFE utilizes the region attention modules and channel attention modules to extract discriminative features and confusable features respectively. Based on this, two novel loss functions are designed to automatically induce attention over these features for fine-grained image classification. Our approach demonstrates its robustness, efficiency, and state-of-the-art performance on three benchmark datasets.

BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning

  • Hao Tang
  • Zechao Li
  • Zhimao Peng
  • Jinhui Tang

Most metric-based meta-learning methods learn only the sophisticated similarity metric for few-shot classification, which may lead to the feature deterioration and unreliable prediction. Toward this end, we propose new mechanisms to learn generalized and discriminative feature embeddings as well as improve the robustness of classifiers against prediction corruptions for meta-learning. For this purpose, a new generation operator BlockMix is proposed by integrating interpolation on the images and labels within metric learning. Based on the above BlockMix, we propose a novel regularization method Meta Regularization as an auxiliary task branch with its own classifier to better constraint the feature embedding module and stabilize the meta-learning process. Furthermore, a novel inference scheme Self-Calibrated Inference is proposed to alleviate the unreliable prediction problem by calibrating the prototype of each category with the confidence-weighted average of the support and generated samples. The proposed mechanisms can be used as supplementary techniques alongside standard metric-based meta-learning algorithms without any pre-training. Experimental results demonstrate the insights and the efficiency of the proposed mechanisms respectively, compared with the state-of-the-art methods on the prevalent few-shot benchmarks.

Fine-grained Feature Alignment with Part Perspective Transformation for Vehicle ReID

  • Dechao Meng
  • Liang Li
  • Shuhui Wang
  • Xingyu Gao
  • Zheng-Jun Zha
  • Qingming Huang

Given a query image, vehicle Re-Identification is to search the same vehicle in multi-camera scenarios, which are attracting much attention in recent years. However, vehicle ReID severely suffers from the perspective variation problem. For different vehicles with similar color and type which are taken from different perspectives, all visual patterns are misaligned and warped, which is hard for the model to find out the exact discriminative regions. In this paper, we propose part perspective transformation module (PPT) to map the different parts of vehicle into a unified perspective respectively. The PPT disentangles the vehicle features of different perspectives and then aligns them in a fine-grained level. Further, we propose a dynamically batch hard triplet loss to select the common visible regions of the compared vehicles. Our approach helps the model to generate the perspective invariant features and find out the exact distinguishable regions for vehicle ReID. Extensive experiments on three standard vehicle ReID datasets show the effectiveness of our method.

Compact Bilinear Augmented Query Structured Attention for Sport Highlights Classification

  • Yanbin Hao
  • Hao Zhang
  • Chong-Wah Ngo
  • Qiang Liu
  • Xiaojun Hu

Understanding fine-grained activities, such as sport highlights, is a problem being overlooked and receives considerably less research attention. Potential reasons include absences of specific fine-grained action benchmark datasets, research preferences to general super-categorical activities classification, and challenges of large visual similarities between fine-grained actions. To tackle these, we collect and manually annotate two sport highlights datasets, i.e., Basketball-8 & Soccer-10, for fine-grained action classification. Sample clips in the datasets are annotated with professional sub-categorical actions like "dunk", "goalkeeping" and etc. We also propose a Compact Bilinear Augmented Query Structured Attention (CBA-QSA) module and stack it on top of general three-dimensional neural networks in a plug-and-play manner to emphasize important spatio-temporal clues in highlight clips. Specifically, we adapt the hierarchical attention neural networks, which contain learnable query-scheme, on the video to identify discriminative spatial/temporal visual clues within highlight clips. We name this altered attention which separately learns a query for spatial/temporal feature as query structured attention (QSA). Furthermore, we inflate bilinear mapping, which is a mature technique to represent local pairwise interactions for image-level fine-grained classification, on video understanding. In detail, we extend its compact version (i.e., compact bilinear mapping (CBM) based on TensorSketch) to deal with the three-dimensional video signal for modeling local pairwise motion information. We eventually incorporate CBM and QSA together to form CBA-QSA neural networks for fine-grained sport highlights classifications. Experimental results demonstrate that CBA-QSA improves the general state-of-the-arts on Basketball-8 and Soccer-10 datasets.

Semantic Image Analogy with a Conditional Single-Image GAN

  • Jiacheng Li
  • Zhiwei Xiong
  • Dong Liu
  • Xuejin Chen
  • Zheng-Jun Zha

Recent image-specific Generative Adversarial Networks (GANs) provide a way to learn generative models from a single image instead of a large dataset. However, the semantic meaning of patches inside a single image is less explored. In this work, we first define the task of Semantic Image Analogy: given a source image and its segmentation map, along with another target segmentation map, synthesizing a new image that matches the appearance of the source image as well as the semantic layout of the target segmentation. To accomplish this task, we propose a novel method to model the patch-level correspondence between semantic layout and appearance of a single image by training a single-image GAN that takes semantic labels as conditional input. Once trained, a controllable redistribution of patches from the training image can be obtained by providing the expected semantic layout as spatial guidance. The proposed method contains three essential parts: 1) a self-supervised training framework, with a progressive data augmentation strategy and an alternating optimization procedure; 2) a semantic feature translation module that predicts transformation parameters in the image domain from the segmentation domain; and 3) a semantics-aware patch-wise loss that explicitly measures the similarity of two images in terms of patch distribution. Compared with existing solutions, our method generates much more realistic results given arbitrary semantic labels as conditional input.

A Structured Graph Attention Network for Vehicle Re-Identification

  • Yangchun Zhu
  • Zheng-Jun Zha
  • Tianzhu Zhang
  • Jiawei Liu
  • Jiebo Luo

Vehicle re-identification aims to identify the same vehicle across different surveillance cameras and plays an important role in public security. Existing approaches mainly focus on exploring informative regions or learning an appropriate distance metric. However, they not only neglect the inherent structured relationship between discriminative regions within an image, but also ignore the extrinsic structured relationship among images. The inherent and extrinsic structured relationships are crucial to learning effective vehicle representation. In this paper, we propose a Structured Graph ATtention network (SGAT) to fully exploit these relationships and allow the message propagation to update the features of graph nodes. SGAT creates two graphs for one probe image. One is an inherent structured graph based on the geometric relationship between the landmarks that can use features of their neighbors to enhance themselves. The other is an extrinsic structured graph guided by the attribute similarity to update image representations. Experimental results on two public vehicle re-identification datasets including VeRi-776 and VehicleID have shown that our proposed method achieves significant improvements over the state-of-the-art methods.

SESSION: Oral Session E2: Media Interpretation

Contextual Multi-Scale Feature Learning for Person Re-Identification

  • Baoyu Fan
  • Li Wang
  • Runze Zhang
  • Zhenhua Guo
  • Yaqian Zhao
  • Rengang Li
  • Weifeng Gong

Representing features at multiple scales is significant for person re-identification (Re-ID). Most existing methods learn the multi-scale features by stacking streams and convolutions without considering the cooperation of multiple scales at a granular level. However, most scales are more discriminative only when they integrate other scales as contextual information. We termed that contextual multi-scale. In this paper, we proposed a novel architecture, namely contextual multi-scale network (CMSNet), for learning common and contextual multi-scale representations simultaneously. The building block of CMSNet obtains contextual multi-scale representations by bidirectionally hierarchical connection groups: the forward hierarchical connection group for stepwise inter-scale information fusion and the backward hierarchical connection group for leap-frogging inter-scale information fusion. Too rich scale features without a selection will confuse the discrimination. Additionally, we introduced a new channel-wise scale selection module to dynamically select scale features for corresponding input image. To the best of our knowledge, CMSNet is the most lightweight model for person Re-ID and it achieves state-of-the-art performance on four commonly used Re-ID datasets, surpassing most large-scale models.

Space-Time Video Super-Resolution Using Temporal Profiles

  • Zeyu Xiao
  • Zhiwei Xiong
  • Xueyang Fu
  • Dong Liu
  • Zheng-Jun Zha

In this paper, we propose a novel space-time video super-resolution method, which aims to recover a high-frame-rate and high-resolution video from its low-frame-rate and low-resolution observation. Existing solutions seldom consider the spatial-temporal correlation and the long-term temporal context simultaneously and thus are limited in the restoration performance. Inspired by the epipolar-plane image used in multi-view computer vision tasks, we first propose the concept of temporal-profile super-resolution to directly exploit the spatial-temporal correlation in the long-term temporal context. Then, we specifically design a feature shuffling module for spatial retargeting and spatial-temporal information fusion, which is followed by a refining module for artifacts alleviation and detail enhancement. Different from existing solutions, our method does not require any explicit or implicit motion estimation, making it lightweight and flexible to handle any number of input frames. Comprehensive experimental results demonstrate that our method not only generates superior space-time video super-resolution results but also retains competitive implementation efficiency.

Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification

  • Boqiang Xu
  • Lingxiao He
  • Xingyu Liao
  • Wu Liu
  • Zhenan Sun
  • Tao Mei

Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on

SalGCN: Saliency Prediction for 360-Degree Images Based on Spherical Graph Convolutional Networks

  • Haoran Lv
  • Qin Yang
  • Chenglin Li
  • Wenrui Dai
  • Junni Zou
  • Hongkai Xiong

The non-Euclidean geometry characteristic poses a challenge to the saliency prediction for 360-degree images. Since spherical data cannot be projected onto a single plane without distortion, existing saliency prediction methods based on traditional CNNs are inefficient. In this paper, we propose a saliency prediction framework for 360-degree images based on graph convolutional networks (SalGCN), which directly applies to the spherical graph signals. Specifically, we adopt the Geodesic ICOsahedral Pixelation (GICOPix) to construct a spherical graph signal from a spherical image in equirectangular projection (ERP) format. We then propose a graph saliency prediction network to directly extract the spherical features and generate the spherical graph saliency map, where we design an unpooling method suitable for spherical graph signals based on linear interpolation. The network training process is realized by modeling the node regression problem of the input and output spherical graph signals, where we further design a Kullback-Leibler (KL) divergence loss with sparse consistency to make the sparseness of the saliency map closer to the ground truth. Eventually, to obtain the ERP format saliency map for evaluation, we further propose a spherical crown-based (SCB) interpolation method to convert the output spherical graph saliency map into a saliency map in ERP format. Experiments show that our SalGCN can achieve comparable or even better saliency prediction performance both subjectively and objectively, with a much lower computation complexity.

LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos

  • Sai Praneeth Reddy Sunkesula
  • Rishabh Dabral
  • Ganesh Ramakrishnan

Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO dataset, setting a new benchmark for visual features based approaches. Code for LIGHTEN is available at

Concept-based Explanation for Fine-grained Images and Its Application in Infectious Keratitis Classification

  • Zhengqing Fang
  • Kun Kuang
  • Yuxiao Lin
  • Fei Wu
  • Yu-Feng Yao

Interpretability has become an essential topic as deep learning is widely applied in professional fields (e.g., medical image processing)where high level of accountability is required. Existing methods for explanation mainly focus on computing the importance of low level pixels or segments, rather than the high-level concepts. Concepts are of paramount importance for human to understand and make decisions, especially for those fine-grained tasks. In this paper, we focus on the real application problem of classification of infectious keratitis and propose a visual concept mining (VCM) method to explain the fine-grained infectious keratitis images. Based on our discovered explainable visual concepts, we further propose a visual concept enhanced framework for infectious keratitis classification. Extensive empirical experiments demonstrate that (i) our discovered visual concepts are highly coherent with the physicians? understanding and interpretation, and (ii) our visual concept enhanced model achieves significant improvement on the performance of infectious keratitis classification.

SESSION: Oral Session F2: Mobile Multimedia & Multimedia HCI and Quality of Experience

Guided Attention Network for Object Detection and Counting on Drones

  • Cai YuanQiang
  • Dawei Du
  • Libo Zhang
  • Longyin Wen
  • Weiqiang Wang
  • Yanjun Wu
  • Siwei Lyu

Object detection and counting are related but challenging problems, especially for drone based scenes with small objects and cluttered background. In this paper, we propose a new Guided Attention network (GAnet) to deal with both object detection and counting tasks based on the feature pyramid. Different from the previous methods relying on unsupervised attention modules, we fuse different scales of feature maps by using the proposed weakly-supervised Background Attention (BA) between the background and objects for more semantic feature representation. Then, the Foreground Attention (FA) module is developed to consider both global and local appearance of the object to facilitate accurate localization. Moreover, the new data argumentation strategy is designed to train a robust model in the drone based scenes with various illumination conditions. Extensive experiments on three challenging benchmarks (i.e., UAVDT, CARPK and PUCPR+) show the state-of-the-art detection and counting performance of the proposed method compared with existing methods. Code can be found at

PIDNet: An Efficient Network for Dynamic Pedestrian Intrusion Detection

  • Jingchen Sun
  • Jiming Chen
  • Tao Chen
  • Jiayuan Fan
  • Shibo He

Vision-based dynamic pedestrian intrusion detection (PID), judging whether pedestrians intrude an area-of-interest (AoI) by a moving camera, is an important task in mobile surveillance. The dynamically changing AoIs and a number of pedestrians in video frames increase the difficulty and computational complexity of determining whether pedestrians intrude the AoI, which makes previous algorithms incapable of this task. In this paper, we propose a novel and efficient multi-task deep neural network, PIDNet, to solve this problem. PIDNet is mainly designed by considering two factors: accurately segmenting the dynamically changing AoIs from a video frame captured by the moving camera and quickly detecting pedestrians from the generated AoI-contained areas. Three efficient network designs are proposed and incorporated into PIDNet to reduce the computational complexity: 1) a special PID task backbone for feature sharing, 2) a feature cropping module for feature cropping, and 3) a lighter detection branch network for feature compression. In addition, considering there are no public datasets and benchmarks in this field, we establish a benchmark dataset to evaluate the proposed network and give the corresponding evaluation metrics for the first time. Experimental results show that PIDNet can achieve 67.1% PID accuracy and 9.6 fps inference speed on the proposed dataset, which serves as a good baseline for the future vision-based dynamic PID study.

VONAS: Network Design in Visual Odometry using Neural Architecture Search

  • Xing Cai
  • Lanqing Zhang
  • Chengyuan Li
  • Ge Li
  • Thomas H. Li

The end-to-end VO (visual odometry) is a complicated task with the property of highly temporal dependency, but the design of its deep networks lacks thorough investigation. Meanwhile, NAS (Neural architecture search) has been widely searched and applied in many computer vision fields due to its advantage in automatic network design. However, most of the existing NAS frameworks only consider single image tasks such as image classification, lacking the consideration of the video (multi-frames) tasks such as VO. Therefore, this paper explores the network design for the VO task and proposes a more general single path based one-shot NAS, named VONAS, which can model sequential information for video-related tasks. Extensive experiments prove that the network architecture is significant for the (un)supervised VO. The models obtained by VONAS are lightweight and achieve SOTA performance with good generalization.

Learning from the Past: Meta-Continual Learning with Knowledge Embedding for Jointly Sketch, Cartoon, and Caricature Face Recognition

  • Wenbo Zheng
  • Lan Yan
  • Fei-Yue Wang
  • Chao Gou

This paper deals with a challenging task of learning from different modalities by tackling the difficulty problem of jointly face recognition between abstract-like sketches, cartoons, caricatures and real-life photographs. Due to the significant variations in the abstract faces, building vision models for recognizing data from these modalities is an extremely challenging. We propose a novel framework termed as Meta-Continual Learning with Knowledge Embedding to address the task of jointly sketch, cartoon, and caricature face recognition. In particular, we firstly present a deep relational network to capture and memorize the relation among different samples. Secondly, we present the construction of our knowledge graph that relates image with the label as the guidance of our meta-learner. We then design a knowledge embedding mechanism to incorporate the knowledge representation into our network. Thirdly, to mitigate catastrophic forgetting, we use a meta-continual model that updates our ensemble model and improves its prediction accuracy. With this meta-continual model, our network can learn from its past. The final classification is derived from our network by learning to compare the features of samples. Experimental results demonstrate that our approach achieves significantly higher performance compared with other state-of-the-art approaches.

ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

  • Zijie Ye
  • Haozhe Wu
  • Jia Jia
  • Yaohua Bu
  • Wei Chen
  • Fanbo Meng
  • Yanfeng Wang

Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).

InvisibleFL: Federated Learning over Non-Informative Intermediate Updates against Multimedia Privacy Leakages

  • Qiushi Li
  • Wenwu Zhu
  • Chao Wu
  • Xinglin Pan
  • Fan Yang
  • Yuezhi Zhou
  • Yaoxue Zhang

In cloud and edge networks, federated learning involves training statistical models over decentralized data, where servers aggregate models through intermediate updates trained from clients. By utilizing private and local data it improves quality of personalized services and reduces user's concern for privacy. However, federated learning still leaks multimedia features through trained intermediate updates and thereby is not privacy-preserving for multimedia. Existing techniques applied from secure community attempt to avoid multimedia features leakages for federated learning but yet cannot address issues of privacy. In this paper, we propose a privacy-preserving solution that avoids multimedia privacy leakages in federated learning. Firstly, we devise a novel encryption scheme called Non-Informative Transformation (NIT) for federated aggregation to eliminates residual multimedia features in intermediate updates. Based on the scheme, we then propose Just-Learn-over-Ciphertext (JLoC) mechanism for federated learning, which includes three stages in each model iteration. The Encrypt stage encrypts intermediate updates and makes it non-informative distribution at clients. The Aggregate stage performs model aggregation without decryption at servers. Specifically, this stage just computes over ciphertext, and its output of aggregation also keeps non-informative. The Decrypt stage converts non-informative outputs of aggregation to available parameters for the next iteration at clients. Moreover, we implement a prototype and conduct experiments to evaluate its privacy and performance on real devices. The experimental results demonstrate that our methods can defend against potential attacks for multimedia privacy leakages without accuracy loss in commercial off-the-shelf products.

Asymmetric Deep Hashing for Efficient Hash Code Compression

  • Shu Zhao
  • Dayan Wu
  • Wanqian Zhang
  • Yu Zhou
  • Bo Li
  • Weiping Wang

Benefiting from recent advances in deep learning, deep hashing methods have achieved promising performance in large-scale image retrieval. To improve storage and computational efficiency, existing hash codes need to be compressed accordingly. However, previous deep hashing methods have to retrain their models and then regenerate the whole database codes using the new models when code length changes, which is time consuming especially for large image databases. In this paper, we propose a novel deep hashing method, called Code Compression oriented Deep Hashing (CCDH), for efficiently compressing hash codes. CCDH learns deep hash functions for query images, while learning a one-hidden-layer Variational Autoencoder (VAE) from existing hash codes. With such asymmetric design, CCDH can efficiently compress database codes only using the learned encoder of VAE. Furthermore, CCDH is flexible enough to be used with a variety of deep hashing methods. Extensive experiments on three widely used image retrieval benchmarks demonstrate that CCDH can significantly reduce the cost for compressing database codes when code length changes while keeping the state-of-the-art retrieval accuracy.

SESSION: Oral Session G2: Multimedia HCI and Quality of Experience

A Human-Computer Duet System for Music Performance

  • Yuen-Jen Lin
  • Hsuan-Kai Kao
  • Yih-Chih Tseng
  • Ming Tsai
  • Li Su

Virtual musicians have become a remarkable phenomenon in the contemporary multimedia arts. However, most of the virtual musicians nowadays have not been endowed with abilities to create their own behaviors, or to perform music with human musicians. In this paper, we firstly create a virtual violinist, who can collaborate with a human pianist to perform chamber music automatically without any intervention. The system incorporates the techniques from various fields, including real-time music tracking, pose estimation, and body movement generation. In our system, the virtual musician's behavior is generated based on the given music audio alone, and such a system results in a low-cost, efficient and scalable way to produce human and virtual musicians' co-performance. The proposed system has been validated in public concerts. Objective quality assessment approaches and possible ways to systematically improve the system are also discussed.

Photo Stand-Out: Photography with Virtual Character

  • Yujia Wang
  • Sifan Hou
  • Bing Ning
  • Wei Liang

In this paper, we propose a novel optimization framework to synthesize an aesthetic pose for the virtual character with respect to the presented user's pose. Our approach applies aesthetic evaluation that exploits fully connected neural networks trained on example images. The aesthetic pose of the virtual character is obtained by optimizing a cost function that guides the rotation of each body joint angles. In our experiments, we demonstrate the proposed approach can synthesize poses for virtual characters according to user pose inputs. We also conducted objective and subjective experiments of the synthesized results to validate the efficacy of our approach.

Norm-in-Norm Loss with Faster Convergence and Better Performance for Image Quality Assessment

  • Dingquan Li
  • Tingting Jiang
  • Ming Jiang

Currently, most image quality assessment (IQA) models are supervised by the MAE or MSE loss with empirically slow convergence. It is well-known that normalization can facilitate fast convergence. Therefore, we explore normalization in the design of loss functions for IQA. Specifically, we first normalize the predicted quality scores and the corresponding subjective quality scores. Then, the loss is defined based on the norm of the differences between these normalized values. The resulting "Norm-in-Norm" loss encourages the IQA model to make linear predictions with respect to subjective quality scores. After training, the least squares regression is applied to determine the linear mapping from the predicted quality to the subjective quality. It is shown that the new loss is closely connected with two common IQA performance criteria (PLCC and RMSE). Through theoretical analysis, it is proved that the embedded normalization makes the gradients of the loss function more stable and more predictable, which is conducive to the faster convergence of the IQA model. Furthermore, to experimentally verify the effectiveness of the proposed loss, it is applied to solve a challenging problem: quality assessment of in-the-wild images. Experiments on two relevant datasets (KonIQ-10k and CLIVE) show that, compared to MAE or MSE loss, the new loss enables the IQA model to converge about 10 times faster and the final model achieves better performance. The proposed model also achieves state-of-the-art prediction performance on this challenging problem. For reproducible scientific research, our code is publicly available at \url

Context-aware Attention Network for Predicting Image Aesthetic Subjectivity

  • Munan Xu
  • Jia-Xing Zhong
  • Yurui Ren
  • Shan Liu
  • Ge Li

Image aesthetic assessment involves both fine-grained details and the holistic layout of images. However, most of current approaches learn the local and the holistic information separately, which has a potential loss of contextual information. Additionally, learning-based methods mainly cast image aesthetic assessment as a binary classification or a regression problem, which cannot sufficiently delineate the potential diversity of human aesthetic experience. To address these limitations, we attempt to render the contextual information and model the varieties of aesthetic experience. Specifically, we explore a context-aware attention module in two dimensions: hierarchical and spatial. The hierarchical context is introduced to present the concern of multi-level aesthetic details while the spatial context is served to yield the long-range perception of images. Based on the attention model, we predict the distribution of human aesthetic ratings of images, which reflects the diversity and similarity of human subjective opinions. We conduct extensive experiments on the prevailing AVA dataset to validate the effectiveness of our approach. Experimental results demonstrate that our approach achieves state-of-the-art results.

Scoring High: Analysis and Prediction of Viewer Behavior and Engagement in the Context of 2018 FIFA WC Live Streaming

  • Nikolas Wehner
  • Michael Seufert
  • Sebastian Egger-Lampl
  • Bruno Gardlo
  • Pedro Casas
  • Raimund Schatz

Large-scale events pose severe challenges to live video streaming service providers, who need to cope with high, peaking viewer numbers and the resulting fluctuating resource demands, keeping high levels of Quality of Experience (QoE) to avoid end-user frustration and churn. In this paper, we analyze a unique dataset consisting of more than a million 2018 FIFA World Cup mobile live streaming sessions, collected at a large national public broadcaster. Different from previous work, we analyze QoE and user engagement as well as their interaction, in dependency to specific soccer match events, which have the potential to trigger flash crowds during a match. Flash crowds are a particular challenge to video service providers, since they cause sudden load peaks and consequently, the likelihood of quality problems. We further exploit the data to model viewer engagement over the course of a soccer match, and show that client counts follow very similar patterns of change across all matches. We believe that the analysis as well as the resulting models are valuable sources of insight for service providers, equipping them with tools for customer-centric resource and capacity management.

Object-level Attention for Aesthetic Rating Distribution Prediction

  • Jingwen Hou
  • Sheng Yang
  • Weisi Lin

We study the problem of image aesthetic assessment (IAA) and aim to automatically predict the image aesthetic quality in the form of discrete distribution, which is particularly important in IAA due to its nature of having possibly higher diversification of agreement for aesthetics. Previous works show the effectiveness of utilizing object-agnostic attention mechanisms to selectively concentrate on more contributive regions for IAA, e.g., attention is learned to weight pixels of input images when inferring aesthetic values. However, as suggested by some neuropsychology studies, the basic units of human attention are visual objects, i.e., the trace of human attention follows a series of objects. This inspires us to predict contributions of different regions at object level for better aesthetics evaluation. With our framework, region-of-interests (RoIs) are proposed by an object detector, and each RoI is associated with a regional feature vector. Then the contribution of each regional feature to the aesthetics prediction is adaptively determined. To the best of our knowledge, this is the first work modeling object-level attention for IAA and experimental results confirm the superiority of our framework over previous relevant methods.

ARSketch: Sketch-Based User Interface for Augmented Reality Glasses

  • Zhaohui Zhang
  • Haichao Zhu
  • Qian Zhang

Hand gesture interaction is a key component in Augmented Reality (AR) / Mixed Reality (MR). Users usually interact with AR/MR devices, e.g., Microsoft HoloLens, etc., via hand gestures to express their intentions and the devices will recognize the gestures and respond accordingly to users. However, the use of such technique so far is limited to only a few less-expressive hand gestures, which, unfortunately, are insufficient or inadequate to input complex information.

To tackle this problem, we introduce a sketch-based neural network-driven user interface for AR/MR glasses, called ARSketch, which enables drawing sketches freely in air to interact with the devices. ARSketch combines: (1) hand pose estimation that estimates the egocentric hand poses in an energy-efficient way, (2) sketch generation that generates sketches using key point positions of hand poses, and (3) sketch-photo retrieval that takes sketches as inputs to retrieve relevant photos. The evaluation results on our collected sketch dataset demonstrate the efficacy of ARSketch for user interaction.

SESSION: Oral Session H2: Multimedia HCI and Quality of Experience & Multimedia Search and Recommendation

RIRNet: Recurrent-In-Recurrent Network for Video Quality Assessment

  • Pengfei Chen
  • Leida Li
  • Lei Ma
  • Jinjian Wu
  • Guangming Shi

Video quality assessment (VQA), which is capable of automatically predicting the perceptual quality of source videos especially when reference information is not available, has become a major concern for video service providers due to the growing demand for video quality of experience (QoE) by end users. While significant advances have been achieved from the recent deep learning techniques, they often lead to misleading results in VQA tasks given their limitations on describing 3D spatio-temporal regularities using only fixed temporal frequency. Partially inspired by psychophysical and vision science studies revealing the speed tuning property of neurons in visual cortex when performing motion perception (i.e., sensitive to different temporal frequencies), we propose a novel no-reference (NR) VQA framework named Recurrent-In-Recurrent Network (RIRNet) to incorporate this characteristic to prompt an accurate representation of motion perception in VQA task. By fusing motion information derived from different temporal frequencies in a more efficient way, the resulting temporal modeling scheme is formulated to quantify the temporal motion effect via a hierarchical distortion description. It is found that the proposed framework is in closer agreement with quality perception of the distorted videos since it integrates concepts from motion perception in human visual system (HVS), which is manifested in the designed network structure composed of low- and high- level processing. A holistic validation of our methods on four challenging video quality databases demonstrates the superior performances over the state-of-the-art methods.

Cognitive Representation Learning of Self-Media Online Article Quality

  • Yiru Wang
  • Shen Huang
  • Gongfu Li
  • Qiang Deng
  • Dongliang Liao
  • Pengda Si
  • Yujiu Yang
  • Jin Xu

The automatic quality assessment of self-media online articles is an urgent and new issue, which is of great value to the online recommendation and search. Different from traditional and well-formed articles, self-media online articles are mainly created by users, which have the appearance characteristics of different text levels and multi-modal hybrid editing, along with the potential characteristics of diverse content, different styles, large semantic spans and good interactive experience requirements. To solve these challenges, we establish a joint model CoQAN in combination with the layout organization, writing characteristics and text semantics, designing different representation learning subnetworks, especially for the feature learning process and interactive reading habits on mobile terminals. It is more consistent with the cognitive style of expressing an expert's evaluation of articles. We have also constructed a large scale real-world assessment dataset. Extensive experimental results show that the proposed framework significantly outperforms state-of-the-art methods, and effectively learns and integrates different factors of the online article quality assessment.

Describing Subjective Experiment Consistency by p-Value P--P Plot

  • Jakub Nawala
  • Lucjan Janowski
  • Bogdan Cmiel
  • Krzysztof Rusek

There are phenomena that cannot be measured without subjective testing. However, subjective testing is a complex issue with many influencing factors. These interplay to yield either precise or incorrect results. Researchers require a tool to classify results of subjective experiment as either consistent or inconsistent. This is necessary in order to decide whether to treat the gathered scores as quality ground truth data. Knowing if subjective scores can be trusted is key to drawing valid conclusions and building functional tools based on those scores (e.g., algorithms assessing the perceived quality of multimedia materials). We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. Additionally, the tool identifies stimuli having irregular score distribution. The approach is based on treating subjective scores as a random variable coming from the discrete Generalized Score Distribution (GSD). The GSD, in combination with a bootstrapped G-test of goodness-of-fit, allows to construct p-value P--P plot that visualizes experiment's consistency. The tool safeguards researchers from using inconsistent subjective data. In this way, it makes sure that conclusions they draw and tools they build are more precise and trustworthy. The proposed approach works in line with expectations drawn solely on experiment design descriptions of 21 real-life multimedia quality subjective experiments.

Increasing Video Perceptual Quality with GANs and Semantic Coding

  • Leonardo Galteri
  • Marco Bertini
  • Lorenzo Seidenari
  • Tiberio Uricchio
  • Alberto Del Bimbo

We have seen a rise in video based user communication in the last year, unfortunately fueled by the spread of COVID-19 disease. Efficient low-latency delay of transmission of video is a challenging problem which must also deal with the segmented nature of network infrastructure not always allowing a high throughput. Lossy video compression is a basic requirement to enable such technology widely. While this may compromise the quality of the streamed video there are recent deep learning based solutions to restore quality of a lossy compressed video.

Considering the very nature of video conferencing, bitrate allocation in video streaming could be driven semantically, differentiating quality between the talking subjects and the background. Currently there have not been any work studying the restoration of semantically coded video using deep learning. In this work we show how such videos can be efficiently generated by shifting bitrate with masks derived via computer vision and how a deep generative adversarial network can be trained to restore video quality. Our study shows that the combination of semantic coding and learning based video restoration can provide superior results.

Label Embedding Online Hashing for Cross-Modal Retrieval

  • Yongxin Wang
  • Xin Luo
  • Xin-Shun Xu

Supervised cross-modal hashing has gained a lot of attention recently. However, most existing methods learn binary codes or hash functions in a batch-based scheme, which is inefficient in an online scenario, i.e., data points come in a streaming fashion. Online hashing is a promising solution; however, there still exist several challenges, e.g., how to effectively exploit semantic information, how to discretely solve the binary optimization problem, how to efficiently update hash codes and hash functions. To address these issues, in this paper, we propose a novel supervised online cross-modal hashing method, i.e., Label EMbedding ONline hashing, LEMON for short. It builds a label embedding framework including label similarity preserving and label reconstructing, which may generate discriminative binary codes and reduce the computational complexity. Furthermore, it not only preserves the pairwise similarity of incoming data, but also establishes a connection between newly coming data and existing data by the inner product minimization on a block similarity matrix. In the light of this, it can exploit more similarity information and make the optimization less sensitive to incoming data, leading to effective binary codes. In addition, we design a discrete optimization algorithm to solve the binary optimization problem without relaxation. Therefore, the quantization error can be reduced. Moreover, its computational complexity is only relevant to the size of incoming data, making it very efficient and scalable to large-scale datasets. Extensive experimental results on three benchmark datasets demonstrate that LEMON outperforms some state-of-the-art offline and online cross-modal hashing methods in terms of accuracy and efficiency.

Quaternion-Based Knowledge Graph Network for Recommendation

  • Zhaopeng Li
  • Qianqian Xu
  • Yangbangyan Jiang
  • Xiaochun Cao
  • Qingming Huang

Recently, to alleviate the data sparsity and cold start problem, many research efforts have been devoted to the usage of knowledge graph (KG) in recommender systems. It is common for most existing KG based models to represent users and items using real-valued embeddings. However, compared with complex or hypercomplex numbers, these real-valued vectors are of less representation capacity and no intrinsic asymmetrical properties, thus may limit the modeling of interactions between entities and relations in KG. In this paper, we propose Quaternion-based Knowledge Graph Network (QKGN) for recommendation, which represents users and items with quaternion embeddings in hypercomplex space, so that the latent inter-dependencies between entities and relations could be captured effectively. In the core of our model, a semantic matching principle based on Hamilton product is applied to learn expressive quaternion representations from the unified user-item KG. On top of this, those embeddings are attentively updated by a customized preference propagation mechanism with structure information concerned. Finally, we apply the proposed QKGN to three real-world datasets of music, movie and book, and experimental results show the validity of our method.

SESSION: Oral Session A3: Multimedia Search and Recommendation

Class-Aware Modality Mix and Center-Guided Metric Learning for Visible-Thermal Person Re-Identification

  • Yongguo Ling
  • Zhun Zhong
  • Zhiming Luo
  • Paolo Rota
  • Shaozi Li
  • Nicu Sebe

Visible thermal person re-identification (VT-REID) is an important and challenging task in that 1) weak lighting environments are inevitably encountered in real-world settings and 2) the inter-modality discrepancy is serious. Most existing methods either aim at reducing the cross-modality gap in pixel- and feature-level or optimizing cross-modality network by metric learning techniques. However, few works have jointly considered these two aspects and studied their mutual benefits. In this paper, we design a novel framework to jointly bridge the modality gap in pixel- and feature-level without additional parameters, as well as reduce the inter- and intra-modalities variations by a center-guided metric learning constraint. Specifically, we introduce the Class-aware Modality Mix (CMM) to generate internal information of the two modalities for reducing the modality gap in pixel-level. In addition, we exploit the KL-divergence to further align modality distributions on feature-level. On the other hand, we propose an efficient Center-guided Metric Learning (CML) method for decreasing the discrepancy within the inter- and intra-modalities, by enforcing constraints on class centers and instances. Extensive experiments on two datasets show the mutual advantage of the proposed components and demonstrate the superiority of our method over the state of the art.

Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

  • Da Cao
  • Yawen Zeng
  • Xiaochi Wei
  • Liqiang Nie
  • Richang Hong
  • Zheng Qin

Retrieving video moments from an untrimmed video given a natural language as the query is a challenging task in both academia and industry. Although much effort has been made to address this issue, traditional video moment ranking methods are unable to generate reasonable video moment candidates and video moment localization approaches are not applicable to large-scale retrieval scenario. How to combine ranking and localization into a unified framework to overcome their drawbacks and reinforce each other is rarely considered. Toward this end, we contribute a novel solution to thoroughly investigate the video moment retrieval issue under the adversarial learning paradigm. The key of our solution is to formulate the video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a pairwise ranking model is utilized as a discriminator to rank the generated video moments and the ground truth. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning framework, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experiments on two well-known datasets have well verified the effectiveness and rationality of our proposed solution.

Beyond the Parts: Learning Multi-view Cross-part Correlation for Vehicle Re-identification

  • Xinchen Liu
  • Wu Liu
  • Jinkai Zheng
  • Chenggang Yan
  • Tao Mei

Vehicle re-identification (Re-Id) is a challenging task due to the inter-class similarity, the intra-class difference, and the cross-view misalignment of vehicle parts. Although recent methods achieve great improvement by learning detailed features from keypoints or bounding boxes of parts, vehicle Re-Id is still far from being solved. Different from existing methods, we propose a Parsing-guided Cross-part Reasoning Network, named as PCRNet, for vehicle Re-Id. The PCRNet explores vehicle parsing to learn discriminative part-level features, model the correlation among vehicle parts, and achieve precise part alignment for vehicle Re-Id. To accurately segment vehicle parts, we first build a large-scale Multi-grained Vehicle Parsing (MVP) dataset from surveillance images. With the parsed parts, we extract regional features for each part and build a part-neighboring graph to explicitly model the correlation among parts. Then, the graph convolutional networks (GCNs) are adopted to propagate local information among parts, which can discover the most effective local features of varied viewpoints. Moreover, we propose a self-supervised part prediction loss to make the GCNs generate features of invisible parts from visible parts under different viewpoints. By this means, the same vehicle from different viewpoints can be matched with the well-aligned and robust feature representations. Through extensive experiments, our PCRNet significantly outperforms the state-of-the-art methods on three large-scale vehicle Re-Id datasets.

Weakly-Supervised Image Hashing through Masked Visual-Semantic Graph-based Reasoning

  • Lu Jin
  • Zechao Li
  • Yonghua Pan
  • Jinhui Tang

With the popularization of social websites, many methods have been proposed to explore the noisy tags for weakly-supervised image hashing.The main challenge lies in learning appropriate and sufficient information from those noisy tags. To address this issue, this work proposes a novel Masked visual-semantic Graph-based Reasoning Network, termed as MGRN, to learn joint visual-semantic representations for image hashing. Specifically, for each image, MGRN constructs a relation graph to capture the interactions among its associated tags and performs reasoning with Graph Attention Networks (GAT). MGRN randomly masks out one tag and then make GAT to predict this masked tag. This forces the GAT model to capture the dependence between the image and its associated tags, which can well address the problem of noisy tags. Thus it can capture key tags and visual structures from images to learn well-aligned visual-semantic representations. Finally, the auto-encoders is leveraged to learn hash codes that can preserve the local structure of the joint space. Meanwhile, the joint visual-semantic representations are reconstructed from those hash codes by using a decoder. Experimental results on two widely-used benchmark datasets demonstrate the superiority of the proposed method for image retrieval compared with several state-of-the-art methods.

Semantic Consistency Guided Instance Feature Alignment for 2D Image-Based 3D Shape Retrieval

  • Heyu Zhou
  • Weizhi Nie
  • Dan Song
  • Nian Hu
  • Xuanya Li
  • An-An Liu

2D image-based 3D shape retrieval (2D-to-3D) investigates the problem of matching the relevant 3D shapes from gallery dataset when given a query image. Recently, adversarial training and environmental style transfer learning have been successful applied to this task and achieved state-of-the-art performance. However, there still exist two problems. First, previous works only concentrate on the connection between the label and representation, where the unique visual characteristics of each instance are paid less attention. Second, the confused features or the transformed images can only cheat the discriminator but can not guarantee the semantic consistency. In another words, features of 2D desk may be mapped nearby the features of 3D chair. In this paper, we propose a novel semantic consistency guided instance feature alignment network (SC-IFA) to address these limitations. SC-IFA mainly consists of two parts, instance visual feature extraction and cross-domain instance feature adaptation. For the first module, unlike previous methods, which merely employ 2D CNN to extract the feature, we additionally maximize the mutual information between the input and feature to enhance the capability of feature representation for each instance. For the second module, we first introduce the margin disparity discrepancy model to mix up the cross-domain features in an adversarial training way. Then, we design two feature translators to transform the feature from one domain to another domain, and impose the translation loss and correlation loss on the transformed features to preserve the semantic consistency. Extensive experimental results on two benchmarks, MI3DOR and MI3DOR-2, verify SC-IFA is superior to the state-of-the-art methods.

RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

  • Niluthpol Chowdhury Mithun
  • Karan Sikka
  • Han-Pang Chiu
  • Supun Samarasekera
  • Rakesh Kumar

We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.

SESSION: Oral Session B3: Multimedia Systems and Middleware & Media Transport and Delivery

Performance Optimization of Federated Person Re-identification via Benchmark Analysis

  • Weiming Zhuang
  • Yonggang Wen
  • Xuesen Zhang
  • Xin Gan
  • Daiying Yin
  • Dongzhan Zhou
  • Shuai Zhang
  • Shuai Yi

Federated learning is a privacy-preserving machine learning technique that learns a shared model across decentralized clients. It can alleviate privacy concerns of personal re-identification, an important computer vision task. In this work, we implement federated learning to person re-identification (FedReID) and optimize its performance affected by statistical heterogeneity in the real-world scenario. We first construct a new benchmark to investigate the performance of FedReID. This benchmark consists of (1) nine datasets with different volumes sourced from different domains to simulate the heterogeneous situation in reality, (2) two federated scenarios, and (3) an enhanced federated algorithm for FedReID. The benchmark analysis shows that the client-edge-cloud architecture, represented by the federated-by-dataset scenario, has better performance than client-server architecture in FedReID. It also reveals the bottlenecks of FedReID under the real-world scenario, including poor performance of large datasets caused by unbalanced weights in model aggregation and challenges in convergence. Then we propose two optimization methods: (1) To address the unbalanced weight problem, we propose a new method to dynamically change the weights according to the scale of model changes in clients in each training round; (2) To facilitate convergence, we adopt knowledge distillation to refine the server model with knowledge generated from client models on a public dataset. Experiment results demonstrate that our strategies can achieve much better convergence with superior performance on all datasets. We believe that our work will inspire the community to further explore the implementation of federated learning on more computer vision tasks in real-world scenarios.

Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model

  • Hung-Min Hsu
  • Yizhou Wang
  • Jenq-Neng Hwang

Multi-target multi-camera tracking (MTMCT), i.e., tracking multiple targets across multiple cameras, is a crucial technique for smart city applications. In this paper, we propose an effective and reliable MTMCT framework for vehicles, which consists of a traffic-aware single camera tracking (TSCT) algorithm, a trajectory-based camera link model (CLM) for vehicle re-identification (ReID), and a hierarchical clustering algorithm to obtain the cross camera vehicle trajectories. First, the TSCT, which jointly considers vehicle appearance, geometric features, and some common traffic scenarios, is proposed to track the vehicles in each camera separately. Second, the trajectory-based CLM is adopted to facilitate the relationship between each pair of adjacently connected cameras and add spatio-temporal constraints for the subsequent vehicle ReID with temporal attention. Third, the hierarchical clustering algorithm is used to merge the vehicle trajectories among all the cameras to obtain the final MTMCT results. Our proposed MTMCT is evaluated on the CityFlow dataset and achieves a new state-of-the-art performance with IDF1 of 74.93%.

Active Object Search

  • Jie Wu
  • Tianshui Chen
  • Lishan Huang
  • Hefeng Wu
  • Guanbin Li
  • Ling Tian
  • Liang Lin

In this work, we investigate an Active Object Search (AOS) task that is not explicitly addressed in the literature. It aims to actively perform as few action steps as possible to search and locate the target object in a 3D indoor scene. Different from classic object detection that passively receives visual information, this task encourages an intelligent agent to perform active search via reasonable action planning; thus it can better recall the target objects, especially for the challenging situations that the target is far from the agent, blocked by an obstacle and out of view. To handle this AOS task, we formulate a reinforcement learning framework that consists of a 3D object detector, a state controller and a cross-modal action planner to work cooperatively to find out the target object with minimal action steps. During training, we design a novel cost-sensitive active search reward that penalizes inaccurate object search and redundant action steps. To evaluate this novel task, we construct an Active Object Search (AOS) benchmark that contains 5,845 samples from 30 diverse indoor scenes. We conduct extensive qualitative and quantitative evaluations on this benchmark to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to address this task.

An Analysis of Delay in Live 360° Video Streaming Systems

  • Jun Yi
  • Md Reazul Islam
  • Shivang Aggarwal
  • Dimitrios Koutsonikolas
  • Y. Charlie Hu
  • Zhisheng Yan

While live 360° video streaming provides an enriched viewing experience, it is challenging to guarantee the user experience against the negative effects introduced by start-up delay, event-to-eye delay, and low frame rate. It is therefore imperative to understand how different computing tasks of a live 360° streaming system contribute to these three delay metrics. Although prior works have studied commercial live 360° video streaming systems, none of them has dug into the end-to-end pipeline and explored how the task-level time consumption affects the user experience. In this paper, we conduct the first in-depth measurement study of task-level time consumption for five system components in live 360° video streaming. We first identify the subtle relationship between the time consumption breakdown across the system pipeline and the three delay metrics. We then build a prototype Zeus to measure this relationship. Our findings indicate the importance of CPU-GPU transfer at the camera and the server initialization as well as the negligible effect of 360° video stitching on the delay metrics. We finally validate that our results are representative of real world systems by comparing them with those obtained with a commercial system.

DeepFacePencil: Creating Face Images from Freehand Sketches

  • Yuhang Li
  • Xuejin Chen
  • Binxin Yang
  • Zihan Chen
  • Zhihua Cheng
  • Zheng-Jun Zha

In this paper, we explore the task of generating photo-realistic face images from hand-drawn sketches. Existing image-to-image translation methods require a large-scale dataset of paired sketches and images for supervision. They typically utilize synthesized edge maps of face images as training data. However, these synthesized edge maps strictly align with the edges of the corresponding face images, which limit their generalization ability to real hand-drawn sketches with vast stroke diversity. To address this problem, we propose DeepFacePencil, an effective tool that is able to generate photo-realistic face images from hand-drawn sketches, based on a novel dual generator image translation network during training. A novel spatial attention pooling (SAP) is designed to adaptively handle stroke distortions which are spatially varying to support various stroke styles and different level of details. We conduct extensive experiments and the results demonstrate the superiority of our model over existing methods on both image quality and model generalization to hand-drawn sketches.

When Bitstream Prior Meets Deep Prior: Compressed Video Super-resolution with Learning from Decoding

  • Peilin Chen
  • Wenhan Yang
  • Long Sun
  • Shiqi Wang

The standard paradigm of video super-resolution (SR) is to generate the spatial-temporal coherent high-resolution (HR) sequence from the corresponding low-resolution (LR) version which has already been decoded from the bitstream. However, a highly practical while relatively under-studied way is enabling the built-in SR functionality in the decoder, in the sense that almost all videos are compactly represented. In this paper, we systematically investigate the SR of compressed LR videos by leveraging the interactivity between decoding prior and deep prior. By fully exploiting the compact video stream information, the proposed bitstream prior embedded SR framework achieves compressed video SR and quality enhancement simultaneously in a single feed-forward process. More specifically, we propose a motion vector guided multi-scale local attention module that explicitly exploits the temporal dependency and suppresses coding artifacts with substantially economized computational complexity. Moreover, a scale-wise deep residual-in-residual network is learned to reconstruct the SR frames from the multi-scale fused features. To facilitate the research of compressed video SR, we also build a large-scale dataset with compressed videos of diverse content, including ready-made diversified kinds of side information extracted from the bitstream. Both quantitative and qualitative evaluations show that our model achieves superior performance for compressed video SR, and offers competitive performance compared to the sequential combinations of the state-of-the-art methods for compressed video artifacts removal and SR.

RL-Bélády: A Unified Learning Framework for Content Caching

  • Gang Yan
  • Jian Li

Content streaming is the dominant application in today's Internet, which is typically distributed via content delivery networks (CDNs). CDNs usually use caching as a means to reduce user access latency so as to enable faster content downloads. Typical analysis of caching systems either focuses on content admission, which decides whether to cache a content, or content eviction to decide which content to evict when the cache is full. This paper instead proposes a novel framework that can simultaneously learn both content admission and content eviction for caching in CDNs. To attain this goal, we first put forward a lightweight architecture for content next request time prediction. We then leverage reinforcement learning (RL) along with the prediction to learn the time-varying content popularities for content admission, and develop a simple threshold-based model for content eviction. We call this new algorithm RL-Bélády (RLB). In addition, we address several key challenges to design learning-based caching algorithms, including how to guarantee lightweight training and prediction with both content eviction and admission in consideration, limit memory overhead, reduce randomness and improve robustness in RL stochastic optimization. Our evaluation results using $3$ production CDN datasets show that RLB can consistently outperform state-of-the-art methods with dramatically reduced running time and modest overhead.

SESSION: Oral Session C3: Multimodal Analysis and Description &Summarization, Analytics, and Storytelling

ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences

  • Zhizhong Han
  • Chao Chen
  • Yu-Shen Liu
  • Matthias Zwicker

3D shape captioning is a challenging application in 3D shape understanding. Captions from recent multi-view based methods reveal that they cannot capture part-level characteristics of 3D shapes. This leads to a lack of detailed part-level description in captions, which human tend to focus on. To resolve this issue, we propose ShapeCaptioner, a generative caption network, to perform 3D shape captioning from semantic parts detected in multiple views. Our novelty lies in learning the knowledge of part detection in multiple views from 3D shape segmentations and transferring this knowledge to facilitate learning the mapping from 3D shapes to sentences. Specifically, ShapeCaptioner aggregates the parts detected in multiple colored views using our novel part class specific aggregation to represent a 3D shape, and then, employs a sequence to sequence model to generate the caption. Our outperforming results show that ShapeCaptioner can learn 3D shape features with more detailed part characteristics to facilitate better 3D shape captioning than previous work.

Co-Attentive Lifting for Infrared-Visible Person Re-Identification

  • Xing Wei
  • Diangang Li
  • Xiaopeng Hong
  • Wei Ke
  • Yihong Gong

Infrared-visible cross-modality person re-identification (IV-ReID) has attracted much attention with the popularity of dual-mode video surveillance systems, where the RGB mode works in the daytime and automatically switches to the infrared mode at night. Despite its significant application value, IV-ReID remains a difficult problem mainly due to two great challenges. First, it is difficult to identify persons in the infrared image, which lacks color and texture clues. Second, there is a significant gap between the infrared and visible modalities where appearances of the same person vary considerably. This paper proposes a novel attention-based approach to handle the two difficulties in a unified framework. 1) We propose an attention lifting mechanism to learn discriminative features in each modality. 2) We propose a co-attentive learning mechanism to bridge the gap between the two modalities. Our method only makes slight modifications of a given backbone network and requires small computation overhead while improving the performance significantly. We conduct extensive experiments to demonstrate the superiority of our proposed method.

Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts

  • Zhiwei Wu
  • Changmeng Zheng
  • Yi Cai
  • Junying Chen
  • Ho-fung Leung
  • Qing Li

Visual contexts often help to recognize named entities more precisely in short texts such as tweets or snapchat. For example, one can identify "Charlie'' as a name of a dog according to the user posts. Previous works on multimodal named entity recognition ignore the corresponding relations of visual objects and entities. Visual objects are considered as fine-grained image representations. For a sentence with multiple entity types, objects of the relevant image can be utilized to capture different entity information. In this paper, we propose a neural network which combines object-level image information and character-level text information to predict entities. Vision and language are bridged by leveraging object labels as embeddings, and a dense co-attention mechanism is introduced for fine-grained interactions. Experimental results in Twitter dataset demonstrate that our method outperforms the state-of-the-art methods.

Context-Aware Multi-View Summarization Network for Image-Text Matching

  • Leigang Qu
  • Meng Liu
  • Da Cao
  • Liqiang Nie
  • Qi Tian

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines.

Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods

  • Evlampios Apostolidis
  • Eleni Adamantidou
  • Alexandros I. Metsai
  • Vasileios Mezaris
  • Ioannis Patras

This paper proposes a new evaluation approach for video summarization algorithms. We start by studying the currently established evaluation protocol; this protocol, defined over the ground-truth annotations of the SumMe and TVSum datasets, quantifies the agreement between the user-defined and the automatically-created summaries with F-Score, and reports the average performance on a few different training/testing splits of the used dataset. We evaluate five publicly-available summarization algorithms under a large-scale experimental setting with 50 randomly-created data splits. We show that the results reported in the papers are not always congruent with their performance on the large-scale experiment, and that the F-Score cannot be used for comparing algorithms evaluated on different splits. We also show that the above shortcomings of the established evaluation protocol are due to the significantly varying levels of difficulty among the utilized splits, that affect the outcomes of the evaluations. Further analysis of these findings indicates a noticeable performance correlation among all algorithms and a random summarizer. To mitigate these shortcomings we propose an evaluation protocol that makes estimates about the difficulty of each used data split and utilizes this information during the evaluation process. Experiments involving different evaluation settings demonstrate the increased representativeness of performance results when using the proposed evaluation approach, and the increased reliability of comparisons when the examined methods have been evaluated on different data splits.

Concept Drift Detection for Multivariate Data Streams and Temporal Segmentation of Daylong Egocentric Videos

  • Pravin Nagar
  • Mansi Khemka
  • Chetan Arora

The long and unconstrained nature of egocentric videos makes it imperative to use temporal segmentation as an important pre-processing step for many higher-level inference tasks. Activities of the wearer in an egocentric video typically span over hours and are often separated by slow, gradual changes. Furthermore, the change of camera viewpoint due to the wearer's head motion causes frequent and extreme, but, spurious scene changes. The continuous nature of boundaries makes it difficult to apply traditional Markov Random Field (MRF) pipelines relying on temporal discontinuity, whereas deep Long Short Term Memory (LSTM) networks gather context only upto a few hundred frames, rendering them ineffective for egocentric videos. In this paper, we present a novel unsupervised temporal segmentation technique especially suited for day-long egocentric videos. We formulate the problem as detecting concept drift in a time-varying, non i.i.d. sequence of frames. Statistically bounded thresholds are calculated to detect concept drift between two temporally adjacent multivariate data segments with different underlying distributions while establishing guarantees on false positives. Since the derived threshold indicates confidence in the prediction, it can also be used to control the granularity of the output segmentation. Using our technique, we report significantly improved state of the art f-measure for daylong egocentric video datasets, as well as photostream datasets derived from them: HUJI~(73.01%, 59.44%), UTEgo~(58.41%, 60.61%) and Disney~(67.63%, 68.83%).

Distributed Multi-agent Video Fast-forwarding

  • Shuyue Lan
  • Zhilu Wang
  • Amit K. Roy-Chowdhury
  • Ermin Wei
  • Qi Zhu

In many intelligent systems, a network of agents collaboratively perceives the environment for better and more efficient situation awareness. As these agents often have limited resources, it could be greatly beneficial to identify the content overlapping among camera views from different agents and leverage it for reducing the processing, transmission and storage of redundant/unimportant video frames. This paper presents a consensus-based distributed multi-agent video fast-forwarding framework, named DMVF, that fast-forwards multi-view video streams collaboratively and adaptively. In our framework, each camera view is addressed by a reinforcement learning based fast-forwarding agent, which periodically chooses from multiple strategies to selectively process video frames and transmits the selected frames at adjustable paces. During every adaptation period, each agent communicates with a number of neighboring agents, evaluates the importance of the selected frames from itself and those from its neighbors, refines such evaluation together with other agents via a system-wide consensus algorithm, and uses such evaluation to decide their strategy for the next period. Compared with approaches in the literature on a real-world surveillance video dataset VideoWeb, our method significantly improves the coverage of important frames and also reduces the number of frames processed in the system.

SESSION: Oral Session D3: Multimodal Fusion and Embedding

Controllable Video Captioning with an Exemplar Sentence

  • Yitian Yuan
  • Lin Ma
  • Jingwen Wang
  • Wenwu Zhu

In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture. The proposed SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network with respect to the encoded syntactic information of the given exemplar sentence. Therefore, SMCG is able to control the states for word prediction and achieve the syntax customized caption generation. We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets. Extensive experimental results demonstrate the effectiveness of our approach on generating syntax controllable and semantic preserved video captions. By providing different exemplar sentences, our approach is capable of producing different captions with various syntactic structures, thus indicating a promising way to strengthen the diversity of video captioning. Code for this paper is available at

MMFL: Multimodal Fusion Learning for Text-Guided Image Inpainting

  • Qing Lin
  • Bo Yan
  • Jichun Li
  • Weimin Tan

Painters can successfully recover severely damaged objects, yet current inpainting algorithms still can not achieve this ability. Generally, painters will have a conjecture about the seriously missing image before restoring it, which can be expressed in a text description. This paper imitates the process of painters' conjecture, and proposes to introduce the text description into the image inpainting task for the first time, which provides abundant guidance information for image restoration through the fusion of multimodal features. We propose a multimodal fusion learning method for image inpainting (MMFL). To make better use of text features, we construct an image-adaptive word demand module to reasonably filter the effective text features. We introduce a text guided attention loss and a text-image matching loss to make the network pay more attention to the entities in the text description. Extensive experiments prove that our method can better predict the semantics of objects in the missing regions and generate fine grained textures.

Vision Meets Wireless Positioning: Effective Person Re-identification with Recurrent Context Propagation

  • Yiheng Liu
  • Wengang Zhou
  • Mao Xi
  • Sanjing Shen
  • Houqiang Li

Existing person re-identification methods rely on the visual sensor to capture the pedestrians. The image or video data from visual sensor inevitably suffers the occlusion and dramatic variations of pedestrian postures, which degrades the re-identification performance and further limits its application to the open environment. On the other hand, for most people, one of the most important carry-on items is the mobile phone, which can be sensed by WiFi and cellular networks in the form of a wireless positioning signal. Such signal is robust to the pedestrian occlusion and visual appearance change, but suffers some positioning error. In this work, we approach person re-identification with the sensing data from both vision and wireless positioning. To take advantage of such cross-modality cues, we propose a novel recurrent context propagation module that enables information to propagate between visual data and wireless positioning data and finally improves the matching accuracy. To evaluate our approach, we contribute a new Wireless Positioning Person Re-identification (WP-ReID) dataset. Extensive experiments are conducted and demonstrate the effectiveness of the proposed algorithm. Code will be released at

Structural Semantic Adversarial Active Learning for Image Captioning

  • Beichen Zhang
  • Liang Li
  • Li Su
  • Shuhui Wang
  • Jincan Deng
  • Zheng-Jun Zha
  • Qingming Huang

Most image captioning models achieve superior performances with the help of large-scale surprised training data, but it is prohibitively costly to label the image captions. To solve this problem, we propose a structural semantic adversarial active learning (SSAAL) model that leverages both visual and textual information for deriving the most representative samples while maximizing the image captioning performance. SSAAL consists of a semantic constructor, a snapshot& caption (SC) supervisor, and a labeled/unlabeled state discriminator. The constructor is designed to generate a structural semantic representation describing the objects, attributes and object relationships in the image. The SC supervisor is proposed to supervise this representation at the word-level and sentence-level in a multi-task learning manner, which directly relates the representation to ground-truth captions and updates it in the caption generating process. Finally, we introduce a state discriminator to predict the sample state and select images with sufficient semantic and fine-grained diversity. Extensive experiments on standard captioning dataset show that our model outperforms other active learning methods and achieves a competitive performance even though selecting a small amount of samples.

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

  • Devamanyu Hazarika
  • Roger Zimmermann
  • Soujanya Poria

Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. However, the heterogeneous nature of the signals creates distributional modality gaps that pose significant challenges. In this paper, we aim to learn effective modality representations to aid the process of fusion. We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. The second subspace is modality-specific, which is private to each modality and captures their characteristic features. These representations provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models. We also consider the task of Multimodal Humor Detection and experiment on the recently proposed UR_FUNNY dataset. Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.

Multi-modal Cooking Workflow Construction for Food Recipes

  • Liang-Ming Pan
  • Jingjing Chen
  • Jianlong Wu
  • Shaoteng Liu
  • Chong-Wah Ngo
  • Min-Yen Kan
  • Yugang Jiang
  • Tat-Seng Chua

Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps are multi-modal, featuring both text instructions and cooking images. We then propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow, which achieved over 20% performance gain over existing hand-crafted baselines.

Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition

  • Yuqian Fu
  • Li Zhang
  • Junke Wang
  • Yanwei Fu
  • Yu-Gang Jiang

Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.

Adaptive Temporal Triplet-loss for Cross-modal Embedding Learning

  • David Semedo
  • João Magalhães

There are many domains where the temporal dimension is critical to unveil how different modalities, such as images and texts, are correlated. Notably, in the social media domain, information is constantly evolving over time according to the events that take place in the real world. In this work, we seek for highly expressive loss functions that allow the encoding of data temporal traits into cross-modal embedding spaces. To achieve this goal, we propose to steer the learning procedure of such embedding through a set of adaptively enforced temporal constraints. In particular, we propose a new formulation of the triplet loss function, where the traditional static margin is superseded by a novel temporally adaptive maximum margin function. This novel redesign of the static margin formulation, allows the embedding to effectively capture not only the semantic correlations across data modalities, but also data's fine-grained temporal correlations. Our experiments confirm the effectiveness of our model in structuring different modalities, while organizing data according to temporal correlations. Moreover, we experimentally highlight how can these embeddings be used for multimedia understanding.

SESSION: Oral Session E3: Music, Speech and Audio Processing in Multimedia & Social Media

Scene-Aware Background Music Synthesis

  • Yujia Wang
  • Wei Liang
  • Wanwan Li
  • Dingzeyu Li
  • Lap-Fai Yu

In this paper, we introduce an interactive background music synthesis algorithm guided by visual content. We leverage a cascading strategy to synthesize background music in two stages: Scene Visual Analysis and Background Music Synthesis. First, seeking a deep learning-based solution, we leverage neural networks to analyze the sentiment of the input scene. Second, real-time background music is synthesized by optimizing a cost function that guides the selection and transition of music clips to maximize the emotion consistency between visual and auditory criteria, and music continuity. In our experiments, we demonstrate the proposed approach can synthesize dynamic background music for different types of scenarios. We also conducted quantitative and qualitative analysis on the synthesized results of multiple example scenes to validate the efficacy of our approach.

Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes

  • Xutong Jin
  • Sheng Li
  • Tianshu Qu
  • Dinesh Manocha
  • Guoping Wang

Model sound synthesis is a physically-based sound synthesis method used to generate audio content in games and virtual worlds. We present a novel learning-based impact sound synthesis algorithm called Deep-Modal. Our approach can handle sound synthesis for common arbitrary objects, especially dynamic generated objects, in real-time. We present a new compact strategy to represent the mode data, corresponding to frequency and amplitude, as fixed-length vectors. This is combined with a new network architecture that can convert shape features of 3D objects into mode data. Our network is based on an encoder-decoder architecture with the contact positions of objects and external forces embedded. Our method can synthesize interactive sounds related to objects of various shapes at any contact position, as well as objects of different materials and sizes. The synthesis process only takes ~0.01s on a GTX 1080 Ti GPU. We show the effectiveness of Deep-Modal through extensive evaluation using different metrics, including recall and precision of prediction, sound spectrogram, and a user study.

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

  • Yu-Siang Huang
  • Yi-Hsuan Yang

A great number of deep learning based models have been recently proposed for automatic music composition. Among these models, the Transformer stands out as a prominent approach for generating expressive classical piano performance with a coherent structure of up to one minute. The model is powerful in that it learns abstractions of data on its own, without much human-imposed domain knowledge or constraints. In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. In particular, we seek to impose a metrical structure in the input data, so that Transformers can be more easily aware of the beat-bar-phrase hierarchical structure in music. The new data representation maintains the flexibility of local tempo changes, and provides hurdles to control the rhythmic and harmonic structure of music. With this approach, we build a Pop Music Transformer that composes Pop piano music with better rhythmic structure than existing Transformer models.

Make Your Favorite Music Curative: Music Style Transfer for Anxiety Reduction

  • Zhejing Hu
  • Yan Liu
  • Gong Chen
  • Sheng-hua Zhong
  • Aiwei Zhang

Anxiety is the most common mental problem that affects nearly 300 million individuals worldwide. The situation is even worse recently. In clinical practice, music therapy has been used for more than forty years because of its effectiveness and few side effects in emotion regulation. This paper proposes a novel style transfer model to generate the therapeutic music according to user's preference. It is widely recognized that the favorite music greatly increases the engagement of the user, hence results in much better curative effects. But in general, users can provide only one or several favorite songs, which are insufficient for the customization of therapeutic music. To address this difficulty, a new domain adaption algorithm that transfers the learning result for music genre classification to the music personalization, is designed. Targeting the joint minimization of the loss functions, three convolutional neural networks are utilized to generate the therapeutic music with only one labelled data of favorite song. The experiment on the anxiety suffers shows that the customized therapeutic music has achieved better and stable performance in anxiety reduction.

PopMAG: Pop Music Accompaniment Generation

  • Yi Ren
  • Jinzheng He
  • Xu Tan
  • Tao Qin
  • Zhou Zhao
  • Tie-Yan Liu

In pop music, accompaniments are usually played by multiple instruments (tracks) such as drum, bass, string and guitar, and can make a song more expressive and contagious by arranging together with its melody. Previous works usually generate multiple tracks separately and the music notes from different tracks not explicitly depend on each other, which hurts the harmony modeling. To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks. While this greatly improves harmony, unfortunately, it enlarges the sequence length and brings the new challenge of long-term music modeling. We further introduce two new techniques to address this challenge: 1) We model multiple note attributes (e.g., pitch, duration, velocity) of a musical note in one step instead of multiple steps, which can shorten the length of a MuMIDI sequence. 2) We introduce extra long-context as memory to capture long-term dependency in music. We call our system for pop music accompaniment generation as PopMAG. We evaluate PopMAG on multiple datasets (LMD, FreeMidi and CPMD, a private dataset of Chinese pop songs) with both subjective and objective metrics. The results demonstrate the effectiveness of PopMAG for multi-track harmony modeling and long-term context modeling. Specifically, PopMAG wins 42%/38%/40% votes when comparing with ground truth musical pieces on LMD, FreeMidi and CPMD datasets respectively and largely outperforms other state-of-the-art music accompaniment generation models and multi-track MIDI representations in terms of subjective and objective metrics.

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices

  • Run Wang
  • Felix Juefei-Xu
  • Yihao Huang
  • Qing Guo
  • Xiaofei Xie
  • Lei Ma
  • Yang Liu

With the recent advances in voice synthesis, AI-synthesized fake voices are indistinguishable to human ears and widely are applied to produce realistic and natural DeepFakes, exhibiting real threats to our society. However, effective and robust detectors for synthesized fake voices are still in their infancy and are not ready to fully tackle this emerging threat. In this paper, we devise a novel approach, named DeepSonar, based on monitoring neuron behaviors of speaker recognition (SR) system, i.e., a deep neural network (DNN), to discern AI-synthesized fake voices. Layer-wise neuron behaviors provide an important insight to meticulously catch the differences among inputs, which are widely employed for building safety, robust, and interpretable DNNs. In this work, we leverage the power of layer-wise neuron activation patterns with a conjecture that they can capture the subtle differences between real and AI-synthesized fake voices, in providing a cleaner signal to classifiers than raw inputs. Experiments are conducted on three datasets (including commercial products from Google, Baidu, etc) containing both English and Chinese languages to corroborate the high detection rates (98.1% average accuracy) and low false alarm rates (about 2% error rate) of DeepSonar in discerning fake voices. Furthermore, extensive experimental results also demonstrate its robustness against manipulation attacks (e.g., voice conversion and additive real-world noises). Our work further poses a new insight into adopting neuron behaviors for effective and robust AI aided multimedia fakes forensics as an inside-out approach instead of being motivated and swayed by various artifacts introduced in synthesizing fakes.

FakePolisher: Making DeepFakes More Detection-Evasive by Shallow Reconstruction

  • Yihao Huang
  • Felix Juefei-Xu
  • Run Wang
  • Qing Guo
  • Lei Ma
  • Xiaofei Xie
  • Jianwen Li
  • Weikai Miao
  • Yang Liu
  • Geguang Pu

At this moment, GAN-based image generation methods are still imperfect, whose upsampling design has limitations in leaving some certain artifact patterns in the synthesized image. Such artifact patterns can be easily exploited (by recent methods) for difference detection of real and GAN-synthesized images. However, the existing detection methods put much emphasis on the artifact patterns, which can become futile if such artifact patterns were reduced.

Towards reducing the artifacts in the synthesized images, in this paper, we devise a simple yet powerful approach termed FakePolisher that performs shallow reconstruction of fake images through a learned linear dictionary, intending to effectively and efficiently reduce the artifacts introduced during image synthesis. In particular, we first train a dictionary model to capture the patterns of real images. Based on this dictionary, we seek the representation of DeepFake images in a low dimensional subspace through linear projection or sparse coding. Then, we are able to perform shallow reconstruction of the 'fake-free' version of the DeepFake image, which largely reduces the artifact patterns DeepFake introduces. The comprehensive evaluation on 3 state-of-the-art DeepFake detection methods and fake images generated by 16 popular GAN-based fake image generation techniques, demonstrates the effectiveness of our technique. Overall, through reducing artifact patterns, our technique significantly reduces the accuracy of the 3 state-of-the-art fake image detection methods, i.e., 47% on average and up to 93% in the worst case.

Our results confirm the limitation of current fake detection methods and calls the attention of DeepFake researchers and practitioners for more general-purpose fake detection techniques.

SESSION: Oral Session F3: Vision and Language

Boosting Visual Question Answering with Context-aware Knowledge Aggregation

  • Guohao Li
  • Xin Wang
  • Wenwu Zhu

Given an image and a natural language question, Visual Question Answering (VQA) aims at answering the textual question correctly. Most VQA approaches in literature targets at finding answers to the questions solely based on analyzing the given images and questions alone. Other works that try to incorporate external knowledge into VQA adopt a query-based search on knowledge graphs to obtain the answer. However, these works suffer from the following problem: the model training process heavily relies on the ground-truth knowledge facts which serve as supervised information --- missing these ground-truth knowledge facts during training will lead to failures in producing the correct answers. To solve the challenging issue, we propose a Knowledge Graph Augmented (KG-Aug) model which conducts context-aware knowledge aggregation on external knowledge graphs, requiring no ground-truth knowledge facts for extra supervision. The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions. We carry out extensive experiments to validate the effectiveness of our proposed KG-Aug models against several baseline approaches on various datasets.

Memory-Augmented Relation Network for Few-Shot Learning

  • Jun He
  • Richang Hong
  • Xueliang Liu
  • Mingliang Xu
  • Zheng-Jun Zha
  • Meng Wang

Metric-based few-shot learning methods concentrate on learning transferable feature embedding which generalizes well from seen categories to unseen categories under limited supervision. However, most of the methods treat each individual instance separately without considering its relationships with the others in the working context. We investigate a new metric-learning method to explicitly exploit these relationships. In particular, for an instance, we choose the samples that are visually similar from the working context, and perform weighted information propagation to attentively aggregate helpful information from the chosen samples to enhance its representation. We further formulate the distance metric as a learnable relation module which learns to compare for similarity measurement, and equip the working context with memory slots, both contributing to generality. We empirically demonstrate that the proposed method yields significant improvement over its ancestor and achieves competitive or even better performance when compared with other few-shot learning approaches on the two major benchmark datasets, Imagenet andtiered Imagenet.

K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering

  • Yiyi Zhou
  • Rongrong Ji
  • Xiaoshuai Sun
  • Gen Luo
  • Xiaopeng Hong
  • Jinsong Su
  • Xinghao Ding
  • Ling Shao

In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.

Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition

  • Yuan Xie
  • Tianshui Chen
  • Tao Pu
  • Hefeng Wu
  • Liang Lin

Data inconsistency and bias are inevitable among different facial expression recognition (FER) datasets due to subjective annotating process and different collecting conditions. Recent works resort to adversarial mechanisms that learn domain-invariant features to mitigate domain shift. However, most of these works focus on holistic feature adaptation, and they ignore local features that are more transferable across different datasets. Moreover, local features carry more detailed and discriminative content for expression recognition, and thus integrating local features may enable fine-grained adaptation. In this work, we propose a novel Adversarial Graph Representation Adaptation (AGRA) framework that unifies graph representation propagation with adversarial learning for cross-domain holistic-local feature co-adaptation. To achieve this, we first build a graph to correlate holistic and local regions within each domain and another graph to correlate these regions across different domains. Then, we learn the per-class statistical distribution of each domain and extract holistic-local features from the input image to initialize the corresponding graph nodes. Finally, we introduce two stacked graph convolution networks to propagate holistic-local feature within each domain to explore their interaction and across different domains for holistic-local feature co-adaptation. In this way, the AGRA framework can adaptively learn fine-grained domain-invariant features and thus facilitate cross-domain expression recognition. We conduct extensive and fair experiments on several popular benchmarks and show that the proposed AGRA framework achieves superior performance over previous state-of-the-art methods.

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

  • Xiaoze Jiang
  • Siyi Du
  • Zengchang Qin
  • Yajing Sun
  • Jing Yu

Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and inter-modal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms existing models with state-of-the-art results.

Cascade Grouped Attention Network for Referring Expression Segmentation

  • Gen Luo
  • Yiyi Zhou
  • Rongrong Ji
  • Xiaoshuai Sun
  • Jinsong Su
  • Chia-Wen Lin
  • Qi Tian

Referring expression segmentation (RES) aims to segment the target instance in a given image according to a natural language expression. Its main challenge lies in how to quickly and accurately align the text expression to the referred visual instances. In this paper, we focus on addressing this issue by proposing a Cascade Grouped Attention Network (CGAN) with two innovative designs: Cascade Grouped Attention (CGA) and Instance-level Attention (ILA) loss. Specifically, CGA is used to perform step-wise reasoning over the entire image to perceive the differences between instances accurately yet efficiently, so as to identify the referent. ILA loss is further embedded into each step of CGA to directly supervise the attention modeling, which improves the alignments between the text expression and the visual instances. Through these two novel designs, CGAN can achieve the high efficiency of one-stage RES while possessing a strong reasoning ability comparable to the two-stage methods. To validate our model, we conduct extensive experiments on three RES benchmark datasets and achieve significant performance gains over existing one-stage and multi-stage models

SESSION: Oral Session G3: Vision and Language

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

  • Jie Wu
  • Guanbin Li
  • Xiaoguang Han
  • Liang Lin

Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

Poet: Product-oriented Video Captioner for E-commerce

  • Shengyu Zhang
  • Ziqi Tan
  • Jin Yu
  • Zhou Zhao
  • Kun Kuang
  • Jie Liu
  • Jingren Zhou
  • Hongxia Yang
  • Fei Wu

In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting. Traditional video captioning methods, which focus on routinely describing what exists and happens in a video, are not amenable for product-oriented video captioning. To address this problem, we propose a product-oriented video captioner framework, abbreviated as Poet. Poet firstly represents the videos as product-oriented spatial-temporal graphs. Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics. The knowledge leveraging module in Poet differs from the traditional design by performing knowledge filtering and dynamic memory modeling. We show that Poet achieves consistent performance improvement over previous methods concerning generation quality, product aspects capturing, and lexical diversity. Experiments are performed on two product-oriented video captioning datasets, buyer-generated fashion video dataset (BFVD) and fan-generated fashion video dataset (FFVD), collected from Mobile Taobao. We will release the desensitized datasets to promote further investigations on both video captioning and general video analysis problems.

Text-Guided Neural Image Inpainting

  • Lisai Zhang
  • Qingcai Chen
  • Baotian Hu
  • Shuoran Jiang

Image inpainting task requires filling the corrupted image with contents coherent with the context. This research field has achieved promising progress by using neural image inpainting methods. Nevertheless, there is still a critical challenge in guessing the missed content with only the context pixels. The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. Unique from existing text-guided image generation works, the inpainting models are required to compare the semantic content of the given text and the remaining part of the image, then find out the semantic content that should be filled for missing part. To fulfill such a task, we propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet). Firstly, a dual multimodal attention mechanism is designed to extract the explicit semantic information about the corrupted regions, which is done by comparing the descriptive text and complementary image areas through reciprocal attention. Secondly, an image-text matching loss is applied to maximize the semantic similarity of the generated image and the text. Experiments are conducted on two open datasets. Results show that the proposed TDANet model reaches new state-of-the-art on both quantitative and qualitative measures. Result analysis suggests that the generated images are consistent with the guidance text, enabling the generation of various results by providing different descriptions. Codes are available at

Single-Shot Two-Pronged Detector with Rectified IoU Loss

  • Keyang Wang
  • Lei Zhang

In the CNN based object detectors, feature pyramids are widely exploited to alleviate the problem of scale variation across object instances. These object detectors, which strengthen features via a top-down pathway and lateral connections, are mainly to enrich the semantic information of low-level features, but ignore the enhancement of high-level features. This can lead to an imbalance between different levels of features, in particular a serious lack of detailed information in the high-level features, which makes it difficult to get accurate bounding boxes. In this paper, we introduce a novel two-pronged transductive idea to explore the relationship among different layers in both backward and forward directions, which can enrich the semantic information of low-level features and detailed information of high-level features at the same time. Under the guidance of the two-pronged idea, we propose a Two-Pronged Network (TPNet) to achieve bidirectional transfer between high-level features and low-level features, which is useful for accurately detecting object at different scales. Furthermore, due to the distribution imbalance between the hard and easy samples in single-stage detectors, the gradient of localization loss is always dominated by the hard examples that have poor localization accuracy. This will enable the model to be biased toward the hard samples. So in our TPNet, an adaptive IoU based localization loss, named Rectified IoU (RIoU) loss, is proposed to rectify the gradients of each kind of samples. The Rectified IoU loss increases the gradients of examples with high IoU while suppressing the gradients of examples with low IoU, which can improve the overall localization accuracy of model. Extensive experiments demonstrate the superiority of our TPNet and RIoU loss.

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

  • Huan Lin
  • Fandong Meng
  • Jinsong Su
  • Yongjing Yin
  • Zhengyuan Yang
  • Yubin Ge
  • Jie Zhou
  • Jiebo Luo

Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on

Differentiable Manifold Reconstruction for Point Cloud Denoising

  • Shitong Luo
  • Wei Hu

3D point clouds are often perturbed by noise due to the inherent limitation of acquisition equipments, which obstructs downstream tasks such as surface reconstruction, rendering and so on. Previous works mostly infer the displacement of noisy points from the underlying surface, which however are not designated to recover the surface explicitly and may lead to sub-optimal denoising results. To this end, we propose to learn the underlying manifold of a noisy point cloud from differentiably subsampled points with trivial noise perturbation and their embedded neighborhood feature, aiming to capture intrinsic structures in point clouds. Specifically, we present an autoencoder-like neural network. The encoder learns both local and non-local feature representations of each point, and then samples points with low noise via an adaptive differentiable pooling operation. Afterwards, the decoder infers the underlying manifold by transforming each sampled point along with the embedded feature of its neighborhood to a local surface centered around the point. By resampling on the reconstructed manifold, we obtain a denoised point cloud. Further, we design an unsupervised training loss, so that our network can be trained in either an unsupervised or supervised fashion. Experiments show that our method significantly outperforms state-of-the-art denoising methods under both synthetic noise and real world noise. The code and data are available at

SESSION: Oral Session H3: Vision and Language

BS-MCVR: Binary-sensing based Mobile-cloud Visual Recognition

  • Hongyi Zheng
  • Wangmeng Zuo
  • Lei Zhang

The mobile-cloud based visual recognition (MCVR) system, in which the low-end mobile sensors are deployed to persistently collect and transmit visual data to the cloud for analysis and recognition, is important for visual monitoring applications such as wildfire detection, wildlife monitoring, etc. However, the current MCVR systems are mostly human-perception-oriented, which consume many computational resources and much energy for data sensing as well as much bandwidth for data transmission, limiting their large-scale deployment. In this work, we present a machine-perception-oriented MCVR system, called BS-MCVR, where the mobile end is designed to efficiently sense highly compact and discriminative features directly from the scene, and the sensed features are analyzed on the cloud for recognition. Particularly, the mobile end is designed to operate with completely binary operations and generate fixed-point feature maps. Experiments on benchmark datasets show that our system only needs to transmit 1/200 the amount of original image data without degrading much the recognition accuracy, while it consumes minimal computational cost in the data sensing process. BS-MCVR provides a highly cost-effective solution for deploying MCVR systems at a large-scale.

Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning

  • Jingjing Li
  • Mengmeng Jing
  • Lei Zhu
  • Zhengming Ding
  • Ke Lu
  • Yang Yang

Recently, feature generating methods have been successfully applied to zero-shot learning (ZSL). However, most previous approaches only generate visual representations for zero-shot recognition. In fact, typical ZSL is a classic multi-modal learning protocol which consists of a visual space and a semantic space. In this paper, therefore, we present a new method which can simultaneously generate both visual representations and semantic representations so that the essential multi-modal information associated with unseen classes can be captured. Specifically, we address the most challenging issue in such a paradigm, i.e., how to handle the domain shift and thus guarantee that the learned representations are modality-invariant. To this end, we propose two strategies: 1) leveraging the mutual information between the latent visual representations and the semantic representations; 2) maximizing the entropy of the joint distribution of the two latent representations. By leveraging the two strategies, we argue that the two modalities can be well aligned. At last, extensive experiments on five widely used datasets verify that the proposed method is able to significantly outperform previous the state-of-the-arts.

Describe What to Change: A Text-guided Unsupervised Image-to-image Translation Approach

  • Yahui Liu
  • Marco De Nadai
  • Deng Cai
  • Huayang Li
  • Xavier Alameda-Pineda
  • Nicu Sebe
  • Bruno Lepri

Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.

INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition

  • Advaith Sridhar
  • Rohith Gandhi Ganesan
  • Pratyush Kumar
  • Mitesh Khapra

Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR meth- ods with hyperparameter tuning. As the first large scale study of SLR on ISL, we evaluate several deep neural networks combining different methods for augmentation, feature extraction, encoding and decoding. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset. This model uses a pre-trained feature extractor and encoder and only trains a decoder. We further explore generalisation by fine-tuning the decoder for an American Sign Language dataset. On the ASLLVD with 48 classes, our model has an accuracy of 92.1%; improving on existing results and providing an efficient method to support SLR for multiple languages.

Amora: Black-box Adversarial Morphing Attack

  • Run Wang
  • Felix Juefei-Xu
  • Qing Guo
  • Yihao Huang
  • Xiaofei Xie
  • Lei Ma
  • Yang Liu

Nowadays, digital facial content manipulation has become ubiquitous and realistic with the success of generative adversarial networks (GANs), making face recognition (FR) systems suffer from unprecedented security concerns. In this paper, we investigate and introduce a new type of adversarial attack to evade FR systems by manipulating facial content, called adversarial morphing attack (a.k.a. Amora). In contrast to adversarial noise attack that perturbs pixel intensity values by adding human-imperceptible noise, our proposed adversarial morphing attack works at the semantic level that perturbs pixels spatially in a coherent manner. To tackle the black-box attack problem, we devise a simple yet effective joint dictionary learning pipeline to obtain a proprietary optical flow field for each attack. Our extensive evaluation on two popular FR systems demonstrates the effectiveness of our adversarial morphing attack at various levels of morphing intensity with smiling facial expression manipulations. Both open-set and closed-set experimental results indicate that a novel black-box adversarial attack based on local deformation is possible, and is vastly different from additive noise attacks. The findings of this work potentially pave a new research direction towards a more thorough understanding and investigation of image-based adversarial attacks and defenses.

Visual Relation of Interest Detection

  • Fan Yu
  • Haonan Wang
  • Tongwei Ren
  • Jinhui Tang
  • Gangshan Wu

In this paper, we propose a novel Visual Relation of Interest Detection (VROID) task, which aims to detect visual relations that are important for conveying the main content of an image, motivated from the intuition that not all correctly detected relations are really "interesting" in semantics and only a fraction of them really make sense for representing the image main content. Such relations are named Visual Relations of Interest (VROIs). VROID can be deemed as an evolution over the traditional Visual Relation Detection (VRD) task that tries to discover all visual relations in an image. We construct a new dataset to facilitate research on this new task, named ViROI, which contains 30,120 images each with VROIs annotated. Furthermore, we develop an Interest Propagation Network (IPNet) to solve VROID. IPNet contains a Panoptic Object Detection (POD) module, a Pair Interest Prediction (PaIP) module and a Predicate Interest Prediction (PrIP) module. The POD module extracts instances from the input image and also generates corresponding instance features and union features. The PaIP module then predicts the interest score of each instance pair while the PrIP module predicts that of each predicate for each instance pair. Then the interest scores of instance pairs are combined with those of the corresponding predicates as the final interest scores. All VROI candidates are sorted by final interest scores and the highest ones are taken as final results. We conduct extensive experiments to test effectiveness of our method, and the results show that IPNet achieves the best performance compared with the baselines on visual relation detection, scene graph generation and image captioning.

SESSION: Poster Session A1: Deep Learning for Multimedia

University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization

  • Zhedong Zheng
  • Yunchao Wei
  • Yi Yang

We consider the problem of cross-view geo-localization. The primary challenge is to learn the robust feature against large viewpoint changes. Existing benchmarks can help, but are limited in the number of viewpoints. Image pairs, containing two viewpoints, e.g., satellite and ground, are usually provided, which may compromise the feature learning. Besides phone cameras and satellites, in this paper, we argue that drones could serve as the third platform to deal with the geo-localization problem. In contrast to traditional ground-view images, drone-view images meet fewer obstacles, e.g., trees, and provide a comprehensive view when flying around the target place. To verify the effectiveness of the drone platform, we introduce a new multi-view multi-source benchmark for drone-based geo-localization, named University-1652. University-1652 contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. To our knowledge, University-1652 is the first drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation. As the name implies, drone-view target localization intends to predict the location of the target place via drone-view images. On the other hand, given a satellite-view query image, drone navigation is to drive the drone to the area of interest in the query. We use this dataset to analyze a variety of off-the-shelf CNN features and propose a strong CNN baseline on this challenging dataset. The experiments show that University-1652 helps the model to learn viewpoint-invariant features and also has good generalization ability in real-world scenarios.

DIPDefend: Deep Image Prior Driven Defense against Adversarial Examples

  • Tao Dai
  • Yan Feng
  • Dongxian Wu
  • Bin Chen
  • Jian Lu
  • Yong Jiang
  • Shu-Tao Xia

Deep neural networks (DNNs) have shown serious vulnerability to adversarial examples with imperceptible perturbation to clean images. Most existing input-transformation based defense methods (e.g., ComDefend) rely heavily on the learned external priors from an external large training dataset, while neglecting the rich image internal priors of the input itself, thus limiting the generalization of the defense models against the adversarial examples with biased image statistics from the external training dataset. Motivated by deep image prior that can capture rich image statistics from a single image, we propose an effective Deep Image Prior Driven Defense (DIPDefend) method against adversarial examples. With a DIP generator to fit the target/adversarial input, we find that our image reconstruction exhibits quite interesting learning preference from a feature learning perspectives, i.e., the early stage primarily learns the robust features resistant to adversarial perturbation, followed by learning non-robust features that are sensitive to adversarial perturbation. Besides, we develop an adaptive stopping strategy that adapts our method to diverse images. In this way, the proposed model obtains a unique defender for each individual adversarial input, thus being robust to various attackers. Experimental results demonstrate the superiority of our method over the state-of-the-art defense methods against white-box and black-box adversarial attacks.

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

  • Peng Zhang
  • Yunlu Xu
  • Zhanzhan Cheng
  • Shiliang Pu
  • Jing Lu
  • Liang Qiao
  • Yi Niu
  • Fei Wu

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text.However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.

Adversarial Privacy-preserving Filter

  • Jiaming Zhang
  • Jitao Sang
  • Xian Zhao
  • Xiaowen Huang
  • Yanfeng Sun
  • Yongli Hu

While widely adopted in practical applications, face recognition has been critically discussed regarding the malicious use of face images and the potential privacy problems, e.g., deceiving payment system and causing personal sabotage. Online photo sharing services unintentionally act as the main repository for malicious crawler and face recognition applications. This work aims to develop a privacy-preserving solution, called Adversarial Privacy-preserving Filter (APF), to protect the online shared face images from being maliciously used. We propose an end-cloud collaborated adversarial attack solution to satisfy requirements of privacy, utility and non-accessibility. Specifically, the solutions consist of three modules: (1) image-specific gradient generation, to extract image-specific gradient in the user end with a compressed probe model; (2) adversarial gradient transfer, to fine-tune the image-specific gradient in the server cloud; and (3) universal adversarial perturbation enhancement, to append image-independent perturbation to derive the final adversarial noise. Extensive experiments on three datasets validate the effectiveness and efficiency of the proposed solution. A prototype application is also released for further evaluation. We hope the end-cloud collaborated attack framework could shed light on addressing the issue of online multimedia sharing privacy-preserving issues from user side.

Mix Dimension in Poincaré Geometry for 3D Skeleton-based Action Recognition

  • Wei Peng
  • Jingang Shi
  • Zhaoqiang Xia
  • Guoying Zhao

Graph Convolutional Networks (GCNs) have already demonstrated their powerful ability to model the irregular data, e.g., skeletal data in human action recognition, providing an exciting new way to fuse rich structural information for nodes residing in different parts of a graph. In human action recognition, current works introduce a dynamic graph generation mechanism to better capture the underlying semantic skeleton connections and thus improves the performance. In this paper, we provide an orthogonal way to explore the underlying connections. Instead of introducing an expensive dynamic graph generation paradigm, we build a more efficient GCN on a Riemann manifold, which we think is a more suitable space to model the graph data, to make the extracted representations fit the embedding matrix. Specifically, we present a novel spatial-temporal GCN (ST-GCN) architecture which is defined via the Poincaré geometry such that it is able to better model the latent anatomy of the structure data. To further explore the optimal projection dimension in the Riemann space, we mix different dimensions on the manifold and provide an efficient way to explore the dimension for each ST-GCN layer. With the final resulted architecture, we evaluate our method on two current largest scale 3D datasets, i.e., NTU RGB+D and NTU RGB+D 120. The comparison results show that the model could achieve a superior performance under any given evaluation metrics with only 40% model size when compared with the previous best GCN method, which proves the effectiveness of our model.

Dynamic Extension Nets for Few-shot Semantic Segmentation

  • Lizhao Liu
  • Junyi Cao
  • Minqian Liu
  • Yong Guo
  • Qi Chen
  • Mingkui Tan

Semantic segmentation requires a large amount of densely annotated data for training and may generalize poorly to novel categories. In real-world applications, we have an urgent need for few-shot semantic segmentation which aims to empower a model to handle unseen object categories with limited data. This task is non-trivial due to several challenges. First, it is difficult to extract the class-relevant information to handle the novel class as only a few samples are available. Second, since the image content can be very complex, the novel class information may be suppressed by the base categories due to limited data. Third, one may easily learn promising base classifiers based on a large amount of training data, but it is non-trivial to exploit the knowledge to train the novel classifiers. More critically, once a novel classifier is built, the output probability space will change. How to maintain the base classifiers and dynamically include the novel classifiers remains an open question. To address the above issues, we propose a Dynamic Extension Network (DENet) in which we dynamically construct and maintain a classifier for the novel class by leveraging the knowledge from the base classes and the information from novel data. More importantly, to overcome the information suppression issue, we design a Guided Attention Module (GAM), which can be plugged into any framework to help learn class-relevant features. Last, rather than directly train the model with limited data, we propose a dynamic extension training algorithm to predict the weights of novel classifiers, which is able to exploit the knowledge of base classifiers by dynamically extending classes during training. The extensive experiments show that our proposed method achieves state-of-the-art performance on the PASCAL-5i and COCO-20i datasets. The source code is available at

Fast Enhancement for Non-Uniform Illumination Images using Light-weight CNNs

  • Feifan Lv
  • Bo Liu
  • Feng Lu

This paper proposes a new light-weight convolutional neural network (~5k params) for non-uniform illumination image enhancement to handle color, exposure, contrast, noise and artifacts, etc., simultaneously and effectively. More concretely, the input image is first enhanced using Retinex model from dual different aspects (enhancing under-exposure and suppressing over-exposure), respectively. Then, these two enhanced results and the original image are fused to obtain an image with satisfactory brightness, contrast and details. Finally, the extra noise and compression artifacts are removed to get the final result. To train this network, we propose a semi-supervised retouching solution and construct a new dataset (~82k images) that contains various scenes and light conditions. Our model can enhance 0.5 mega-pixel (like 600×800) images in real-time (~50 fps), which is faster than existing enhancement methods. Extensive experiments show that our solution is fast and effective to deal with non-uniform illumination images.

Animating Through Warping: An Efficient Method for High-Quality Facial Expression Animation

  • Zili Yi
  • Qiang Tang
  • Vishnu Sanjay Ramiya Srinivasan
  • Zhan Xu

Advances in deep neural networks have considerably improved the art of animating a still image without operating in 3D domain. Whereas, prior arts can only animate small images (typically no larger than 512x512) due to memory limitations, difficulty of training and lack of high-resolution (HD) training datasets, which significantly reduce their potential for applications in movie production and interactive systems. Motivated by the idea that HD images can be generated by adding high-frequency residuals to low-resolution results produced by a neural network, we propose a novel framework known as Animating Through Warping (ATW) to enable efficient animation of HD images.

Specifically, the proposed framework consists of two modules, a novel two-stage neural-network generator and a novel post-processing module known as Animating Through Warping (ATW). It only requires the generator to be trained on small images and can do inference on an image of any size. During inference, an HD input image is decomposed into a low-resolution component(128x128) and its corresponding high-frequency residuals. The generator predicts the low-resolution result as well as the motion field that warps the input face to the desired status (e.g., expressions categories or action units). Finally, the ResWarp module warps the residuals based on the motion field and adding the warped residuals to generates the final HD results from the naively up-sampled low-resolution results. Experiments show the effectiveness and efficiency of our method in generating high-resolution animations. Our proposed framework successfully animates a 4K facial image, which has never been achieved by prior neural models. In addition, our method generally guarantee the temporal coherency of the generated animations. Source codes will be made publicly available.

Exploiting Better Feature Aggregation for Video Object Detection

  • Liang Han
  • Pichao Wang
  • Zhaozheng Yin
  • Fan Wang
  • Hao Li

Video object detection (VOD) has been a rising topic in recent years due to the challenges such as occlusion, motion blur, etc. To deal with these challenges, feature aggregation from local or global support frames is verified effective. To exploit better feature aggregation, in this paper, we propose two improvements over previous works: a class-constrained spatial-temporal relation network and a correlation-based feature alignment module. For the class constrained spatial-temporal relation network, it operates on object region proposals, and learns two kinds of relations: (1) the dependencies among region proposals of the same object class from support frames sampled in a long time range or even the whole sequence, and (2) spatial relations among proposals of different objects in the target frame. The homogeneity constraint in spatial-temporal relation network not only filters out many defective proposals but also implicitly embeds the traditional post-processing strategies (e.g., Seq-NMS), leading to a unified end-to-end training networks. In the feature alignment module, we propose a correlation based feature alignment method to align the support and target frames for feature aggregation in the temporal domain. Our experiments show that the proposed method improves the accuracy of single-frame detectors significantly, and outperforms previous temporal or spatial relation networks. Without bells or whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset (84.80% with ResNet-101) without any post-processing methods.

NuI-Go: Recursive Non-Local Encoder-Decoder Network for Retinal Image Non-Uniform Illumination Removal

  • Chongyi Li
  • Huazhu Fu
  • Runmin Cong
  • Zechao Li
  • Qianqian Xu

Retinal images have been widely used by clinicians for early diagnosis of ocular diseases. However, the quality of retinal images is often clinically unsatisfactory due to eye lesions and imperfect imaging process. One of the most challenging quality degradation issues in retinal images is non-uniform which hinders the pathological information and further impairs the diagnosis of ophthalmologists and computer-aided analysis. To address this issue, we propose a non-uniform illumination removal network for retinal image, called NuI-Go, which consists of three Recursive Non-local Encoder-Decoder Residual Blocks (NEDRBs) for enhancing the degraded retinal images in a progressive manner. Each NEDRB contains a feature encoder module that captures the hierarchical feature representations, a non-local context module that models the context information, and a feature decoder module that recovers the details and spatial dimension. Additionally, the symmetric skip-connections between the encoder module and the decoder module provide long-range information compensation and reuse. Extensive experiments demonstrate that the proposed method can effectively remove the non-uniform illumination on retinal images while well preserving the image details and color. We further demonstrate the advantages of the proposed method for improving the accuracy of retinal vessel segmentation.

Online Filtering Training Samples for Robust Visual Tracking

  • Jie Zhao
  • Kenan Dai
  • Dong Wang
  • Huchuan Lu
  • Xiaoyun Yang

In recent years, discriminative trackers show its great tracking performance, that is mainly due to the online updating using samples collected during tracking. The model could adapt appearance changes of objects and the background well after updating. But these trackers have a serious disadvantage that wrong samples may cause severe model degradation. Most of the training samples in the tracking phase are obtained according to the tracking result of the current frame. Wrong training samples will be collected when the tracking result is inaccurate, seriously affecting the discrimination ability of the model. Besides, partial occlusion also leads to the same problem. In this paper, we propose an optimization module named MetricNet for online filtering training samples. It applies a matching network containing the classification and distance branches, and uses multiple metric methods for different type samples. MetricNet optimizes the training sample set by recognizing wrong and redundant samples, thereby improving the tracking performance. The proposed MetricNet can be regarded as an independent optimization module and integrated into all discriminative trackers updated online. Extensive experiments on three tracking datasets show its effectiveness and generalization ability. After applying MetricNet to MDNet, the tracking result is increased by 5.3% in terms of the success plot on the LaSOT dataset. Our project is available at

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

  • Junfu Pu
  • Wengang Zhou
  • Hezhen Hu
  • Houqiang Li

Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to minimize the cross modality distance between the video and ground truth label, and make the network distinguish the difference between real and pseudo modalities. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures. Extensive experiments on two continuous SLR benchmarks, i.e., RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed method.

ThumbNet: One Thumbnail Image Contains All You Need for Recognition

  • Chen Zhao
  • Bernard Ghanem

Although deep convolutional neural networks (CNNs) have achieved great success in computer vision tasks, its real-world application is still impeded by its voracious demand of computational resources. Current works mostly seek to compress the network by reducing its parameters or parameter-incurred computation, neglecting the influence of the input image on the system complexity. Based on the fact that input images of a CNN contain substantial redundancy, in this paper, we propose a unified framework, dubbed as ThumbNet, to simultaneously accelerate and compress CNN models by enabling them to infer on one thumbnail image. We provide three effective strategies to train ThumbNet. In doing so, ThumbNet learns an inference network that performs equally well on small images as the original-input network on large images. With ThumbNet, not only do we obtain the thumbnail-input inference network that can drastically reduce computation and memory requirements, but also we obtain an image downscaler that can generate thumbnail images for generic classification tasks. Extensive experiments show the effectiveness of ThumbNet, and demonstrate that the thumbnail-input inference network learned by ThumbNet can adequately retain the accuracy of the original-input network even when the input images are downscaled 16 times.

Dual Temporal Memory Network for Efficient Video Object Segmentation

  • Kaihua Zhang
  • Long Wang
  • Dong Liu
  • Bo Liu
  • Qingshan Liu
  • Zhu Li

Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.

SESSION: Poster Session B1: Deep Learning for Multimedia

Cooperative Bi-path Metric for Few-shot Learning

  • Zeyuan Wang
  • Yifan Zhao
  • Jia Li
  • Yonghong Tian

Given base classes with sufficient labeled samples, the target of few-shot classification is to recognize unlabeled samples of novel classes with only a few labeled samples. Most existing methods only pay attention to the relationship between labeled and unlabeled samples of novel classes, which do not make full use of information within base classes. In this paper, we make two contributions to investigate the few-shot classification problem. First, we report a simple and effective baseline trained on base classes in the way of traditional supervised learning, which can achieve comparable results to the state of the art. Second, based on the baseline, we propose a cooperative bi-path metric for classification, which leverages the correlations between base classes and novel classes to further improve the accuracy. Experiments on two widely used benchmarks show that our method is a simple and effective framework, and a new state of the art is established in the few-shot classification field.

From Design Draft to Real Attire: Unaligned Fashion Image Translation

  • Yu Han
  • Shuai Yang
  • Wenjing Wang
  • Jiaying Liu

Fashion manipulation has attracted growing interest due to its great application value, which inspires many researches towards fashion images. However, little attention has been paid to fashion design draft. In this paper, we study a new unaligned translation problem between design drafts and real fashion items, whose main challenge lies in the huge misalignment between the two modalities. We first collect paired design drafts and real fashion item images without pixel-wise alignment. To solve the misalignment problem, our main idea is to train a sampling network to adaptively adjust the input to an intermediate state with structure alignment to the output. Moreover, built upon the sampling network, we present design draft to real fashion item translation network (D2RNet), where two separate translation streams that focus on texture and shape, respectively, are combined tactfully to get both benefits. D2RNet is able to generate realistic garments with both texture and shape consistency to their design drafts. We show that this idea can be effectively applied to the reverse translation problem and present R2DNet accordingly. Extensive experiments on unaligned fashion design translation demonstrate the superiority of our method over state-of-the-art methods. Our project website is available at:

Siamese Attentive Graph Tracking

  • Fei Zhao
  • Ting Zhang
  • Chao Ma
  • Ming Tang
  • Jinqiao Wang
  • Xiaobo Wang

Recently, deep Siamese matching networks have attracted increasing attention for visual tracking. Despite the demonstrated successes, Siamese trackers do not take full advantage of the structural information of target objects. They tend to drift in the presence of non-rigid deformation or partly occlusion. In this paper, we propose to advance Siamese trackers with graph convolutional networks, which pay more attention to the structural layout of target objects, to learn features robust to large appearance changes over time. Specifically, we divide the target object into several sub-parts and design an attentive graph convolutional network to model the relationship between parts. We incrementally update the attention coefficients of the graph with the attention scheme at each frame in an end-to-end manner. To further improve localization accuracy, we propose a learnable cascade regression algorithm based on deep reinforcement learning to refine the predicted bounding boxes. Extensive experiments on seven challenging benchmark datasets, i.e., OTB-100, TC-128, VOT2018, VOT2019, TrackingNet, GOT-10k and LaSOT, demonstrate that the proposed tracking method performs favorably against state-of-the-art approaches.

HiFaceGAN: Face Renovation via Collaborative Suppression and Replenishment

  • Lingbo Yang
  • Shanshe Wang
  • Siwei Ma
  • Wen Gao
  • Chang Liu
  • Pan Wang
  • Peiran Ren

Existing face restoration researches typically rely on either the image degradation prior or explicit guidance labels for training, which often lead to limited generalization ability over real-world images with heterogeneous degradation and rich background contents. In this paper, we investigate a more challenging and practical "dual-blind" version of the problem by lifting the requirements on both types of prior, termed as "Face Renovation"(FR). Specifically, we formulate FR as a semantic-guided generation problem and tackle it with a collaborative suppression and replenishment (CSR) approach. This leads to HiFaceGAN, a multi-stage framework containing several nested CSR units that progressively replenish facial details based on the hierarchical semantic guidance extracted from the front-end content-adaptive suppression modules. Extensive experiments on both synthetic and real face images have verified the superior performance of our HiFaceGAN over a wide range of challenging restoration subtasks, demonstrating its versatility, robustness and generalization ability towards real-world face processing applications. Code is available at

Discernible Image Compression

  • Zhaohui Yang
  • Yunhe Wang
  • Chang Xu
  • Peng Du
  • Chao Xu
  • Chunjing Xu
  • Qi Tian

Image compression, as one of the fundamental low-level image processing tasks, is very essential for computer vision. Tremendous computing and storage resources can be preserved with a trivial amount of visual information. Conventional image compression methods tend to obtain compressed images by minimizing their appearance discrepancy with the corresponding original images, but pay little attention to their efficacy in downstream perception tasks, e.g., image recognition and object detection. Thus, some of compressed images could be recognized with bias. In contrast, this paper aims to produce compressed images by pursuing both appearance and perceptual consistency. Based on the encoder-decoder framework, we propose using a pre-trained CNN to extract features of the original and compressed images, and making them similar. Thus the compressed images are discernible to subsequent tasks, and we name our method as Discernible Image Compression (DIC). In addition, the maximum mean discrepancy (MMD) is employed to minimize the difference between feature distributions. The resulting compression network can generate images with high image quality and preserve the consistent perception in the feature domain, so that these images can be well recognized by pre-trained machine learning models. Experiments on benchmarks demonstrate that images compressed by using the proposed method can also be well recognized by subsequent visual recognition and detection models. For instance, the mAP value of compressed images by DIC is about 0.6% higher than that of using compressed images by conventional methods.

Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation

  • Jialian Wu
  • Liangchen Song
  • Tiancai Wang
  • Qian Zhang
  • Junsong Yuan

Despite the previous success of object analysis, detecting and segmenting a large number of object categories with a long-tailed data distribution remains a challenging problem and is less investigated. For a large-vocabulary classifier, the chance of obtaining noisy logits is much higher, which can easily lead to a wrong recognition. In this paper, we exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes, and construct a classification tree that is responsible for parsing an object instance into a fine-grained category via its parent class. In the classification tree, as the number of parent class nodes are significantly less, their logits are less noisy and can be utilized to suppress the wrong/noisy logits existed in the fine-grained class nodes. As the way to construct the parent class is not unique, we further build multiple trees to form a classification forest where each tree contributes its vote to the fine-grained classification. To alleviate the imbalanced learning caused by the long-tail phenomena, we propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution. Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models for recognizing more than 1000 categories. Extensive experiments are performed on the large vocabulary dataset LVIS. Compared with the Mask R-CNN baseline, the Forest R-CNN significantly boosts the performance with 11.5% and 3.9% AP improvements on the rare categories and overall categories, respectively. Moreover, we achieve state-of-the-art results on the LVIS dataset. Code is available at

Adv-watermark: A Novel Watermark Perturbation for Adversarial Examples

  • Xiaojun Jia
  • Xingxing Wei
  • Xiaochun Cao
  • Xiaoguang Han

Recent research has demonstrated that adding some imperceptible perturbations to original images can fool deep learning models. However, the current adversarial perturbations are usually shown in the form of noises, and thus have no practical meaning. Image watermark is a technique widely used for copyright protection. We can regard image watermark as a king of meaningful noises and adding it to the original image will not affect people's understanding of the image content, and will not arouse people's suspicion. Therefore, it will be interesting to generate adversarial examples using watermarks. In this paper, we propose a novel watermark perturbation for adversarial examples (Adv-watermark) which combines image watermarking techniques and adversarial example algorithms. Adding a meaningful watermark to the clean images can attack the DNN models. Specifically, we propose a novel optimization algorithm, which is called Basin Hopping Evolution (BHE), to generate adversarial watermarks in the black-box attack mode. Thanks to the BHE, Adv-watermark only requires a few queries from the threat models to finish the attacks. A series of experiments conducted on ImageNet and CASIA-WebFace datasets show that the proposed method can efficiently generate adversarial examples, and outperforms the state-of-the-art attack methods. Moreover, Adv-watermark is more robust against image transformation defense methods.

Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild

  • Jichao Zhang
  • Jingjing Chen
  • Hao Tang
  • Wei Wang
  • Yan Yan
  • Enver Sangineto
  • Nicu Sebe

We address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need of precise annotations of the gaze angle and the head pose. We created a new dataset called CelebAGaze consisting of two domains X, Y, where the eyes are either staring at the camera or somewhere else. Our method consists of three novel modules: the Gaze Correction module(GCM), the Gaze Animation module(GAM), and the Pretrained Autoencoder module (PAM). Specifically, GCM and GAM separately train a dual in-painting network using data from the domain X for gaze correction and data from the domain Y for gaze animation. Additionally, a Synthesis-As-Training method is proposed when training GAM to encourage the features encoded from the eye region to be correlated with the angle information, resulting in gaze animation achieved by interpolation in the latent space. To further preserve the identity information e.g., eye shape, iris color, we propose the PAM with an Autoencoder, which is based on Self-Supervised mirror learning where the bottleneck features are angle-invariant and which works as an extra input to the dual in-painting models. Extensive experiments validate the effectiveness of the proposed method for gaze correction and gaze animation in the wild and demonstrate the superiority of our approach in producing more compelling results than state-of-the-art baselines. Our code, the pretrained models and supplementary results are available at:

Learning Hierarchical Graph for Occluded Pedestrian Detection

  • Gang Li
  • Jian Li
  • Shanshan Zhang
  • Jian Yang

Although pedestrian detection has made significant progress with the help of deep convolution neural networks, it is still a challenging problem to detect occluded pedestrians since the occluded ones can not provide sufficient information for classification and regression. In this paper, we propose a novel Hierarchical Graph Pedestrian Detector (HGPD), which integrates semantic and spatial relation information to construct two graphs named intra-proposal graph and inter-proposal graph, without relying on extra cues w.r.t visible regions. In order to capture the occlusion patterns and enhance features from visible regions, the intra-proposal graph considers body parts as nodes and assigns corresponding edge weights based on semantic relations between body parts. On the other hand, the inter-proposal graph adopts spatial relations between neighbouring proposals to provide additional proposal-wise context information for each proposal, which alleviates the lack of information caused by occlusion. We conduct extensive experiments on standard benchmarks of CityPersons and Caltech to demonstrate the effectiveness of our method. On CityPersons, our approach outperforms the baseline method by a large margin of 5.24pp on the heavy occlusion set, and surpasses all previous methods; on Caltech, we establish a new state of the art of 3.78% MR. Code is available at

Adaptively-Accumulated Knowledge Transfer for Partial Domain Adaptation

  • Taotao Jing
  • Haifeng Xia
  • Zhengming Ding

Partial domain adaptation (PDA) attracts appealing attention as it deals with a realistic and challenging problem when the source domain label space substitutes the target domain. Most conventional domain adaptation (DA) efforts concentrate on learning domain-invariant features to mitigate the distribution disparity across domains. However, it is crucial to alleviate the negative influence caused by the irrelevant source domain categories explicitly for PDA. In this work, we propose an Adaptively-Accumulated Knowledge Transfer framework (A$^2$KT) to align the relevant categories across two domains for effective domain adaptation. Specifically, an adaptively-accumulated mechanism is explored to gradually filter out the most confident target samples and their corresponding source categories, promoting positive transfer with more knowledge across two domains. Moreover, a dual distinct classifier architecture consisting of a prototype classifier and a multilayer perceptron classifier is built to capture intrinsic data distribution knowledge across domains from various perspectives. By maximizing the inter-class center-wise discrepancy and minimizing the intra-class sample-wise compactness, the proposed model is able to obtain more domain-invariant and task-specific discriminative representations of the shared categories data. Comprehensive experiments on several partial domain adaptation benchmarks demonstrate the effectiveness of our proposed model, compared with the state-of-the-art PDA methods.

Box Guided Convolution for Pedestrian Detection

  • Jinpeng Li
  • Shengcai Liao
  • Hangzhi Jiang
  • Ling Shao

Occlusions, scale variation and numerous false positives still represent fundamental challenges in pedestrian detection. Intuitively, different sizes of receptive fields and more attention to the visible parts are required for detecting pedestrians with various scales and occlusion levels, respectively. However, these challenges have not been addressed well by existing pedestrian detectors. This paper presents a novel convolutional network, denoted as box guided convolution network (BGCNet), to tackle these challenges simultaneously in a unified framework. In particular, we proposed a box guided convolution (BGC) that can dynamically adjust the sizes of convolution kernels guided by the predicted bounding boxes. In this way, BGCNet provides position-aware receptive fields to address the challenge of large variations of scales. In addition, for the issue of heavy occlusion, the kernel parameters of BGC are spatially localized around the salient and mostly visible key points of a pedestrian, such as the head and foot, to effectively capture high-level semantic features to help detection. Furthermore, a local maximum (LM) loss is introduced to depress false positives and highlight true positives by forcing positives, rather than negatives, as local maximums, without any additional inference burden. We evaluate BGCNet on popular pedestrian detection benchmarks, and achieve the state-of-the-art results, with the significant performance improvement on heavily occluded and small-scale pedestrians.

Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition

  • Yi-Fan Song
  • Zhang Zhang
  • Caifeng Shan
  • Liang Wang

One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.

Adversarial Image Attacks Using Multi-Sample and Most-Likely Ensemble Methods

  • Xia Du
  • Chi-Man Pun

Many studies on deep neural networks have shown very promising results for most image recognition tasks. However, these networks can often be fooled by adversarial examples that simply add small but powerful distortions to the original input. Recent works have demonstrated the vulnerability of deep learning systems to adversarial examples, but most such works directly manipulate and attack the digital images for a specific classifier only, and cannot attack the physical images in real world. In this paper, we propose the multi-sample ensemble method (MSEM) and most-likely ensemble method (MLEM) to generate adversarial attacks that successfully fool the classifier for images in both the digital and real worlds. The proposed adaptive norm algorithm can craft faster and smaller perturbation than other state-of-the-art attack methods. Besides, the proposed MLEM extended with weighted objective function can generate robust adversarial attacks that can mislead multiple classifiers (Inception-v3, Inception-v4, Resnet-v2, Ince-res-v2) simultaneously for physical images in real world. Compared with other methods, experiments show that our adversarial attack methods not only can achieve higher success rates but also can survive in the multi-model defense tests.

DCSFN: Deep Cross-scale Fusion Network for Single Image Rain Removal

  • Cong Wang
  • Xiaoying Xing
  • Yutong Wu
  • Zhixun Su
  • Junyang Chen

Rain removal is an important but challenging computer vision task as rain streaks can severely degrade the visibility of images that may make other visions or multimedia tasks fail to work. Previous works mainly focused on feature extraction and processing or neural network structure, while the current rain removal methods can already achieve remarkable results, training based on single network structure without considering the cross-scale relationship may cause information drop-out. In this paper, we explore the cross-scale manner between networks and inner-scale fusion operation to solve the image rain removal task. Specifically, to learn features with different scales, we propose a multi-sub-networks structure, where these sub-networks are fused via a cross-scale manner by Gate Recurrent Unit to inner-learn and make full use of information at different scales in these sub-networks. Further, we design an inner-scale connection block to utilize the multi-scale information and features fusion way between different scales to improve rain representation ability and we introduce the dense block with skip connection to inner-connect these blocks. Experimental results on both synthetic and real-world datasets have demonstrated the superiority of our proposed method, which outperforms over the state-of-the-art methods. The source code will be available at

SESSION: Poster Session C1: Deep Learning for Multimedia

Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples

  • Yumeng Zhang
  • Gaoguo Jia
  • Li Chen
  • Mingrui Zhang
  • Junhai Yong

An effective video classification method by means of a small number of samples is urgently needed. The deficiency of samples could be alleviated by generating samples through generative adversarial networks (GANs). However, the generation of videos in a typical category remains underexplored because the complex actions and the changeable viewpoints are difficult to simulate. Thus, applying GANs to perform video augmentation is difficult. In this study, we propose a generative data augmentation method for video classification using dynamic images. The dynamic image compresses the motion information of a video into a still image, removing the interference factors such as the background. Thus, utilizing the GANs to augment dynamic images can keep the categorical motion information and save memory compared with generating videos. To deal with the uneven quality of generated images, we propose a self-paced selection method to automatically select high-quality generated samples for training. These selected dynamic images are used to enhance the features, attain regularization, and finally achieve video augmentation. Our method is verified on two benchmark datasets, namely, HMDB51 and UCF101. Experimental results show that the method remarkably improves the accuracy of video classification under the circumstance of sample insufficiency and sample imbalance.

CF-SIS: Semantic-Instance Segmentation of 3D Point Clouds by Context Fusion with Self-Attention

  • Xin Wen
  • Zhizhong Han
  • Geunhyuk Youk
  • Yu-Shen Liu

3D Semantic-Instance Segmentation (SIS) is a newly emerging research direction that aims to understand visual information of 3D scene on both semantic and instance level. The main difficulty lies in how to coordinate the paradox between mutual aid and sub-optimal problem. Previous methods usually address the mutual aid between instances and semantics by direct feature fusion or hand-crafted constraints to share the common knowledge of the two tasks. However, they neglect the abundant common knowledge of feature context in the feature space. Moreover, the direct feature fusion can raise the sub-optimal problem, since the false prediction of instance object can interfere the prediction of the semantic segmentation and vice versa. To address the above two issues, we propose a novel network of feature context fusion for SIS task, named CF-SIS. The idea is to associatively learn semantic and instance segmentation of 3D point clouds by context fusion with attention in the feature space. Our main contributions are two context fusion modules. First, we propose a novel inter-task context fusion module to take full advantage of mutual aid and relive the sub-optimal problem. It extracts the context in feature space from one task with attention, and selectively fuses the context into the other task using a gate fusion mechanism. Then, in order to enhance the mutual aid effect, the intra-task context fusion module is designed to further integrate the fused context, by selectively merging the similar feature through the self-attention mechanism. We conduct experiments on the S3DIS and ShapeNet datasets and show that CF-SIS outperforms the state-of-the-art methods on semantic and instance segmentation task.

Hybrid Resolution Network Using Edge Guided Region Mutual Information Loss for Human Parsing

  • Yunan Liu
  • Liang Zhao
  • Shanshan Zhang
  • Jian Yang

In this paper, we propose a new method for human parsing, which effectively maintains high-resolution representations and leverages body edge details to improve the performance. First, we propose a hybrid resolution network (HyRN) for human parsing and body edge detection. In our HyRN, we adopt deconvolution operation and auxiliary supervision to increase the discrimination ability of features from each scale. Second, considering the close relationship between human parsing and body edge detection, we propose a dual-task cascaded framework (DTCF), which implicitly integrates parsing and edge features to progressively refine the parsing results. Third, we develop an edge guided region mutual information loss, which uses the edge detection results to explicitly maintain the high order consistency between parsing prediction and ground truth around body edge pixels. When evaluated on standard benchmarks, our proposed HyRN achieves competitive accuracy compared with state-of-the-art human parsing methods. Moreover, our DTCF further improves the performance and outperforms the established baseline approach by 3.42 points w.t.r mIoU on the LIP dataset.

Meta-RCNN: Meta Learning for Few-Shot Object Detection

  • Xiongwei Wu
  • Doyen Sahoo
  • Steven Hoi

Despite significant advances in deep learning based object detection in recent years, training effective detectors in a small data regime remains an open challenge. This is very important since labelling training data for object detection is often very expensive and time-consuming. In this paper, we investigate the problem of few-shot object detection, where a detector has access to only limited amounts of annotated data. Based on the meta-learning principle, we propose a new meta-learning framework for object detection named "Meta-RCNN", which learns the ability to perform few-shot detection via meta-learning. Specifically, Meta-RCNN learns an object detector in an episodic learning paradigm on the (meta) training data. This learning scheme helps acquire a prior which enables Meta-RCNN to do few-shot detection on novel tasks. Built on top of the popular Faster RCNN detector, in Meta-RCNN, both the Region Proposal Network (RPN) and the object classification branch are meta-learned. The meta-trained RPN learns to provide class-specific proposals, while the object classifier learns to do few-shot classification. The novel loss objectives and learning strategy of Meta-RCNN can be trained in an end-to-end manner. We demonstrate the effectiveness of Meta-RCNN in few-shot detection on three datasets (Pascal-VOC, ImageNet-LOC and MSCOCO) with promising results.

Objectness Consistent Representation for Weakly Supervised Object Detection

  • Ke Yang
  • Peng Zhang
  • Peng Qiao
  • Zhiyuan Wang
  • Dongsheng Li
  • Yong Dou

Weakly supervised object detection aims at learning object detectors with only image-level category labels. Most existing methods tend to solve this problem by using a multiple instance learning detector which is usually trapped to discriminate object parts. In order to select high-quality proposals, recent works leverage objectness scores derived from weakly-supervised segmentation maps to rank the object proposals. Base on our observation, this kind of segmentation guided method always fails due to neglect of the fact that the objectness of all proposals inside the ground-truth box should be consistent. In this paper, we propose a novel object representation named Objectness Consistent Representation (OCRepr) to meet the consistency criterion of objectness. Specifically, we project the segmentation confidence scores into two orthogonal directions, namely vertical and horizontal, to get the OCRepr. With the novel object representation, more high-quality proposals can be mined for learning a much stronger object detector. We obtain 54.6% and 51.1% mAP scores on VOC 2007 and 2012 datasets, significantly outperforming the state-of-the-art and demonstrating the superiority of OCRepr for weakly supervised object detection.

Unpaired Image Enhancement with Quality-Attention Generative Adversarial Network

  • Zhangkai Ni
  • Wenhan Yang
  • Shiqi Wang
  • Lin Ma
  • Sam Kwong

In this work, we aim to learn an unpaired image enhancement model, which can enrich low-quality images with the characteristics of high-quality images provided by users. We propose a quality attention generative adversarial network (QAGAN) trained on unpaired data based on the bidirectional Generative Adversarial Network (GAN) embedded with a quality attention module (QAM). The key novelty of the proposed QAGAN lies in the injected QAM for the generator such that it learns domain-relevant quality attention directly from the two domains. More specifically, the proposed QAM allows the generator to effectively select semantic-related characteristics from the spatial-wise and adaptively incorporate style-related attributes from the channel-wise, respectively. Therefore, in our proposed QAGAN, not only discriminators but also the generator can directly access both domains which significantly facilitate the generator to learn the mapping function. Extensive experimental results show that, compared with the state-of-the-art methods based on unpaired learning, our proposed method achieves better performance in both objective and subjective evaluations.

ASTA-Net: Adaptive Spatio-Temporal Attention Network for Person Re-Identification in Videos

  • Xierong Zhu
  • Jiawei Liu
  • Haoze Wu
  • Meng Wang
  • Zheng-Jun Zha

The attention mechanism has been widely applied to enhance pedestrian representation for person re-identification in videos. However, most existing methods learn the spatial and temporal attention separately, and thus ignore the correlation between them. In this work, we propose a novel Adaptive Spatio-Temporal Attention Network (ASTA-Net) to adaptively aggregate the spatial and temporal attention features into discriminative pedestrian representation for person re-identification in videos. Specifically, multiple Adaptive Spatio-Temporal Fusion modules within ASTA-Net are designed for exploring precise spatio-temporal attention on multi-level feature maps. They first obtain the preliminary spatial and temporal attention features via the spatial semantic relations for each frame and temporal dependencies among inconsecutive frames, then adaptively aggregate the preliminary attention features on the basis of their correlation. Moreover, an Adjacent-Frame Motion module is designed to explicitly extract motion patterns according to the feature-level variation among adjacent frames. Extensive experiments on the three widely-used datasets, i.e., MARS, iLIDS-VID and PRID2011, have demonstrated the effectiveness of the proposed approach.

Talking Face Generation with Expression-Tailored Generative Adversarial Network

  • Dan Zeng
  • Han Liu
  • Hui Lin
  • Shiming Ge

A key of automatically generating vivid talking faces is to synthesize identity-preserving natural facial expressions beyond audio-lip synchronization, which usually need to disentangle the informative features from multiple modals and then fuse them together. In this paper, we propose an end-to-end Expression-Tailored Generative Adversarial Network (ET-GAN) to generate an expression enriched talking face video of arbitrary identity. Different from talking face generation based on identity image and audio, an expressional video of arbitrary identity serves as the expression source in our approach. Expression encoder is proposed to disentangle expression-tailored representation from the guiding expressional video, while audio encoder disentangles audio-lip representation. Instead of using single image as identity input, multi-image identity encoder is proposed by learning different views of faces and merging a unified representation. Multiple discriminators are exploited to keep both image-aware and the video-aware realistic details, including a spatial-temporal discriminator for visual continuity of expression synthesis and facial movements. We conduct extensive experimental evaluations on quantitative metrics, expression retention quality and audio-visual synchronization. The results show the effectiveness of our ET-GAN in generating high quality expressional talking face videos against existing state-of-the-arts.

Cross-Modal Omni Interaction Modeling for Phrase Grounding

  • Tianyu Yu
  • Tianrui Hui
  • Zhihao Yu
  • Yue Liao
  • Sansi Yu
  • Faxi Zhang
  • Si Liu

Phrase grounding aims to localize the objects described by phrases in a natural language specification. Previous works model the interaction of inputs from text modality and visual modality only in the intra-modal global level and consequently lacks the ability to capture the precise and complete context information. In this paper, we propose a novel Cross-Modal Omni Interaction network (COI Net) composed of a neighboring interaction module, a global interaction module, a cross-modal interaction module and a multilevel alignment module. Our approach formulates the complex spatial and semantic relationship among image regions and phrases through multi-level multi-modal interaction. We capture the local relationship using the interaction among neighboring regions and then collect the global context through the interaction among all regions using a transformer encoder. We further use a co-attention module to apply the interaction between two modalities to gather the cross-modal context for all image regions and phrases. In addition to the omni interaction modeling, we also leverage a straightforward yet effective multilevel alignment regularization to formulate the dependencies among all grounding decisions. We extensively validate the effectiveness of our model. Experiments show that our approach outperforms existing state-of-the-art methods by large margins on two popular datasets in terms of accuracy: 6.15% on Flickr30K Entities (71.36% increased to 77.51%) and 21.25% on ReferItGame (44.91% increased to 66.16%). The code of our implementation is available at

Bridging the Web Data and Fine-Grained Visual Recognition via Alleviating Label Noise and Domain Mismatch

  • Yazhou Yao
  • Xiansheng Hua
  • Guanyu Gao
  • Zeren Sun
  • Zhibin Li
  • Jian Zhang

To distinguish the subtle differences among fine-grained categories, a large amount of well-labeled images are typically required. However, manual annotations for fine-grained categories is an extremely difficult task as it usually has a high demand for professional knowledge. To this end, we propose to directly leverage web images for fine-grained visual recognition. Our work mainly focuses on two critical issues including "label noise" and "domain mismatch" in the web images. Specifically, we propose an end-to-end deep denoising network (DDN) model to jointly solve these problems in the process of web images selection. To verify the effectiveness of our proposed approach, we first collect web images by using the labels in fine-grained datasets. Then we apply the proposed deep denoising network model for noise removal and domain mismatch alleviation. We leverage the selected web images as the training set for fine-grained categorization models learning. Extensive experiments and ablation studies demonstrate state-of-the-art performance gained by our proposed approach, which, at the same time, delivers a new pipeline for fine-grained visual categorization that is to be highly effective for real-world applications.

Is Depth Really Necessary for Salient Object Detection?

  • Jiawei Zhao
  • Yifan Zhao
  • Jia Li
  • Xiaowu Chen

Salient object detection (SOD) is a crucial and preliminary task for many computer vision applications, which have made progress with deep CNNs. Most of the existing methods mainly rely on the RGB information to distinguish the salient objects, which faces difficulties in some complex scenarios. To solve this, many recent RGBD-based networks are proposed by adopting the depth map as an independent input and fuse the features with RGB information. Taking the advantages of RGB and RGBD methods, we propose a novel depth-aware salient object detection framework, which has following superior designs: 1) It does not rely on depth data in the testing phase. 2) It comprehensively optimizes SOD features with multi-level depth-aware regularizations. 3) The depth information also serves as error-weighted map to correct the segmentation process. With these insightful designs combined, we make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference, which not only surpasses the state-of-the-art performance on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin, while adopting less information and implementation light-weighted.

Self-Play Reinforcement Learning for Fast Image Retargeting

  • Nobukatsu Kajiura
  • Satoshi Kosugi
  • Xueting Wang
  • Toshihiko Yamasaki

In this study, we address image retargeting, which is a task that adjusts input images to arbitrary sizes. In one of the best-performing methods called MULTIOP, multiple retargeting operators were combined and retargeted images at each stage were generated to find the optimal sequence of operators that minimized the distance between original and retargeted images. The limitation of this method is in its tremendous processing time, which severely prohibits its practical use. Therefore, the purpose of this study is to find the optimal combination of operators within a reasonable processing time; we propose a method of predicting the optimal operator for each step using a reinforcement learning agent. The technical contributions of this study are as follows. Firstly, we propose a reward based on self-play, which will be insensitive to the large variance in the content-dependent distance measured in MULTIOP. Secondly, we propose to dynamically change the loss weight for each action to prevent the algorithm from falling into a local optimum and from choosing only the most frequently used operator in its training. Our experiments showed that we achieved multi-operator image retargeting with less processing time by three orders of magnitude and the same quality as the original multi-operator-based method, which was the best-performing algorithm in retargeting tasks.

Brain-media: A Dual Conditioned and Lateralization Supported GAN (DCLS-GAN) towards Visualization of Image-evoked Brain Activities

  • Ahmed Fares
  • Sheng-hua Zhong
  • Jianmin Jiang

Essentially, the current concept of multimedia is limited to presenting what people see in their eyes. What people think inside brains, however, remains a rich source of multimedia, such as imaginations of paradise and memories of good old days etc. In this paper, we propose a dual conditioned and lateralization supported GAN (DCLS-GAN) framework to learn and visualize the brain thoughts evoked by stimulating images and hence enable multimedia to reflect not only what people see but also what people think. To reveal such a new world of multimedia inside human brains, we coin such an attempt as "brain-media". By examining the relevance between the visualized image and the stimulation image, we are able to measure the efficiency of our proposed deep framework regarding the quality of such visualization and also the feasibility of exploring the concept of "brain-media". To ensure that such extracted multimedia elements remain meaningful, we introduce a dually conditioned learning technique in the proposed deep framework, where one condition is analyzing EEGs through deep learning to extract a class-dependent and more compact brain feature space utilizing the distinctive characteristics of hemispheric lateralization and brain stimulation, and the other is to extract expressive visual features assisting our automated analysis of brain activities as well as their visualizations aided by artificial intelligence. To support the proposed GAN framework, we create a combined-conditional space by merging the brain feature space with the visual feature space provoked by the stimuli. Extensive experiments are carried out and the results show that our proposed deep framework significantly outperforms the representative existing state-of-the-arts under several settings, especially in terms of both visualization and classification of brain responses to the evoked images. For the convenience of research dissemination, we make the source code openly accessible for downloading at GitHub.

Mesh Guided One-shot Face Reenactment Using Graph Convolutional Networks

  • Guangming Yao
  • Yi Yuan
  • Tianjia Shao
  • Kun Zhou

Face reenactment aims to animate a source face image to a different pose and expression provided by a driving image. Existing approaches are either designed for a specific identity, or suffer from the identity preservation problem in the one-shot or few-shot scenarios. In this paper, we introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes (i.e., the source mesh and driving mesh) as guidance to learn the optical flow needed for the reenacted face synthesis. Technically, we explicitly exclude the driving face's identity information in the reconstructed driving mesh. In this way, our network can focus on the motion estimation for the source face without the interference of driving face shape. We propose a motion net to learn the face motion, which is an asymmetric autoencoder. The encoder is a graph convolutional network (GCN) that learns a latent motion vector from the meshes, and the decoder serves to produce an optical flow image from the latent vector with CNNs. Compared to previous methods using sparse keypoints to guide the optical flow learning, our motion net learns the optical flow directly from 3D dense meshes, which provide the detailed shape and pose information for the optical flow, so it can achieve more accurate expression and pose on the reenacted face. Extensive experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.

SESSION: Poster Session D1: Deep Learning for Multimedia

Controllable Continuous Gaze Redirection

  • Weihao Xia
  • Yujiu Yang
  • Jing-Hao Xue
  • Wensen Feng

In this work, we present interpGaze, a novel framework for controllable gaze redirection that achieves both precise redirection and continuous interpolation. Given two gaze images with different attributes, our goal is to redirect the eye gaze of one person into any gaze direction depicted in the reference image or to generate continuous intermediate results. To accomplish this, we design a model including three cooperative components: an encoder, a controller and a decoder. The encoder maps images into a well-disentangled and hierarchically-organized latent space. The controller adjusts the magnitudes of latent vectors to the desired strength of corresponding attributes by altering a control vector. The decoder converts the desired representations from the attribute space to the image space. To facilitate covering the full space of gaze directions, we introduce a high-quality gaze image dataset with a large range of directions, which also benefits researchers in related areas. Extensive experimental validation and comparisons to several baseline methods show that the proposed interpGaze outperforms state-of-the-art methods in terms of image quality and redirection precision.

Preserving Global and Local Temporal Consistency for Arbitrary Video Style Transfer

  • Xinxiao Wu
  • Jialu Chen

Video style transfer is a challenging task that requires not only stylizing video frames but also preserving temporal consistency among them. Many existing methods resort to optical flow for maintaining the temporal consistency in stylized videos. However, optical flow is sensitive to occlusions and rapid motions, and its training processing speed is quite slow, which makes it less practical in real-world applications. In this paper, we propose a novel fast method that explores both global and local temporal consistency for video style transfer without estimating optical flow. To preserve the temporal consistency of the entire video (i.e., global consistency), we use structural similarity index instead of flow optical and propose a self-similarity loss to ensure the temporal structure similarity between the stylized video and the source video. Furthermore, to enhance the coherence between adjacent frames (i.e., local consistency), a self-attention mechanism is designed to attend the previous stylized frame for synthesizing the current frame. Extensive experiments demonstrate that our method generally achieves better visual results and runs faster than the state-of-the-art methods, which validates the superiority of simultaneously preserving global and local temporal consistency for video style transfer

Deep Shapely Portraits

  • Qinjie Xiao
  • Xiangjun Tang
  • You Wu
  • Leyang Jin
  • Yong-Liang Yang
  • Xiaogang Jin

We present deep shapely portraits, a novel method based on deep learning, to automatically reshape an input portrait to be better proportioned and more shapely while keeping personal facial characteristics. Different from existing methods that may suffer from irrational face artifacts when dealing with portraits with large pose variations or reshaping adjustments, we utilize dense 3D face information and constraints instead of sparse facial landmarks based on 3D morphable models, resulting in better reshaped faces lying in rational face space. To this end, we first estimate the best shapely degree for the input portrait using a convolutional neural network (CNN) trained on our newly developed ShapeFaceNet dataset. Then the best shapely degree is used as the control parameter to reshape the 3D face reconstructed from the input portrait image. After that, we render the reshaped 3D face back to 2D and generate a seamless portrait image using a fast image warping optimization. Our work can deal with pose and expression free (PE-Free) portrait images and generate plausible shapely faces without noticeable artifacts, which cannot be achieved by prior work. We validate the effectiveness, efficiency, and robustness of the proposed method by extensive experiments and user studies.

Depth Super-Resolution via Deep Controllable Slicing Network

  • Xinchen Ye
  • Baoli Sun
  • Zhihui Wang
  • Jingyu Yang
  • Rui Xu
  • Haojie Li
  • Baopu Li

Due to the imaging limitation of depth sensors, high-resolution (HR) depth maps are often difficult to be acquired directly, thus effective depth super-resolution (DSR) algorithms are needed to generate HR output from its low-resolution (LR) counterpart. Previous methods treat all depth regions equally without considering different extents of degradation at region-level, and regard DSR under different scales as independent tasks without considering the modeling of different scales, which impede further performance improvement and practical use of DSR. To alleviate these problems, we propose a deep controllable slicing network from a novel perspective. Specifically, our model is to learn a set of slicing branches in a divide-and-conquer manner, parameterized by a distance-aware weighting scheme to adaptively aggregate different depths in an ensemble. Each branch that specifies a depth slice (e.g., the region in some depth range) tends to yield accurate depth recovery. Meanwhile, a scale-controllable module that extracts depth features under different scales is proposed and inserted into the front of slicing network, and enables finely-grained control of the depth restoration results of slicing network with a scale hyper-parameter. Extensive experiments on synthetic and real-world benchmark datasets demonstrate that our method achieves superior performance.

Efficient Joint Gradient Based Attack Against SOR Defense for 3D Point Cloud Classification

  • Chengcheng Ma
  • Weiliang Meng
  • Baoyuan Wu
  • Shibiao Xu
  • Xiaopeng Zhang

Deep learning based classifiers on 3D point cloud data have been shown vulnerable to adversarial examples, while a defense strategy named Statistical Outlier Removal (SOR) is widely adopted to defend adversarial examples successfully, by discarding outlier points in the point cloud.

In this paper, we propose a novel white-box attack method, Joint Gradient Based Attack (JGBA), aiming to break the SOR defense. Specifically, we generate adversarial examples by optimizing an objective function containing both the original point cloud and its SOR-processed version, for the purpose of pushing both of them towards the decision boundary of classifier at the same time. Since the SOR defense introduces a non-differentiable optimization problem, we overcome the problem by introducing a linear approximation of the SOR defense and successfully compute the joint gradient. Moreover, we impose constraints on perturbation norm for each component point in the point cloud instead of for the entire object, to further enhance the attack ability against the SOR defense. Our JGBA method can be directly extended to the semi white-box setting, where the values of hyper-parameters in the SOR defense are unknown to the attacker. Extensive experiments validate that our JGBA method achieves the highest performance to break both the SOR defense and the DUP-Net defense (a recently proposed defense which takes SOR as its core procedure), compared with state-of-the-art attacks on four victim classifiers, namely PointNet, PointNet++(SSG), PointNet++(MSG), and DGCNN.

Discrete Haze Level Dehazing Network

  • Xiaofeng Cong
  • Jie Gui
  • Kai-Chao Miao
  • Jun Zhang
  • Bing Wang
  • Peng Chen

In contrast to traditional dehazing methods, deep learning based single image dehazing (SID) algorithms have achieved better performances by creating a mapping function from haze to haze-free images. Usually, the images taken from the natural scenes have different haze levels, but deep SID algorithms only process the hazy images as one group. It makes the deep SID algorithms difficult to deal with the image set with some images having specific haze density. In this paper, a Discrete Haze Level Dehazing network (DHL-Dehaze), a very effective method to dehaze multiple different haze level images, is proposed. The proposed approach considers a single image dehazing problem as a multi-domain image-to-image translation, instead of grouping all hazy images into the same domain. DHL-Dehaze provides computational derivation to describe the role of different haze levels for image translation. To verify the proposed approach, we synthesize two largescale datasets with multiple haze level images based on the NYU-Depth and DIML/CVL datasets. The experiments show that DHL-Dehaze can obtain excellent quantitative and qualitative dehazing results, especially when the haze concentration is high.

Deep Heterogeneous Multi-Task Metric Learning for Visual Recognition and Retrieval

  • Shikang Gan
  • Yong Luo
  • Yonggang Wen
  • Tongliang Liu
  • Han Hu

How to estimate the distance between data instances is a fundamental problem in many artificial intelligence algorithms, and critical in diverse multimedia applications. A major challenge in the estimation is how to find an appropriate distance function when labeled data are insufficient for a certain task. Multi-task metric learning (MTML) is able to alleviate such data deficiency issue by learning distance metrics for multiple tasks together and sharing information between the different tasks. Recently, heterogeneous MTML (HMTML) has attracted much attention since it can handle multiple tasks with varied data representations. A major drawback of the current HMTML approaches is that only linear transformations are learned to connect different domains. This is suboptimal since the correlations between different domains may be very complex and highly nonlinear. To overcome this drawback, we propose a deep heterogeneous MTML (DHMTML) method, in which a nonlinear mapping is learned for each task by using a deep neural network. The correlations of different domains are exploited by sharing some parameters at the top layers of different networks. More importantly, the auto-encoder scheme and the adversarial learning mechanism are integrated and incorporated to help exploit the feature correlations in and between different tasks and the specific properties are preserved by learning additional task-specific layers together with the common layers. Experiments demonstrated that the proposed method outperforms single-task deep metric learning algorithms and other HMTML approaches consistently on several benchmark datasets.

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

  • Meng Wei
  • Chun Yuan
  • Xiaoyu Yue
  • Kuo Zhong

Scene graph generation aims to produce structured representations for images, which requires to understand the relations between objects. Due to the continuous nature of deep neural networks, the prediction of scene graphs is divided into object detection and relation classification. However, the independent relation classes cannot separate the visual features well. Although some methods organize the visual features into graph structures and use message passing to learn contextual information, they still suffer from drastic intra-class variations and unbalanced data distributions. One important factor is that they learn an unstructured output space that ignores the inherent structures of scene graphs. Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. First, we propose a novel structure-aware embedding-to-classifier(SEC) module to incorporate both local and global structural information of relationships into the output space. Specifically, a set of context embeddings are learned via local graph based message passing and then mapped to a global structure based classification space. Second, since learning too many context-specific classification subspaces can suffer from data sparsity issues, we propose a hierarchical semantic aggregation(HSA) module to reduces the number of subspaces by introducing higher order structural information. HSA is also a fast and flexible tool to automatically search a semantic object hierarchy based on relational knowledge graphs. Extensive experiments show that the proposed HOSE-Net achieves the state-of-the-art performance on two popular benchmarks of Visual Genome and VRD.

Dual Semantic Fusion Network for Video Object Detection

  • Lijian Lin
  • Haosheng Chen
  • Honglun Zhang
  • Jun Liang
  • Yu Li
  • Ying Shan
  • Hanzi Wang

Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4% mAP with ResNeXt-101 without using any post-processing steps.

Sharp Multiple Instance Learning for DeepFake Video Detection

  • Xiaodan Li
  • Yining Lang
  • Yuefeng Chen
  • Xiaofeng Mao
  • Yuan He
  • Shuhui Wang
  • Hui Xue
  • Quan Lu

With the rapid development of facial manipulation techniques, face forgery has received considerable attention in multimedia and computer vision community due to security concerns. Existing methods are mostly designed for single-frame detection trained with precise image-level labels or for video-level prediction by only modeling the inter-frame inconsistency, leaving potential high risks for DeepFake attackers. In this paper, we introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated. We address this problem by multiple instance learning framework, treating faces and input video as instances and bag respectively. A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction, rather than from instance embeddings to instance prediction and then to bag prediction in traditional MIL. Theoretical analysis proves that the gradient vanishing in traditional MIL is relieved in S-MIL. To generate instances that can accurately incorporate the partially manipulated faces, spatial-temporal encoded instance is designed to fully model the intra-frame and inter-frame inconsistency, which further helps to promote the detection performance. We also construct a new dataset FFPMS for partially attacked DeepFake video detection, which can benefit the evaluation of different methods at both frame and video levels. Experiments on FFPMS and the widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection. In addition, S-MIL can also be adapted to traditional DeepFake image detection tasks and achieve state-of-the-art performance on single-frame datasets.

Learning to Detect Specular Highlights from Real-world Images

  • Gang Fu
  • Qing Zhang
  • Qifeng Lin
  • Lei Zhu
  • Chunxia Xiao

Specular highlight detection is a challenging problem, and has many applications such as shiny object detection and light source estimation. Although various highlight detection methods have been proposed, they fail to disambiguate bright material surfaces from highlights, and cannot handle non-white-balanced images. Moreover, at present, there is still no benchmark dataset for highlight detection. In this paper, we present a large-scale real-world highlight dataset containing a rich variety of material categories, with diverse highlight shapes and appearances, in which each image is with an annotated ground-truth mask. Based on the dataset, we develop a deep learning-based specular highlight detection network (SHDNet) leveraging multi-scale context contrasted features to accurately detect specular highlights of varying scales. In addition, we design a binary cross-entropy (BCE) loss and an intersection-over-union edge (IoUE) loss for our network. Compared with existing highlight detection methods, our method can accurately detect highlights of different sizes, while effectively excluding the non-highlight regions, such as bright materials, non-specular as well as colored lighting, and even light sources.

Video Super-Resolution using Multi-scale Pyramid 3D Convolutional Networks

  • Jianping Luo
  • Shaofei Huang
  • Yuan Yuan

Video super-resolution (SR) aims at generating high-resolution (HR) frames from consecutive low-resolution (LR) frames. The challenge is how to make use of temporal coherence among neighbouring LR frames. Most previous works use motion estimation and compensation based models. However, their performance relies heavily on motion estimation accuracy. In this paper, we propose a multi-scale pyramid 3D convolutional (MP3D) network for video SR, where 3D convolution can explore temporal correlation directly without explicit motion compensation. Specifically, we first apply 3D convolution into a pyramid subnet to extractmulti-scale spatial and temporal features simultaneously from the LR frames, such that it can handle various sizes of motions. We then feed the fused feature maps into an SR reconstruction subnet, where a 3D sub-pixel convolution layer is used for up-sampling. Finally, we append a detail refinement subnet based on the encoder-decoder structure to further enhance texture details of the reconstructed HR frames. Extensive experiments on benchmark datasets and real-world cases show that the proposed MP3D model outperforms state-of-the-art video SR methods in terms of PSNR/SSIM values, visual quality and temporal consistency, respectively.

PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-resolution

  • Hao Dou
  • Chen Chen
  • Xiyuan Hu
  • Zuxing Xuan
  • Zhisen Hu
  • Silong Peng

Generative Adversarial Networks (GANs) have been employed for face super resolution but they bring distorted facial details easily and still have weakness on recovering realistic texture. To further improve the performance of GAN-based models on super-resolving face images, we propose PCA-SRGAN which pays attention to the cumulative discrimination in the orthogonal projection space spanned by PCA projection matrix of face data. By feeding the principal component projections ranging from structure to details into the discriminator, the discrimination difficulty will be greatly alleviated and the generator can be enhanced to reconstruct clearer contour and finer texture, helpful to achieve the high perception and low distortion eventually. This incremental orthogonal projection discrimination has ensured a precise optimization procedure from coarse to fine and avoids the dependence on the perceptual regularization. We conduct experiments on CelebA and FFHQ face datasets. The qualitative visual effect and quantitative evaluation have demonstrated the overwhelming performance of our model over related works.

Exploring Font-independent Features for Scene Text Recognition

  • Yizhi Wang
  • Zhouhui Lian

Scene text recognition (STR) has been extensively studied in last few years. Many recently-proposed methods are specially designed to accommodate the arbitrary shape, layout and orientation of scene texts, but ignoring that various font (or writing) styles also pose severe challenges to STR. These methods, where font features and content features of characters are tangled, perform poorly in text recognition on scene images with texts in novel font styles. To address this problem, we explore font-independent features of scene texts via attentional generation of glyphs in a large number of font styles. Specifically, we introduce trainable font embeddings to shape the font styles of generated glyphs, with the image feature of scene text only representing its essential patterns. The generation process is directed by the spatial attention mechanism, which effectively copes with irregular texts and generates higher-quality glyphs than existing image-to-image translation methods. Experiments conducted on several STR benchmarks demonstrate the superiority of our method compared to the state of the art.

SESSION: Poster Session E1: Deep Learning for Multimedia

Context-aware Feature Generation For Zero-shot Semantic Segmentation

  • Zhangxuan Gu
  • Siyuan Zhou
  • Li Niu
  • Zihan Zhao
  • Liqing Zhang

Existing semantic segmentation models heavily rely on dense pixel-wise annotations. To reduce the annotation pressure, we focus on a challenging task named zero-shot semantic segmentation, which aims to segment unseen objects with zero annotations. This task can be accomplished by transferring knowledge across categories via semantic word embeddings. In this paper, we propose a novel context-aware feature generation method for zero-shot segmentation named CaGNet. In particular, with the observation that a pixel-wise feature highly depends on its contextual information, we insert a contextual module in a segmentation network to capture the pixel-wise contextual information, which guides the process of generating more diverse and context-aware features from semantic word embeddings. Our method achieves state-of-the-art results on three benchmark datasets for zero-shot segmentation.

Defending Adversarial Examples via DNN Bottleneck Reinforcement

  • Wenqing Liu
  • Miaojing Shi
  • Teddy Furon
  • Li Li

This paper presents a DNN bottleneck reinforcement scheme to alleviate the vulnerability of Deep Neural Networks (DNN) against adversarial attacks. Typical DNN classifiers encode the input image into a compressed latent representation more suitable for inference.This information bottleneck makes a trade-off between the image-specific structure and class-specific information in an image. By reinforcing the former while maintaining the latter, any redundant information, be it adversarial or not, should be removed from the latent representation. Hence, this paper proposes to jointly train an auto-encoder (AE) sharing the same encoding weights with the visual classifier. In order to reinforce the information bottleneck,we introduce the multi-scale low-pass objective and multi-scale high-frequency communication for better frequency steering in the network. Unlike existing approaches, our scheme is the first reforming defense per se which keeps the classifier structure untouched without appending any pre-processing head and is trained with clean images only. Extensive experiments on MNIST, CIFAR-10 and ImageNet demonstrate the strong defense of our method againstvarious adversarial attacks.

Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts

  • Xun Yang
  • Xueliang liu
  • Meng Jian
  • Xinjian Gao
  • Meng Wang

Grounding objects in visual context from natural language queries is a crucial yet challenging vision-and-language task, which has gained increasing attention in recent years. Existing work has primarily investigated this task in the context of still images. Despite their effectiveness, these methods cannot be directly migrated into the video context, mainly due to 1) the complex spatio-temporal structure of videos and 2) the scarcity of fine-grained annotations of videos. To effectively ground objects in videos is profoundly more challenging and less explored.

To fill the research gap, this paper presents a weakly-supervised framework for linking objects mentioned in a sentence with the corresponding regions in videos. It mainly considers two types of video characteristics: 1) objects are dynamically distributed across multiple frames and have diverse temporal durations, and 2) object regions in videos are spatially correlated with each other. Specifically, we propose a weakly-supervised video object grounding approach which mainly consists of three modules: 1) a temporal localization module to model the latent relation between queried objects and frames with a temporal attention network, 2) a spatial interaction module to capture feature correlation among object regions for learning context-aware region representation, and 3) a hierarchical video multiple instance learning algorithm to estimate the sentence-segment grounding score for discriminative training. Extensive experiments demonstrate that our method can achieve consistent improvement over the state-of-the-arts.

S2SiamFC: Self-supervised Fully Convolutional Siamese Network for Visual Tracking

  • Chon Hou Sio
  • Yu-Jen Ma
  • Hong-Han Shuai
  • Jun-Cheng Chen
  • Wen-Huang Cheng

To exploit rich information from unlabeled data, in this work, we propose a novel self-supervised framework for visual tracking which can easily adapt the state-of-the-art supervised Siamese-based trackers into unsupervised ones by utilizing the fact that an image and any cropped region of it can form a natural pair for self-training. Besides common geometric transformation-based data augmentation and hard negative mining, we also propose adversarial masking which helps the tracker to learn other context information by adaptively blacking out salient regions of the target. The proposed approach can be trained offline using images only without any requirement of manual annotations and temporal information from multiple consecutive frames. Thus, it can be used with any kind of unlabeled data, including images and video frames. For evaluation, we take SiamFC as the base tracker and name the proposed self-supervised method as S2SiamFC. Extensive experiments and ablation studies on the challenging VOT2016 and VOT2018 datasets are provided to demonstrate the effectiveness of the proposed method which not only achieves comparable performance to its supervised counterpart and other unsupervised methods requiring multiple frames.

Learnable Optimal Sequential Grouping for Video Scene Detection

  • Daniel Rotman
  • Yevgeny Yaroker
  • Elad Amrani
  • Udi Barzelay
  • Rami Ben-Ari

Video scene detection is the task of dividing videos into temporal semantic chapters. This is an important preliminary step before attempting to analyze heterogeneous video content. Recently, Optimal Sequential Grouping (OSG) was proposed as a powerful unsupervised solution to solve a formulation of the video scene detection problem. In this work, we extend the capabilities of OSG to the learning regime. By giving the capability to both learn from examples and leverage a robust optimization formulation, we can boost performance and enhance the versatility of the technology. We present a comprehensive analysis of incorporating OSG into deep learning neural networks under various configurations. These configurations include learning an embedding in a straight-forward manner, a tailored loss designed to guide the solution of OSG, and an integrated model where the learning is performed through the OSG pipeline. With thorough evaluation and analysis, we assess the benefits and behavior of the various configurations, and show that our learnable OSG approach exhibits desirable behavior and enhanced performance compared to the state of the art.

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination

  • Penghao Zhou
  • Chong Zhou
  • Pai Peng
  • Junlong Du
  • Xing Sun
  • Xiaowei Guo
  • Feiyue Huang

Greedy-NMS inherently raises a dilemma, where a lower NMS threshold will potentially lead to a lower recall rate and a higher threshold introduces more false positives. This problem is more severe in pedestrian detection because the instance density varies more intensively. However, previous works on NMS don't consider or vaguely consider the factor of the existent of nearby pedestrians. Thus, we propose \heatmapname (\heatmapnameshort ), which pinpoints the objects nearby each proposal with a Gaussian distribution, together with \nmsname, which dynamically eases the suppression for the space that might contain other objects with a high likelihood. Compared to Greedy-NMS, our method, as the state-of-the-art, improves by $3.9%$ AP, $5.1%$ Recall, and $0.8%$ MR\textsuperscript-2 on CrowdHuman to $89.0%$ AP and $92.9%$ Recall, and $43.9%$ MR\textsuperscript-2 respectively.

Dual-Gradients Localization Framework for Weakly Supervised Object Localization

  • Chuangchuang Tan
  • Guanghua Gu
  • Tao Ruan
  • Shikui Wei
  • Yao Zhao

Weakly Supervised Object Localization (WSOL) aims to learn object locations in a given image while only using image-level annotations. For highlighting the whole object regions instead of the discriminative parts, previous works often attempt to train classification model for both classification and localization tasks. However, it is hard to achieve a good tradeoff between the two tasks, if only classification labels are employed for training on a single classification model. In addition, all of recent works just perform localization based on the last convolutional layer of classification model, ignoring the localization ability of other layers. In this work, we propose an offline framework to achieve precise localization on any convolutional layer of a classification model by exploiting two kinds of gradients, called Dual-Gradients Localization (DGL) framework. DGL framework is developed based on two branches: 1) Pixel-level Class Selection, leveraging gradients of the target class to identify the correlation ratio of pixels to the target class within any convolutional feature maps, and 2) Class-aware Enhanced Maps, utilizing gradients of classification loss function to mine entire target object regions, which would not damage classification performance. Extensive experiments on public ILSVRC and CUB-200-2011 datasets show the effectiveness of the proposed DGL framework. Especially, our DGL obtains a new state-of-the-art Top-1 localization error of 43.55% on the ILSVRC benchmark.

DualLip: A System for Joint Lip Reading and Generation

  • Weicong Chen
  • Xu Tan
  • Yingce Xia
  • Tao Qin
  • Yu Wang
  • Tie-Yan Liu

Lip reading aims to recognize text from talking lip, while lip generation aims to synthesize talking lip according to text, which is a key component in talking face generation and is a dual task of lip reading. Both tasks require a large amount of paired lip video and text training data, and perform poorly in low-resource scenarios with limited paired training data. In this paper, we develop DualLip, a system that jointly improves lip reading and generation by leveraging the task duality and using unlabeled text and lip video data. The key ideas of the DualLip include: 1) Generate lip video from unlabeled text using a lip generation model, and use the pseudo data pairs to improve lip reading; 2) Generate text from unlabeled lip video using a lip reading model, and use the pseudo data pairs to improve lip generation. To leverage the benefit of DualLip on lip generation, we further extend DualLip to talking face generation with two additionally introduced components: lip to face generation and text to speech generation, which share the same duration for synchronization. Experiments on GRID and TCD-TIMIT datasets demonstrate the effectiveness of DualLip on improving lip reading, lip generation and talking face generation by utilizing unlabeled data, especially in low-resource scenarios. Specifically, on the GRID dataset, the lip generation model in our DualLip system trained with only 10% paired data and 90% unpaired data surpasses the performance of that trained with the whole paired data, and our lip reading model achieves 1.16% character error rate and 2.71% word error rate, outperforming the state-of-the-art models using the same amount of paired data.

Dual Attention GANs for Semantic Image Synthesis

  • Hao Tang
  • Song Bai
  • Nicu Sebe

In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. Existing methods lack effective semantic constraints to preserve the semantic information and ignore the structural correlations in both spatial and channel dimensions, leading to unsatisfactory blurry and artifact-prone results. To address these limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images with fine details from the input layouts without imposing extra training overhead or modifying the network architectures of existing methods. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively. Specifically, SAM selectively correlates the pixels at each position by a spatial attention map, leading to pixels with the same semantic label being related to each other regardless of their spatial distances. Meanwhile, CAM selectively emphasizes the scale-wise features at each channel by a channel attention map, which integrates associated features among all channel maps regardless of their scales. We finally sum the outputs of SAM and CAM to further improve feature representation. Extensive experiments on four challenging datasets show that DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.

SimSwap: An Efficient Framework For High Fidelity Face Swapping

  • Renwang Chen
  • Xuanhong Chen
  • Bingbing Ni
  • Yanhao Ge

We propose an efficient framework, called Simple Swap (SimSwap), aiming for generalized and high fidelity face swapping. In contrast to previous approaches that either lack the ability to generalize to arbitrary identity or fail to preserve attributes like facial expression and gaze direction, our framework is capable of transferring the identity of an arbitrary source face into an arbitrary target face while preserving the attributes of the target face. We overcome the above defects in the following two ways. First, we present the ID Injection Module (IIM) which transfers the identity information of the source face into the target face at feature level. By using this module, we extend the architecture of an identity-specific face swapping algorithm to a framework for arbitrary face swapping. Second, we propose the Weak Feature Matching Loss which efficiently helps our framework to preserve the facial attributes in an implicit way. Extensive experiments on wild faces demonstrate that our SimSwap is able to achieve competitive identity performance while preserving attributes better than previous state-of-the-art methods.

Self-Mimic Learning for Small-scale Pedestrian Detection

  • Jialian Wu
  • Chunluan Zhou
  • Qian Zhang
  • Ming Yang
  • Junsong Yuan

Detecting small-scale pedestrians is one of the most challenging problems in pedestrian detection. Due to the lack of visual details, the representations of small-scale pedestrians tend to be weak to be distinguished from background clutters. In this paper, we conduct an in-depth analysis of the small-scale pedestrian detection problem, which reveals that weak representations of small-scale pedestrians are the main cause for a classifier to miss them. To address this issue, we propose a novel Self-Mimic Learning (SML) method to improve the detection performance on small-scale pedestrians. We enhance the representations of small-scale pedestrians by mimicking the rich representations from large-scale pedestrians. Specifically, we design a mimic loss to force the feature representations of small-scale pedestrians to approach those of large-scale pedestrians. The proposed SML is a general component that can be readily incorporated into both one-stage and two-stage detectors, with no additional network layers and incurring no extra computational cost during inference. Extensive experiments on both the CityPersons and Caltech datasets show that the detector trained with the mimic loss is significantly effective for small-scale pedestrian detection and achieves state-of-the-art results on CityPersons and Caltech, respectively.

Action2Motion: Conditioned Generation of 3D Human Motions

  • Chuan Guo
  • Xinxin Zuo
  • Sen Wang
  • Shihao Zou
  • Qingyao Sun
  • Annan Deng
  • Minglun Gong
  • Li Cheng

Action recognition is a relatively established task, where given an input sequence of human motion, the goal is to predict its action category. This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. Importantly, the set of generated motions are expected to maintain its diversity to be able to explore the entire action-conditioned motion space; meanwhile, each sampled sequence faithfully resembles a natural human body articulation dynamics. Motivated by these objectives, we follow the physics law of human kinematics by adopting the Lie Algebra theory to represent the natural human motions; we also propose a temporal Variational Auto-Encoder (VAE) that encourages a diverse sampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments over three distinct human motion datasets (including ours) demonstrate the effectiveness of our approach.

Skin Textural Generation via Blue-noise Gabor Filtering based Generative Adversarial Network

  • Hui Zhang
  • Chuan Wang
  • Nenglun Chen
  • Jue Wang
  • Wenping Wang

Facial skin texture synthesis is a fundamental problem in high-quality facial image generation and enhancement. The key behind is how to effectively synthesize plausible textured noise for the faces. With the development of CNNs and GANs, most works cast the problem as an image to image translation problem. However, these methods lack an explicit mechanism to simulate the facial noise pattern, so that the generated images are of obvious artifacts. To this end, we propose a new facial noise generation method. Specifically, we utilize the property of blue noise and Gabor filter to implicitly guide the asymmetrical sampling for the face region as a guidance map, where non-uniform point sampling is conducted. Thus we propose a novel Blue-Noise Gabor Module to produce a spatial-variant noisy image. Our proposed two-branch framework combined facial identity enhancing with textures details generation to jointly produce a high-quality facial image. Experimental results demonstrate the superiority of our method compared with the state-of-the-art, which enables the generation of high-quality facial texture based on a 2D image only, without the involvement of any 3D models.

A Slow-I-Fast-P Architecture for Compressed Video Action Recognition

  • Jiapeng Li
  • Ping Wei
  • Yongchi Zhang
  • Nanning Zheng

Compressed video action recognition has drawn growing attention for the storage and processing advantages of compressed videos over original raw videos. While the past few years have witnessed remarkable progress in this problem, most existing approaches rely on RGB frames from raw videos and require multi-step training. In this paper, we propose a novel Slow-I-Fast-P (SIFP) neural network model for compressed video action recognition. It consists of the slow I pathway receiving a sparse sampling I-frame clip and the fast P pathway receiving a dense sampling pseudo optical flow clip. An unsupervised estimation method and a new loss function are designed to generate pseudo optical flows in compressed videos. Our model eliminates the dependence on the traditional optical flows calculated from raw videos. The model is trained in an end-to-end way. The proposed method is evaluated on the challenging HMDB51 and UCF101 datasets. The extensive comparison results and ablation studies demonstrate the effectiveness and strength of the proposed method.

SESSION: Poster Session F1: Deep Learning for Multimedia

DMVOS: Discriminative Matching for Real-time Video Object Segmentation

  • Peisong Wen
  • Ruolin Yang
  • Qianqian Xu
  • Chen Qian
  • Qingming Huang
  • Runmin Cong
  • Jianlou Si

Though recent methods on semi-supervised video object segmentation (VOS) have achieved an appreciable improvement of segmentation accuracy, it is still hard to get an adequate speed-accuracy balance when facing real-world application scenarios. In this work, we propose Discriminative Matching for real-time Video Object Segmentation (DMVOS), a real-time VOS framework with high-accuracy to fill this gap. Based on the matching mechanism, our framework introduces discriminative information through the Isometric Correlation module and the Instance Center Offset module. Specifically, the isometric correlation module learns a pixel-level similarity map with semantic discriminability, and the instance center offset module is applied to exploit the instance-level spatial discriminability. Experiments on two benchmark datasets show that our model achieves state-of-the-art performance with extremely fast speed, for example, J&F of 87.8% on DAVIS-2016 validation set with 35 milliseconds per frame.

Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

  • Zhensheng Shi
  • Liangjie Cao
  • Cheng Guan
  • Ju Liang
  • Qianqian Li
  • Zhaorui Gu
  • Haiyong Zheng
  • Bing Zheng

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is three-fold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multi-group structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGB-only model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at

Vaccine-style-net: Point Cloud Completion in Implicit Continuous Function Space

  • Wei Yan
  • Ruonan Zhang
  • Jing Wang
  • Shan Liu
  • Thomas H. Li
  • Ge Li

Though recent advances in point cloud completion have shown exciting promise with learning-based methods, most of them still generate coarse point clouds with a fixed number of points (e.g. 2048). In this paper, we propose Vaccine-Style-Net, a new point cloud completion method that can produce high resolution 3D shapes with complete smooth surface. Vaccine-Style-Net performs point cloud completion in the function space of 3D surface, which represent the 3D surface as the continuous decision boundary function. Meanwhile, a reinforcement learning agent is embedded to deduce the complete 3D geometry from the incomplete point cloud. In contrast to the existing approaches, the completed 3D shapes produced by our method can be any resolution without excessive memory footprint. Moreover, to increase the diversity and adaptability of the method, we introduce two-type-free-form masks to simulate various corrupted inputs as well as a mask dataset called onion-peeling-mask (OPM). Finally, we discuss the limitations of existing evaluation metrics for shape completion tasks and explore a novel metric to supplement the existing ones. Experiments demonstrate that our method not only achieves competitive results qualitatively and quantitatively but also can produce a continuous 3D shape with any resolution.

Adaptive Wasserstein Hourglass for Weakly Supervised RGB 3D Hand Pose Estimation

  • Yumeng Zhang
  • Li Chen
  • Yufeng Liu
  • Wen Zheng
  • Junhai Yong

The deficiency of labeled training data is one of the bottlenecks in 3D hand pose estimation from monocular RGB images. Synthetic datasets have a large number of images with precise annotations, but their obvious difference with real-world datasets limits the generalization ability. Few efforts have been made to bridge the gap between the two domains in terms of their large differences. In this paper, we propose a domain adaptation method called Adaptive Wasserstein Hourglass for weakly-supervised 3D hand pose estimation to close the large gap between synthetic and real-world datasets flexibly. Adaptive Wasserstein Hourglass utilizes a feature similarity metric to identify the differences and explore the common features (e.g., hand structure) of the two datasets. Common features are drawn close adaptively during the training, whereas domain-specific features retain the differences. Learning common features helps the network in focusing on pose-related information, whereas maintaining domain-specific features reduces the optimization difficulty when closing the big gap between two domains. Extensive evaluations on two benchmark datasets demonstrate that our method succeeds in distinguishing different features and achieves optimal results when compared with state-of-the-art 3D pose estimation approaches and domain adaptation methods.

Weakly Supervised Segmentation with Maximum Bipartite Graph Matching

  • Weide Liu
  • Chi Zhang
  • Guosheng Lin
  • Tzu-Yi HUNG
  • Chunyan Miao

In the weakly supervised segmentation task with only image-level labels, a common step in many existing algorithms is first to locate the image regions corresponding to each existing class with the Class Activation Maps (CAMs), and then generate the pseudo ground truth masks based on the CAMs to train a segmentation network in the fully supervised manner. The quality of the CAMs has a crucial impact on the performance of the segmentation model. We propose to improve the CAMs from a novel graph perspective. We model paired images containing common classes with a bipartite graph and use the maximum matching algorithm to locate corresponding areas in two images. The matching areas are then used to refine the predicted object regions in the CAMs. The experiments on Pascal VOC 2012 dataset show that our network can effectively boost the performance of the baseline model and achieves new state-of-the-art performance.

Recognizing Camera Wearer from Hand Gestures in Egocentric Videos:

  • Daksh Thapar
  • Aditya Nigam
  • Chetan Arora

Wearable egocentric cameras are typically harnessed to a wearer's head, giving them the unique advantage of capturing their points of view. Hoshen and Peleg have shown that egocentric cameras indirectly capture the wearer's gait, which can be used to identify a wearer based on their egocentric videos. The authors have shown a wearer recognition accuracy of up to 77% over 32 subjects. However, an important limitation of their work is that such gait features can be extracted only from walking sequences of a wearer. In this work, we take the privacy threat a notch higher and show that even the wearer's hand gestures, as seen through an egocentric video, leak wearer's identity. We have designed a model to extract and match hand gesture signatures from egocentric videos. We demonstrate the threat on the EPIC kitchen dataset containing 55 hours of the egocentric videos acquired from 32 subjects doing various activities. We show that: (1) Our model can recognize a wearer with an accuracy of up to 73% based on the same activity, i.e., the model has seen 'cut' activity by a wearer in the train set, and recognizes the wearer based on another 'cut' activity by him/her while testing. (2) The hand gesture signatures transfer across activities, i.e., even if our model does not see 'cut' activity of a wearer at the train time, but sees other activities such as 'wash', 'mix' etc., the model can still recognize a wearer with an accuracy of up to 60%, by matching hand gesture signatures of 'cut' at test time with train time signatures of 'wash' or 'mix'. (3) The hand gesture features even transfer across subjects, i.e., even if the model has not seen any activity by some subject, one can still verify a wearer (open-set), and predict that the same wearer has performed both activities with an Equal Error Rate of 15.21%. The code, trained models are available at

Prototype-Matching Graph Network for Heterogeneous Domain Adaptation

  • Zijian Wang
  • Yadan Luo
  • Zi Huang
  • Mahsa Baktashmotlagh

Even though the multimedia data is ubiquitous on the web, the scarcity of the annotated data and variety of data modalities hinder their usage by multimedia applications. Heterogeneous domain adaptation (HDA) has therefore arisen to address such limitations by facilitating the knowledge transfer between heterogeneous domains. Existing HDA methods only focus on aligning the cross-domain feature distributions and ignore the importance of maximizing the margin among different classes, which may lead to a sub-optimal classification performance. To tackle this problem, in this paper, we propose the Prototype-Matching Graph Network (PMGN), which gradually explores the domain-invariant class prototype representations. Specifically, we build an end-to-end Graph Prototypical Network, which computes the class prototypes through multiple layers of edge learning, node aggregation, and discrepancy minimization. Our framework utilizes the Swap training strategy to provide adequate supervision for training the edge learning component. Moreover, the proposed PMGN can be equipped with the clustering module that utilises the KL-divergence as a distance metric to reduce the distribution difference between the source and target data. Extensive experiments on three HDA tasks (i.e. object recognition, text-to-image classification, and text categorization) demonstrate the superiority of our approach over the state-of-the-art HDA methods.

Towards Lighter and Faster: Learning Wavelets Progressively for Image Super-Resolution

  • Huanrong Zhang
  • Zhi Jin
  • Xiaojun Tan
  • Xiying Li

Due to the significant development of deep learning (DL) techniques, recent advances in the super-resolution (SR) field have achieved a great performance. While seeking for better performance, the later proposed networks prone to be deeper and heavier, which limits the applications of SR algorithms in the resource-constrain devices. Some advances rely on recurrent/recursive learning to reduce the number of network parameters, however, they ignore the caused long inference time, since the more recurrences/recursions are involved, the longer inference time the network needs. To address this trade-off issue between reconstruction performance, the number of network parameters, and inference time, we propose a lightweight and fast network (WSR) to learn wavelet coefficients of the target image progressively for single image super-resolution. More specifically, the network comprises two main branches. One is used for predicting the second level low-frequency wavelet coefficients, and the other one is designed in a recurrent way for predicting the rest wavelet coefficients at the first and second levels. Finally, an inverse wavelet transformation is adopted to reconstruct the SR images from these coefficients. In addition, we propose a deformable convolution kernel (side window) to construct the side-information multi-distillation block (S-IMDB), which is the basic unit of the recurrent blocks (RBs). We train the WSR with loss constraints at wavelet and spatial domains. Comprehensive experiments demonstrate that our WSR achieves a better trade-off than most of the state-of-the-art approaches. Code is available at

Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition

  • Zhen Huang
  • Xu Shen
  • Xinmei Tian
  • Houqiang Li
  • Jianqiang Huang
  • Xian-Sheng Hua

Skeleton-based human action recognition has attracted much attention with the prevalence of accessible depth sensors. Recently, graph convolutional networks (GCNs) have been widely used for this task due to their powerful capability to model graph data. The topology of the adjacency graph is a key factor for modeling the correlations of the input skeletons. Thus, previous methods mainly focus on the design/learning of the graph topology. But once the topology is learned, only a single-scale feature and one transformation exist in each layer of the networks. Many insights, such as multi-scale information and multiple sets of transformations, that have been proven to be very effective in convolutional neural networks (CNNs), have not been investigated in GCNs. The reason is that, due to the gap between graph-structured skeleton data and conventional image/video data, it is very challenging to embed these insights into GCNs. To overcome this gap, we reinvent the split-transform-merge strategy in GCNs for skeleton sequence processing. Specifically, we design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition. Our network is constructed by repeating a building block that aggregates multi-granularity information from both the spatial and temporal paths. Extensive experiments demonstrate that our network outperforms state-of-the-art methods by a significant margin with only 1/5 of the parameters and 1/10 of the FLOPs.

Dynamic Future Net: Diversified Human Motion Generation

  • Wenheng Chen
  • He Wang
  • Yi Yuan
  • Tianjia Shao
  • Kun Zhou

Human motion modelling is crucial in many areas such as computergraphics, vision and virtual reality. Acquiring high-quality skele-tal motions is difficult due to the need for specialized equipmentand laborious manual post-posting, which necessitates maximiz-ing the use of existing data to synthesize new data. However, it is a challenge due to the intrinsic motion stochasticity of humanmotion dynamics, manifested in the short and long terms. In theshort term, there is strong randomness within a couple frames, frame followed by multiple possible frames leading to differentmotion styles; while in the long term, there are non-deterministicaction transitions. In this paper, we present Dynamic Future Net,a new deep learning model where we explicitly focuses on the aforementioned motion stochasticity by constructing a generative model with non-trivial modelling capacity in temporal stochas-ticity. Given limited amounts of data, our model can generate a large number of high-quality motions with arbitrary duration, andvisually-convincing variations in both space and time. We evaluateour model on a wide range of motions and compare it with the state-of-the-art methods. Both qualitative and quantitative results show the superiority of our method, for its robustness, versatility and high-quality.

ATF: Towards Robust Face Alignment via Leveraging Similarity and Diversity across Different Datasets

  • Xing Lan
  • Qinghao Hu
  • Fangzhou Xiong
  • Cong Leng
  • Jian Cheng

Face alignment is an important task in the field of multi-media. Together with the impressive progress of algorithms, various benchmark datasets have been released in recent years. Intuitively, it is meaningful to integrate multiple labeled datasets with different annotations to achieve higher performance on a target landmark detector. Although numerous efforts have been made in joint usage, there yet remain three shortages in recent works, e.g., additional computation, limitation of the markups scheme, and limited support for the regression method. To address the above problems, we proposed a novel Alternating Training Framework (ATF), which leverages similarity and diversity across multi-media sources for a more robust detector. Our framework mainly contains two sub-modules: Alternating Training with Decreasing Proportions (ATDP) and Mixed Branch Loss (mathcal LMB). In particular, ATDP trains multiple datasets simultaneously to take advantage of the diversity between them, while mathcal LMB utilizes similar landmark pairs to constrain different branches of corresponding datasets. Extensive experiments on various benchmarks show the effectiveness of our framework, and ATF is feasible for both heatmap-based network and direct coordinate regression. Specifically, the mean error even reaches 3.17 on the experiment on 300W leveraging WFLW, which significantly outperforms state-of-the-art methods. Both in an ordinary convolutional network (OCN) and HRNET, ATF achieves up to 9.96% relative improvement. Our source codes are made publicly available at

Dual Gaussian-based Variational Subspace Disentanglement for Visible-Infrared Person Re-Identification

  • Nan Pu
  • Wei Chen
  • Yu Liu
  • Erwin M. Bakker
  • Michael S. Lew

Visible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person re-identification mainly overcomes, VI-ReID suffers from additional inter-modality variance caused by the inherent heterogeneous gap. To solve the problem, we present a carefully designed dual Gaussian-based variational auto-encoder (DG-VAE), which disentangles an identity-discriminable and an identity-ambiguous cross-modality feature subspace, following a mixture-of-Gaussians (MoG) prior and a standard Gaussian distribution prior, respectively. Disentangling cross-modality identity-discriminable features leads to more robust retrieval for VI-ReID. To achieve efficient optimization like conventional VAE, we theoretically derive two variational inference terms for the MoG prior under the supervised setting, which not only restricts the identity-discriminable subspace so that the model explicitly handles the cross-modality intra-identity variance, but also enables the MoG distribution to avoid posterior collapse. Furthermore, we propose a triplet swap reconstruction (TSR) strategy to promote the above disentangling process. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two VI-ReID datasets. Codes will be available at

Attention Based Dual Branches Fingertip Detection Network and Virtual Key System

  • Chong Mou
  • Xin Zhang

Gesture and fingertip are becoming more and more important mediums for human-computer interaction (HCI). Therefore, algorithms of gesture recognition and fingertip detection have been extensively investigated. However, problems mainly remain in how to achieve a win-win situation between speed and accuracy, and how to deal with complex interaction environment. To rectify these problems, this paper proposes an attention-based dual branches network that can efficiently fulfill both fingertip detection and gesture recognition tasks. In order to deal with complex interaction environment, we combine both channel-wise attention and spatial-wise attention into the fingertip detection model. The extensive experiments demonstrate that our novel model is both effective and efficient. In the experiment, our proposed model achieves the average fingertip detection error at around 2.8 pixels in 640×480 video frame, and the average recognition accuracy among eight gestures reaches $99%$. Moreover, the average forward time is about 8 ms. Due to the light-weight design, this model can also achieve high-efficiency performance on CPU. In addition, we design a virtual key system based on our proposed model, which can allow users to complete the "clicking" operation naturally in virtual environment. Our proposed system can perform well with a single normal RGB camera without any pre-processing (e.g., image segmentation or contour extraction), which can significantly reduce the complexity of the interaction system.

Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization

  • Md Moniruzzaman
  • Zhaozheng Yin
  • Zhihai He
  • Ruwen Qin
  • Ming C. Leu

The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action labels are given without the timestamp annotation on when the actions occur. The main reason comes from that, the weakly-supervised networks only focus on the highly discriminative frames, but there are some ambiguous frames in both background and action classes. The ambiguous frames in background class are very similar to the real actions, which may be treated as target actions and result in false positives. On the other hand, the ambiguous frames in action class which possibly contain action instances, are prone to be false negatives by the weakly-supervised networks and result in a coarse localization. To solve these problems, we introduce a novel weakly-supervised Action Completeness Modeling with Background Aware Networks (ACM-BANets). Our Background Aware Network (BANet) contains a weight-sharing two-branch architecture, with an action guided Background aware Temporal Attention Module (B-TAM) and an asymmetrical training strategy, to suppress both highly discriminative and ambiguous background frames to remove the false positives. Our action completeness modeling contains multiple BANets, and the BANets are forced to discover different but complementary action instances to completely localize the action instances in both highly discriminative and ambiguous action frames. In the i-th iteration, the i-th BANet discovers the discriminative features, which are then erased from the feature map. The partially-erased feature map is fed into the (i+1)-th BANet of the next iteration to force this BANet to discover discriminative features different from the i-th BANet. Evaluated on two challenging untrimmed video datasets, THUMOS14 and ActivityNet1.3, our approach outperforms all the current weakly-supervised methods for temporal action localization.

SESSION: Poster Session G1: Deep Learning for Multimedia

Adversarial Knowledge Transfer from Unlabeled Data

  • Akash Gupta
  • Rameswar Panda
  • Sujoy Paul
  • Jianming Zhang
  • Amit K. Roy-Chowdhury

While machine learning approaches to visual recognition offer great promise, most of the existing methods rely heavily on the availability of large quantities of labeled training data. However, in the vast majority of real-world settings, manually collecting such large labeled datasets is infeasible due to the cost of labeling data or the paucity of data in a given domain. In this paper, we present a novel Adversarial Knowledge Transfer (AKT) framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier on a given visual recognition task. The proposed adversarial learning framework aligns the feature space of the unlabeled source data with the labeled target data such that the target classifier can be used to predict pseudo labels on the source data. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task, unlike some existing approaches. Extensive experiments well demonstrate that models learned using our approach hold a lot of promise across a variety of visual recognition tasks on multiple standard datasets. Project page is at \texttt

Task Decoupled Knowledge Distillation For Lightweight Face Detectors

  • Xiaoqing Liang
  • Xu Zhao
  • Chaoyang Zhao
  • Nanfei Jiang
  • Ming Tang
  • Jinqiao Wang

Face detection is a hot topic in computer vision. The face detection methods usually consist of two subtasks, i.e. the classification subtask and the regression subtask, which are trained with different samples. However, current face detection knowledge distillation methods usually couple the two subtasks, and use the same set of samples in the distillation task. In this paper, we propose a task decoupled knowledge distillation method, which decouples the detection distillation task into two subtasks and uses different samples in distilling the features of different subtasks. We firstly propose a feature decoupling method to decouple the classification features and the regression features, without introducing any extra calculations at inference time. Specifically, we generate the corresponding features by adding task-specific convolutions in the teacher network and adding adaption convolutions on the feature maps of the student network. Then we select different samples for different subtasks to imitate. Moreover, we also propose an effective probability distillation method to joint boost the accuracy of the student network. We apply our distillation method on a lightweight face detector, EagleEye. Experimental results show that the proposed method effectively improves the student detector's accuracy by 5.1%, 5.1%, and 2.8% AP in Easy, Medium, Hard subsets respectively.

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

  • Li Tao
  • Xueting Wang
  • Toshihiko Yamasaki

We propose a self-supervised method to learn feature representations from videos. A standard approach in traditional self-supervised methods uses positive-negative data pairs to train with contrastive learning strategy. In such a case, different modalities of the same video are treated as positives and video clips from a different video are treated as negatives. Because the spatio-temporal information is important for video representation, we extend the negative samples by introducing intra-negative samples, which are transformed from the same anchor video by breaking temporal relations in video clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations. There are many flexible options in our IIC framework and we conduct experiments by using several different configurations. Evaluations are conducted on video retrieval and video recognition tasks using the learned video representation. Our proposed IIC outperforms current state-of-the-art results by a large margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, improvements can also be obtained on these two benchmark datasets.

Memory Recursive Network for Single Image Super-Resolution

  • Jie Liu
  • Minqiang Zou
  • Jie Tang
  • Gangshan Wu

Recently, extensive works based on convolutional neural network (CNN) have shown great success in single image super-resolution (SISR). In order to improve the SISR performance while reducing the number of model parameters, some methods adopt multiple recursive layers to enhance the intermediate features. However, in the recursive process, these methods only use the output features of current stage as the input of the next stage and neglect the output features of historical stages, which degrades the performance of the recursive blocks. The long-term dependencies can only be learned implicitly during the recursive processes. To address these issues, we propose the memory recursive network (MRNet) to make full use of the output features at each stage. The proposed MRNet utilizes a memory recursive module (MRM) to generate features for each recursive stage, and then these features are fused by our proposed ShuffleConv block. Specifically, MRM adopts a memory updater block to explicitly model the long-term dependencies between the output features of historical recursive stages. The output features from the memory updater will be used as the input of the next recursive stage and will be continuously updated during the recursions. To reduce the number of parameters and ease the training difficulty, we introduce a ShuffleConv module to fuse the features from different recursive stages, which is much more effective than using plain convolutional combinations. Comprehensive experiments demonstrate that the proposed MRNet achieves state-of-the-art SISR performance while using much fewer parameters.

Scale-aware Progressive Optimization Network

  • Ying Chen
  • Lifeng Huang
  • Chengying Gao
  • Ning Liu

Crowd counting has attracted increasing attention due to its wide application prospect. One of the most essential challenge in this domain is large scale variation, which impacts the accuracy of density estimation. To this end, we propose a scale-aware progressive optimization network (SPO-Net) for crowd counting, which trains a scale adaptive network to achieve high-quality density map estimation and overcome the variable scale dilemma in highly congested scenes. Concretely, the first phase of SPO-Net, band-pass stage, mainly concentrates on preprocessesing the input image and fusing both high-level semantic information and low-level spatial information from separated multi-layer features. And the second phase of SPO-Net, rolling guidance stage, aims to learn a scale-adapted network from multi-scale features as well as rolling training manner. For better learning local correlation of multi-size regions and reducing redundant calculations, we introduce a progressive optimization strategy. Extensive experiments on three challenging crowd counting datasets not only demonstrate the efficacy of each part in SPO-Net, but also suggest the superiority of our proposed method compared with the state-of-the-art approaches.

Resource Efficient Domain Adaptation

  • Junguang Jiang
  • Ximei Wang
  • Mingsheng Long
  • Jianmin Wang

Domain Adaptation (DA) aims at transferring knowledge from a labeled source domain to an unlabeled target domain. While re- markable advances have been witnessed recently, the power of DA methods still heavily depends on the network depth, especially when the domain discrepancy is large, posing an unprecedented challenge to DA in low-resource scenarios where fast and adaptive inference is required. How to bridge transferability and resource- efficient inference in DA becomes an important problem. In this paper, we propose Resource Efficient Domain Adaptation (REDA), a general framework that can adaptively adjust computation re- sources across 'easier' and 'harder' inputs. Based on existing multi- exit architectures, REDA has two novel designs: 1) Transferable distillation to distill the transferability of top classifier into the early exits; 2) Consistency weighting to control the distillation degree via prediction consistency. As a general method, REDA can be easily applied with a variety of DA methods. Empirical results and analy- ses justify that REDA can substantially improve the accuracy and accelerate the inference under domain shift and low resource.

MGAAttack: Toward More Query-efficient Black-box Attack by Microbial Genetic Algorithm

  • Lina Wang
  • Kang Yang
  • Wenqi Wang
  • Run Wang
  • Aoshuang Ye

Recent studies have shown that deep neural networks (DNNs) are susceptible to adversarial attacks even in the black-box settings. However, previous studies on creating black-box based adversarial examples by merely solving the traditional continuous problem, which suffer query efficiency issues. To address the efficiency of querying in black-box attack, we propose a novel attack, called MGAAttack, which is a query-efficient and gradient-free black-box attack without obtaining any knowledge of the target model. In our approach, we leverage the advantages of both transfer-based and scored-based methods, two typical techniques in black-box attack, and solve a discretized problem by using a simple yet effective microbial genetic algorithm (MGA). Experimental results show that our approach dramatically reduces the number of queries on CIFAR-10 and ImageNet and significantly outperforms previous work. In the untargeted attack, we can attack a VGG19 classifier with only 16 queries and give an attack success rate more than 99.90% on ImageNet. Our code is available at

A Novel Graph-TCN with a Graph Structured Representation for Micro-expression Recognition

  • Ling Lei
  • Jianfeng Li
  • Tong Chen
  • Shigang Li

Facial micro-expressions (MEs) recognition has attracted much attention recently. However, because MEs are spontaneous, subtle and transient, recognizing MEs is a challenge task. In this paper, first, we use transfer learning to apply learning-based video motion magnification to magnify MEs and extract the shape information, aiming to solve the problem of the low muscle movement intensity of MEs. Then, we design a novel graph-temporal convolutional network (Graph-TCN) to extract the features of the local muscle movements of MEs. First, we define a graph structure based on the facial landmarks. Second, the Graph-TCN deals with the graph structure in dual channels with a TCN block. One channel is for node feature extraction, and the other one is for edge feature extraction. Last, the edges and nodes are fused for classification. The Graph-TCN can automatically train the graph representation to distinguish MEs while not using a hand-crafted graph representation. To the best of our knowledge, we are the first to use the learning-based video motion magnification method to extract the features of shape representations from the intermediate layer while magnifying MEs. Furthermore, we are also the first to use deep learning to automatically train the graph representation for MEs.

Masked Face Recognition with Generative Data Augmentation and Domain Constrained Ranking

  • Mengyue Geng
  • Peixi Peng
  • Yangru Huang
  • Yonghong Tian

Masked faces recognition (MFR) aims to match a masked face with its corresponding full face, which is an important task especially during the global outbreak of COVID-19. However, most existing face recognition models generalize poorly in this case, and it is hard to train a robust MFR model due to two main reasons: 1) the absence of large scale training data as well as ground truth testing data, and 2) the presence of large intra-class variation between masked faces and full faces. To address the first challenge, this paper firstly contributes a new dataset denoted as MFSR, which consists of two parts. The first part contains 9,742 masked face images with mask region segmentation annotation. The second part contains 11,615 images of 1,004 identities, and each identity has masked and full face images with various orientations, lighting conditions and mask types. However, it is still not enough for training MFR models with deep learning. To obtain sufficient training data, based on the MFSR, we introduce a novel Identity Aware Mask GAN (IAMGAN) with segmentation guided multi-level identity preserve module to generate the synthetic masked face images from the full face images. In addition, to tackle the second challenge, a Domain Constrained Ranking (DCR) loss is proposed by adopting a center-based cross-domain ranking strategy. For each identity, two centers are designed which correspond to the full face images and the masked face images respectively. The DCR forces the feature of masked faces getting closer to its corresponding full face center and vice-versa. Experimental results on the MFSR dataset demonstrate the effectiveness of the proposed approaches.

Occlusion Detection for Automatic Video Editing

  • Junhua Liao
  • Haihan Duan
  • Xin Li
  • Haoran Xu
  • Yanbing Yang
  • Wei Cai
  • Yanru Chen
  • Liangyin Chen

Videos have become the new preference comparing with images in recent years. However, during the recording of videos, the cameras are inevitably occluded by some objects or persons that pass through the cameras, which would highly increase the workload of video editors for searching out such occlusions. In this paper, for releasing the burden of video editors, a frame-level video occlusion detection method is proposed, which is a fundamental component of automatic video editing. The proposed method enhances the extraction of spatial-temporal information based on C3D yet only using around half amount of parameters, with an occlusion correction algorithm for correcting the prediction results. In addition, a novel loss function is proposed to better extract the characterization of occlusion and improve the detection performance. For performance evaluation, this paper builds a new large scale dataset, containing 1,000 video segments from seven different real-world scenarios, which could be available at: All occlusions in video segments are annotated frame by frame with bounding-boxes so that the dataset could be utilized in both frame-level occlusion detection and precise occlusion location. The experimental results illustrate that the proposed method could achieve good performance on video occlusion detection compared with the state-of-the-art approaches. To the best of our knowledge, this is the first study which focuses on occlusion detection for automatic video editing.

Cartoon Face Recognition: A Benchmark Dataset

  • Yi Zheng
  • Yifan Zhao
  • Mengyuan Ren
  • He Yan
  • Xiangju Lu
  • Junhui Liu
  • Jia Li

Recent years have witnessed increasing attention in cartoon media, powered by the strong demands of industrial applications. As the first step to understand this media, cartoon face recognition is a crucial but less-explored task with few datasets proposed. In this work, we first present a new challenging benchmark dataset, consisting of 389,678 images of 5,013 cartoon characters annotated with identity, bounding box, pose, and other auxiliary attributes. The dataset, named iCartoonFace, is currently the largest-scale, high-quality, rich-annotated, and spanning multiple occurrences in the field of image recognition, including near-duplications, occlusions, and appearance changes. In addition, we provide two types of annotations for cartoon media, i.e., face recognition, and face detection, with the help of a semi-automatic labeling algorithm. To further investigate this challenging dataset, we propose a multi-task domain adaptation approach that jointly utilizes the human and cartoon domain knowledge with three discriminative regularizations. We hence perform a benchmark analysis of the proposed dataset and verify the superiority of the proposed approach in the cartoon face recognition task. The dataset is available at

Reversible Watermarking in Deep Convolutional Neural Networks for Integrity Authentication

  • Xiquan Guan
  • Huamin Feng
  • Weiming Zhang
  • Hang Zhou
  • Jie Zhang
  • Nenghai Yu

Deep convolutional neural networks have made outstanding contributions in many fields such as computer vision in the past few years and many researchers published well-trained network for downloading. But recent studies have shown serious concerns about integrity due to model-reuse attacks and backdoor attacks. In order to protect these open-source networks, many algorithms have been proposed such as watermarking. However, these existing algorithms modify the contents of the network permanently and are not suitable for integrity authentication. In this paper, we propose a reversible watermarking algorithm for integrity authentication. Specifically, we present the reversible watermarking problem of deep convolutional neural networks and utilize the pruning theory of model compression technology to construct a host sequence used for embedding watermarking information by histogram shift. As shown in the experiments, the influence of embedding reversible watermarking on the classification performance is less than ±0.5% and the parameters of the model can be fully recovered after extracting the watermarking. At the same time, the integrity of the model can be verified by applying the reversible watermarking: if the model is modified illegally, the authentication information generated by original model will be absolutely different from the extracted watermarking information.

Masked Face Recognition with Latent Part Detection

  • Feifei Ding
  • Peixi Peng
  • Yangru Huang
  • Mengyue Geng
  • Yonghong Tian

This paper focuses on a novel task named masked faces recognition (MFR), which aims to match masked faces with common faces and is important especially during the global outbreak of COVID-19. It is challenging to identify masked faces for two main reasons. Firstly, there is no large-scale training data and test data with ground truth for MFR. Collecting and annotating millions of masked faces is labor-consuming. Secondly, since most facial cues are occluded by mask, it is necessary to learn representations which are both discriminative and robust to mask wearing. To handle the first challenge, this paper collects two datasets designed for MFR: MFV with 400 pairs of 200 identities for verification, and MFI which contains 4,916 images of 669 identities for identification. As is known, a robust face recognition model needs images of millions of identities to train, and hundreds of identities is far from enough. Hence, MFV and MFI are only considered as test datasets to evaluate algorithms. Besides, a data augmentation method for training data is introduced to automatically generate synthetic masked face images from existing common face datasets. In addition, a novel latent part detection (LPD) model is proposed to locate the latent facial part which is robust to mask wearing, and the latent part is further used to extract discriminative features. The proposed LPD model is trained in an end-to-end manner and only utilizes the original and synthetic training data. Experimental results on MFV, MFI and synthetic masked LFW demonstrate that LPD model generalizes well on both realistic and synthetic masked data and outperforms other methods by a large margin.

PanelNet: A Novel Deep Neural Network for Predicting Collective Diagnostic Ratings by a Panel of Radiologists for Pulmonary Nodules

  • Chunyan Zhang
  • Songhua Xu
  • Zongfang Li

Reducing misdiagnosis rate is a central concern in modern medicine. In clinical practice, group-based collective diagnosis is frequently exercised to curb the misdiagnosis rate. However, little effort has been dedicated to emulating the collective intelligence behind the group-based decision making practice in computer-aided diagnosis research to this day. To fill the overlooked gap, this study introduces a novel deep neural network, titled PanelNet, that is able to computationally model and reproduce the aforesaid collective diagnosis capability demonstrated by a group of medical experts. To experimentally explore the validity of the new solution, we apply the proposed PanelNet to one of the key tasks in radiology---assessing malignant ratings of pulmonary nodules. For each nodule and a given panel, PanelNet is able to predict statistical distribution of malignant ratings collectively judged by the panel of radiologists. Extensive experimental results consistently demonstrate PanelNet outperforms multiple state-of-the-art computer-aided diagnosis methods applicable to the collective diagnostic task. To our best knowledge, no other collective computer-aided diagnosis method grounded on modern machine learning technologies has been previously proposed. By its design, PanelNet can also be easily applied to model collective diagnosis processes employed for other diseases.

SESSION: Poster Session H1: Deep Learning for Multimedia

Privacy-Preserving Visual Content Tagging using Graph Transformer Networks

  • Xuan-Son Vu
  • Duc-Trong Le
  • Christoffer Edlund
  • Lili Jiang
  • Hoang D. Nguyen

With the rapid growth of Internet media, content tagging has become an important topic with many multimedia understanding applications, including efficient organisation and search. Nevertheless, existing visual tagging approaches are susceptible to inherent privacy risks in which private information may be exposed unintentionally. The use of anonymisation and privacy-protection methods is desirable, but with the expense of task performance. Therefore, this paper proposes an end-to-end framework (SGTN) using Graph Transformer and Convolutional Networks to significantly improve classification and privacy preservation of visual data. Especially, we employ several mechanisms such as differential privacy based graph construction and noise-induced graph transformation to protect the privacy of knowledge graphs. Our approach unveils new state-of-the-art on MS-COCO dataset in various semi-supervised settings. In addition, we showcase a real experiment in the education domain to address the automation of sensitive document tagging. Experimental results show that our approach achieves an excellent balance of model accuracy and privacy preservation on both public and private datasets.

Rotationally-Consistent Novel View Synthesis for Humans

  • Youngjoong Kwon
  • Stefano Petrangeli
  • Dahun Kim
  • Haoliang Wang
  • Henry Fuchs
  • Viswanathan Swaminathan

Human novel view synthesis aims to synthesize target views of a human subject given input images taken from one or more reference viewpoints. Despite significant advances in model-free novel view synthesis, existing methods present two major limitations when applied to complex shapes like humans. First, these methods mainly focus on simple and symmetric objects, e.g., cars and chairs, limiting their performances to fine-grained and asymmetric shapes. Second, existing methods cannot guarantee visual consistency across different adjacent views of the same object. To solve these problems, we present in this paper a learning framework for the novel view synthesis of human subjects, which explicitly enforces consistency across different generated views of the subject. Specifically, we introduce a novel multi-view supervision and an explicit rotational loss during the learning process, enabling the model to preserve detailed body parts and to achieve consistency between adjacent synthesized views. To show the superior performance of our approach, we present qualitative and quantitative results on the Multi-View Human Action (MVHA) dataset we collected (consisting of 3D human models animated with different Mocap sequences and captured from 54 different viewpoints), the Pose-Varying Human Model (PVHM) dataset, and ShapeNet. The qualitative and quantitative results demonstrate that our approach outperforms the state-of-the-art baselines in both per-view synthesis quality, and in preserving rotational consistency and complex shapes (e.g. fine-grained details, challenging poses) across multiple adjacent views in a variety of scenarios, for both humans and rigid objects.

Integrating Semantic Segmentation and Retinex Model for Low-Light Image Enhancement

  • Minhao Fan
  • Wenjing Wang
  • Wenhan Yang
  • Jiaying Liu

Retinex model is widely adopted in various low-light image enhancement tasks. The basic idea of the Retinex theory is to decompose images into reflectance and illumination. The ill-posed decomposition is usually handled by hand-crafted constraints and priors. With the recently emerging deep-learning based approaches as tools, in this paper, we integrate the idea of Retinex decomposition and semantic information awareness. Based on the observation that various objects and backgrounds have different material, reflection and perspective attributes, regions of a single low-light image may require different adjustment and enhancement regarding contrast, illumination and noise. We propose an enhancement pipeline with three parts that effectively utilize the semantic layer information. Specifically, we extract the segmentation, reflectance as well as illumination layers, and concurrently enhance every separate region, i.e. sky, ground and objects for outdoor scenes. Extensive experiments on both synthetic data and real world images demonstrate the superiority of our method over current state-of-the-art low-light enhancement algorithms.

Alleviating Human-level Shift: A Robust Domain Adaptation Method for Multi-person Pose Estimation

  • Xixia Xu
  • Qi Zou
  • Xue Lin

Human pose estimation has been widely studied with much focus on supervised learning requiring sufficient annotations. However, in real applications, a pretrained pose estimation model usually need be adapted to a novel domain with no labels or sparse labels. Such domain adaptation for 2D pose estimation hasn't been explored. The main reason is that a pose, by nature, has typical topological structure and needs fine-grained features in local keypoints. While existing adaptation methods do not consider topological structure of object-of-interest and they align the whole images coarsely. Therefore, we propose a novel domain adaptation method for multi-person pose estimation to conduct the human-level topological structure alignment and fine-grained feature alignment. Our method consists of three modules: Cross-Attentive Feature Alignment (CAFA), Intra-domain Structure Adaptation (ISA) and Inter-domain Human-Topology Alignment (IHTA) module. The CAFA adopts a bidirectional spatial attention module (BSAM) that focuses on fine-grained local feature correlation between two humans to adaptively aggregate consistent features for adaptation. We adopt ISA only in semi-supervised domain adaptation (SSDA) to exploit the corresponding keypoint semantic relationship for reducing the intra-domain bias. Most importantly, we propose an IHTA to learn more domain-invariant human topological representation for reducing the inter-domain discrepancy. We model the human topological structure via the graph convolution network (GCN), by passing messages on which, high-order relations can be considered. This structure preserving alignment based on GCN is beneficial to the occluded or extreme pose inference. Extensive experiments are conducted on two popular benchmarks and results demonstrate the competency of our method compared with existing supervised approaches.

SpatialGAN: Progressive Image Generation Based on Spatial Recursive Adversarial Expansion

  • Lei Zhao
  • Sihuan Lin
  • Ailin Li
  • Huaizhong Lin
  • Wei Xing
  • Dongming Lu

The image generation model based on generative adversarial networks has recently received significant attention and can produce diverse, sharp, and realistic images. However, generating high-resolution images has long been a challenge. In this paper, we propose a progressive spatial recursive adversarial expansion model(called SpatialGAN) capable of producing high-quality samples of the natural image. Our approach uses a cascade of convolutional networks to progressively generate images in a part-to-whole fashion. At each level of spatial expansion, a separate image-to-image spatial adversarial expansion network (conditional GAN) is recursively trained based on context image generated by previous GAN or CGAN. Unlike other coarse-to-fine generative methods that constraint on generative process either by multi-scale resolution or by hierarchical feature, the SpatialGAN decomposes image space into multiple subspaces and gradually resolves uncertainties in the local-to-whole generative process. The SpatialGAN greatly stabilizes and speeds up the training, which allows us to produce images of high quality. Based on visual Inception Score and Fréchet Inception Distance, we demonstrate that the quality of images generated by SpatialGAN on several typical datasets is better than that of images generated by GANs without cascading and comparative with the state of art methods with cascading.

Medical Visual Question Answering via Conditional Reasoning

  • Li-Ming Zhan
  • Bo Liu
  • Lu Fan
  • Jiaxin Chen
  • Xiao-Ming Wu

Medical visual question answering (Med-VQA) aims to accurately answer a clinical question presented with a medical image. Despite its enormous potential in healthcare industry and services, the technology is still in its infancy and is far from practical use. Med-VQA tasks are highly challenging due to the massive diversity of clinical questions and the disparity of required visual reasoning skills for different types of questions. In this paper, we propose a novel conditional reasoning framework for Med-VQA, aiming to automatically learn effective reasoning skills for various Med-VQA tasks. Particularly, we develop a question-conditioned reasoning module to guide the importance selection over multimodal fusion features. Considering the different nature of closed-ended and open-ended Med-VQA tasks, we further propose a type-conditioned reasoning module to learn a different set of reasoning skills for the two types of tasks separately. Our conditional reasoning framework can be easily applied to existing Med-VQA systems to bring performance gains. In the experiments, we build our system on top of a recent state-of-the-art Med-VQA model and evaluate it on the VQA-RAD benchmark [23]. Remarkably, our system achieves significantly increased accuracy in predicting answers to both closed-ended and open-ended questions, especially for open-ended questions, where a 10.8% increase in absolute accuracy is obtained. The source code can be downloaded from

Nighttime Dehazing with a Synthetic Benchmark

  • Jing Zhang
  • Yang Cao
  • Zheng-Jun Zha
  • Dacheng Tao

Increasing the visibility of nighttime hazy images is challenging because of uneven illumination from active artificial light sources and haze absorbing/scattering. The absence of large-scale benchmark datasets hampers progress in this area. To address this issue, we propose a novel synthetic method called 3R to simulate nighttime hazy images from daytime clear images, which first reconstructs the scene geometry, then simulates the light rays and object reflectance, and finally renders the haze effects. Based on it, we generate realistic nighttime hazy images by sampling real-world light colors from a prior empirical distribution. Experiments on the synthetic benchmark show that the degrading factors jointly reduce the image quality. To address this issue, we propose an optimal-scale maximum reflectance prior to disentangle the color correction from haze removal and address them sequentially. Besides, we also devise a simple but effective learning-based baseline which has an encoder-decoder structure based on the MobileNet-v2 backbone. Experiment results demonstrate their superiority over state-of-the-art methods in terms of both image quality and runtime. Both the dataset and source code will be available at

Pay Attention Selectively and Comprehensively: Pyramid Gating Network for Human Pose Estimation without Pre-training

  • Chenru Jiang
  • Kaizhu Huang
  • Shufei Zhang
  • Xinheng Wang
  • Jimin Xiao

Deep neural network with multi-scale feature fusion has achieved great success in human pose estimation. However, drawbacks still exist in these methods: 1) they consider multi-scale features equally, which may over-emphasize redundant features; 2) preferring deeper structures, they can learn features with the strong semantic representation, but tend to lose natural discriminative information; 3) to attain good performance, they rely heavily on pretraining, which is time-consuming, or even unavailable practically. To mitigate these problems, we propose a novel comprehensive recalibration model called Pyramid GAting Network (PGA-Net) that is capable of distillating, selecting, and fusing the discriminative and attention-aware features at different scales and different levels (i.e., both semantic and natural levels). Meanwhile, focusing on fusing features both selectively and comprehensively, PGA-Net can demonstrate remarkable stability and encouraging performance even without pre-training, making the model can be trained truly from scratch. We demonstrate the effectiveness of PGA-Net through validating on COCO and MPII benchmarks, attaining new state-of-the-art performance.

Data-driven Meta-set Based Fine-Grained Visual Recognition

  • Chuanyi Zhang
  • Yazhou Yao
  • Xiangbo Shu
  • Zechao Li
  • Zhenmin Tang
  • Qi Wu

Constructing fine-grained image datasets typically requires domain-specific expert knowledge, which is not always available for crowd-sourcing platform annotators. Accordingly, learning directly from web images becomes an alternative method for fine-grained visual recognition. However, label noise in the web training set can severely degrade the model performance. To this end, we propose a data-driven meta-set based approach to deal with noisy web images for fine-grained recognition. Specifically, guided by a small amount of clean meta-set, we train a selection net in a meta-learning manner to distinguish in- and out-of-distribution noisy images. To further boost the robustness of the model, we also learn a labeling net to correct the labels of in-distribution noisy data. In this way, our proposed method can alleviate the harmful effects caused by out-of-distribution noise and properly exploit the in-distribution noisy samples for training. Extensive experiments on three commonly used fine-grained datasets demonstrate that our approach is much superior to state-of-the-art noise-robust methods.

WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection

  • Bojia Zi
  • Minghao Chang
  • Jingjing Chen
  • Xingjun Ma
  • Yu-Gang Jiang

In recent years, the abuse of a face swap technique called deepfake has raised enormous public concerns. So far, a large number of deepfake videos (known as "deepfakes") have been crafted and uploaded to the internet, calling for effective countermeasures. One promising countermeasure against deepfakes is deepfake detection. Several deepfake datasets have been released to support the training and testing of deepfake detectors, such as DeepfakeDetection [1] and FaceForensics++ [23]. While this has greatly advanced deepfake detection, most of the real videos in these datasets are filmed with a few volunteer actors in limited scenes, and the fake videos are crafted by researchers using a few popular deepfake softwares. Detectors developed on these datasets may become less effective against real-world deepfakes on the internet. To better support detection against real-world deepfakes, in this paper, we introduce a new dataset WildDeepfake, which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. WildDeepfake is a small dataset that can be used, in addition to existing datasets, to develop and test the effectiveness of deepfake detectors against real-world deepfakes. We conduct a systematic evaluation of a set of baseline detection networks on both existing and our WildDeepfake datasets, and show that WildDeepfake is indeed a more challenging dataset, where the detection performance can decrease drastically. We also propose two (eg. 2D and 3D) Attention-based Deepfake Detection Networks (ADDNets) to leverage the attention masks on real/fake faces for improved detection. We empirically verify the effectiveness of ADDNets on both existing datasets and WildDeepfake. The dataset is available at:

LodoNet: A Deep Neural Network with 2D Keypoint Matching for 3D LiDAR Odometry Estimation

  • Ce Zheng
  • Yecheng Lyu
  • Ming Li
  • Ziming Zhang

Deep learning based LiDAR odometry (LO) estimation attracts increasing research interests in the field of autonomous driving and robotics. Existing works feed consecutive LiDAR frames into neural networks as point clouds and match pairs in the learned feature space. In contrast, motivated by the success of image based feature extractors, we propose to transfer the LiDAR frames to image space and reformulate the problem as image feature extraction. With the help of scale-invariant feature transform (SIFT) for feature extraction, we are able to generate matched keypoint pairs (MKPs) that can be precisely returned to the 3D space. A convolutional neural network pipeline is designed for LiDAR odometry estimation by extracted MKPs. The proposed scheme, namely LodoNet, is then evaluated in the KITTI odometry estimation benchmark, achieving on par with or even better results than the state-of-the-art.

Memory-Based Network for Scene Graph with Unbalanced Relations

  • Weitao Wang
  • Ruyang Liu
  • Meng Wang
  • Sen Wang
  • Xiaojun Chang
  • Yang Chen

The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K and mR@K compared with state-of-the-art baselines.

Pairwise Similarity Regularization for Adversarial Domain Adaptation

  • Haotian Wang
  • Wenjing Yang
  • Ji Wang
  • Ruxin Wang
  • Long Lan
  • Mingyang Geng

Domain adaptation aims at learning a predictive model that can generalize to a new target domain different from the source (training) domain. To mitigate the domain gap, adversarial training has been developed to learn domain invariant representations. State-of-the-art methods further make use of pseudo labels generated by the source domain classifier to match conditional feature distributions between the source and target domains. However, if the target domain is more complex than the source domain, the pseudo labels are unreliable to characterize the class-conditional structure of the target domain data, undermining prediction performance. To resolve this issue, we propose a Pairwise Similarity Regularization (PSR) approach that exploits cluster structures of the target domain data and minimizes the divergence between the pairwise similarity of clustering partition and that of pseudo predictions. Therefore, PSR guarantees that two target instances in the same cluster have the same class prediction and thus eliminate the negative effect of unreliable pseudo labels. Extensive experimental results show that our PSR method significantly boosts the current adversarial domain adaptation methods by a large margin on four visual benchmarks. In particular, PSR achieves a remarkable improvement of more than 5% over the state-of-the-art on several hard-to-transfer tasks.

Generalized Zero-Shot Video Classification via Generative Adversarial Networks

  • Mingyao Hong
  • Guorong Li
  • Xinfeng Zhang
  • Qingming Huang

Zero-shot learning (ZSL) is to classify images according to detailed attribute annotations into new categories that are unseen during the training stage. Generalized zero-shot learning (GZSL) adds seen categories to the test samples. Since the learned classifier has inherent bias against seen categories, GZSL is more challenging than traditional ZSL. However, at present, there is no detailed attribute description dataset for video classification. Therefore, the current zero-shot video classification problem is based on the synthesis of generative adversarial networks trained on seen-class features into unseen-class features for ZSL classification. In order to solve this problem, we propose a description text dataset based on the UCF101 action recognition dataset. To the best of our knowledge, this is the first work to add description of the classes to zero-shot video classification. We propose a new loss function that combines visual features with textual features. We extract text features from the proposed text data set, and constrain the process of generating synthetic features based on the principle that videos with similar text types should be similar. Our method reapplies the traditional zero-shot learning idea to video classification. From the experimental point of view, our proposed dataset and method have a positive impact on the generalized zero-shot video classification.