QoEVMA '22: Proceedings of the 2nd Workshop on Quality of Experience in Visual Multimedia Applications
QoEVMA '22: Proceedings of the 2nd Workshop on Quality of Experience in Visual Multimedia Applications
SESSION: Keynote Talks
Estimating the Quality of Experience of Immersive Contents
- Mylène Farias
Recent technology advancements have driven the production of plenoptic devices that capture and display visual contents, not only as texture information (as in 2D images) but also as 3D texture-geometric information. These devices represent the visual information using an approximation of the plenoptic illumination function that can describe visible objects from any point in the 3D space. Depending on the capturing device, this approximation can correspond to holograms, light fields, or point clouds imaging formats. Naturally, the success of immersive applications depends on the acceptability of these formats by the final user, which ultimately depends on the perceived quality of experience. Several subjective experiments have been performed with the goal of understanding how humans perceive immersive media in 6 Degree-of-Freedom (6DoF) environments and what are the impacts of different rendering and compression techniques on the perceived visual quality. In this context, an open area of research is the design of objective methods that estimate the quality of this type of content. In this talk, I describe a set of objective methods designed to estimate the quality of immersive visual contents -- an important aspect of the overall user quality of experience. The methods use different techniques, from texture operators to convolution neural networks, to estimate quality by also taking into consideration the specificities of the different formats. Finally, I discuss some of the exciting research challenges in the area of realistic immersive multimedia applications.
SESSION: Session 1: Quality Assessment on 2D Images
Adversarial Attacks Against Blind Image Quality Assessment Models
- Jari Korhonen
- Junyong You
Several deep models for blind image quality assessment (BIQA) have been proposed during the past few years, with promising results on standard image quality datasets. However, generalization of BIQA models beyond the standard content remains a challenge. In this paper, we study basic adversarial attack techniques to assess the robustness of representative deep BIQA models. Our results show that adversarial images created for a simple substitute BIQA model (i.e. white-box scenario) are transferable as such and able to deceive also several other more complex BIQA models (i.e. black-box scenario). We also investigated some basic defense mechanisms. Our results indicate that re-training BIQA models with a dataset augmented with adversarial images improves robustness of several models, but at the cost of decreased quality prediction accuracy on genuine images.
Simulating Visual Mechanisms by Sequential Spatial-Channel Attention for Image Quality Assessment
- Junyong You
- Jari Korhonen
As a subjective concept, image quality assessment (IQA) is significantly affected by perceptual mechanisms. Two mutually influenced mechanisms, namely spatial attention and contrast sensitivity, are particularly important for IQA. This paper aims to explore a deep learning approach based on transformer for the two mechanisms. By converting contrast sensitivity to attention representation, a unified multi-head attention module is performed on spatial and channel features in transformer encoder to simulate the two mechanisms in IQA. Sequential spatial-channel self-attention is proposed to avoid expensive computation in the classical Transformer model. In addition, as image rescaling can potentially affect perceived quality, zero-padding and masking with assigning special attention weights are performed to handle arbitrary image resolutions without requiring image rescaling. The evaluation results on publicly available large-scale IQA databases have demonstrated outstanding performance and generalization of the proposed IQA model.
From Just Noticeable Differences to Image Quality
- Ali Ak
- Andreas Pastor
- Patrick Le Callet
Distortions can occur due to several processing steps in the imaging chain of a wide range of multimedia content. The visibility of distortions is highly correlated with the overall perceived quality of a certain multimedia content. Subjective quality evaluation of images relies mainly on mean opinion scores (MOS) to provide ground-truth for measuring image quality on a continuous scale. Alternatively, just noticeable difference (JND) defines the visibility of distortions as a binary measurement based on an anchor point. By using the pristine reference as the anchor, the first JND point can be determined. This first JND point provides an intrinsic quantification of the visible distortions within the multimedia content. Therefore, it is intuitively appealing to develop a quality assessment model by utilizing the JND information as the fundamental cornerstone. In this work, we use the first JND point information to train a Siamese Convolutional Neural Network to predict image quality scores on a continuous scale. To ensure generalization, we incorporated a white-box optical retinal pathway model to acquire achromatic responses. The proposed model, D-JNDQ, displays a competitive performance on cross dataset evaluation conducted on TID2013 dataset, proving the generalization of the model on unseen distortion types and supra-threshold distortion levels.
SESSION: Session 2: QoE on Immersive Multimedia
Impact of Content on Subjective Quality of Experience Assessment for 3D Video
- Dawid Juszka
- Zdzislaw Papir
Ongoing improvements in the field of visual entertainment may incline users to display 3D video content on various terminal devices provided satisfying Quality of Experience. This study aims to determine whether the cognitive features of appealing and yet uncommon 3D content may obfuscate subjective QoE measurements performed under different bitrates. To test the hypothesis two 3D video databases are compared in terms of a perceived QoE under an innovative scenario. The reference database is GroTruQoE-3D (VQEG) including short artificial clips. The authorial 3D video database DJ3D contains longer clips from feature films with a proved substantial level of cognitive features. The 3D video content features are operationalised by three cognitive attributes (attractiveness, interestingness, 3D effect experience). Gradation of video quality is introduced by streaming at four bitrate levels. The collected subjects' scores are statistically analysed with a stochastic dominance test adjusted to a 5-point Likert scale. The obtained results show that quality assessment scores depend on the intensity of the cognitive attributes of the content. Sequences commonly used in subjective QoE experiments are more vulnerable to the intensity of subjective content attributes (visual attractiveness, interestingness, and 3D effect experience) than sequences from commercial feature films and documentaries. Moreover, it is shown that the test material commonly used in research is assessed higher for lower bitrates. In view of key results QoE researchers should consider to use test material originating from commercially available content to minimize content impact on QoE assessment scores collected during subjective experiment. The research contributes to the QoE best practices by paying attention to 3D cognitive attributes that may obfuscate subjective scores. The innovative scenario for comparing video databases with the stochastic dominance test adjusted to the ordinal scale is proposed. The approach may be useful in a broader context when an emerging service operator wants to ascertain whether subjective QoE tests are not substantially biased by the service novelty.
No-Reference Quality Assessment of Stereoscopic Video Based on Deep Frequency Perception
- Shuai Xiao
- Jiabao Wen
- Jiachen Yang
- Yanshuang Zhou
The purpose of stereo video quality assessment (SVQA) is to easily and quickly measure the quality of stereo video, and strive to reach a consensus with human visual perception. Stereo video contains more perceptual information and involves more visual perception theory than 2D image/video,making SVQA more challenging. Aiming at the effect of distortion on the frequency domain characteristics of stereo video, a SVQA method based on frequency domain depth perception is proposed. Specifically, the frequency domain is utilized while minimizing the changes in the existing network structure to realize the in-depth exploration of the frequency domain characteristics without changing the original frame size of stereo video. Experiments are carried out on three public stereo video databases, namely NAMA3DS1-COSPAD1 database, WaterlooIVC 3D Video Phase I database, and QI-SVQA database. From the experimental results, it can be seen that the proposed method has good quality prediction ability, especially on asymmetric compressed stereo video databases.
On Objective and Subjective Quality of 6DoF Synthesized Live Immersive Videos
- Yuan-Chun Sun
- Sheng-Ming Tang
- Ching-Ting Wang
- Cheng-Hsin Hsu
We address the problem of quantifying the perceived quality in 6DoF (Degree-of-Freedom) live immersive video in two steps. First, we develop a set of tools to generate (or collect) datasets in a photorealistic simulator, AirSim. Using these tools, we get to change diverse settings of live immersive videos, such as scenes, trajectories, camera placements, and encoding parameters. Second, we develop objective and subjective evaluation procedures, and carry out evaluations on a sample immersive video codec, MPEG MIV, using our own dataset. Several insights were found through our experiments: (1) the two synthesizers in TMIV produce comparable target view quality, but RVS runs 2 times faster; (2) Quantization Parameter (QP) is a good control knob to exercise target view quality and bitrate, but camera placements (or trajectories) also impose significant impacts; and (3) overall subjective quality has strong linear/rank correlation with subjective similarity, sharpness, and color. These findings shed some light on the future research problems for the development of emerging applications relying on immersive interactions.
No-reference Point Clouds Quality Assessment using Transformer and Visual Saliency
- Salima Bourbia
- Ayoub Karine
- Aladine Chetouani
- Mohammed El Hassouni
- Maher Jridi
Quality estimation of 3D objects/scenes represented by cloud point is a crucial and challenging task in computer vision. In real-world applications, reference data is not always available, which motivates the development of new point cloud quality assessment (PCQA) metrics that do not require the original 3D point cloud (3DPC). This family of methods is called no-reference or blind PCQA. In this context, we propose a deep-learning-based approach that benefits from the advantage of the self-attention mechanism in transformers to accurately predict the perceptual quality score for each degraded 3DPC. Additionally, we introduce the use of saliency maps to reflect the human visual system behavior that is attracted to some specific regions compared to others during the evaluation. To this end, we first render 2D projections (i.e. views) of a 3DPC from different viewpoints. Then, we weight the obtained projected images with their corresponding saliency maps. After that, we discard the majority of the background information by extracting sub-salient images. The latter is introduced as a sequential input of the vision transformer in order to extract the global contextual information and to predict the quality scores of the sub-images. Finally, we average the scores of all the salient sub-images to obtain the perceptual 3DPC quality score. We evaluate the performance of our model on the ICIP2020 and SJTU point cloud quality assessment benchmarks. Experimental results show that our model achieves promising performance compared to the state-of-the-art point cloud quality assessment metrics.
Point Cloud Quality Assessment Using Cross-correlation of Deep Features
- Marouane Tliba
- Aladine Chetouani
- Giuseppe Valenzise
- Frederic Dufaux
3D point clouds have emerged as a preferred format for recent immersive communication systems, due to the six degrees of freedom they offer. The huge data size of point clouds, which consists of both geometry and color information, has motivated the development of efficient compression schemes recently. To support the optimization of these algorithms, adequate and efficient perceptual quality metrics are needed. In this paper we propose a novel end-to-end deep full-reference framework for 3D point cloud quality assessment, considering both the geometry and color information. We use two identical neural networks, based on a residual permutation-invariant architecture, for extracting local features from a sparse set of patches extracted from the point cloud. Afterwards, we measure the cross-correlation between the embedding of pristine and distorted point clouds to quantify the global shift in the features due to visual distortion. The proposed scheme achieves comparable results to state-of-the-art metrics even when a small number of centroids are used, reducing the computational complexity.