IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security
IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security
SESSION: Keynote Talks
Photoshop Fantasies
- Walter Scheirer
The possibility of an altered photo revising history in a convincing way highlights a salient threat of imaging technology. Afterall, seeing is believing. Or is it? The examples history has preserved make it clear that the observer is more often than not meant to understand that something has changed. Surprisingly, the objectives of photographic manipulation have remained largely the same since the camera first appeared in the 19th century. The old battleworn techniques have simply evolved to keep pace with technological developments. In this talk, we will learn about the history of photographic manipulation, from the invention of the camera to the present day. Importantly, we will consider the reception of photo editing and its relationship to the notion of reality, which is more significant than the technologies themselves. Surprisingly, we will discover that creative mythmaking has found a new medium to embed itself in.
Steganography on Mobile Apps
- Jennifer L. Newman
Steganography is an ancient communication technique that hides a message inside a common object so that the message escapes scrutiny - today, we use digital files like photos and videos to hide and pass messages. Code to execute the message hiding inside digital photos was developed (on computers) in the 1990s. However, code to execute message hiding inside photos on an app for a smartphone was developed only in the last ten years. These mobile stego apps expanded steganography's reach from computer experts to the general public, where few technological skills are necessary.
How much of a presence does mobile stego have in our day-to-day lives? Beyond collecting download statistics, it is hard to say, as there are no known software tools capable of reliable detection of mobile stego images. Why not? A simple observation is that there exist no training pairs of cover-stego images to develop software solutions. Of course, the answer is more complex. In response to this lack of data, our research team designed and created a steganography database populated with mobile phone images, including stego images created from mobile phone apps. The story of creating StegoAppDB is remarkably involved and far more complex than simply acquiring smartphones, taking pictures, and creating stego images from apps. In this talk, I present our adventure of creating a steganography app database with a diverse team of people, skills, and ideas. This has led to various discoveries connecting steganography with mobile apps, identifying several interesting and difficult open problems.
On the Detection, Localization, and Reverse Engineering of Diverse Image Manipulations
- Xiaoming Liu
With the abundance of imagery data captured in our daily life, there is an increasing amount of diverse manipulations being applied to imagery, including generative model based image generation and manipulation, adversarial attacks, classic image editing such as splicing, etc. As the imagery plays important roles in our society, it is necessary to understand whether, where and how a given image is manipulated. From the perspective of a defender, this talk will introduce our recent efforts on detecting these manipulations individually and jointly, as well as reverse engineering the various information regarding the manipulation process. We will also share some of the topics that warrant future research.
SESSION: Media Security and Privacy
TMCIH: Perceptual Robust Image Hashing with Transformer-based Multi-layer Constraints
- Yaodong Fang
- Yuanding Zhou
- Xinran Li
- Ping Kong
- Chuan Qin
In recent decades, many perceptual image hashing schemes for content authentication have been proposed. However, existing algorithms cannot provide satisfactory robustness and discrimination in the face of complex manipulations in real scenarios. In this work, we propose a novel perceptual robust image hashing scheme with transformer-based multi-layer constraints. Specifically, we first exploit the Transformer structure into the field of perceptual image hashing, and an integrated loss function is designed to optimize the training of the model. In addition, to solve the issue of the simple content-preserving manipulations used in previous datasets, we construct a more challenging image dataset based on various manipulations, which can deal with complex image authentication scenarios. Experimental results demonstrate that our scheme achieves competitive results compared with existing schemes.
Perceptual Robust Hashing for Video Copy Detection with Unsupervised Learning
- Gejian Zhao
- Chuan Qin
- Xiangyang Luo
- Xinpeng Zhang
- Chin-Chen Chang
In this paper, we propose an end-to-end perceptual robust hashing scheme for video copy detection based on unsupervised learning. Firstly, the spatio-temporal information in videos is effectively fused and condensed into high-dimensional features through a 3D self-attention, multi-scale feature fusion model based on 3D-CNN, in which the Inception block and the 3D self-attention mechanism are integrated. Then, we calculate the correlation distances between the extracted features to differentiate perceptual contents. Based on the similarity relationship, we can dynamically generate the pseudo-labels and exploit them to further guide the model training for video hash generation. In addition, we design the dual constraints to make the hash code obtain satisfactory robustness and discrimination. Extensive experiments demonstrate that the proposed scheme achieves superior performance of copy detection compared with existing schemes and performs well even in the case of untrained manipulations.
Video Frame Interpolation via Multi-scale Expandable Deformable Convolution
- Dengyong Zhang
- Pu Huang
- Xiangling Ding
- Feng Li
- Gaobo Yang
Video frame interpolation is a challenging task in the video processing field. Benefiting from the development of deep learning, many video frame interpolation methods have been proposed, which focus on sampling pixels with useful information to synthesize each output pixel using their own sampling operation. However, these works have data redundancy limitations and fail to sample the correct pixel of complex motions. To solve these problems, we propose a new warping framework to sample called multi-scale expandable deformable convolution(MSEConv) which employs a deep fully convolutional neural network to estimate multiple small-scale kernel weights with different expansion degrees and adaptive weight allocation for each pixel synthesis. MSEConv covers most prevailing research methods as special cases of it, thus MSEConv is also possible to be transferred to existing works for performance improvement. To further improve the robustness of the whole network to occlusion, we also introduce a data preprocessing method for mask occlusion in video frame interpolation. Quantitative and qualitative experiments show that our method shows a robust performance comparable to or even superior to the state-of-the-art method. Our source code and visual comparable results are available at https://github.com/Pumpkin123709/MSEConv.
SESSION: Steganography and Steganalysis
Compatibility and Timing Attacks for JPEG Steganalysis
- Etienne Levecque
- Patrick Bas
- Jan Butora
This paper introduces a novel compatibility attack to detect a steganographic message embedded in the DCT domain of a JPEG image at high-quality factors (close to 100). Because the JPEG compression is not a surjective function, i.e. not every DCT blocks can be mapped from a pixel block, embedding a message in the DCT domain can create incompatible blocks. We propose a method to find such a block, which directly proves that a block has been modified during the embedding. This theoretical method provides many advantages such as being completely independent to Cover Source Mismatch, having good detection power, and perfect reliability since false alarms are impossible as soon as incompatible blocks are found. We show that finding an incompatible block is equivalent to proving the infeasibility of an Integer Linear Programming problem. However, solving such a problem requires considerable computational power and has not been reached for 8x8 blocks. Instead, a timing attack approach is presented to perform steganalysis without potentially any false alarms for large computing power.
On Comparing Ad Hoc Detectors with Statistical Hypothesis Tests
- Eli Dworetzky
- Edgar Kaziakhmedov
- Jessica Fridrich
This paper addresses how to fairly compare ROCs of ad hoc (or data driven) detectors with tests derived from statistical models of digital media. We argue that the ways ROCs are typically drawn for each detector type correspond to different hypothesis testing problems with different optimality criteria, making the ROCs uncomparable. To understand the problem and why it occurs, we model a source of natural images as a mixture of scene oracles and derive optimal detectors for the task of image steganalysis. Our goal is to guarantee that, when the data follows the statistical model adopted for the hypothesis test, the ROC of the optimal detector bounds the ROC of the ad hoc detector. While the results are applicable beyond the field of image steganalysis, we use this setup to point out possible inconsistencies when comparing both types of detectors and explain guidelines for their proper comparison. Experiments on an artificial cover source with a known model with real steganographic algorithms and deep learning detectors are used to confirm our claims.
Progressive JPEGs in the Wild: Implications for Information Hiding and Forensics
- Nora Hofer
- Rainer Boehme
JPEG images stored in progressive mode have become more prevalent recently. An estimated 30% of all JPEG images on the most popular websites use progressive mode. Presumably, this surge is caused by the adoption of MozJPEG, an open-source library designed for web publishers. So far, the optimizations used by MozJPEG have not been considered by the multimedia security community, although they are highly relevant. The goal of this paper is to document these optimizations and make them accessible to the research community. Most notably, we find that Trellis optimization in MozJPEG modifies quantized DCT coefficients in order to improve the rate-distortion tradeoff using a perceptual model based on PSNR-HVS. This may compromise the reliability of known methods in steganography, steganalysis, and image forensics when dealing with images compressed with MozJPEG. We also find that the type and order of scans in progressive mode, which MozJPEG adjusts to the image, offer novel cues that can aid forensic source identification.
Analysis and Mitigation of the False Alarms of the Reverse JPEG Compatibility Attack
- Jan Butora
- Patrick Bas
- Rémi Cogranne
The Reverse JPEG Compatibility Attack can be used for steganalysis of JPEG images compressed with Quality Factor 100 by detecting increased variance of decompression rounding errors. In this work, we point out the dangers associated with this attack by showing that in an uncontrolled environment, the variance can be elevated simply by using a different JPEG compressor. If not careful, the steganalyst can wrongly misclassify cover images. In order to deal with the diversity associated to the devices or softwares generating JPEGs, we propose in this paper to build a deep learning detector trained on a huge dataset of downloaded images. Experimental evaluation shows that such a detector can provide operational false alarms as small as 10-4, while still correctly classifying 90% of stego images. Furthermore, it is shown that this performance is directly applicable to other image datasets. As a side product, we indicate that the attack is not applicable to images developed with a specific JPEG compressor based on the trunc quantization function.
Advancing the JPEG Compatibility Attack: Theory, Performance, Robustness, and Practice
- Eli Dworetzky
- Edgar Kaziakhmedov
- Jessica Fridrich
The JPEG compatibility attack is a steganalysis method for detecting messages embedded in the spatial representation of an image under the assumption that the cover image was a decompressed JPEG. This paper addresses a number of open problems in previous art, namely the lack of theoretical insight into how and why the attack works, low detection accuracy for high JPEG qualities, robustness to the JPEG compressor and DCT coefficient quantizer, and real-life performance evaluation. To explain the main mechanism responsible for detection and to understand the trends exhibited by heuristic detectors, we adopt a model of quantization errors of DCT coefficients in the recompressed image, and within a simplified setup, we analyze the behavior of the most powerful detector. Empowered by our analysis, we resolve the performance deficiencies using an SRNet trained on a two-channel input consisting of the image and its SQ error. This detector is compared with previous state of the art on four content-adaptive stego methods and for a wide range of payloads and quality factors. The last sections of this paper are devoted to studying robustness of this detector with respect to JPEG compressors, quantizers, and errors in estimating the JPEG quantization table. Finally, to demonstrate practical usability of this attack, we test our detector on stego images outputted by real steganographic tools available on the Internet.
Limits of Data Driven Steganography Detectors
- Edgar Kaziakhmedov
- Eli Dworetzky
- Jessica Fridrich
While deep learning has revolutionized image steganalysis in terms of performance, little is known about how much modern data driven detectors can still be improved. In this paper, we approach this difficult and currently wide open question by working with artificial but realistic looking images with a known statistical model that allows us to compute the detectability of modern content-adaptive algorithms with respect to the most powerful detectors. Multiple artificial image datasets are crafted with different levels of content complexity and noise power to assess their influence on the gap between both types of detectors. Experiments with SRNet as the heuristic detector indicate that independent noise contributes less to the performance gap than content of the same MSE. While this loss is rather small for smooth images, it can be quite large for textured images. A network trained on many realizations of a fixed textured scene will, however, recuperate most of the loss, suggesting that networks have the capacity to approximately learn the parameters of a cover source narrowed to a fixed scene.
Calibration-based Steganalysis for Neural Network Steganography
- Na Zhao
- Kejiang Chen
- Chuan Qin
- Yi Yin
- Weiming Zhang
- Nenghai Yu
Recent research has shown that neural network models can be used to steal sensitive data or embed malware. Therefore, steganalysis for neural networks is urgently needed. However, existing neural network steganalysis methods do not perform well under small embedding rates. In addition, because of the large number of parameters, the neural network steganography method under a small embedding rate can embed enough information into the model for malicious purposes. To address this problem, this paper proposes a calibration-based steganalysis method, which fine-tunes the original neural network model without implicit constraints to obtain a reference model, then extracts and fuses statistical moments from the parameter distributions of the original model and its reference model, and finally trains a logistic regressor for detection. Extensive experiments show that the proposed method has superior performance in detecting steganographic neural network models under small embedding rates.
SCL-Stega: Exploring Advanced Objective in Linguistic Steganalysis using Contrastive Learning
- Juan Wen
- Liting Gao
- Guangying Fan
- Ziwei Zhang
- Jianghao Jia
- Yiming Xue
Text steganography is becoming increasingly secure by eliminating the distribution discrepancy between normal and stego text. On the other hand, the existing cross-entropy-based steganalysis models struggle to distinguish subtle distribution differences and lack robustness regarding confusable samples. To enhance steganalysis accuracy on hard-to-detect samples, this paper draws on contrastive learning to design a text steganalysis framework incorporating supervised contrastive loss into the training process. This framework improves feature representation by pushing apart embeddings from different classes while pulling closer embeddings from the same class. The experimental results show that our method makes remarkable improvement compared to the four baseline models. Additionally, as the embedding rate increases, our method's advantages become increasingly apparent, with maximum improvements of 13.98%, 12.47%, and 13.65% over the baseline methods across three common linguistic steganalysis datasets, Twitter, IMDB, and News, respectively. Our code is available at https://github.com/Katelin-glt/SCL-Stega https://github.com/katelin-glt/SCL-Stega.
SESSION: Watermarking and Security
An Improved Reversible Database Watermarking Method based on Histogram Shifting
- Cheng Li
- Xinhui Han
- Wenfa Qi
- Zongming Guo
Database watermarking is typically employed to address the issues of data theft, illegal replication, and copyright infringement that may arise during the sharing of databases. Unfortunately, the existing methods often cause permanent distortion to the original data, and it is challenging to strike a balance between the watermark embedding capacity and data distortion. Therefore, this paper proposes a reversible database watermarking method based on histogram shifting, rhombus prediction, and double embedding with high capacity and low distortion, called RPDE-HSW. By utilizing the rhombus prediction, we respectively constructed two prediction error histograms in each subgroup and expanded the watermark capacity through the adoption of double-layer embedding and single-bin embedding 2 bits. A scrambling algorithm is used to make the attribute value distribution more discretized, resulting in a sparse distribution of the database histogram. Subsequently, we optimized the selection rules for the watermark embedding carrier, effectively eliminating the redundant distortion caused by histogram shifting. Experimental results demonstrate that the proposed method achieves smaller data distortion and higher watermark embedding capacity, outperforming some other state-of-the-art works, and does not affect the classification results and data mining.
Applying a Zero-Knowledge Watermarking Protocol to Secure Elections
- Scott Craver
- Nicholas Rosbrook
Zero-knowledge protocols for digital watermarking allow a copyright holder to prove that a signal is hidden in a cover without revealing the signal or where it is. This same approach can be applied outside of watermarking to other problems where anonymity is required, and in particular to secure elections. We outline steganographic zero-knowledge protocols---proving in zero knowledge the existence of a hidden object in a cover---and adapt them to prove that selected ballot values must exist within a volume of anonymized voter receipts. We also improve the space efficiency of such protocols by employing camouflaged elliptic curve cryptography.
SESSION: Biometrics
On the Feasibility of Post-Mortem Hand-Based Vascular Biometric Recognition
- Simon Kirchgasser
- Christof Kauba
- Bernhard Prommegger
- Fabio Monticelli
- Andreas Uhl
Recently, there is a growing interest to employ biometrics in post-mortem forensics, mainly to replace cost intensive radiology based imaging devices. While it has been shown that post-mortem biometric recognition is feasible for fingerprints, face and iris, no studies regarding post-mortem vasculature pattern recognition have been published. Based on the first reported post-mortem hand- and finger-vein dataset, the hypothesis, that hand vasculature biometrics can be used as post-mortem biometric modality, is falsified. Using an indirect proof, it is shown that no usable vascular features are present in the small amount of sample data collected, by visual inspection as well as by applying several biometric quality metrics, which confirm that hand-based vasculature biometrics can not be used as post-mortem biometric modality.
Differentially Private Adversarial Auto-Encoder to Protect Gender in Voice Biometrics
- Oubaïda Chouchane
- Michele Panariello
- Oualid Zari
- Ismet Kerenciler
- Imen Chihaoui
- Massimiliano Todisco
- Melek Önen
Over the last decade, the use of Automatic Speaker Verification (ASV) systems has become increasingly widespread in response to the growing need for secure and efficient identity verification methods. The voice data encompasses a wealth of personal information, which includes but is not limited to gender, age, health condition, stress levels, and geographical and socio-cultural origins. These attributes, known as soft biometrics, are private and the user may wish to keep them confidential. However, with the advancement of machine learning algorithms, soft biometrics can be inferred automatically, creating the potential for unauthorized use. As such, it is crucial to ensure the protection of these personal data that are inherent within the voice while retaining the utility of identity recognition. In this paper, we present an adversarial Auto-Encoder-based approach to hide gender-related information in speaker embeddings, while preserving their effectiveness for speaker verification. We use an adversarial procedure against a gender classifier and incorporate a layer based on the Laplace mechanism into the Auto-Encoder architecture. This layer adds Laplace noise for more robust gender concealment and ensures differential privacy guarantees during inference for the output speaker embeddings. Experiments conducted on the VoxCeleb dataset demonstrate that speaker verification tasks can be effectively carried out while concealing speaker gender and ensuring differential privacy guarantees; moreover, the intensity of the Laplace noise can be tuned to select the desired trade-off between privacy and utility.
Hand Vein Spoof GANs: Pitfalls in the Assessment of Synthetic Presentation Attack Artefacts
- Andreas Vorderleitner
- Jutta Hämmerle-Uhl
- Andreas Uhl
I2I translation techniques for unpaired data are used for the creation of biometric presentation attack artefact samples. For the assessment of these synthetic samples, we analyse their behaviour when attacking hand vein recognition systems, comparing these results to such obtained from actually crafted presentation attack samples. We observe that although visual appearance and sample set correspondence are suprisingly good, respectively, the assessment of the behaviour of the data in a conducted attack is more difficult. Even if for some recognition schemes we find a good accordance in terms of IAPMR (for others we don't), the attack score distributions turn out to be highly dissimilar. More work is needed for reliable assesment of such data, to be able to correctly interpret corresponding results with respect to the usefulness in attack simulation.
First Learning Steps to Recognize Faces in the Noise
- Lukas Lamminger
- Heinz Hofbauer
- Andreas Uhl
A UNet-type encoder-decoder inpainting network is applied to weaken the protection strength of selectively encrypted face samples. Based on visual assessment, FaceQNet quality, and ArcFace recognition accuracy the strategy is shown to be successful, however, to a different extent depending on the original protection strength. For almost cryptographic strength, inpainting does not cause a practically relevant protection weakening, while for lower original protection strength inpainting almost removes the protection entirely.
SESSION: Trends and Challenges in DeepFake Creation, Application, and Forensics (Special Session)
Comprehensive Dataset of Synthetic and Manipulated Overhead Imagery for Development and Evaluation of Forensic Tools
- Brandon B. May
- Kirill Trapeznikov
- Shengbang Fang
- Matthew Stamm
We present a first of its kind dataset of overhead imagery for development and evaluation of forensic tools. Our dataset consists of real, fully synthetic and partially manipulated overhead imagery generated from a custom diffusion model trained on two sets of different zoom levels and on two sources of pristine data. We developed our model to support controllable generation of multiple manipulation categories including fully synthetic imagery conditioned on real and generated base maps, and location. We also support partial in-painted imagery with same conditioning options and with several types of manipulated content. The data consist of raw images and ground truth annotations describing the manipulation parameters. We also report benchmark performance on several tasks supported by our dataset including detection of fully and partially manipulated imagery, manipulation localization and classification.
MetaFake: Few-shot Face Forgery Detection with Meta Learning
- Nanqing Xu
- Weiwei Feng
With remarkable progress achieved by facial forgery technologies, their potential security risks cause serious concern to society since they can easily fool face recognition systems and even human beings. Current forgery detection methods have achieved excellent performance when training with a large-scale database. However, they usually fail to give correct predictions in real applications where only a few fake samples created by unseen forgery methods are available. In this paper, we propose a novel method to boost the performance of identifying samples generated by unseen techniques, dubbed MetaFake, which requires only a few fake samples. Our MetaFake enjoys the part features located by meta forgery prototypes created adaptively based on each task. The local-aggregated module helps to integrate these part features for the final prediction. Besides, we establish a large database of about 0.6 million images to verify the proposed method, including fake samples synthesized by 18 forgery techniques. Extensive experiments demonstrate the superior performance of the proposed method.
Synthesized Speech Attribution Using The Patchout Spectrogram Attribution Transformer
- Kratika Bhagtani
- Emily R. Bartusiak
- Amit Kumar Singh Yadav
- Paolo Bestagini
- Edward J. Delp
The malicious use of synthetic speech has increased with the recent availability of speech generation tools. It is important to determine whether a speech signal is authentic (spoken by a human) or is synthesized and to determine the generation method used to create it. Identifying the synthesis method is known as synthetic speech attribution. In this paper, we propose the use of a transformer deep learning method that analyzes mel-spectrograms for synthetic speech attribution. Our method known as Patchout Spectrogram Attribution Transformer (PSAT) can distinguish new, unseen speech generation methods from those seen during training. PSAT demonstrates high performance in attributing synthetic speech signals. Evaluation on the DARPA SemaFor Audio Attribution Dataset and the ASVSpoof2019 Dataset shows that our method achieves more than 95% accuracy in synthetic speech attribution and performs better than existing deep learning approaches.
Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection
- Ziyue Xiang
- Amit Kumar Singh Yadav
- Stefano Tubaro
- Paolo Bestagini
- Edward J. Delp
Many speech signals are compressed with MP3 to reduce the data rate. In many synthetic speech detection methods the spectrogram of the speech signal is used. This usually requires the speech signal to be fully decompressed. We show that the design of MP3 compression allows one to approximate the spectrogram of the MP3 compressed speech efficiently without fully decoding the compressed speech. We denote the spectograms obtained using our proposed approach by Efficient Spectrograms (E-Specs). E-Spec can reduce the complexity of spectrogram computation by ~77.60 percentage points (p.p.) and save ~37.87 p.p. of MP3 decoding time. E-Spec bypasses the reconstruction artifacts introduced by the MP3 synthesis filterbank, which makes it useful in speech forensics tasks. We tested E-Spec in the synthetic speech detection, where a detector is asked to determine whether a speech signal is synthesized or recorded from a human. We examined 4 different neural network architectures to evaluate the performance of E-Spec compared to speech features extracted from the fully decoded speech signal. E-Spec achieved the best synthetic speech detection performance for 3 architectures; it also achieved the best overall detection performance across architectures. The computation of E-Spec is an approximation to Short Time Fourier Transform (STFT). E-Spec can be extended to other audio compression methods.
Exposing Deepfakes using Dual-Channel Network with Multi-Axis Attention and Frequency Analysis
- Yue Zhou
- Bing Fan
- Pradeep K. Atrey
- Feng Ding
This paper proposes a dual-channel network for DeepFake detection. The network comprises two channels: one using a stacked Maxvit block to process the downsampled original images, and the other using a stacked ResNet basic block to capture features from the discrete cosine transform of the image spectrums. The components extracted from the two channels are concatenated using a linear layer to train the entire model for exposing DeepFakes. Experimental results demonstrate that the proposed method could achieve satisfactory forensics performance. Besides, the experiments of cross-dataset evaluations prove it is also high in generalizability.
Fooling State-of-the-art Deepfake Detection with High-quality Deepfakes
- Arian Beckmann
- Anna Hilsmann
- Peter Eisert
Due to the rising threat of deepfakes to security and privacy, it is most important to develop robust and reliable detectors. In this paper, we examine the need for high-quality samples in the training datasets of such detectors. Accordingly, we show that deepfake detectors proven to generalize well on multiple research datasets still struggle in real-world scenarios with well-crafted fakes. First, we propose a novel autoencoder for face swapping alongside an advanced face blending technique, which we utilize to generate 90 high-quality deepfakes. Second, we feed those fakes to a state-of-the-art detector, causing its performance to decrease drastically. Moreover, we fine-tune the detector on our fakes and demonstrate that they contain useful clues for the detection of manipulations. Overall, our results provide insights into the generalization of deepfake detectors and suggest that their training datasets should be complemented by high-quality fakes since training on mere research data is insufficient.