DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia


Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Lessons Learned from ASVSpoof and Remaining Challenges

  • Junichi Yamagishi

Although speech technology reproducing an individual's voice is expected to bring new value to entertainment, it may cause security problems in speaker recognition systems if misused. In addition, there is a possibility of this technology being used for telephone fraud and information manipulation. Recognizing the importance of this issue, we have been working on speech anti-spoofing countermeasures since 2010, including building large-scale speech databases and organizing a series of ASVspoof challenges to evaluate the detectors on the shared database. This presentation will summarize the essential findings and lessons we have learned recently [1] and present the remaining challenges we are currently facing and the results we have achieved to date [2-4]. Examples of the lessons include a) sensitivity to hyper-parameters and features in deep learning-based countermeasure models and the importance of designing a network structure and learning loss that are stable even under different conditions, and b) effectiveness of ensemble learning of multiple models trained on different types of acoustic features and ineffectiveness of ensemble learning of different network structures using similar acoustic features. The ongoing research topics include 1) front-end features that are robust to domain and channel mismatches [2], 2) how to automatically expand the countermeasure database in a situation where new speech synthesis methods are being invented regularly [3], and 3) detection of partial synthetic regions to provide evidence for XAI anti-spoofing countermeasures [4]. Through these new attempts, the importance of studying the issue of speech anti-spoofing countermeasures from various angles, in addition to reducing EERs, will be illustrated.

SESSION: Session 1: Deepfake Audio Detection

Detection of Synthetic Speech Based on Spectrum Defects

  • JiaCheng Deng
  • Terui Mao
  • Diqun Yan
  • Li Dong
  • Mingyu Dong

Synthetic spoofing speech has become a threat to online communication and automatic speaker verification (ASV) systems based on deep learning since the synthetic model can produce anyone's voice. The first Audio Deep Synthesis Detection Challenge (ADD 2022) is launched to spur researchers around the world to build innovative new technologies that can further accelerate and foster research on detecting deep synthesis and manipulated speech. This paper presents a spoofing detection system submitted to ADD 2022 Track 3.2 Detection task (FG-D). The system consists of two parts to detect synthetic speech. First, Mel-frequency cepstral coefficients (MFCCs), Linear frequency cepstral coefficients (LFCCs), Delta coefficients, and Delta-Delta coefficients features derived from speech spectrogram are fed into DenseNet for building the DenseNet detection system (DDS). Then Mute segment classifier (MSC), High-frequency classifier (HFC), and Block spectrogram classifier (BSC) algorithms are designed for the defects of the synthetic speech on the spectrogram and the spectrum defect detection system SPECT is formed. The experimental results of the fusion system composed of SPECT and DDS in ADD FG-D demonstrate an EER of 8.5%, and our final submission ranks 6th in the evaluation phase of ADD FG-D.

Low-quality Fake Audio Detection through Frequency Feature Masking

  • Il-Youp Kwak
  • Sunmook Choi
  • Jonghoon Yang
  • Yerin Lee
  • Soyul Han
  • Seungsang Oh

The first Audio Deep Synthesis Detection Challenge (ADD 2022) competition was held which dealt with audio deepfake detection, audio deep synthesis, audio fake game, and adversarial attacks. Our team participated in track 1, classifying bona fide and fake utterances in noisy environments. Through exploratory data analysis, we found that noisy signals appear in similar frequency bands for given voice samples. If a model is trained to rely heavily on information in frequency bands where noise exists, performance will be poor. In this paper, we propose a data augmentation method, Frequency Feature Masking (FFM) that randomly masks frequency bands. FFM makes a model robust by not relying on specific frequency bands and prevents overfitting. We applied FFM and mixup augmentation on five spectrogram-based deep neural network architectures that performed well for spoofing detection using mel-spectrogram and constant Q transform (CQT) features. Our best submission achieved 23.8% in EER and ranked 3rd on track 1. To demonstrate the usefulness of our proposed FFM augmentation, we further experimented with FFM augmentation using ASVspoof 2019 Logical Access (LA) datasets.

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

  • Jun Xue
  • Cunhang Fan
  • Zhao Lv
  • Jianhua Tao
  • Jiangyan Yi
  • Chengshi Zheng
  • Zhengqi Wen
  • Minmin Yuan
  • Shegang Shao

Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific information in the subband, and these features also lose information such as phase. Inspired by the mechanism of synthetic speech, the fundamental frequency (F0) information is used to improve the quality of synthetic speech, while the F0 of synthetic speech is still too average, which differs significantly from that of real speech. It is expected that F0 can be used as important information to discriminate between bonafide and fake speech, while this information cannot be used directly due to the irregular distribution of F0. Insteadly, the frequency band containing most of F0 is selected as the input feature. Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately. Finally, the results of F0, real and imaginary spectrogram features are fused. Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.

Fully Automated End-to-End Fake Audio Detection

  • Chenglong Wang
  • Jiangyan Yi
  • Jianhua Tao
  • Haiyang Sun
  • Xun Chen
  • Zhengkun Tian
  • Haoxin Ma
  • Cunhang Fan
  • Ruibo Fu

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-to-end fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system.

A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection

  • Menglu Li
  • Yasaman Ahmadiadli
  • Xiao-Ping Zhang

Audio content synthesis has stepped into a new era and brought a great threat to daily life since the development of deep learning techniques. The ASVSpoof Challenge and the ADD Challenge have been launched to motivate the development of Deepfake audio detection algorithms. Currently, the detection models, which consist of front-end feature extractors and back-end classifiers, utilize the physical features mainly, rather than the perceptual features that relate to natural emotions or breathiness. Therefore, we provide a comprehensive study on 16 physical and perceptual features and evaluate their effectiveness in both Track 1 and Track 2 of the ADD Challenge. Based on results, PLP, as a perceptual feature, outperforms the rest of the features in Track 1, while CQCC has the best performance in Track 2. Our experiments demonstrate the significance of perceptual features in detecting Deepfake audios. We also seek to explore the underlying characteristics of features that can distinguish Deepfake audio from a real one. We perform statistical analysis on each feature to show its distribution differences on real and synthesized audios. This paper will provide a potential direction in selecting appropriate feature extraction methods for the future implementation of detection models.

Deepfake Detection System for the ADD Challenge Track 3.2 Based on Score Fusion

  • Yuxiang Zhang
  • Jingze Lu
  • Xingming Wang
  • Zhuo Li
  • Runqiu Xiao
  • Wenchao Wang
  • Ming Li
  • Pengyuan Zhang

This paper describes the deepfake audio detection system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3.2 and gives an analysis of score fusion. The proposed system is a score-level fusion of several light convolutional neural network (LCNN) based models. Various front-ends are used as input features, including low-frequency short-time Fourier transform and Constant Q transform. Due to the complex noise and rich synthesis algorithms, it is difficult to obtain the desired performance using the training set directly. Online data augmentation methods effectively improve the robustness of fake audio detection systems. In particular, the reasons for the poor improvement of score fusion are explored through visualization of the score distributions and comparison with score distribution on another dataset. The overfitting of the model to the training set leads to extreme values of the scores and low correlation of the score distributions, which makes score fusion difficult. Fusion with partially fake audio detection system improves system performance further. The submission on track 3.2 obtained the weighted equal error rate (WEER) of 11.04%, which is one of the best performing systems in the challenge.

SESSION: Session 2: Deepfake Audio Generation and Evaluation

Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for End-to-end Singing Voice Synthesis

  • Tao Wang
  • Ruibo Fu
  • Jiangyan Yi
  • Zhengqi Wen
  • Jianhua Tao

End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto-learned alignment of singing voice with lyrics is difficult to match the duration information in a musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation \footnoteExamples of experiments can be found at \hrefhttps://hairuo55.github.io/SingingTacotron https://hairuo55.github.io/SingingTacotron. verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored.

An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

  • Xinrui Yan
  • Jiangyan Yi
  • Jianhua Tao
  • Chenglong Wang
  • Haoxin Ma
  • Tao Wang
  • Shiming Wang
  • Ruibo Fu

\beginabstract Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesized by eight state-of-the-art vocoders. We have preliminarily explored the features and model architectures. The t-SNE visualization shows that different vocoders generate distinct vocoder fingerprints. \endabstract

Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

  • Xiaohui Liu
  • Meng Liu
  • Lin Zhang
  • Linjuan Zhang
  • Chang Zeng
  • Kai Li
  • Nan Li
  • Kong Aik Lee
  • Longbiao Wang
  • Jianwu Dang

The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.

Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

  • Jingze Lu
  • Zhuo Li
  • Yuxiang Zhang
  • Wenchao Wang
  • Pengyuan Zhang

Traditional speech spoofing countermeasures (CM) typically contain a frontend which extract a two dimensional feature from the waveform, and a Convolutional Neural Network (CNN) based backend classifier. This pipeline is similar to an image classification task, in some degree. Pre-training is a widely used paradigm in many fields. Self-supervised pre-trained frontend such as Wav2Vec 2.0 has shown superior improvement in the speech spoofing detection task. However, these pre-trained models are only trained by bonafide utterances. Moreover, acoustic pre-trained frontends can also be used in the text-to-speech (TTS) and voice conversion (VC) task, which reveals that commonalities of speech are learnt by them, rather than discriminative information between real and fake data. Speech spoofing detection task and image classification task share the same pipeline. Based on the hypothesis that CNNs follow the same pattern in capturing artefacts in these two tasks, we apply image pre-trained CNN model to detect spoofed utterances, counterintuitively. To supplement the model with potentially missing acoustic features, we concatenate Jitter and Shimmer features to the output embedding. Our proposed CM achieve top-level performance on the ASVspoof 2019 dataset.

Human Perception of Audio Deepfakes

  • Nicolas M. Müller
  • Karla Pizzi
  • Jennifer Williams

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques. Human detection capabilities, however, are far less explored. In this paper, we present results from comparing the abilities of humans and machines for detecting audio deepfakes used to imitate someone's voice. For this, we use a web-based application framework formulated as a game. Participants were asked to distinguish between real and fake audio samples. In our experiment, 410 unique users competed against a state-of-the-art AI deepfake detection algorithm for 13229 total of rounds of the game. We find that humans and deepfake detection algorithms share similar strengths and weaknesses, both struggling to detect certain types of attacks. This is in contrast to the superhuman performance of AI in many application areas such as object detection or face recognition. Concerning human success factors, we find that IT professionals have no advantage over non-professionals but native speakers have an advantage over non-native speakers. Additionally, we find that older participants tend to be more susceptible than younger ones. These insights may be helpful when designing future cybersecurity training for humans as well as developing better detection algorithms.

Improving Spoofing Capability for End-to-end Any-to-many Voice Conversion

  • Hua Hua
  • Ziyi Chen
  • Yuxiang Zhang
  • Ming Li
  • Pengyuan Zhang

Audio deep synthesis techniques have been able to generate high-quality speech whose authenticity is difficult for humans to recognize. Meanwhile, many anti-spoofing systems have been developed to capture artifacts in the synthesized speech that are imperceptible to human hearing, thus a continuous escalating race of 'attacking and defending' in voice deepfake has started. Hence, to further improve the probability of successfully cheating anti-spoofing systems, we propose a fully end-to-end, any-to-many voice conversion method based on a non-autoregressive structure with the addition of two light but strong post-processing strategies namely silence replacement and global noise perturbation. Experimental results show that the proposed method performs better than current baselines in fooling several state-of-the-art anti-spoofing systems. Better naturalness and speaker similarity are also achieved, resulting in our proposed method showing high deception performance against humans.