摘要
为了消除训练数据集中真实语音和伪造语音的样本数量不平衡对合成伪造语音检测系统性能的影响,并进一步提高系统的检测准确率,提出了一种基于自监督对比学习的合成语音检测方法。所提方法将经过音高变换后的样本视为负样本,通过训练神经网络使锚点样本特征与负样本特征不同,从而促使网络提取对于音高变换敏感的特征,再采用深度残差网络作为后端分类器来判决语音真伪。实验结果表明,与传统手工设计的声学特征方法、基于深度学习的伪造语音检测系统以及基于端到端的伪造语音检测系统相比,所提方法显著降低了系统的等错误率。由于自监督对比学习的合成伪造语音检测方法可以训练网络提取对音高变换敏感的特征,并且不受数据集中真伪语音数量不平衡的影响,因此显著提高了合成伪造语音检测的准确率。
In order to eliminate the impact of the imbalance of the sample size of bonafide speech and fake speech in the training dataset on the performance of synthetic speech detection system and further improve the accuracy of syn‐thetic speech detection,a method of synthetic speech detection was proposed based on self-supervised contrastive learning.In this method,the samples after pitch transformation were regarded as negative samples,and the neural net‐work was trained to make the anchor sample features different from the negative sample features,so that the network could extract the features sensitive to pitch transformation.And the deep residual network was used as the back-end classifier to judge the authenticity of the speech.Experimental results show that,compared with the traditional hand-crafted acoustic features,the deep learning-based and the end-to-end spoofing speech detection systems,the proposed method significantly reduces the equal error rate of the system.The synthetic forged speech detection method based on self-supervised contrastive learning can train the network to extract features sensitive to pitch transformation and will not affect the accuracy of synthetic speech detection because of the imbalance of bonafide and fake speech in the dataset,so the accuracy of synthetic forged speech detection is significantly improved.
作者
杨曼
简志华
梁承涵
YANG Man;JIAN Zhihua;LIANG Chenghan(School of Communication Engineering,Hangzhou Dianzi University,Hangzhou 310018,China)
出处
《电信科学》
北大核心
2024年第11期40-49,共10页
Telecommunications Science
基金
国家自然科学基金资助项目(No.61201301,No.61772166)。
关键词
伪造语音检测
合成语音检测
自监督对比学习
深度残差网络
音高变换
spoofing speech detection
synthesized speech detection
self-supervised contrastive learning
deep re‐sidual network
pitch transformation