摘要
随着网络音乐产业的快速发展,构筑音乐自动检索和分类系统的需求日益增加。利用计算机对音乐流派进行正确标注是实现音乐类型精准分类和保障音乐推荐系统性能的重要前提。针对卷积运算不具备提取全局表征的能力,深度卷积神经网络对音乐流派数据的全局建模能力较弱的问题,提出了一种基于视觉变换(ViT)神经网络的音乐流派自动分类方法。该方法对待分类的音频进行预处理后,利用短时傅里叶变换(STFT)转化为尺寸统一的语谱图切片,实现音乐频域特征的转换。为了避免训练过拟合,通过增加白噪声对语谱图切片集进行数据增强。然后利用所生成的语谱切片集及其增强后的数据集对所构建的ViT神经网络进行训练,从而实现音乐流派风格的自动分类。仿真结果表明,所构建的ViT网络在音乐流派分类公共数据集GTZAN上的测试识别准确率达到91.01%,比基于AlexNet、AlexNet-enhanced和VGG16等传统卷积神经网络(CNN)的音乐流派分类方法提升了1.00~5.00个百分点。
With the rapid development of the online music industry,the demand for building automatic music retrieval and classification systems is increasing.Correct annotation of music genres using computers is an important prerequisite to achieve accurate classification of music types and guarantee the performance of music recommendation systems.To address the problem that convolutional operations do not have the ability to extract global representations and deep convolutional neural networks are weak in global modeling of music genre data,an automatic music genre classification method based on Vision Transformer(ViT)neural network was proposed.After pre-processing the audio to be classified,a Short-Time Fourier Transform(STFT)was used to transform it into uniform-sized spectrogram slices to realize the conversion of music frequency domain features.In order to avoid training over-fitting,data enhancement was performed by adding white noise to the speech spectrum graph slice set.Then the generated spectrum slice set and its enhanced data set were used to train the constructed ViT neural network,so as to realize the automatic classification of music genre styles.Simulation results show that the test recognition accuracy of the constructed ViT network on the public GTZAN data set reaches 91.01%,which is 1.00-5.00 percentage points higher than those of traditional Convolutional Neural Network(CNN)based music genre classification methods such as AlexNet,AlexNet-enhanced and VGG16.
作者
董安明
刘宗银
禹继国
韩玉冰
周酉
DONG Anming;LIU Zongyin;YU Jiguo;HAN Yubing;ZHOU You(Big Data Institute,Qilu University of Technology,Jinan Shandong 250353,China;School of Mathematics and Statistics,Qilu University of Technology,Jinan Shandong 250353,China;School of Computer Science and Technology,Qilu University of Technology,Jinan Shandong 250353,China;Shandong HiCon New Media Institute Company Limited,Jinan Shandong 250013,China)
出处
《计算机应用》
CSCD
北大核心
2022年第S01期54-58,共5页
journal of Computer Applications
基金
国家重点研发计划项目(2017YFB1400500)
山东省重点研发计划项目(2019JZZY020124)
山东省自然科学基金资助项目(ZR2017BF012)
山东省高等学校青年创新团队发展计划(2019KJN010)
齐鲁工业大学(山东省科学院)计算机科学与技术学科基础研究加强计划项目(2021JC02014)
齐鲁工业大学(山东省科学院)计算机科学与技术学科人才培养提升计划项目(2021PY05001)。
关键词
视觉变换网络
音乐流派
特征转换
语谱图
深度学习
数据增强
vision transformer network
music genre
feature transform
spectrogram
deep learning
data enhancement