摘要
为解决目前因谜语问答数据集缺乏导致机器无法很好回答谜语的问题,本文构建了一种中文大规模多类型谜语问答数据集,并对其进行评测.首先从谜语网站上爬取大量中文谜语,然后经过数据清洗、筛选和分类,采用人工结合自动生成的方式为每条谜语问答对生成了4个干扰选项,最后得到一个包含1 000个中文脑筋急转弯问答、300个中文数字谜语问答和16 766个中文汉字谜语问答的谜语问答数据集.在多个基准问答模型上的实验结果显示,现有智能问答模型在该数据集上的准确率最高可达37.56%,与人类77.33%的准确率还有较大差距.该数据集的提出有助于推动谜语智能问答模型的发展.
In order to solve the problem that machines cannot answer riddles well due to the lack of riddle data sets,this paper constructed a large-scale multi-type Chinese riddle data set and evaluates it.First,a large number of Chinese riddles were collected from the Chinese riddle website,and then after data cleaning,screening and classification,four interference options were generated for each riddle pair by manual combination of automatic generation,and finally,a puzzle data set containing 1000 Chinese brain teasers,300 Chinese number puzzles and 16766 Chinese character riddles and answers was obtained.Experimental results on several benchmark question-answering models show that the highest accuracy rate of the models in this dataset is 37.56%,which is still much lower than that of human beings(77.33%).This dataset can promote the development of intelligent riddle-answering models.
作者
李建华
陈涛
贾旭东
常青玲
LI Jian-hua;CHEN Tao;JIA Xu-dong;CHANG Qing-ing(Faculty of Intelligent Manufacturing,Wuyi University,Jiangmen 529020,China;College of Engineering and Computer Science,California State University Northridge,Northridge 91330 USA)
出处
《五邑大学学报(自然科学版)》
CAS
2023年第4期38-46,共9页
Journal of Wuyi University(Natural Science Edition)
基金
广东省2019年省拨高建“冲补强”专项项目(5041700175)
教育部第二批新工科研究与实践项目(E-RGZN20201036)。
关键词
谜语
智能问答
多项选择
自然语言理解
Riddles
Intelligent question answering
Multiple choices
Natural language understanding