摘要
在话题检测任务中,面对微博这类短文本时,针对SBERT模型的特征提取能力的局限性,以及在聚类阶段,单遍聚类算法存在的小簇问题和效率问题,对两者改进,提出一种基于半监督SBERT与SinglePass(semi-supervised SBERT with SinglePass clustering,Semi-SBERT-SP)的微博热点话题检测方法,将SBERT模型结合半监督训练,提高其短文本特征提取能力。在聚类阶段过程中引入时间窗口和降维,提高算法效率,增加一个合并层,处理算法产生的小簇。在话题表示层,提出一种融入词热度的词贡献指标,用于提取话题簇中的关键词。实验结果表明,该方法在准确率、F1、互信息3个指标上均优于对比模型或方法,能够有效检测出微博中包含的热点话题。
In the task of topic detection,when faced with short texts such as microblog,aiming at the limitations of feature extraction ability of SBERT model and the problem that the SinglePass clustering algorithm has small cluster and low efficiency,a microblog hot topic detection method based on semi-supervised SBERT with SinglePass clustering(Semi-SBERT-SP)was proposed,combining the SBERT model with semi-supervised training to improve its short text feature extraction capabilities.The time window and dimensionality reduction were introduced during the clustering stage to improve the efficiency of the algorithm,and a merging layer was added to process the small clusters generated using the algorithm.In the topic representation layer,a word contribution index integrating word popularity was proposed to extract keywords from topic clusters.Experimental results show that the proposed method is superior to the comparison model or method in accuracy,F1 and mutual information,and it can effectively detect the hot topics contained in microblogs.
作者
李彦
邓宇浩
LI Yan;DENG Yu-hao(College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
出处
《计算机工程与设计》
北大核心
2024年第11期3329-3336,共8页
Computer Engineering and Design
基金
国家自然科学基金面上基金项目(61173184)
重庆市自然科学基金项目(cstc2018jcyjA2328、cstc2018jcyjAX0694)。
关键词
微博
话题检测
短文本
预训练模型
监督学习
孪生网络
单遍聚类
microblog
topic detection
short text
pre-trained model
semi-supervised learning
siamese network
SinglePass clustering