摘要
共词分析是文本内容分析的重要基础方法,但已有共词分析方法存在两方面不足,一是在关键词共词矩阵构建中未考虑词对的语义关联,二是在共词矩阵聚类分析中不支持词汇主题归属的多元性。本文提出基于语义关联与模糊聚类的共词分析方法,结合高频低频词界分公式和词频g指数抽取领域关键词,利用词嵌入模型学习关键词的语义向量表示,进而构建语义加权共词矩阵,以综合共现特征与语义关联来度量词对间相关性;结合模糊C均值聚类算法与因子降维对语义加权共词矩阵进行关键词模糊聚类,以弥补硬聚类中词汇主题归属单一化的不足,提高类团的信息质量并揭示类团之间的联系。选择“感染性疾病学和传染病学类”期刊文献开展实验,结果验证了本文方法的有效性和优越性。
Co-word analysis is an important basic method for text content analysis;however,there are two shortcomings of the existing co-word analysis methods.One is that the semantic relevance of word pairs is not considered in the construction of the keyword co-word matrix;the other is that the diversity of word topic attribution is not supported in the cluster analysis of the co-word matrix.This study proposes a co-word analysis method based on semantic relevance and fuzzy clustering.Domain keywords are extracted based on Donohue's formula and the g-index of word frequency.The semantic vector representation of keywords is learned by the word embedding model.Subsequently,the semantic weighted co-word matrix is constructed to synthesize co-occurrence features and semantic relevance to measure the correlation between word pairs.Combining the fuzzy C-means clustering algorithm and factor dimensionality reduction,the semantic weighted coword matrix is used for keyword fuzzy clustering to overcome the lack of simplification of word topic attribution in hard clustering,which can improve the information quality of clusters and determine the relationship between clusters.Experiments are conducted using periodicals of infectious diseases to verify the effectiveness and superiority of the method.
作者
陆泉
曹越
陈静
Lu Quan;Cao Yue;Chen Jing(Center for Studies of Information Resources,Wuhan University,Wuhan 430072;Big Data Institute,Wuhan University,Wuhan 430072;School of Information Management,Central China Normal University,Wuhan 430079)
出处
《情报学报》
CSSCI
CSCD
北大核心
2022年第10期1003-1014,共12页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金重点项目“心理账户理论视角下在线健康社区精准信息服务研究”(20ATQ008)。
关键词
共词分析
语义关联
词嵌入模型
模糊C均值聚类
co-word analysis
semantic relevance
word embedding model
fuzzy C-means clustering