摘要
Web文档内容数据质量评估决定获取数据的有用性。基于词法或用户交互进行质量评估的方法缺乏通用性,也不能获取内容的事实内涵。因此提出基于事实的质量评估方法(Fact-based Quality Assessment,FQA)。首先在Web上构建目标文档上下文,并抽取Web文档内容的事实;然后分别采用投票和图迭代策略,构建准确性和完整性维度的参照;最后,比对目标文档和维度参照的事实,量化准确性和完整性。该方法不依赖特定特征,基于事实内涵量化数据质量维度,可取得高的评估精度。实验结果证明了FQA方法的优越性。
Data quality assessment of Web article content helps identify useful data.Exiting approaches not only heavily rely on lexicon features or user interactions to obtain quality indicators,but also can not capture the content’semantics.A fact-based quality assessment(FQA)approach was proposed in this article.Given one target article,the approach starts with the identification of alternative context by collecting relevant articles and extracting facts from every article.Then,the accuracy baseline is constructed by voting,and the completeness baseline is constructed by iterations over fact graphs.Finally,data quality dimensions,including accuracy and completeness are calculated by comparing the facts of the target article with the established dimension baselines.Based on the facts of target article content,rather than particular features,FQA approach can quantify data quality dimensions with high precisions.The superior performance of FQA was verified in the experiments.
出处
《计算机科学》
CSCD
北大核心
2014年第11期247-251,255,共6页
Computer Science
基金
国家自然科学基金项目(61003040
61100135)
中央高校基本科研业务费专项资金项目(LGZD201324)资助