Min-wise hash function-based sampling over distributed data streams

分布数据流上基于Min-wise散列函数的采样(英文)

下载PDF

导出

摘要 In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method. 针对分布数据流中存在的冗余和不一致信息问题,提出了一种基于Min-wise散列的采样方法,并定义了反映应用需求的分布数据流并的语义.首先,对于每一族Min-wise散列函数选取具有最小散列值的数据作为局部样本,滤除单个数据流中的频繁更新对采样偏斜的影响.然后,对于相同散列函数产生的样本选取具有最小散列值的样本作为全局样本,完成局部样本集在中心节点的合并,滤除在分布节点上的重复更新对样本偏斜的影响.最后,利用获得的均匀样本集,在多种数据流并的语义上精确估计聚集函数的值.基于人造数据和真实数据的对比试验验证了该方法的有效性.

作者崇志宏倪巍伟徐立臻吕建华谢英豪

机构地区东南大学计算机科学与工程学院

出处《Journal of Southeast University(English Edition)》 EI CAS 2009年第4期456-459,共4页 东南大学学报（英文版）

基金 The National Natural Science Foundation of China(No60973023,60603040) the Natural Science Foundation of Southeast University(NoKJ2009362)

关键词 data streams AGGREGATION rain-wise hashing 数据流聚集 Min-wise散列

分类号 TP392 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献8

1Gibbons P B,Tirthapura S.Estimating simple functions onthe union of data streams[].Annual ACM Symposium on Parallel Algorithms and Architectures.2001
2Ganguly S,Garofalakis M,Rastogi R.Processing set ex- pressions over continuous update streams[].Proceedings of the ACM SIGMOD International Conference on Manage- ment of Data.2003
3Ganguly S,Garofalakis M,Kumar A, et al.Join-distinct aggregate estimation over update streams[].Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Prin- ciples of Database Systems.2005
4Indyk P.Stable distributions, pseudorandom generators, embeddings and data stream computation [ C][].Annual Symposium on Foundations of Computer Science Proceed- ings.2000
5Gilbert A C,Kotidis Y,Muthuukrishnan S, et al.How to summarize the universe: dynamic maintenance of quantiles[].Proceedings of theth Annual International Confer- ence on Very Large Data Bases.2002
6Flajolet P,Martin GN.Probabilistic counting algorithms for data base applications[].Journal of Computer and System Sciences.1985
7Hoeffding W.Probability inequalities for sums of bounded random variables[].Journal of the American Statistical Association.1963
8Broder,AZ,Charikar,M,Frieze,A.M,et al.Min-wise independent permutations[].Proceedings of the Annual ACM Symposium on Theory of Computing.1998

1杨宜东,孙志挥,张净.基于核密度估计的分布数据流离群点检测[J].计算机研究与发展,2005,42(9):1498-1504. 被引量：9
2王树广.分布式数据流上的连续异常检测[J].微电子学与计算机,2008,25(9):158-160. 被引量：1
3詹艳梅,曾向阳,孙进才.基于粗糙集理论的目标特征选择方法[J].自然科学进展,2004,14(12):1483-1487. 被引量：2
4刘华富.基于模糊支持向量机的Web挖掘[J].计算机工程与应用,2007,43(10):189-190. 被引量：1
5刘力雄,郭云飞,康晶,马宏.分布式数据流聚类算法[J].计算机工程与设计,2011,32(8):2708-2711. 被引量：2
6朱雪玲,兰军,李寿其,贾焰.基于张量分解的数据流异常检测[J].计算机工程与科学,2009,31(6):75-78.
7马长安.一种分布式信息监控系统的设计[J].安徽电子信息职业技术学院学报,2007,6(4):89-89.
8孙昌霞,车银超,郭玉峰,庞晓丹,马新明,张龙龙.基于安全策略的分布式农产品数字认证[J].河南农业大学学报,2011,45(2):236-240.
9李雄飞,苑森淼,董立岩.基于相联规则的数据挖掘理论[J].吉林工业大学自然科学学报,2000,30(2):43-46. 被引量：9
10刘维弋,金远平.一种基于动态修正值的分布数据流Top-K查询处理算法[J].计算机应用与软件,2009,26(1):53-56.

Journal of Southeast University(English Edition)

2009年第4期

浏览历史

内容加载中请稍等...

Min-wise hash function-based sampling over distributed data streams

参考文献8

相关作者

相关机构

相关主题

浏览历史