期刊文献+

Min-wise hash function-based sampling over distributed data streams

分布数据流上基于Min-wise散列函数的采样(英文)
在线阅读 下载PDF
导出
摘要 In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method. 针对分布数据流中存在的冗余和不一致信息问题,提出了一种基于Min-wise散列的采样方法,并定义了反映应用需求的分布数据流并的语义.首先,对于每一族Min-wise散列函数选取具有最小散列值的数据作为局部样本,滤除单个数据流中的频繁更新对采样偏斜的影响.然后,对于相同散列函数产生的样本选取具有最小散列值的样本作为全局样本,完成局部样本集在中心节点的合并,滤除在分布节点上的重复更新对样本偏斜的影响.最后,利用获得的均匀样本集,在多种数据流并的语义上精确估计聚集函数的值.基于人造数据和真实数据的对比试验验证了该方法的有效性.
出处 《Journal of Southeast University(English Edition)》 EI CAS 2009年第4期456-459,共4页 东南大学学报(英文版)
基金 The National Natural Science Foundation of China(No60973023,60603040) the Natural Science Foundation of Southeast University(NoKJ2009362)
关键词 data streams AGGREGATION rain-wise hashing 数据流 聚集 Min-wise散列
  • 相关文献

参考文献8

  • 1Gibbons P B,Tirthapura S.Estimating simple functions onthe union of data streams[].Annual ACM Symposium on Parallel Algorithms and Architectures.2001
  • 2Ganguly S,Garofalakis M,Rastogi R.Processing set ex- pressions over continuous update streams[].Proceedings of the ACM SIGMOD International Conference on Manage- ment of Data.2003
  • 3Ganguly S,Garofalakis M,Kumar A, et al.Join-distinct aggregate estimation over update streams[].Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Prin- ciples of Database Systems.2005
  • 4Indyk P.Stable distributions, pseudorandom generators, embeddings and data stream computation [ C][].Annual Symposium on Foundations of Computer Science Proceed- ings.2000
  • 5Gilbert A C,Kotidis Y,Muthuukrishnan S, et al.How to summarize the universe: dynamic maintenance of quantiles[].Proceedings of theth Annual International Confer- ence on Very Large Data Bases.2002
  • 6Flajolet P,Martin GN.Probabilistic counting algorithms for data base applications[].Journal of Computer and System Sciences.1985
  • 7Hoeffding W.Probability inequalities for sums of bounded random variables[].Journal of the American Statistical Association.1963
  • 8Broder,AZ,Charikar,M,Frieze,A.M,et al.Min-wise independent permutations[].Proceedings of the Annual ACM Symposium on Theory of Computing.1998

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部