摘要
In order to avoid the redundant and inconsistent information in distributed data streams, a sampling method based on min-wise hash functions is designed and the practical semantics of the union of distributed data streams is defined. First, for each family of min-wise hash functions, the data with the minimum hash value are selected as local samples and the biased effect caused by frequent updates in a single data stream is filtered out. Secondly, for the same hash function, the sample with the minimum hash value is selected as the global sample and the local samples are combined at the center node to filter out the biased effect of duplicated updates. Finally, based on the obtained uniform samples, several aggregations on the defined semantics of the union of data streams are precisely estimated. The results of comparison tests on synthetic and real-life data streams demonstrate the effectiveness of this method.
针对分布数据流中存在的冗余和不一致信息问题,提出了一种基于Min-wise散列的采样方法,并定义了反映应用需求的分布数据流并的语义.首先,对于每一族Min-wise散列函数选取具有最小散列值的数据作为局部样本,滤除单个数据流中的频繁更新对采样偏斜的影响.然后,对于相同散列函数产生的样本选取具有最小散列值的样本作为全局样本,完成局部样本集在中心节点的合并,滤除在分布节点上的重复更新对样本偏斜的影响.最后,利用获得的均匀样本集,在多种数据流并的语义上精确估计聚集函数的值.基于人造数据和真实数据的对比试验验证了该方法的有效性.
基金
The National Natural Science Foundation of China(No60973023,60603040)
the Natural Science Foundation of Southeast University(NoKJ2009362)