大规模数据密集型系统中的去重查询优化被引量：6

Duplication Elimination in Large Scale Data Intensive Systems

下载PDF

导出

摘要针对shared-nothing结构下大规模数据密集型系统去重查询的挑战,提出了一种有效的数据分布策略和并行处理方法分别对相关属性和无关属性去重进行优化:即自适应的散列和直方图相结合的数据分布策略,以及异步式并行查询中间件.前者在数据写入时保证数据均衡,并在数据量发生倾斜时自动调整数据的分布;后者充分发掘了去重查询处理中的粗粒度流水级并行,并消除了多节点同步等待的开销,尽早地返回结果.在生产系统DBroker上的测试表明,数据分布策略极大地改善相关属性的去重查询性能,而异步式并行查询引擎能够充分发掘并行性,对不相关属性的去重查询具有显著的性能提升. As the emerging data intensive applications have received more and more attentions from researchers, it＇s a severe challenge for duplication elimination for large volume data in a sharednothing environment. The authors propose an effective and adaptive data placement method which is a combination of hash partition and histogram, as well as a design of an asynchronous parallel query engine （APQE） for duplication elimination. Hash partition divides data into non-relevant subsets in order to reduce data migration in duplication elimination, while histogram method keeps balance in data size in different nodes. Furthermore, adaptive approach can make data size rebalanced while data skew occurs. The parallel query engine develops maximum degree of pipeline parallelism for large scale data processing by employing coarse-grained pipelining, and the asynchronous method makes further efforts to eliminate synchronous overhead of multiple nodes parallelism. APQE launches data merging when some of database nodes returns intermediate result, and at the same time returns part of the final result as early as the slowest node returns relevant data, and then frees the memory space. Experimental results tested in a productive large scale system DBroker demonstrate that the combined data placement strategy and adaptive method work well for relative attributes duplication elimination, and the asynchronous parallel query engine can make a great performance improvement for duplication elimination of large volume of data in a cluster environment.

作者宋怀明安明远王洋袁春阳孙凝晖

机构地区中国科学院计算技术研究所计算机系统结构重点实验室中国科学院研究生院国家计算机网络应急技术处理协调中心

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第4期581-588,共8页 Journal of Computer Research and Development

基金国家"八六三"高技术研究发展计划基金项目(2007AA010505)~~

关键词去重查询数据划分大规模数据密集型系统异步查询并行查询引擎 duplication elimination data partitioning large scale data intensive system asynchronous query parallel query engine（PQE）

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1Mehta M,DeWitt D.Data placement in shared-nothing parallel database systems[J].The VLDB Journal,1997,6(1):53-72.
2DeWitt D,Gray J.Parallel database systems:The future of high performance database systems[J].Communications of ACM,1992,35(6):85-98.
3Bitton D,Dewitt D J.Duplication record elimination in large data files[J].ACM Trans on Database Systems,1983,8(2):255-265.
4Wang Xiaoyu,Cherniack Mitch.Avoid sorting and grouping in processing queries[C]//Proc of the 29th Int Conf on VLDB.San Francisco:Morgan Kaufmann,2003:826-837.
5Claussen J,Kemper A,Kossmann D,et al.Exploiting early sorting and early partitioning for decision support query processing[J].The VLDB Journal,2000,9(3):190-213.
6Graefe G,Cole R L.Fast algorithms for universal quantification in large databases[J].ACM Trans on Database Systems,1995,20(2):187-236.
7Kitsuregawa M,Ogawa Y.Bucket spreading parallel hash:A new,robust,parallel hash join method for data skew in the super database computer(SDC)[C]//Proc of the 16th Int Conf on VLDB.San Francisco:Morgan Kaufmann,1990:210-221.
8Ung Kyu Park,Hwang Kyu Choi,Tag Gon Kim.Uniform partitioning of relations using histogram equalization framework:An efficient parallel hash-based join[J].Information Processing Letters,1995,55(5):283-289.
9Hua K A,Lee C.Handling data skew in multiprocessor database computers using partition tuning[C]//Proc of the 17th Int Conf on VLDB.San Francisco:Morgan Kaufmann,1991:525-535.
10Xu Yu,Kostamma Pekka,Zhou Xin,et al.Handling data skew in parallel joins in shared-nothing systems[C]//Proc of the 28th Int Conf on VLDB.New York:ACM,2008:1043-1052.

同被引文献139

1王佰玲,方滨兴,云晓春.零拷贝报文捕获平台的研究与实现[J].计算机学报,2005,28(1):46-52. 被引量：67
2孙广中,肖锋,熊曦.MapReduce模型的调度及容错机制研究[J].微电子学与计算机,2007,24(9):178-180. 被引量：26
3Bryant R E.Data-Intensive Supercomputing:The Case for DISC. Technical report CMU-CS-07-128 . 2010
4James K G, Evelson B, Karel R. In-database analytics , The heart of the predictive enterprise. Forrester Whitepaper, USA: Forrester Research, 2009.
5Brewer E. Towards robust distributed systems/ /Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing. Portland, USA, 2004: 7.
6Brewer E. CAP twelve years later: How the "rules" have changed. Computer, 2012, 45(2): 23-29.
7DeanJ, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 510): 107-113.
8White T. Hadoop , The Definitive Guide. USA: Yahool Press, 2010.
9Isard M, Budiu M, Yu y, et al. Dryad: Distributed data?parallel programs from sequential building blocks/ /Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems. Lisbon, Portugal, 2007: 59-72.
10Olston B, Reed U, Srivastava R, et al. Pig latin: A not-so?foreign language for data processing/ /Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, Canada, 2008: 1099-1110.

引证文献6

1冷芳玲,鲍玉斌,高伟,于戈.基于MapReduce的数据聚集运算算法[J].中国科技论文在线,2011,6(7):469-475. 被引量：6
2张亮,陆余良,袁桓,张旻.Deep Web查询优化算法研究[J].小型微型计算机系统,2012,33(3):552-557.
3王晓燕,陈晋川,杜小勇.云计算环境中面向OLTP应用的数据分布研究[J].计算机学报,2016,39(2):253-269. 被引量：10
4李欣,徐亮,蒋兆辉.低复杂度重构在分布式实时数据密集型Web服务架构中的应用[J].计算机应用研究,2016,33(4):1114-1119. 被引量：2
5郭庆,朱一凡,谢莹莹,张榆,陈小兵.面向大规模网络流量数据的实时汇聚查询关键技术研究[J].小型微型计算机系统,2020,41(6):1314-1320. 被引量：18
6孟令伍,杨阳朝,黄晓明,练丽萍.基于节点负载的数据动态分区[J].计算机系统应用,2021,30(12):299-307.

二级引证文献36

1陆忠敏,孙建,张家精.基于MapReduce框架的BCH码并行译码研究[J].安徽建筑工业学院学报（自然科学版）,2014,22(3):91-94.
2李莉.云计算环境下基于MapReduce并行的Apriori算法优化研究[J].自动化与仪器仪表,2014(7):1-4. 被引量：5
3杨燕.基于云计算的网络测量方法研究[J].计算机与数字工程,2015,43(4):689-694. 被引量：1
4周小平,刘祥磊.海量铁路机车GIS定位数据分布式处理技术[J].中国科技论文,2015,10(7):812-816. 被引量：3
5亢华爱.面向机器学习的通信网络大数据相关性分析算法研究[J].激光杂志,2016,37(8):145-148. 被引量：4
6刘佳伟.云计算技术的应用与发展[J].电子技术与软件工程,2016(22):149-149. 被引量：1
7杨学林.云计算环境下三维海量激光扫描数据的分布存储技术研究[J].激光杂志,2017,38(7):171-175. 被引量：3
8袁磊,许劼,许广州.数据分析在汽车工业设备智能分析系统的应用[J].计算机应用与软件,2017,34(12):154-157. 被引量：1
9张仕学.大型文本数据库中分布式数据去重备份方法[J].科学技术与工程,2018,18(4):310-315. 被引量：5
10张伟,马利民,智昊.面向商品筛选应用的大数据处理优化技术[J].北京信息科技大学学报（自然科学版）,2018,33(4):1-9.

1宋怀明,焦丽梅,孙凝晖.一种数据库中间件的配置服务的研究和实现[J].小型微型计算机系统,2007,28(3):438-442. 被引量：2
2李宗福,邓琼波,李均甫.基于PowerBuilder的大结果集查询优化技术[J].计算机应用研究,2003,20(12):109-110. 被引量：1
3于红博,陈钟荣,查书平.提高SQL Server查询速度的方法[J].计算机应用与软件,2004,21(6):25-26. 被引量：3
4李娜.SQL server查询优化分析[J].辽宁省交通高等专科学校学报,2007,9(3):41-43.
5刘莹,宋怀明,焦丽梅.面向过程的测试方法在大规模数据密集型系统中的应用[J].计算机应用,2006,26(6):1452-1455.
6李文华,孙江华.LINQ to SQL异步查询技术在Web开发中的应用[J].长江大学学报（自科版）（上旬）,2008,5(4):208-210. 被引量：2
7贾鸿燕,蒋曙光,吴征艳,李英华.基于ASP.NET 3.5 AJAX的注浆检测参数网络发布[J].煤矿安全,2011,42(11):58-60. 被引量：1
8张延园,刘敏,蒋立源.并行程序性能分析系统的研究[J].微机发展,1997,7(5):17-19.
9张东东.基于遗传算法的支持向量机分类算法[J].电子科技,2015,28(12):32-35. 被引量：6
10汪楠.分布式存储哈西算法浅析[J].智富时代,2015,0(9X):193-193.

计算机研究与发展

2010年第4期

浏览历史

内容加载中请稍等...

大规模数据密集型系统中的去重查询优化被引量：6

参考文献16

同被引文献139

引证文献6

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

大规模数据密集型系统中的去重查询优化 被引量：6

参考文献16

同被引文献139

引证文献6

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

大规模数据密集型系统中的去重查询优化被引量：6