期刊文献+

大规模数据密集型系统中的去重查询优化 被引量:6

Duplication Elimination in Large Scale Data Intensive Systems
在线阅读 下载PDF
导出
摘要 针对shared-nothing结构下大规模数据密集型系统去重查询的挑战,提出了一种有效的数据分布策略和并行处理方法分别对相关属性和无关属性去重进行优化:即自适应的散列和直方图相结合的数据分布策略,以及异步式并行查询中间件.前者在数据写入时保证数据均衡,并在数据量发生倾斜时自动调整数据的分布;后者充分发掘了去重查询处理中的粗粒度流水级并行,并消除了多节点同步等待的开销,尽早地返回结果.在生产系统DBroker上的测试表明,数据分布策略极大地改善相关属性的去重查询性能,而异步式并行查询引擎能够充分发掘并行性,对不相关属性的去重查询具有显著的性能提升. As the emerging data intensive applications have received more and more attentions from researchers, it's a severe challenge for duplication elimination for large volume data in a sharednothing environment. The authors propose an effective and adaptive data placement method which is a combination of hash partition and histogram, as well as a design of an asynchronous parallel query engine (APQE) for duplication elimination. Hash partition divides data into non-relevant subsets in order to reduce data migration in duplication elimination, while histogram method keeps balance in data size in different nodes. Furthermore, adaptive approach can make data size rebalanced while data skew occurs. The parallel query engine develops maximum degree of pipeline parallelism for large scale data processing by employing coarse-grained pipelining, and the asynchronous method makes further efforts to eliminate synchronous overhead of multiple nodes parallelism. APQE launches data merging when some of database nodes returns intermediate result, and at the same time returns part of the final result as early as the slowest node returns relevant data, and then frees the memory space. Experimental results tested in a productive large scale system DBroker demonstrate that the combined data placement strategy and adaptive method work well for relative attributes duplication elimination, and the asynchronous parallel query engine can make a great performance improvement for duplication elimination of large volume of data in a cluster environment.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第4期581-588,共8页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2007AA010505)~~
关键词 去重查询 数据划分 大规模数据密集型系统 异步查询 并行查询引擎 duplication elimination data partitioning large scale data intensive system asynchronous query parallel query engine(PQE)
  • 相关文献

参考文献16

  • 1Mehta M,DeWitt D.Data placement in shared-nothing parallel database systems[J].The VLDB Journal,1997,6(1):53-72.
  • 2DeWitt D,Gray J.Parallel database systems:The future of high performance database systems[J].Communications of ACM,1992,35(6):85-98.
  • 3Bitton D,Dewitt D J.Duplication record elimination in large data files[J].ACM Trans on Database Systems,1983,8(2):255-265.
  • 4Wang Xiaoyu,Cherniack Mitch.Avoid sorting and grouping in processing queries[C]//Proc of the 29th Int Conf on VLDB.San Francisco:Morgan Kaufmann,2003:826-837.
  • 5Claussen J,Kemper A,Kossmann D,et al.Exploiting early sorting and early partitioning for decision support query processing[J].The VLDB Journal,2000,9(3):190-213.
  • 6Graefe G,Cole R L.Fast algorithms for universal quantification in large databases[J].ACM Trans on Database Systems,1995,20(2):187-236.
  • 7Kitsuregawa M,Ogawa Y.Bucket spreading parallel hash:A new,robust,parallel hash join method for data skew in the super database computer(SDC)[C]//Proc of the 16th Int Conf on VLDB.San Francisco:Morgan Kaufmann,1990:210-221.
  • 8Ung Kyu Park,Hwang Kyu Choi,Tag Gon Kim.Uniform partitioning of relations using histogram equalization framework:An efficient parallel hash-based join[J].Information Processing Letters,1995,55(5):283-289.
  • 9Hua K A,Lee C.Handling data skew in multiprocessor database computers using partition tuning[C]//Proc of the 17th Int Conf on VLDB.San Francisco:Morgan Kaufmann,1991:525-535.
  • 10Xu Yu,Kostamma Pekka,Zhou Xin,et al.Handling data skew in parallel joins in shared-nothing systems[C]//Proc of the 28th Int Conf on VLDB.New York:ACM,2008:1043-1052.

同被引文献139

  • 1王佰玲,方滨兴,云晓春.零拷贝报文捕获平台的研究与实现[J].计算机学报,2005,28(1):46-52. 被引量:67
  • 2孙广中,肖锋,熊曦.MapReduce模型的调度及容错机制研究[J].微电子学与计算机,2007,24(9):178-180. 被引量:26
  • 3Bryant R E.Data-Intensive Supercomputing:The Case for DISC. Technical report CMU-CS-07-128 . 2010
  • 4James K G, Evelson B, Karel R. In-database analytics , The heart of the predictive enterprise. Forrester Whitepaper, USA: Forrester Research, 2009.
  • 5Brewer E. Towards robust distributed systems/ /Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing. Portland, USA, 2004: 7.
  • 6Brewer E. CAP twelve years later: How the "rules" have changed. Computer, 2012, 45(2): 23-29.
  • 7DeanJ, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 510): 107-113.
  • 8White T. Hadoop , The Definitive Guide. USA: Yahool Press, 2010.
  • 9Isard M, Budiu M, Yu y, et al. Dryad: Distributed data?parallel programs from sequential building blocks/ /Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems. Lisbon, Portugal, 2007: 59-72.
  • 10Olston B, Reed U, Srivastava R, et al. Pig latin: A not-so?foreign language for data processing/ /Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, Canada, 2008: 1099-1110.

引证文献6

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部