OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When ...OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.展开更多
Unified programming models can effectively improve program portability on various heterogeneous high-performance computers.Existing unified programming models put a lot of effort to code portability but are still far ...Unified programming models can effectively improve program portability on various heterogeneous high-performance computers.Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability.In this paper,we present a preliminary design of a performance-portable unified programming model including four aspects:programming language,programming abstraction,compilation optimization,and scheduling system.Specifically,domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures.The unified programming abstraction unifies the common features of different architectures to support common optimizations.Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations.Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers.This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.展开更多
基金Project supported by the National Natural Science Foundation of China(No.61272145)the National High-Tech R&D Program(863)of China(No.2012AA012706)
文摘OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
基金partially supported by the National Natural Science Foundation of China under Grant No.62225206.
文摘Unified programming models can effectively improve program portability on various heterogeneous high-performance computers.Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability.In this paper,we present a preliminary design of a performance-portable unified programming model including four aspects:programming language,programming abstraction,compilation optimization,and scheduling system.Specifically,domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures.The unified programming abstraction unifies the common features of different architectures to support common optimizations.Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations.Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers.This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.