J Shanghai Jiaotong Univ Sci ›› 2021, Vol. 26 ›› Issue (2): 245-256.doi: 10.1007/s12204-020-2240-x

• Welding Automation & Computer Technology • Previous Articles    

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

LI Bingchao (李炳超), WEI Jizeng (魏继增), GUO Wei (郭炜), SUN Jizhou (孙济州)   

  1. (1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China;
    2. College of Intelligence and Computing, Tianjin University, Tianjin 300350, China)
  2. (1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China;
    2. College of Intelligence and Computing, Tianjin University, Tianjin 300350, China)
  • Online:2021-04-28 Published:2021-03-24
  • Contact: WEI Jizeng (魏继增) E-mail:weijizeng@tju.edu.cn

Abstract: Graphics processing units (GPUs) employ the single instruction multiple data (SIMD) hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow. Threads running concurrently within a warp may jump to different paths after conditional branches. Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs. To alleviate the waste of SIMD lanes, threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes. However, this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions, resulting in that no warps are scheduled in some cases. In this paper, we propose an approach to reduce the overhead of barrier synchronizations induced by compactions. In our approach, a compaction is bypassed by warps whose threads all jump to the same path after branches. Moreover, warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing. In addition, a compaction is canceled if idle lanes can not be reduced via this compaction. The experimental results demonstrate that our approach provides an average improvement of 21% over the baseline GPU for applications with massive divergent branches, while recovering the performance loss induced by compactions by 13% on average for applications with many non-divergent control flows.


Key words: graphics processing unit (GPU)| single instruction multiple data (SIMD)| thread| warps| bypass

摘要: Graphics processing units (GPUs) employ the single instruction multiple data (SIMD) hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow. Threads running concurrently within a warp may jump to different paths after conditional branches. Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs. To alleviate the waste of SIMD lanes, threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes. However, this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions, resulting in that no warps are scheduled in some cases. In this paper, we propose an approach to reduce the overhead of barrier synchronizations induced by compactions. In our approach, a compaction is bypassed by warps whose threads all jump to the same path after branches. Moreover, warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing. In addition, a compaction is canceled if idle lanes can not be reduced via this compaction. The experimental results demonstrate that our approach provides an average improvement of 21% over the baseline GPU for applications with massive divergent branches, while recovering the performance loss induced by compactions by 13% on average for applications with many non-divergent control flows.


关键词: graphics processing unit (GPU)| single instruction multiple data (SIMD)| thread| warps| bypass

CLC Number: