Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

doi:10.1007/s12204-020-2240-x

Abstract

Abstract: Graphics processing units (GPUs) employ the single instruction multiple data (SIMD) hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow. Threads running concurrently within a warp may jump to different paths after conditional branches. Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs. To alleviate the waste of SIMD lanes, threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes. However, this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions, resulting in that no warps are scheduled in some cases. In this paper, we propose an approach to reduce the overhead of barrier synchronizations induced by compactions. In our approach, a compaction is bypassed by warps whose threads all jump to the same path after branches. Moreover, warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing. In addition, a compaction is canceled if idle lanes can not be reduced via this compaction. The experimental results demonstrate that our approach provides an average improvement of 21% over the baseline GPU for applications with massive divergent branches, while recovering the performance loss induced by compactions by 13% on average for applications with many non-divergent control flows.

Key words: graphics processing unit (GPU)| single instruction multiple data (SIMD)| thread| warps| bypass

摘要： Graphics processing units (GPUs) employ the single instruction multiple data (SIMD) hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow. Threads running concurrently within a warp may jump to different paths after conditional branches. Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs. To alleviate the waste of SIMD lanes, threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes. However, this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions, resulting in that no warps are scheduled in some cases. In this paper, we propose an approach to reduce the overhead of barrier synchronizations induced by compactions. In our approach, a compaction is bypassed by warps whose threads all jump to the same path after branches. Moreover, warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing. In addition, a compaction is canceled if idle lanes can not be reduced via this compaction. The experimental results demonstrate that our approach provides an average improvement of 21% over the baseline GPU for applications with massive divergent branches, while recovering the performance loss induced by compactions by 13% on average for applications with many non-divergent control flows.

关键词: graphics processing unit (GPU)| single instruction multiple data (SIMD)| thread| warps| bypass

CLC Number:

TP 33

LI Bingchao (李炳超), WEI Jizeng (魏继增), GUO Wei (郭炜), SUN Jizhou (孙济州). Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units[J]. J Shanghai Jiaotong Univ Sci, 2021, 26(2): 245-256.

References 23

[1]	LINDHOLM E, NICKOLLS J, OBERMAN S, et al.NVIDIA tesla: A unified graphics and computing architecture [J]. IEEE Micro, 2008, 28(2): 39-55.
[2]	DAI H, LIN Z, LI C, et al. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls[C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA).Piscataway, NJ, USA: IEEE, 2018: 208-220.
[3]	KIM K, RO W W. WIR: Warp instruction reuse to minimize repeated computations in GPUs [C]//IEEE International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA:IEEE, 2018: 389-402.
[4]	ABBASITABAR H, SAMAVATIAN M H, SARBAZIAZAD H. ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs [J]. Microprocessors and Microsystems, 2016, 46: 264-273.
[5]	OH B, KIM N S, AHN J, et al. A load balancing technique for memory channels [C]//International Symposium on Memory Systems. New York, USA: ACM,2018: 55-66.
[6]	WANG B, YU W K, SUN X H, et al. DaCache:Memory divergence-aware GPU cache management[C]//29th ACM International Conference on Supercomputing (ICS). New York, USA: ACM, 2015: 89-98.
[7]	TANASIC I, GELADO I, JORDA M, et al. Efficient exception handling support for GPUs[C]//Proceedings of the 50th International Symposium on Microarchitecture(MICRO). New York, USA: ACM, 2017: 109-122.
[8]	DIAMOS G, ASHBAUGH B, MAIYURAN S,et al. SIMD re-convergence at thread frontiers[C]//Proceedings of the 44th International Symposium on Microarchitecture (MICRO). New York, USA:ACM, 2011: 477-488.
[9]	FUNG W W L, SHAM I, YUAN G, et al. Dynamic warp formation and scheduling for efficient GPU control flow [C]//Proceedings of the 40th International Symposium on Microarchitecture (MICRO). Piscataway,NJ, USA: IEEE, 2007: 407-420.
[10]	JIN X X, DAKU B, KO S B. Improved GPU SIMD control flow efficiency via hybrid warp size mechanism[J]. Microprocessors and Microsystems, 2014, 38(7):717-729.
[11]	RHU M, EREZ M. The dual-path execution model for efficient GPU control flow [C]//Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA:IEEE, 2013: 591-602.
[12]	ZHANG T, JING N, JIANG K, et al. Buddy SM: Sharing pipeline front-end for improved energy efficiency in GPGPUs [J]. ACM Transactions on Architecture and Code Optimization, 2015, 12(2): 16.
[13]	KHORASANI F, GUPTA R, BHUYAN L N. Efficient warp execution in presence of divergence with collaborative context collection [C]//Proceedings of the 48th International Symposium on Microarchitecture (MICRO).Piscataway, NJ, USA: IEEE, 2015: 204-215.
[14]	ELTANTAWY A, AAMODT T M. MIMD synchronization on SIMT architectures [C]//Proceedings of the 49th International Symposium on Microarchitecture (MICRO). Piscataway, NJ, USA: IEEE, 2016: 11.
[15]	WANG Y, WANG D, CHEN S, et al. Iteration interleaving-based SIMD lane partition [J]. ACM Transactions on Architecture and Code Optimization,2016, 12(4): 58.
[16]	FUNG W W L, AAMODT T M. Thread block compaction for efficient SIMT control flow[C]//Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA: IEEE, 2011: 25-36.
[17]	LIU Y, YU Z, EECKHOUT L, et al. Barrieraware warp scheduling for throughput processors[C]//Proceedings of the International Conference on Supercomputing (ICS). New York, USA: ACM, 2016:42.
[18]	ELTANTAWY A, AAMODT T M. Warp scheduling for fine-grained synchronization [C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA). Piscataway,NJ, USA: IEEE, 2018: 375-388.
[19]	GRAUER-GRAY S, XU L, SEARLES R, et al. Autotuning a high-level language targeted to GPU codes[C]//Innovative Parallel Computing (InPar). Piscataway,NJ, USA: IEEE, 2012: 1-10.
[20]	HE B, FANG W, LUO Q, et al. Mars: A MapReduce framework on graphics processors [C]//International Conference on Parallel Architectures and Compilation Techniques (PACT). New York, USA: ACM, 2008:260-269.
[21]	BURTSCHER M, NASRE R, PINGALI K. A quantitative study of irregular programs on GPUs[C]//Proceedings of the International Symposium on Workload Characterization (IISWC). Piscataway, NJ,USA: IEEE, 2012: 141-151.
[22]	CHE S, BOYER M, MENG J, et al. Rodinia:A benchmark suite for heterogeneous computing [C]//Proceedings of the International Symposium on Workload Characterization (IISWC). Piscataway, NJ,USA: IEEE, 2009: 44-54.
[23]	BAKHODA A, YUAN G L, FUNGWWL, et al. Analyzing CUDA workloads using a detailed GPU simulator[C]//International Symposium on Performance Analysis of Systems and Software (ISPASS). Piscataway,NJ, USA: IEEE, 2009: 163-174.