[1] LINDHOLM E, NICKOLLS J, OBERMAN S, et al.NVIDIA tesla: A unified graphics and computing architecture [J]. IEEE Micro, 2008, 28(2): 39-55.
[2] DAI H, LIN Z, LI C, et al. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls[C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA).Piscataway, NJ, USA: IEEE, 2018: 208-220.
[3] KIM K, RO W W. WIR: Warp instruction reuse to minimize repeated computations in GPUs [C]//IEEE International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA:IEEE, 2018: 389-402.
[4] ABBASITABAR H, SAMAVATIAN M H, SARBAZIAZAD H. ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs [J]. Microprocessors and Microsystems, 2016, 46: 264-273.
[5] OH B, KIM N S, AHN J, et al. A load balancing technique for memory channels [C]//International Symposium on Memory Systems. New York, USA: ACM,2018: 55-66.
[6] WANG B, YU W K, SUN X H, et al. DaCache:Memory divergence-aware GPU cache management[C]//29th ACM International Conference on Supercomputing (ICS). New York, USA: ACM, 2015: 89-98.
[7] TANASIC I, GELADO I, JORDA M, et al. Efficient exception handling support for GPUs[C]//Proceedings of the 50th International Symposium on Microarchitecture(MICRO). New York, USA: ACM, 2017: 109-122.
[8] DIAMOS G, ASHBAUGH B, MAIYURAN S,et al. SIMD re-convergence at thread frontiers[C]//Proceedings of the 44th International Symposium on Microarchitecture (MICRO). New York, USA:ACM, 2011: 477-488.
[9] FUNG W W L, SHAM I, YUAN G, et al. Dynamic warp formation and scheduling for efficient GPU control flow [C]//Proceedings of the 40th International Symposium on Microarchitecture (MICRO). Piscataway,NJ, USA: IEEE, 2007: 407-420.
[10] JIN X X, DAKU B, KO S B. Improved GPU SIMD control flow efficiency via hybrid warp size mechanism[J]. Microprocessors and Microsystems, 2014, 38(7):717-729.
[11] RHU M, EREZ M. The dual-path execution model for efficient GPU control flow [C]//Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA:IEEE, 2013: 591-602.
[12] ZHANG T, JING N, JIANG K, et al. Buddy SM: Sharing pipeline front-end for improved energy efficiency in GPGPUs [J]. ACM Transactions on Architecture and Code Optimization, 2015, 12(2): 16.
[13] KHORASANI F, GUPTA R, BHUYAN L N. Efficient warp execution in presence of divergence with collaborative context collection [C]//Proceedings of the 48th International Symposium on Microarchitecture (MICRO).Piscataway, NJ, USA: IEEE, 2015: 204-215.
[14] ELTANTAWY A, AAMODT T M. MIMD synchronization on SIMT architectures [C]//Proceedings of the 49th International Symposium on Microarchitecture (MICRO). Piscataway, NJ, USA: IEEE, 2016: 11.
[15] WANG Y, WANG D, CHEN S, et al. Iteration interleaving-based SIMD lane partition [J]. ACM Transactions on Architecture and Code Optimization,2016, 12(4): 58.
[16] FUNG W W L, AAMODT T M. Thread block compaction for efficient SIMT control flow[C]//Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA: IEEE, 2011: 25-36.
[17] LIU Y, YU Z, EECKHOUT L, et al. Barrieraware warp scheduling for throughput processors[C]//Proceedings of the International Conference on Supercomputing (ICS). New York, USA: ACM, 2016:42.
[18] ELTANTAWY A, AAMODT T M. Warp scheduling for fine-grained synchronization [C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA). Piscataway,NJ, USA: IEEE, 2018: 375-388.
[19] GRAUER-GRAY S, XU L, SEARLES R, et al. Autotuning a high-level language targeted to GPU codes[C]//Innovative Parallel Computing (InPar). Piscataway,NJ, USA: IEEE, 2012: 1-10.
[20] HE B, FANG W, LUO Q, et al. Mars: A MapReduce framework on graphics processors [C]//International Conference on Parallel Architectures and Compilation Techniques (PACT). New York, USA: ACM, 2008:260-269.
[21] BURTSCHER M, NASRE R, PINGALI K. A quantitative study of irregular programs on GPUs[C]//Proceedings of the International Symposium on Workload Characterization (IISWC). Piscataway, NJ,USA: IEEE, 2012: 141-151.
[22] CHE S, BOYER M, MENG J, et al. Rodinia:A benchmark suite for heterogeneous computing [C]//Proceedings of the International Symposium on Workload Characterization (IISWC). Piscataway, NJ,USA: IEEE, 2009: 44-54.
[23] BAKHODA A, YUAN G L, FUNGWWL, et al. Analyzing CUDA workloads using a detailed GPU simulator[C]//International Symposium on Performance Analysis of Systems and Software (ISPASS). Piscataway,NJ, USA: IEEE, 2009: 163-174.