[1] |
LINDHOLM E, NICKOLLS J, OBERMAN S, et al.NVIDIA tesla: A unified graphics and computing architecture [J]. IEEE Micro, 2008, 28(2): 39-55.
|
[2] |
DAI H, LIN Z, LI C, et al. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls[C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA).Piscataway, NJ, USA: IEEE, 2018: 208-220.
|
[3] |
KIM K, RO W W. WIR: Warp instruction reuse to minimize repeated computations in GPUs [C]//IEEE International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA:IEEE, 2018: 389-402.
|
[4] |
ABBASITABAR H, SAMAVATIAN M H, SARBAZIAZAD H. ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs [J]. Microprocessors and Microsystems, 2016, 46: 264-273.
|
[5] |
OH B, KIM N S, AHN J, et al. A load balancing technique for memory channels [C]//International Symposium on Memory Systems. New York, USA: ACM,2018: 55-66.
|
[6] |
WANG B, YU W K, SUN X H, et al. DaCache:Memory divergence-aware GPU cache management[C]//29th ACM International Conference on Supercomputing (ICS). New York, USA: ACM, 2015: 89-98.
|
[7] |
TANASIC I, GELADO I, JORDA M, et al. Efficient exception handling support for GPUs[C]//Proceedings of the 50th International Symposium on Microarchitecture(MICRO). New York, USA: ACM, 2017: 109-122.
|
[8] |
DIAMOS G, ASHBAUGH B, MAIYURAN S,et al. SIMD re-convergence at thread frontiers[C]//Proceedings of the 44th International Symposium on Microarchitecture (MICRO). New York, USA:ACM, 2011: 477-488.
|
[9] |
FUNG W W L, SHAM I, YUAN G, et al. Dynamic warp formation and scheduling for efficient GPU control flow [C]//Proceedings of the 40th International Symposium on Microarchitecture (MICRO). Piscataway,NJ, USA: IEEE, 2007: 407-420.
|
[10] |
JIN X X, DAKU B, KO S B. Improved GPU SIMD control flow efficiency via hybrid warp size mechanism[J]. Microprocessors and Microsystems, 2014, 38(7):717-729.
|
[11] |
RHU M, EREZ M. The dual-path execution model for efficient GPU control flow [C]//Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA:IEEE, 2013: 591-602.
|
[12] |
ZHANG T, JING N, JIANG K, et al. Buddy SM: Sharing pipeline front-end for improved energy efficiency in GPGPUs [J]. ACM Transactions on Architecture and Code Optimization, 2015, 12(2): 16.
|
[13] |
KHORASANI F, GUPTA R, BHUYAN L N. Efficient warp execution in presence of divergence with collaborative context collection [C]//Proceedings of the 48th International Symposium on Microarchitecture (MICRO).Piscataway, NJ, USA: IEEE, 2015: 204-215.
|
[14] |
ELTANTAWY A, AAMODT T M. MIMD synchronization on SIMT architectures [C]//Proceedings of the 49th International Symposium on Microarchitecture (MICRO). Piscataway, NJ, USA: IEEE, 2016: 11.
|
[15] |
WANG Y, WANG D, CHEN S, et al. Iteration interleaving-based SIMD lane partition [J]. ACM Transactions on Architecture and Code Optimization,2016, 12(4): 58.
|
[16] |
FUNG W W L, AAMODT T M. Thread block compaction for efficient SIMT control flow[C]//Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA: IEEE, 2011: 25-36.
|
[17] |
LIU Y, YU Z, EECKHOUT L, et al. Barrieraware warp scheduling for throughput processors[C]//Proceedings of the International Conference on Supercomputing (ICS). New York, USA: ACM, 2016:42.
|
[18] |
ELTANTAWY A, AAMODT T M. Warp scheduling for fine-grained synchronization [C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA). Piscataway,NJ, USA: IEEE, 2018: 375-388.
|
[19] |
GRAUER-GRAY S, XU L, SEARLES R, et al. Autotuning a high-level language targeted to GPU codes[C]//Innovative Parallel Computing (InPar). Piscataway,NJ, USA: IEEE, 2012: 1-10.
|
[20] |
HE B, FANG W, LUO Q, et al. Mars: A MapReduce framework on graphics processors [C]//International Conference on Parallel Architectures and Compilation Techniques (PACT). New York, USA: ACM, 2008:260-269.
|
[21] |
BURTSCHER M, NASRE R, PINGALI K. A quantitative study of irregular programs on GPUs[C]//Proceedings of the International Symposium on Workload Characterization (IISWC). Piscataway, NJ,USA: IEEE, 2012: 141-151.
|
[22] |
CHE S, BOYER M, MENG J, et al. Rodinia:A benchmark suite for heterogeneous computing [C]//Proceedings of the International Symposium on Workload Characterization (IISWC). Piscataway, NJ,USA: IEEE, 2009: 44-54.
|
[23] |
BAKHODA A, YUAN G L, FUNGWWL, et al. Analyzing CUDA workloads using a detailed GPU simulator[C]//International Symposium on Performance Analysis of Systems and Software (ISPASS). Piscataway,NJ, USA: IEEE, 2009: 163-174.
|