Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

doi:10.1007/s12204-016-1723-2

Abstract

Abstract: The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit (GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.

Key words: general purpose graphics processing unit (GPGPU)| multicore| intra-kernel| inter-kernel| parallel

摘要： The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit (GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.

关键词: general purpose graphics processing unit (GPGPU)| multicore| intra-kernel| inter-kernel| parallel

CLC Number:

TP 302.7

ZHAO Xia* (赵夏), MA Sheng (马胜), CHEN Wei (陈微), WANG Zhiying (王志英). Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program[J]. Journal of shanghai Jiaotong University (Science), 2016, 21(3): 280-288.

References 19

[1]	NVIDIA. TeslaKeplerTM GPU accelerator [EB/OL].(2014-09-01). http://www.nvidia.com/content/tesla/pdf/ Tesla-KSeries-Overview-LR.pdf.
[2]	AYANI R. Parallel simulation [C]//Performance Evaluationof Computer and Communication Systems.Berlin Heidelberg: Springer, 1993: 1-20.
[3]	NICOL D, FUJIMOTO R. Parallel simulation today[J]. Annals of Operations Research, 1994, 53(1): 249-285.
[4]	REINHARDT S K, HILL M D, LARUS J R, et al.The Wisconsin wind tunnel: Virtual prototyping ofparallel computers [C]// Proceedings of the 1993 ACMSIGMETRICS Conference. New York: ACM, 1993: 1-3.
[5]	MUKHERJEE S S, REINHARDT S K, FALSAFI B,et al. Wisconsin wind tunnel II: A fast, portable parallelarchitecture simulator [J]. IEEE Concurrency, 2000,8(4): 12-20.
[6]	CHEN J W, ANNAVARAM M, DUBOIS M. Slack-Sim: A platform for parallel simulations of CMPson CMPs [J]. ACM SIGARCH Computer ArchitectureNews, 2009, 37(2): 20-29.
[7]	MILLER J E, KASTURE H, KURIAN G, et al.Graphite: A distributed parallel simulator for multicores[C]//Proceedings of 16th International Symposiumon High Performance Computer Architecture.Washington: IEEE, 2010: 1-12.
[8]	LEE S, RO W W. Parallel GPU architecture simulationframework exploiting work allocation unit parallelism[C]//2013 IEEE International Symposium onPerformance Analysis of Systems and Software. Washington:IEEE, 2013: 107-117.
[9]	DEL BARRIO V M, GONZ′ALEZ C, ROCA J, et al.ATTILA: A cycle-level execution-driven simulator formodern GPU architectures [C]//2006 IEEE InternationalSymposium on Performance Analysis of Systemsand Software. Washington: IEEE, 2006: 231-241.
[10]	BAKHODA A, YUAN G L, FUNG W W L, et al. AnalyzingCUDA workloads using a detailed GPU simulator[C]// 2009 IEEE International Symposium onPerformance Analysis of Systems and Software. Washington:IEEE, 2009: 163-174.
[11]	UBAL R, JANG B, MISTRY P, et al. Multi2Sim:A simulation framework for CPU-GPU computing[C]//Proceedings of the 21st International Conferenceon Parallel Architectures and Compilation Techniques.New York: ACM, 2012: 335-344.
[12]	YU Z B, EECKHOUT L, GOSWAMI N, et al. AcceleratingGPGPU architecture simulation [C]// Proceedingsof the ACM SIGMETRICS/International Conferenceon Measurement and Modeling of Computer Systems.New York: ACM, 2013: 331-332.
[13]	MAUER C J, HILL M D, WOOD D A. Full-systemtiming-first simulation [C]// Proceedings of the 2002ACM Sigmetrics Conference on Measurement andModeling of Computer Systems. New York: ACM,2002: 108-116.
[14]	Illinois Microarchitecture Project utilizing AdvancedCompiler Technology Research Group.Parboil benchmark suite [EB/OL]. (2014-09-01).http://impact.crhc.illinois.edu/Parboil/parboil.aspx.
[15]	NVIDIA Corporation. NVIDIA CUDASDK code samples [EB/OL]. (2014-09-01).http://docs.nvidia.com/cuda/cuda-samples.
[16]	MIKE GILES. Libor [EB/OL]. (2014-09-01).http://people.maths.ox.ac.uk/gilesm/cuda.html.
[17]	EECKHOUT L. Computer architecture performanceevaluation methods [J]. Synthesis Lectures on ComputerArchitecture, 2010, 5(1): 1-145.
[18]	LUO Y, JOHN L K, EECKHOUT L. Self-monitoredadaptive cache warm-up for microprocessor simulation[C]// Proceedings of the 16th Symposium on ComputerArchitecture and High Performance Computing(SBAC-PAD’04). [s.l.]: IEEE, 2004: 10-17.
[19]	HASKINS JR J W, SKADRON K. Acceleratedwarmup for sampled microarchitecture simulation [J].ACM Transactions on Architecture and Code Optimization,2005, 2(1): 78-108.