上海交通大学学报(英文版) ›› 2016, Vol. 21 ›› Issue (3): 280-288.doi: 10.1007/s12204-016-1723-2
ZHAO Xia* (赵 夏), MA Sheng (马 胜), CHEN Wei (陈 微), WANG Zhiying (王志英)
出版日期:
2016-06-30
发布日期:
2016-06-30
通讯作者:
ZHAO Xia (赵 夏)
E-mail: xiazhao@nudt.edu.cn
ZHAO Xia* (赵 夏), MA Sheng (马 胜), CHEN Wei (陈 微), WANG Zhiying (王志英)
Online:
2016-06-30
Published:
2016-06-30
Contact:
ZHAO Xia (赵 夏)
E-mail: xiazhao@nudt.edu.cn
摘要: The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit (GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.
中图分类号:
ZHAO Xia* (赵 夏), MA Sheng (马 胜), CHEN Wei (陈 微), WANG Zhiying (王志英). Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program[J]. 上海交通大学学报(英文版), 2016, 21(3): 280-288.
ZHAO Xia* (赵 夏), MA Sheng (马 胜), CHEN Wei (陈 微), WANG Zhiying (王志英). Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program[J]. Journal of shanghai Jiaotong University (Science), 2016, 21(3): 280-288.
[1] | NVIDIA. TeslaKeplerTM GPU accelerator [EB/OL].(2014-09-01). http://www.nvidia.com/content/tesla/pdf/ Tesla-KSeries-Overview-LR.pdf. |
[2] | AYANI R. Parallel simulation [C]//Performance Evaluationof Computer and Communication Systems.Berlin Heidelberg: Springer, 1993: 1-20. |
[3] | NICOL D, FUJIMOTO R. Parallel simulation today[J]. Annals of Operations Research, 1994, 53(1): 249-285. |
[4] | REINHARDT S K, HILL M D, LARUS J R, et al.The Wisconsin wind tunnel: Virtual prototyping ofparallel computers [C]// Proceedings of the 1993 ACMSIGMETRICS Conference. New York: ACM, 1993: 1-3. |
[5] | MUKHERJEE S S, REINHARDT S K, FALSAFI B,et al. Wisconsin wind tunnel II: A fast, portable parallelarchitecture simulator [J]. IEEE Concurrency, 2000,8(4): 12-20. |
[6] | CHEN J W, ANNAVARAM M, DUBOIS M. Slack-Sim: A platform for parallel simulations of CMPson CMPs [J]. ACM SIGARCH Computer ArchitectureNews, 2009, 37(2): 20-29. |
[7] | MILLER J E, KASTURE H, KURIAN G, et al.Graphite: A distributed parallel simulator for multicores[C]//Proceedings of 16th International Symposiumon High Performance Computer Architecture.Washington: IEEE, 2010: 1-12. |
[8] | LEE S, RO W W. Parallel GPU architecture simulationframework exploiting work allocation unit parallelism[C]//2013 IEEE International Symposium onPerformance Analysis of Systems and Software. Washington:IEEE, 2013: 107-117. |
[9] | DEL BARRIO V M, GONZ′ALEZ C, ROCA J, et al.ATTILA: A cycle-level execution-driven simulator formodern GPU architectures [C]//2006 IEEE InternationalSymposium on Performance Analysis of Systemsand Software. Washington: IEEE, 2006: 231-241. |
[10] | BAKHODA A, YUAN G L, FUNG W W L, et al. AnalyzingCUDA workloads using a detailed GPU simulator[C]// 2009 IEEE International Symposium onPerformance Analysis of Systems and Software. Washington:IEEE, 2009: 163-174. |
[11] | UBAL R, JANG B, MISTRY P, et al. Multi2Sim:A simulation framework for CPU-GPU computing[C]//Proceedings of the 21st International Conferenceon Parallel Architectures and Compilation Techniques.New York: ACM, 2012: 335-344. |
[12] | YU Z B, EECKHOUT L, GOSWAMI N, et al. AcceleratingGPGPU architecture simulation [C]// Proceedingsof the ACM SIGMETRICS/International Conferenceon Measurement and Modeling of Computer Systems.New York: ACM, 2013: 331-332. |
[13] | MAUER C J, HILL M D, WOOD D A. Full-systemtiming-first simulation [C]// Proceedings of the 2002ACM Sigmetrics Conference on Measurement andModeling of Computer Systems. New York: ACM,2002: 108-116. |
[14] | Illinois Microarchitecture Project utilizing AdvancedCompiler Technology Research Group.Parboil benchmark suite [EB/OL]. (2014-09-01).http://impact.crhc.illinois.edu/Parboil/parboil.aspx. |
[15] | NVIDIA Corporation. NVIDIA CUDASDK code samples [EB/OL]. (2014-09-01).http://docs.nvidia.com/cuda/cuda-samples. |
[16] | MIKE GILES. Libor [EB/OL]. (2014-09-01).http://people.maths.ox.ac.uk/gilesm/cuda.html. |
[17] | EECKHOUT L. Computer architecture performanceevaluation methods [J]. Synthesis Lectures on ComputerArchitecture, 2010, 5(1): 1-145. |
[18] | LUO Y, JOHN L K, EECKHOUT L. Self-monitoredadaptive cache warm-up for microprocessor simulation[C]// Proceedings of the 16th Symposium on ComputerArchitecture and High Performance Computing(SBAC-PAD’04). [s.l.]: IEEE, 2004: 10-17. |
[19] | HASKINS JR J W, SKADRON K. Acceleratedwarmup for sampled microarchitecture simulation [J].ACM Transactions on Architecture and Code Optimization,2005, 2(1): 78-108. |
[1] | LIN Heyun (林和昀), YUAN Chaowei (袁超伟), DU Jianhe (杜建和), HU Zhongwei (胡仲伟). Tensor-Based Joint Channel Estimation and Symbol Detection for AF MIMO Relay Networks[J]. Journal of Shanghai Jiao Tong University (Science), 2020, 25(1): 88-96. |
[2] | WEN Kai (温凯), LI Yichen (李熠辰), YANG Yang (杨洋), GONG Jing (宫敬). Reliability Evaluation of Compressor Systems Based on Universal Generating Function Method[J]. sa, 2018, 23(2): 291-296. |
[3] | CHEN Xiao-qinga (陈晓庆), WANG Yu-jieb (王宇杰), SUN Jian-qia*(孙建奇). Graphic Processing Unit Based Phase Retrieval and CT Reconstruction for Differential X-Ray Phase Contrast Imaging[J]. 上海交通大学学报(英文版), 2014, 19(5): 550-554. |
[4] | LI Meng-shi* (李孟实), YU Hui (俞 晖), LUO Han-wen (罗汉文), XU You-yun (徐友云). Pragmatic Physical Layer Abstraction Method Based on Bit-LLR-Wise Exponential Effective SNR Mapping for Bit Interleaved Coded Orthogonal Frequency Division Multiplexing System[J]. 上海交通大学学报(英文版), 2014, 19(2): 173-180. |
[5] | DI Dong-chao (狄东超), YE Guan-lin (叶冠林), XIA Xiao-he (夏小和), WANG Jian-hua* (王建华). Application of PETSc in Soil-Water Coupled Geotechnical Problems[J]. 上海交通大学学报(英文版), 2013, 18(4): 401-408. |
[6] | LI Wen1,2* (李雯), ZHANG Cheng-ning1 (张承宁). Power Management of Parallel Hybrid Electric Power Train[J]. 上海交通大学学报(英文版), 2013, 18(1): 84-91. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||