☆ 4.3 Article

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS (2022)

期刊

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS

卷 36, 期 4, 页码 475-491

出版社

SAGE PUBLICATIONS LTD

DOI: 10.1177/10943420221090256

关键词

Tensor cores; error correction; SGEMM; mixed precision; rounding

类别

Computer Science, Hardware & Architecture Computer Science, Interdisciplinary Applications Computer Science, Theory & Methods

资金

JSPS KAKENHI, Japan Society for the Promotion of Science [JP18H03248, JP21H03447, JP21J14694]
Japan Science and Technology Agency (JST CREST) [JPMJCR19F5]
Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures in Japan [jh210024NAHI]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Tensor Core is a high-performance unit on NVIDIA GPUs that is used for mixed-precision matrix-matrix multiplication. It can meet the demands of machine learning and scientific computing, but its accuracy is limited compared to FP32 SIMT Cores. Through our research, we have developed a Tensor Core implementation that achieves the same accuracy as FP32 SIMT Cores while having higher throughput.

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix-matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA's CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

期刊

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS

出版社

SAGE PUBLICATIONS LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

期刊

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS

出版社

SAGE PUBLICATIONS LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文