《Faster Transformer 3.0 编码器的 INT8 量化实现.pdf》由会员分享,可在线阅读,更多相关《Faster Transformer 3.0 编码器的 INT8 量化实现.pdf(36页珍藏版)》请在三个皮匠报告上搜索。
1、Yu Chen, 2020/12 THE INT8 QUANTIZATION OF FASTER TRANSFORMER 3.0 ENCODER 2 What is Faster Transformer Introduce the Faster Transformer cublasLtCreate( cublasLtMatmulDesc_t matmulDesc = NULL; cudaDataType_t scaleType = CUDA_R_32I; int32_t alpha = 1, beta = 0; cublasComputeType_t computeType = CUBLAS_
2、COMPUTE_32I; cublasOperation_t opTranspose = CUBLAS_OP_T; cublasLtMatmulDescCreate( cublasLtMatmulDescSetAttribute(matmulDesc, CUBLASLT_MATMUL_DESC_TRANSB, Sample code can be found here. 18 HOW TO DO INT8 QUANTIZATION WITH CUBLASLT Suppose we calculate = , is a column-major matrix with size of (m, k
3、), is a column-major matrix with size of (n, k). Step 2. Create descriptor of matrixes cublasLtMatrixLayout_t Adesc = NULL, Bdesc = NULL, Cdesc = NULL; int lda = m, ldb = n, ldc = m; cublasLtMatrixLayoutCreate( cublasLtMatrixLayoutCreate( cublasLtMatrixLayoutCreate( int ldatransform = 32 * m, ldbtransform = 32 * roundoff(n, 8), ldctransform = 32 * m; int int8_t *Atransform = NULL, *Btransform = NU