1、2023 SiFiveAccelerating the migration from ARM NEON to RISC-V VectorsHan-Kuan ChenSenior Engineer,SiFive 2023 SiFive2OutlineWhat is intrinsics?How do software support various intrinsics?SiFive RecodeImprove SiFive RecodeArm Compute Library benchmarkOpenCV benchmarkAcknowledgmentsSpecial thanks to Cr
2、aig Topper,Kito Cheng,Peter Liao and Yi-Hsiu Hsu,who provided mentorship and guidance.2023 SiFive4What is intrinsics?Intrinsics are low-level functions provided by compiler that allow direct access to specific CPU instructions.Directly using intrinsics leverages hardware capabilities,which improves
3、execution speed of performance-critical software tasks.Most major vendors(Intel,AMD,ARM,etc.)offer intrinsics.x86:SSE&AVXarm:NEONRISC-V:RVVIntrinsics are widely used in software.e.g.,TensorFlow,Arm Compute Library,OpenCV,libyuv 2023 SiFive5How do software support various intrinsics?Due to the presen
4、ce of various intrinsics,some projects have been proposed to minimize the effort required for porting.Provide an universal interface and translate it to different targets.e.g.,xnnpack and highwayTransfer intrinsics internally into another different intrinsics.e.g.,simde,AvxToNeon,neon2sse and sse2ne
5、onRISC-V is new,how do we support various software and intrinsics?2023 SiFive6SiFive RecodeProtect your existing software investment,migrate with confidence.#include float32_t dot_prod(const float32_t*in1,const float32_t*in2,uint32_t blockSize)float32x4_t acc=vdupq_n_f32(0.0f);for(uint32_t i=0;i!=bl
6、ockSize;i+=4)float32x4_t A=vld1q_f32(in1+i);float32x4_t B=vld1q_f32(in2+i);acc=vmlaq_f32(acc,A,B);return vaddvq_f32(acc);dot_prod:beqza2,.LBB0_3vsetivlizero,4,e32,mf2,ta,mavmv.v.i v8,0.LBB0_2:vle32.v v9,(a0)addia0,a0,16vle32.v v10,(a1)addia1,a1,16vfmacc.vvv8,v9,v10addiwa2,a2,-4bneza2,.LBB0_2j .LBB0_
7、4.LBB0_3:vsetivlizero,4,e32,mf2,ta,mavmv.v.i v8,0.LBB0_4:li a0,32vsetivlizero,2,e32,mf2,ta,mavnsrl.wiv9,v8,0vnsrl.wxv8,v8,a0vfadd.vvv8,v9,v8vsetivlizero,1,e32,mf2,ta,mavslidedown.vi v9,v8,1vfadd.vvv8,v8,v9vfmv.f.sfa0,v8retNEON to RVVCompiler option-msifive-recode=neonARM NEON Source codeCompiled SiF
8、ive Assembly Code 2023 SiFive7Improve SiFive RecodeWe optimize a header based Recode to LLVM based Recode.RVV intrinsics in header versus lowering NEON intrinsics into LLVM IRThe issues of header based Recode Hard to improve the performance.Hard to maintain.The advantages of LLVM based Recode LLVM h
9、as rich optimizations.2023 SiFive8Lower NEON intrinsics into LLVM IRNEON intrinsicsLLVM IRdefine void vabsq_s8(ptr%in_0,ptr%out)entry:%0=load,ptr%in_0,align 1%2=tail call llvm.aarch64.neon.abs.v16i8(%0)store%2,ptr%out,align 1ret voiddefine void vabsq_s8(ptr%in_0,ptr%out)entry:%0=load,ptr%in_0,align
10、1%1=call llvm.abs.v16i8(%0,i1 false)store%1,ptr%out,align 1ret voiddefine void vmull_u8(ptr%in_0,ptr%in_1,ptr%out)entry:%0=load,ptr%in_0,align 1%1=load,ptr%in_1,align 1%2=tail call llvm.aarch64.neon.umull.v8i16(%0,%1)store%2,ptr%out,align 2ret voiddefine void vmull_u8(ptr%in_0,ptr%in_1,ptr%out)entry
11、:%0=load,ptr%in_0,align 1%1=load,ptr%in_1,align 1%2=zext%0 to%3=zext%1 to%4=mul nuw%2,%3store%4,ptr%out,align 2ret void2023 SiFive9header based Recode VS LLVM based Recode43%speedup after using LLVM solutionadopt LLVM solution2023 SiFive10Arm Compute Library benchmark42%average speedup compared to t
12、he competitor.2023 SiFive11OpenCV benchmarkIn core module,SiFive-X280 Recode is 6.7x speedup and SiFive-P470 Recode is 6.9x speedup compared to SiFive-X280 scalar.2023 SiFive12OpenCV benchmark(Cont.)In core module,SiFive-P470 Recode is 1.6x speedup compared to the competitor.2023 SiFive13Is it the end?No.We strive for higher performance.We are investing new optimization.At least 20%improvement is expected.No.Recode will support more intrinsics.SSE,AVX and SVE will be supported.The future of Empowering