26-d3s4-4-SiFive_Accelerating the migration from ARM NEON to RISC-V Vectors_Han-Kuan Chen.pdf

编号：155398

PDF 14页 1.44MB 下载积分：VIP专享

下载报告请您先登录！

26-d3s4-4-SiFive_Accelerating the migration from ARM NEON to RISC-V Vectors_Han-Kuan Chen.pdf

1、2023 SiFiveAccelerating the migration from ARM NEON to RISC-V VectorsHan-Kuan ChenSenior Engineer,SiFive 2023 SiFive2OutlineWhat is intrinsics?How do software support various intrinsics?SiFive RecodeImprove SiFive RecodeArm Compute Library benchmarkOpenCV benchmarkAcknowledgmentsSpecial thanks to Cr

2、aig Topper,Kito Cheng,Peter Liao and Yi-Hsiu Hsu,who provided mentorship and guidance.2023 SiFive4What is intrinsics?Intrinsics are low-level functions provided by compiler that allow direct access to specific CPU instructions.Directly using intrinsics leverages hardware capabilities,which improves

3、execution speed of performance-critical software tasks.Most major vendors(Intel,AMD,ARM,etc.)offer intrinsics.x86:SSE&AVXarm:NEONRISC-V:RVVIntrinsics are widely used in software.e.g.,TensorFlow,Arm Compute Library,OpenCV,libyuv 2023 SiFive5How do software support various intrinsics?Due to the presen

4、ce of various intrinsics,some projects have been proposed to minimize the effort required for porting.Provide an universal interface and translate it to different targets.e.g.,xnnpack and highwayTransfer intrinsics internally into another different intrinsics.e.g.,simde,AvxToNeon,neon2sse and sse2ne

5、onRISC-V is new,how do we support various software and intrinsics?2023 SiFive6SiFive RecodeProtect your existing software investment,migrate with confidence.#include float32_t dot_prod(const float32_t*in1,const float32_t*in2,uint32_t blockSize)float32x4_t acc=vdupq_n_f32(0.0f);for(uint32_t i=0;i!=bl

6、ockSize;i+=4)float32x4_t A=vld1q_f32(in1+i);float32x4_t B=vld1q_f32(in2+i);acc=vmlaq_f32(acc,A,B);return vaddvq_f32(acc);dot_prod:beqza2,.LBB0_3vsetivlizero,4,e32,mf2,ta,mavmv.v.i v8,0.LBB0_2:vle32.v v9,(a0)addia0,a0,16vle32.v v10,(a1)addia1,a1,16vfmacc.vvv8,v9,v10addiwa2,a2,-4bneza2,.LBB0_2j .LBB0_

7、4.LBB0_3:vsetivlizero,4,e32,mf2,ta,mavmv.v.i v8,0.LBB0_4:li a0,32vsetivlizero,2,e32,mf2,ta,mavnsrl.wiv9,v8,0vnsrl.wxv8,v8,a0vfadd.vvv8,v9,v8vsetivlizero,1,e32,mf2,ta,mavslidedown.vi v9,v8,1vfadd.vvv8,v8,v9vfmv.f.sfa0,v8retNEON to RVVCompiler option-msifive-recode=neonARM NEON Source codeCompiled SiF

8、ive Assembly Code 2023 SiFive7Improve SiFive RecodeWe optimize a header based Recode to LLVM based Recode.RVV intrinsics in header versus lowering NEON intrinsics into LLVM IRThe issues of header based Recode Hard to improve the performance.Hard to maintain.The advantages of LLVM based Recode LLVM h

9、as rich optimizations.2023 SiFive8Lower NEON intrinsics into LLVM IRNEON intrinsicsLLVM IRdefine void vabsq_s8(ptr%in_0,ptr%out)entry:%0=load,ptr%in_0,align 1%2=tail call llvm.aarch64.neon.abs.v16i8(%0)store%2,ptr%out,align 1ret voiddefine void vabsq_s8(ptr%in_0,ptr%out)entry:%0=load,ptr%in_0,align

10、1%1=call llvm.abs.v16i8(%0,i1 false)store%1,ptr%out,align 1ret voiddefine void vmull_u8(ptr%in_0,ptr%in_1,ptr%out)entry:%0=load,ptr%in_0,align 1%1=load,ptr%in_1,align 1%2=tail call llvm.aarch64.neon.umull.v8i16(%0,%1)store%2,ptr%out,align 2ret voiddefine void vmull_u8(ptr%in_0,ptr%in_1,ptr%out)entry

11、:%0=load,ptr%in_0,align 1%1=load,ptr%in_1,align 1%2=zext%0 to%3=zext%1 to%4=mul nuw%2,%3store%4,ptr%out,align 2ret void2023 SiFive9header based Recode VS LLVM based Recode43%speedup after using LLVM solutionadopt LLVM solution2023 SiFive10Arm Compute Library benchmark42%average speedup compared to t

12、he competitor.2023 SiFive11OpenCV benchmarkIn core module,SiFive-X280 Recode is 6.7x speedup and SiFive-P470 Recode is 6.9x speedup compared to SiFive-X280 scalar.2023 SiFive12OpenCV benchmark(Cont.)In core module,SiFive-P470 Recode is 1.6x speedup compared to the competitor.2023 SiFive13Is it the end?No.We strive for higher performance.We are investing new optimization.At least 20%improvement is expected.No.Recode will support more intrinsics.SSE,AVX and SVE will be supported.The future of Empowering

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（26-d3s4-4-SiFive_Accelerating the migration from ARM NEON to RISC-V Vectors_Han-Kuan Chen.pdf）为本站（张5G）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。