上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

27-The_Super_Long_RISC_V_Vector_Machine_RVSC.pdf

编号:155439 PDF 24页 2.37MB 下载积分:VIP专享
下载报告请您先登录!

27-The_Super_Long_RISC_V_Vector_Machine_RVSC.pdf

1、T1-The Long RISC-V Vector Machine GeneratorJiuyang LiuHUST&Chips AllianceQinjun LiPLCT/ISCASYunqian LuoTsinghua UniversityAnd other PLCT-CAAT interns and FTEsAug 25,20231/24BioJiuyang Liuliujiuyang.mesequencerGitHubPGP Key:0 x8D7B5A2/24Open Source Chip DeveloperChisel DeveloperRocketChip MaintainerC

2、hips Alliance TSC Member3/24Topic Today-T1Micro architecture show-off;Methodlogy to tune RISC-VVector performance;My 50 cents to the future ofRISC-V Vector in HPC;https:/ CoreI$D$SequencerMaskUnitPortPort PortPortVector$Coherence ManagerVRF0(dup)Next Level MemoryLSULaneLaneLaneLaneLaneLaneLaneLaneBu

3、fBufBufBufBufBufLaneXBarVRFBankedVRF(SRAM)BankedVRF(SRAM)BankedVRF(SRAM)BankedVRF(SRAM)VFUVFUVFUVFUVFUVFULongLatencyVFU(Divider)Ring Buffer(widen)To/Form SequencerCustomVFU(SoftMax,etc.)SlotFSMSlotFSMSlotFSMSlotFSM4/24The Real Vector MachineLarge DLEN with multiple VFULarge VRFSupport instruction ch

4、ainingFigure:T0 by Prof.Krste Asanovi5/24Pioneers-XiangshanCharacteristics:Dedicated vector pipeline andrename unitNo chainingShare float registers with vectorregistersDedicate VFU pipelinesShare load store unitSmall DLEN,improve code density,IFand BFU friendly.Figure:XiangShan Vector main pipeline6

5、/24Pioneers-X280Characteristics:Dedicated vector pipeline and flop basedregisters;Configurable to 512bits DLEN;Access L1D$and L2$simultaneously;Dual issue scalar core with large DLEN but onlyone lane.Figure:SiFive X280 Architecture7/24Pioneers-OthersThere are other pioneers,not listed because academ

6、ic/closed-source/no silicon products:ETH Ara:multiple lane but no chaining support,interesting toy project.UCB Hwacha:Non-standard Vector implementation,before RISC-V Vector 1.0,interestingfor decoupled vector architecture.Semidynamics Vector:multiple lane with renaming,looking forward to its custom

7、erproducts.T-Head:C908 is too weak to say it is the vector processor.StarFive/Nuclei/Andes/Rivai:Wait for products,god bless them.8/24Architecture OverviewRocket-based RV32 core withOoO write-back,w/o MMU,w/ofast interruptRing-based interconnection forwiden and narrow instruction(configurable buffer

8、 for ring)Single unit on Sequencer for nondata-parallelism instructions,e.g.mask,ffo,reduce.Duplicated VRF0 on Sequencer,used for mask unit to reduce thebandwidth usage from Sequencerto lanesSingle banked-LSU with strongoutstanding slots.Rocket CoreI$D$SequencerMaskUnitPortPort PortPortVector$Cohere

9、nce ManagerVRF0(dup)Next Level MemoryLSULaneLaneLaneLaneLaneLaneLaneLaneBufBufBufBufBufBufLaneXBarVRFBankedVRF(SRAM)BankedVRF(SRAM)BankedVRF(SRAM)BankedVRF(SRAM)VFUVFUVFUVFUVFUVFULongLatencyVFU(Divider)Ring Buffer(widen)To/Form SequencerCustomVFU(SoftMax,etc.)SlotFSMSlotFSMSlotFSMSlotFSM9/24Architec

10、ture PrincipleRocket CoreI$D$SequencerMaskUnitPortPort PortPortVector$Coherence ManagerVRF0(dup)Next Level MemoryLSULaneLaneLaneLaneLaneLaneLaneLaneBufBufBufBufBufBufLaneXBarVRFBankedVRF(SRAM)BankedVRF(SRAM)BankedVRF(SRAM)BankedVRF(SRAM)VFUVFUVFUVFUVFUVFULongLatencyVFU(Divider)Ring Buffer(widen)To/F

11、orm SequencerCustomVFU(SoftMax,etc.)SlotFSMSlotFSMSlotFSMSlotFSMHigh-throughput design withscaleable lane size,RF size,VFUtype and connection matrix;banked SRAM VRF;Load to VFU to VFU to Storechaining ability;HPC only,slow external interruptw/o mmu;10/24Bandwidth Performance PointsVRF-Vector Registe

12、r FieldVFU-Vector Function UnitLSU-Load Store Unit11/24VRFChip frequency bottleneck is VRF SRAM.Flop-based VRF v.s.SRAM based VRF.T1 chooses the banked SRAM.12/24VFUTradeoff by:Cell AreaRouting Area to and from VRFPowerMethodlogy:Full custom design for function cells;Limit function unit to specific

13、cells;Token+Asynchronous function circuit;Retime with commercial EDA toolsT1 only use Retime for now:(we need more human resources13/24LSUNo magic for LSU,In our design,we have already support these feature:LSU port+banked next level cache;Instruction level memory interleaving;Different LSU logic fo

14、r handling unit stride,stride,indexed load store;Merge multiple uop into cacheline size to reduce the memory accessing overhead in thenext level cache;Intensive outstanding strategy,3 MSHR for each bank,support hundreds of outstanding;14/24RISC-V Memory in the futureHowever,the Memory is the non-tri

15、vial part,which may still need a lot of feature to explorer:MMUPrefetch and Cache for SparsityConfiguration for HBM/DDR15/24The Trade-offBalance the bandwidth among VRF,VFU and LSU.Keeping them as busy as possible.Hit the memory bandwidth,and keep everything as busy as possible.16/24PerformanceThe c

16、urrent result(untuned yet)is a little awkward,#bankcycle/elemmemop/cyclevrf writes/cycleavg chaining size22.031.480.250.7941.701.760.310.9481.651.810.300.94161.651.810.300.94Table:matmul 128 12817/24The LayoutBased on our auto floorplaner,the test layout is:18/24Tuning StepThanks Chisel makes all pa

17、rameters being configurable!Check the bandwidth that can be provided by SoC;Check the frequency that can be provided by SRAM;Based on the bandwidth and frequency plus architecture level info to calculate the requiredSRAM ports and VFU counts;Based on the workload(specifically the tiling size)and dec

18、ide the VLEN;Based on physical design constraints,decide the lane counts and VFU counts inside eachLane;19/24How to tune?For long vector architecture,we are always bitten by memory wall:In Hardware:Sparse Vector Cache.Interconnect protocol friendly(TileLink)Memory protocol friendly.In Compiler:Incre

19、ase the n in Load to VFU n to StoreuArch aware to generate chainable instructions as much as possibleIn Software:friendly to tiling,no wrap,no SM.In Architecture:Add custom vector instructions to operate on elements to reduce VRF/LSU bandwidthwaste.20/24BuddyCompiler+MLIR to speed up compiler explor

20、ationBurn old world down:limited programming modellimited loop nest optimizationlimited on CUDATo make it friendly:rewrite high-level operations in MLIR.21/24Future WorkMMU support;FGMT scalar core for handling interrupt during vector SIMD being running;MLIR Sparse Compiler for RVV;Custom Instructio

21、n Support;22/24The Future of Vector is unlimited.Processor(year)Vectorclock rate(MHz)VectorregistersElementsper registerVector arithmetic unitsVectorload-storeunitsLanesCray-1(1976)808646:FP add,FP multiply,FP reciprocal,integer add,logical,shift11Cray X-MP(1983)1188646:FP add,FP multiply,FP recipro

22、cal,integer add,logical,shift,population count/parity2 loads 1 store1Cray Y-MP(1988)1668646:FP add,FP multiply,FP reciprocal,integer add,logical,shift,population count/parity2 loads 1 store1Cray-2(1985)2448645:FP add,FP multiply,FP reciprocal/sqrt,integer add/shift/population count,logical11Fujitsu

23、VP100(1982)1338-25632-10243:FP or integer add/logical,multiply,divide21Fujitsu VP200(1982)1338-25632-10243:FP or integer add/logical,multiply,divide22Hitachi S810(1983)71322564:FP multiply-add,FP multiply/divide-add unit,2 integer add/logical3 loads 1 store1Hitachi S820(1983)71322564:FP multiply-add

24、,FP multiply/divide-add unit,2 integer add/logical3 loads 1 store2Convex C-1(1985)1081282:FP or integer multiply/divide,add/logical11(64 bit),2(32 bit)NEC SX/2(1985)1678+322564:FP multiply/divide,FP add,integer add/logical,shift14Cray C90(1991)24081288:FP add,FP multiply,FP reciprocal,integer add,2

25、logical,shift,population count/parity2 loads 1 store2Cray T90(1995)46081288:FP add,FP multiply,FP reciprocal,integer add,2 logical,shift,population count/parity2 loads 1 store2NEC SX/5(1998)3128+645124:FP or integer add/shift,multiply,divide,logical116Fujitsu VPP5000(1999)-40963:FP or int

26、eger multiply,add/logical,divide1 load 1 store16Cray SV1(1998)300864(MSP)8:FP add,FP multiply,FP reciprocal,integer add,2 logical,shift,population count/parity1 load-store2SV1ex(2001)500864(MSP)8:FP add,FP multiply,FP reciprocal,integer add,2 logical,shift,population count/parity1 load-store8(MSP)VM

27、IPS(2001)5008645:FP multiply,FP divide,FP add,integer add/shift,logical1 load-store1NEC SX/6(2001)5008+642564:FP or integer add/shift,multiply,divide,logical18NEC SX/8(2004)20008+642564:FP or integer add/shift,multiply,divide,logical14Cray X1(2002)80032643:FP or integer,add/logical,multiply/shift,di

28、vide/square root/logical1 load 1 store2 8(MSP)Cray XIE(2005)113032643:FP or integer,add/logical,multiply/shift,divide/square root/logical1 load 1 store2 8(MSP)23/24ThanksChiselRocketChip community for rocket core generator.SiFive for inclusive cache generator.Buddy Compiler and MLIR community for software generator.Krstes PhD thesis for inspring us.YSYX tapeout shuttle.PLCT Lab for funding.24/24

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(27-The_Super_Long_RISC_V_Vector_Machine_RVSC.pdf)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部