上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

1B-103_Accelerating HPC Applications with SmartNICs-x-scalesolutions.PDF

编号:139670 PDF 22页 751.84KB 下载积分:VIP专享
下载报告请您先登录!

1B-103_Accelerating HPC Applications with SmartNICs-x-scalesolutions.PDF

1、Accelerating HPC Applications with SmartNICsDonglai DaiChief Engineercontactusx-San Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022R

2、equirements for Next-Generation Communication Libraries SmartNICs have the potential to take over a wide range of overhead tasks in a variety of applications from the host CPUs in systems Message Passing Interface(MPI)libraries are widely used for parallel and distributed HPC and AI applications in

3、HPC/data centers and clouds Requirements for a high-performance and scalable MPI library:Low latency communication High bandwidth communication Minimum contention for host CPU resources to progress non-blocking collectives High overlap of computation with communication CPU based non-blocking communi

4、cation progress can lead to sub-par performance as the main application has less CPU resources for useful application-level computationSan Jose,CA April 26-28,2022Can MPI Functions be Offloaded?The area of network offloading of MPI primitives is still nascentState-of-the-art BlueField DPUs bring mor

5、e compute power into the networkExploit additional compute capabilities of modern BlueField DPUs into existing MPI middleware to extractPeak pure communication performance Overlap of communication and computationSan Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Desig

6、n Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022Overview of BlueField-2 DPUConnectX-6 network adapter with 200Gbps InfiniBandSystem-on-chip containing eight 64-bit ARMv8 A72 cores with 2.7 GHz each16GB of memory for t

7、he ARM coresMVAPICH2-DPU MPI library is designed to take advantage of DPUs and accelerate scientific applicationsSan Jose,CA April 26-28,2022Basic Idea for MPI offloading to DPU Use of generic and optimized asynchronous progress threads on ARM cores for Point-to-point CollectivesRMA operationsP0MPI_

8、Wait/MPI_WaitallNon-Blocking P2P/Collective/RMA OperationP1P2Bluefield P3ComputationCommunicationControl MessagesCommunication Process/ThreadSan Jose,CA April 26-28,2022High Level Design for MPI offloading to DPU Better support for critical collective communication operationsEnable offloading to the

9、 Bluefield ARM SoC Performance enhancing algorithm selection based on the communication characteristics of application High-Performance Bluefield RDMA-Capable HCASoftware Kernel based Collective Offload on Programmable ARM CoresMVAPICH2-DPU LibraryIn-Network Collective CommunicationDesigns for Data

10、on CPUDesigns for Data on DPUOffload Decision LogicHigh-Performance SHARP-Enabled SwitchHardware(ASIC)based Collective Offload on SHARP-Enabled SwitchesYesNoGeneric Non-OffloadedCollective OperationsSan Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Featur

11、es of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022MVAPICH2-DPU Library 2022.02 ReleaseImplemented by X-ScaleSolutionsBased on MVAPICH2 2.3.6,compliant to MPI 3.1 standardSupports all features available with the MVAPICH2 2.3.6 releas

12、e(http:/mvapich.cse.ohio-state.edu)Novel framework to offload non-blocking collectives to DPUOffloads non-blocking collectives(MPI_Ialltoall,MPI_Iallgather,MPI_Ibcast,etc)to DPUUp to 100%overlap of computation with non-blocking collectiveAccelerates scientific applications using non-blocking collect

13、ivesSan Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks and Applications ConclusionSan Jose,CA April 26-28,2022Total Execution Time with osu_Ialltoall(32 nodes)32 Nodes,32 PPN32 Nodes,16 P

14、PN0.00100.00200.00300.00400.00500.00600.00700.00800.0064K128K256K512KComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_ialltoall)MVAPICH2MVAPICH2-DPU0.00500.001,000.001,500.002,000.002,500.003,000.0064K128K256K512KComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_ialltoall)MVAPICH2MVAPICH2

15、-DPU22%21%20%22%21%17%23%17%San Jose,CA April 26-28,2022007080901001K2K4K8K16K 32K 64K 128K256K512KOverlap(%)Message SizeOverlap(osu_ialltoall)MVAPICH2MVAPICH2-DPU32 Nodes,32 PPN32 Nodes,16 PPN100%Delivers peak overlap007080901001K2K4K8K16K 32K 64K 128K256K512KOverlap(%)Message

16、 SizeOverlap(osu_ialltoall)MVAPICH2MVAPICH2-DPU98%Overlap Between Computation&Communication with osu_Ialltoall(32 nodes)San Jose,CA April 26-28,2022Total Execution Time with osu_Iallgather(16 nodes)16 Nodes,32 PPN16 Nodes,1 PPN00.511.522.533.5128K256K512K1MOverall Time(ms)Message SizeTotal Execution

17、 Time,BF-2(osu_iallgather)MVAPICH2MVAPICH2-DPU0246802K4K8K16KOverall Time(ms)Message SizeTotal Execution Time,BF-2(osu_iallgather)MVAPICH2MVAPICH2-DPU41%29%36%84%38%39%30%57%San Jose,CA April 26-28,2022Overlap Between Computation&Communication with osu_Iallgather(16 nodes)Delivers peak ov

18、erlap16 Nodes,1 PPN00708090100128K256K512K1MOverlap(%)Message SizeOverlap(osu_iallgather)MVAPICH2MVAPICH2-DPU97%San Jose,CA April 26-28,2022Total Execution Time with osu_Ibcast(32 nodes)32 Nodes,16 PPN32 Nodes,1 PPN05101520252M4M8M16MComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_

19、ibcast)MVAPICH2MVAPICH2-DPU05540452M4M8M16MComm.Time(ms)Message SizeTotal Execution Time,BF-2(osu_ibcast)MVAPICH2MVAPICH2-DPU57%47%58%38%58%48%46%8%San Jose,CA April 26-28,2022Overlap Between Computation&Communication with osu_Ibcast(32 nodes)32 Nodes,16 PPNDelivers peak overlap0102030405

20、0607080901002M4M8M16MOverlap(%)Message SizeOverlap(osu_ibcast)MVAPICH2MVAPICH2-DPU+30%32 Nodes,1 PPN007080901002M4M8M16MOverlap(%)Message SizeOverlap(osu_ibcast)MVAPICH2MVAPICH2-DPU+38%San Jose,CA April 26-28,2022P3DFFT Application Execution Time(32 nodes)Benefits in application-levelexec

21、ution time0510152025Latency(s)Grid SizeMVAPICH232 Nodes,32 PPN18%21%16%055Latency(s)Grid SizeMVAPICH232 Nodes,16 PPN12%12%14%San Jose,CA April 26-28,2022Outline Motivation Basic Idea for MVAPICH2-DPU Library Design Main Features of MVAPICH2-DPU Library Performance Benefits for Benchmarks

22、and Applications ConclusionSan Jose,CA April 26-28,2022ConclusionEfficient MVAPICH2-DPU MPI library utilizes the BlueField DPU to progress MPI non-blocking collective operationsProvides up to 100%overlap of communication and computation for non-blocking Alltoall,Allgather,Bcast,etcReduces the total

23、execution time of P3DFFT application up to 21%on 1,024 processes Work in progress for MVAPICH2-DPU library to efficiently offload more types of non-blocking collective operations to DPUs San Jose,CA April 26-28,2022Exhibition and Live DemoIf you are interested in knowing more details,please come and visit our exhibit booth#8 next doorLive demo on MVAPICH2-DPU library at our booth6-7 pm,today1-2 pm,tomorrowThank You!Donglai Daicontactusx-http:/x-

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(1B-103_Accelerating HPC Applications with SmartNICs-x-scalesolutions.PDF)为本站 (2200) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部