上海品茶

HPC 应用性能分析和调优.pdf

编号:29469 PDF 31页 4.13MB 下载积分:VIP专享
下载报告请您先登录!

HPC 应用性能分析和调优.pdf

1、NVIDIAHPC应用性能分析和调优PengzhiZhu,Dec 2020会#page#AGENDAHPC应用计算特征简介HPC加速关键技术介绍HPC-X+OpenMPI+UCX+HCOLL应用创析和性能调优会#page#HIGH PERFORMANCE COMPUTINGHPC应用计算特征简介具#page#page#HPC-AI怎么计算方法数据工作负载分配、计算、规约计算输入计算方法计算框架气象数据数理方程MPI应用模型数据如手机材料和结构数理方程MPI应用图像文件卷积神经网络模型Al FrameworkBack Propagation卷积神经网络模型Distributed AI Frame

2、workStochastiic Gradient Descent#page#HPC算法框架(GROMACS)的API是什么样的(py38)Scat READMEfmd.gro运行命令:(py38)S(py38)S(py38)(py38)head npt.g2Protein in water624831001GLNN6.3513.5333.926成1891GLN6.4113.5573.8440.99660.3B0H21001GLM6.3173.6263.959H36.4061801GLN3.4793.9961.5030CA1091GLN6.2343.4633.8680.13171555HA100

3、1GLN6.1923.5323.7961.11821.15870.0857CB6.2983.3391091GLN3.7970.55599.7156.2083.2743.7581001GLNHB1-0.1427-0.6654HB21091GLM6.3403.3683.7020.51310.2232-0.6924106.3803.2531881GLNCG-0.2330-0.11080.00581091GLNHG1116.4833.2943.8940.2447-0.6359-2.50841001GLNHG26.3283.2463.9840.09700.050-1.05453CD6.3831881GL

4、N3.1063.8450.0122-0.19750.3281女0E11091GLN6.4813.0623.7860.15480.5739-0.023615Lz0*81891GLNNE26.2793.883-0.1614-0.2948-0.27199196T91091GLN3.0733.913-0.3436-1.20570.6744171001GLHE226.2741.40252.9273.8732.0496-0.6262181001GLN6.1193.4343.9580.2029-1.0566lpy38/#page#HPC算法框架中的计算+通信处理流程CPUhresOnCPUsms atpea

5、lwith GPUs:100sofpsatpeaGPUvannttp:/www.gromacs.org/GPu_acceleratio#page#HPC并行代码中的MPI通信nt7GROMACS 2020 Source Code#page#HPC-AI计算集群中的聚合通信GPUGGPU3GPU3GPU3GPU2GPUAI-ReducGPUO LGPUOGPU1GPUOEGPU3LGPU2GPU3#page#page#RDMA-远端内存直接访问User ApplicationUser Applicationser2Buffer 1Buffer 1Buffer 1Buffer 1KernelOSO

6、SBuffer 1Buffer 1RDMA over InfiniBandHCAHCAHardwareBuffer 1Buffer 1DINPINTCP/IP机架1机架2#page#NVIDIAGPUDIRECTIRDMA加速技术SystemMemorySPUGPUMemoryGPUWithout GPU Direct RDMANetwork-Same Data Copied 3xCPDSystemMemoryChips为加速深度学习训练而设计GPUGPUMemory为CUDA加速卡提供最低的通信延退WithGPUDirect RDMA(RequiresRDMA/RoCE)消除不必要的系统内存

7、拷贝开销和CPU通信负载#page#MPI TAG MATCHING AND RENDEZVOUS OFFLOADSSWpa中国间#page#page#聚合通信优化技术的演进0SwitchCPUSwitowitch8Bregate20Hos282#page#HPC-XOPENMPI + UCX + HCOLL +IPM应用剖析和性能调优具#page#HPC-X软件生态系统ApplicationsMPIPGAS/UPCPGAS/SHMEMLogical Shared MemorycalShared MemoryMemoryMemoryMemoryMemoryMemoryMemoryPoint-t

8、o-Point:UCXCollectives: HCOLLReliable Messaging Optimized for InfiniBand HCAHardwareAcceleration:SHARP,Multicast,COREHybridTransportMechanismDirectEfficientMemory RegistrationTopologyAwareReceive Side Tag MatchingInfiniBand Verbs AP#page#HPC-X-加速HPC集群的软件工具包HPC-XToolkit中包含:Open MPI/ Open SHMEM集群消息传递框

9、架:UCX-Unified点对点RDMA通信加速库MXM-点对点加速库FCAv4.x(aka,HCOLL)-聚合通信Offloading加速库SHARP-InNetworkComputing聚合通信加速库-MPIProfiling工具Profiling Tools (IPM)MPl tests and micro benchmarks(OSUIMB)-MPI基准测试工具SourcesandscriptstorebuildHPC-Xifneeded(OpenMPlandUCX)-开源项目源码/bin/bashanjustloca/Desktop/hpcx1S-d*/archive/fcalhco

10、ll/sharp/uCx/utils/modulefiles/mxm/ompi/sources/#page#HPC-X- OPEN MPI (OMPI)MPI/SHMEMApplicationORTEMPIAPIcomponentOSC(MPInot shownPML(p2pmessaginglayer)baseRMA)basecmOSHMEMOB1RDMA(message matching)componentnot shownR2(BML)BTLBasesn MPI#page#HPC-X- UCX (UNIFIED COMMUNICATION X)ApplicationsMPICH.Open

11、-MPl.etc.RPC,MachineLearing,etcPGAS/SHMEM.UPC.etc.SPARK.Hadoop.etcUC-P (Protocols)- High Level APITransport selection.cross-ranspor muli-rai.fragmentationemulaton ofunsupportedoperatonsOAPIDomnainMessage Passing APIDomainTask BasodAPIDorainPGASAPIDomainRemolememoy accessStreamActiwe80086UCXUC-SUC-T

12、(Hardware Transports)- Low Level APIActiveMessage,RMA,Atomic,Tagmatching(Services)Common UtilitiesTansporRCEVersTansporfo GPUmemoryOertansporsDalaa0cessUtitiesatructureCUDADCTOnAMDROCMRCROCMCudaOFA Verbs DriverHardwarre#page#HPC-X中的MPIPROFILING分析工具MPI通信数据统计和分析观察应用运行参数对通信的影响,调优集群参数开发者通过部析,调优代码,均衡系统负载

13、如何运行MPI通信分析Export环境变量并LD_PRELOADIPMprofiler动态库export IPMLOG-FUL#page#IPM- INTEGRATED PERFORMANCE MONITORING42400n424hosts3441c+02CommunicationComputationofMPITiC店#page#page#MPI MESSAGE SIZESHPCGROMACS Profiling - MPI Message SizesMajority of datatransfermessagesare mediumsizes,exceptforMPLSendrecv h

14、asalarge concentration(from8Bto8KB)MPIBcast showssome concentration32NodesNETWORK OF EXPERTISE_InteLE5_2697v3.pdf4511#page#MPI DATA TRANSFER (POINT TO POINT)HPCGROMACS Profiling - MPI Data TransferAs the cluster grows,similar communication behavior is seenMajority ofcommunicationsarebetween neighbor

15、ing ranksNon-blocking(pointto point)data,andpoint-to-point transfersareshownin thegraphCollective data communicationsare small comparedto point-to-point communications2Nodes/56 CoreS32Nodos/896CoresNETWORK OF EXPERTISE_InteLE5_2697v3.pdf#page#应用性能基准测试和剖析HPCGROMACS SummaryLatest system generation imp

16、rove GROMACS performance at scaleCompute:Intel Haswell cluster outperformssystemarchitectureof previous generationsHaswel clusterouperformsSandyBridgeclusterby110%,andoutperfomsWestmereclusterby350%at32nodeCompute: Running more CPU cores provides higher performance7-10%higherproductiviywih28PPNcompa

17、redto24PPNNetwork:EDRInfiniBanddeliverssuperiorscalabilityinapplicationperformanceEDRInfiniBandprovides higher performanceand morescalablethan1GbE.10GbE.or40GbEPerformanceforEthernet(1GbE/10GbE/40GbE)staysflat(orstopsscaling)beyond2nodes-RunningatSingle Precision isapproximately twice as fastasrunni

18、ngatDouble PrecisionSeenaround41%-47%fasternningatSP(Single Precision)versusDPDouble Precision)MPI Profle shows majority of data transferare pointto-pointand non-blocking communicationsMPLSendrecv and MPLWaitall are the mostused MPlcommunicationNETWORK OF EXPERTISEnalysis_InteLE5_2697v3.pdf#page#HPC

19、-AIADVISORYCOUNCIL最佳实践GROMACS(GRQulckB店oh BrowagsIhigoadpDFpresentation(AMD-basedDownload PDFPresentation(AMDOpteron61Cases:FLUENTtion (Intalx5570-baE5-2697y2(Intel E5-2697y3Intel xecn ES-26580w7ha5orE5-2697v3basediDasedhttps:/www.hpcadyicom/bestpractices.php#page#GPUDIRECT加速NCCL2测试NCCL-Test perform

20、ancemodule load SHOMEy/application/mpiw/o GPUDirect RDMAhpcx-v2.6.0-gcc-MLNXOFEDLINUX-4.7-1.0.0.1-redhat7.7-x86 64/modulefiles/hpcx7.79017mpirun-x NCCL P2P DISABLE=0 -x NCCL SHM DISABLE=0-x NCCLIB DISABLE=O-x NCCLNET GDR LEVEL=4-X NCCLSOCKET IFNAME=ibo mlx5 -xNCCL IB HCA=mlx5 0,mlx51-x NCCLDEBUG=INF

21、O -x NCCL MIN NRINGS=T-x NCCL RINGS-mca btropenib if exclude mlx5 0.mlx5 12.0333-np 2 -pernode -host ops003:1.ops004:T-pernode-report-bindings-mca mpi_show mca params 0 -mca pml ucx verbose 0-mca pml ucx -xUCXNET DEVICES=mlx5 0:1-x LD LIBRARY PATH=/glabal/home/users/pengzhiz/application/mpi/hpcx-v2.

22、6.0-gcc-MLNX OFED LINUX-4.7-1.0.0.1-redhat7.7-86 64/ucx/lib:/global/scratch/groups/mellanox/for pengzhi/ops/nvidia/cuda-10.2/lib64:/global/home/users/pengzhiz/application/nccl/FETCH-HEAD-cuda-10.2/libS(HOMEapplicatiion/nccl/FETCH-HEAD-cuda-10.2/nccl-tests/allreduce perf -b 1-f 2-e 32M-g 1-c 1-z#page

23、#SHARPALLREDUCE加速的分布式AI模型训练mpirun-x UCX_NET DEVICES=mlx5_bond_0:1-mca pml ob1 -mca btl openibhostgpuhost1:8,gpuhost2:8,gpuhost3:8,gpuhost4:8-XCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7”-XNCCL NET_GDR_LEVEL-5-XNCCLNET_GDR_READ=1xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3-XNCCL-x NCCL_COLLNET_ENABLE-1Fx NCCL_I

24、B DISABLE=0singularity exec -nv -bind /application /containers/cuda-10.2-cudnn7-devet-centos7/application/anaconda3/envs/py38-tf220/bin/python/github/tensorflow-benchmarks/tf210/scripts/tf_cnn_benchmarks/tf_cnn benchmarks.py-data_name=imagenet-batch_size=512-use_fp16-model=alexnet -num_gpus=1-variable_update=horovod#page#A100 80GB GPU FOR SUPERCOMPUTINGDEsNVIDIA#page#NVIDIA会#page#

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(HPC 应用性能分析和调优.pdf)为本站 (X-iao) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

wei**n_... 升级为标准VIP  wei**n_...  升级为高级VIP

wei**n_...  升级为标准VIP wei**n_... 升级为高级VIP

wei**n_...  升级为高级VIP  wei**n_... 升级为至尊VIP

 wei**n_... 升级为高级VIP wei**n_... 升级为高级VIP  

180**21... 升级为标准VIP 183**36... 升级为标准VIP 

wei**n_...  升级为标准VIP wei**n_... 升级为标准VIP

xie**.g... 升级为至尊VIP 王**  升级为标准VIP 

172**75... 升级为标准VIP  wei**n_... 升级为标准VIP

 wei**n_... 升级为标准VIP  wei**n_...  升级为高级VIP

135**82... 升级为至尊VIP  130**18...   升级为至尊VIP

 wei**n_... 升级为标准VIP  wei**n_... 升级为至尊VIP

wei**n_...  升级为高级VIP 130**88...  升级为标准VIP 

  张川 升级为标准VIP wei**n_...  升级为高级VIP

叶**  升级为标准VIP wei**n_... 升级为高级VIP

138**78...  升级为标准VIP  wu**i 升级为高级VIP 

wei**n_... 升级为高级VIP   wei**n_... 升级为标准VIP

wei**n_...  升级为高级VIP  185**35...  升级为至尊VIP

wei**n_... 升级为标准VIP   186**30...  升级为至尊VIP

156**61... 升级为高级VIP   130**32... 升级为高级VIP

 136**02... 升级为标准VIP  wei**n_...   升级为标准VIP

 133**46... 升级为至尊VIP  wei**n_... 升级为高级VIP

180**01...  升级为高级VIP 130**31...  升级为至尊VIP

 wei**n_... 升级为至尊VIP   微**... 升级为至尊VIP

 wei**n_... 升级为高级VIP   wei**n_...  升级为标准VIP

刘磊  升级为至尊VIP wei**n_...  升级为高级VIP

班长  升级为至尊VIP   wei**n_... 升级为标准VIP

176**40... 升级为高级VIP   136**01... 升级为高级VIP 

159**10...  升级为高级VIP 君君**i... 升级为至尊VIP  

wei**n_...  升级为高级VIP wei**n_... 升级为标准VIP 

158**78...  升级为至尊VIP 微**... 升级为至尊VIP  

185**94... 升级为至尊VIP  wei**n_... 升级为高级VIP 

 139**90... 升级为标准VIP 131**37... 升级为标准VIP 

钟**  升级为至尊VIP  wei**n_...  升级为至尊VIP

 139**46...  升级为标准VIP  wei**n_... 升级为标准VIP

wei**n_... 升级为高级VIP  150**80... 升级为标准VIP 

wei**n_...  升级为标准VIP GT 升级为至尊VIP

186**25... 升级为标准VIP  wei**n_...  升级为至尊VIP

150**68...  升级为至尊VIP  wei**n_... 升级为至尊VIP 

130**05... 升级为标准VIP  wei**n_...   升级为高级VIP

wei**n_...  升级为高级VIP wei**n_...  升级为高级VIP 

138**96...  升级为标准VIP  135**48... 升级为至尊VIP

wei**n_... 升级为标准VIP  肖彦 升级为至尊VIP

wei**n_... 升级为至尊VIP  wei**n_... 升级为高级VIP

wei**n_...  升级为至尊VIP 国**...  升级为高级VIP

158**73...  升级为高级VIP wei**n_...  升级为高级VIP

 wei**n_... 升级为标准VIP wei**n_... 升级为高级VIP

136**79... 升级为标准VIP 沉**...  升级为高级VIP

138**80... 升级为至尊VIP  138**98... 升级为标准VIP

wei**n_...  升级为至尊VIP  wei**n_... 升级为标准VIP 

wei**n_...  升级为标准VIP wei**n_...  升级为至尊VIP