1、NVIDIAHPC应用性能分析和调优PengzhiZhu,Dec 2020会#page#AGENDAHPC应用计算特征简介HPC加速关键技术介绍HPC-X+OpenMPI+UCX+HCOLL应用创析和性能调优会#page#HIGH PERFORMANCE COMPUTINGHPC应用计算特征简介具#page#page#HPC-AI怎么计算方法数据工作负载分配、计算、规约计算输入计算方法计算框架气象数据数理方程MPI应用模型数据如手机材料和结构数理方程MPI应用图像文件卷积神经网络模型Al FrameworkBack Propagation卷积神经网络模型Distributed AI Frame
2、workStochastiic Gradient Descent#page#HPC算法框架(GROMACS)的API是什么样的(py38)Scat READMEfmd.gro运行命令:(py38)S(py38)S(py38)(py38)head npt.g2Protein in water624831001GLNN6.3513.5333.926成1891GLN6.4113.5573.8440.99660.3B0H21001GLM6.3173.6263.959H36.4061801GLN3.4793.9961.5030CA1091GLN6.2343.4633.8680.13171555HA100
3、1GLN6.1923.5323.7961.11821.15870.0857CB6.2983.3391091GLN3.7970.55599.7156.2083.2743.7581001GLNHB1-0.1427-0.6654HB21091GLM6.3403.3683.7020.51310.2232-0.6924106.3803.2531881GLNCG-0.2330-0.11080.00581091GLNHG1116.4833.2943.8940.2447-0.6359-2.50841001GLNHG26.3283.2463.9840.09700.050-1.05453CD6.3831881GL
4、N3.1063.8450.0122-0.19750.3281女0E11091GLN6.4813.0623.7860.15480.5739-0.023615Lz0*81891GLNNE26.2793.883-0.1614-0.2948-0.27199196T91091GLN3.0733.913-0.3436-1.20570.6744171001GLHE226.2741.40252.9273.8732.0496-0.6262181001GLN6.1193.4343.9580.2029-1.0566lpy38/#page#HPC算法框架中的计算+通信处理流程CPUhresOnCPUsms atpea
5、lwith GPUs:100sofpsatpeaGPUvannttp:/www.gromacs.org/GPu_acceleratio#page#HPC并行代码中的MPI通信nt7GROMACS 2020 Source Code#page#HPC-AI计算集群中的聚合通信GPUGGPU3GPU3GPU3GPU2GPUAI-ReducGPUO LGPUOGPU1GPUOEGPU3LGPU2GPU3#page#page#RDMA-远端内存直接访问User ApplicationUser Applicationser2Buffer 1Buffer 1Buffer 1Buffer 1KernelOSO
6、SBuffer 1Buffer 1RDMA over InfiniBandHCAHCAHardwareBuffer 1Buffer 1DINPINTCP/IP机架1机架2#page#NVIDIAGPUDIRECTIRDMA加速技术SystemMemorySPUGPUMemoryGPUWithout GPU Direct RDMANetwork-Same Data Copied 3xCPDSystemMemoryChips为加速深度学习训练而设计GPUGPUMemory为CUDA加速卡提供最低的通信延退WithGPUDirect RDMA(RequiresRDMA/RoCE)消除不必要的系统内存
7、拷贝开销和CPU通信负载#page#MPI TAG MATCHING AND RENDEZVOUS OFFLOADSSWpa中国间#page#page#聚合通信优化技术的演进0SwitchCPUSwitowitch8Bregate20Hos282#page#HPC-XOPENMPI + UCX + HCOLL +IPM应用剖析和性能调优具#page#HPC-X软件生态系统ApplicationsMPIPGAS/UPCPGAS/SHMEMLogical Shared MemorycalShared MemoryMemoryMemoryMemoryMemoryMemoryMemoryPoint-t
8、o-Point:UCXCollectives: HCOLLReliable Messaging Optimized for InfiniBand HCAHardwareAcceleration:SHARP,Multicast,COREHybridTransportMechanismDirectEfficientMemory RegistrationTopologyAwareReceive Side Tag MatchingInfiniBand Verbs AP#page#HPC-X-加速HPC集群的软件工具包HPC-XToolkit中包含:Open MPI/ Open SHMEM集群消息传递框
9、架:UCX-Unified点对点RDMA通信加速库MXM-点对点加速库FCAv4.x(aka,HCOLL)-聚合通信Offloading加速库SHARP-InNetworkComputing聚合通信加速库-MPIProfiling工具Profiling Tools (IPM)MPl tests and micro benchmarks(OSUIMB)-MPI基准测试工具SourcesandscriptstorebuildHPC-Xifneeded(OpenMPlandUCX)-开源项目源码/bin/bashanjustloca/Desktop/hpcx1S-d*/archive/fcalhco
10、ll/sharp/uCx/utils/modulefiles/mxm/ompi/sources/#page#HPC-X- OPEN MPI (OMPI)MPI/SHMEMApplicationORTEMPIAPIcomponentOSC(MPInot shownPML(p2pmessaginglayer)baseRMA)basecmOSHMEMOB1RDMA(message matching)componentnot shownR2(BML)BTLBasesn MPI#page#HPC-X- UCX (UNIFIED COMMUNICATION X)ApplicationsMPICH.Open
11、-MPl.etc.RPC,MachineLearing,etcPGAS/SHMEM.UPC.etc.SPARK.Hadoop.etcUC-P (Protocols)- High Level APITransport selection.cross-ranspor muli-rai.fragmentationemulaton ofunsupportedoperatonsOAPIDomnainMessage Passing APIDomainTask BasodAPIDorainPGASAPIDomainRemolememoy accessStreamActiwe80086UCXUC-SUC-T
12、(Hardware Transports)- Low Level APIActiveMessage,RMA,Atomic,Tagmatching(Services)Common UtilitiesTansporRCEVersTansporfo GPUmemoryOertansporsDalaa0cessUtitiesatructureCUDADCTOnAMDROCMRCROCMCudaOFA Verbs DriverHardwarre#page#HPC-X中的MPIPROFILING分析工具MPI通信数据统计和分析观察应用运行参数对通信的影响,调优集群参数开发者通过部析,调优代码,均衡系统负载
13、如何运行MPI通信分析Export环境变量并LD_PRELOADIPMprofiler动态库export IPMLOG-FUL#page#IPM- INTEGRATED PERFORMANCE MONITORING42400n424hosts3441c+02CommunicationComputationofMPITiC店#page#page#MPI MESSAGE SIZESHPCGROMACS Profiling - MPI Message SizesMajority of datatransfermessagesare mediumsizes,exceptforMPLSendrecv h
14、asalarge concentration(from8Bto8KB)MPIBcast showssome concentration32NodesNETWORK OF EXPERTISE_InteLE5_2697v3.pdf4511#page#MPI DATA TRANSFER (POINT TO POINT)HPCGROMACS Profiling - MPI Data TransferAs the cluster grows,similar communication behavior is seenMajority ofcommunicationsarebetween neighbor
15、ing ranksNon-blocking(pointto point)data,andpoint-to-point transfersareshownin thegraphCollective data communicationsare small comparedto point-to-point communications2Nodes/56 CoreS32Nodos/896CoresNETWORK OF EXPERTISE_InteLE5_2697v3.pdf#page#应用性能基准测试和剖析HPCGROMACS SummaryLatest system generation imp
16、rove GROMACS performance at scaleCompute:Intel Haswell cluster outperformssystemarchitectureof previous generationsHaswel clusterouperformsSandyBridgeclusterby110%,andoutperfomsWestmereclusterby350%at32nodeCompute: Running more CPU cores provides higher performance7-10%higherproductiviywih28PPNcompa
17、redto24PPNNetwork:EDRInfiniBanddeliverssuperiorscalabilityinapplicationperformanceEDRInfiniBandprovides higher performanceand morescalablethan1GbE.10GbE.or40GbEPerformanceforEthernet(1GbE/10GbE/40GbE)staysflat(orstopsscaling)beyond2nodes-RunningatSingle Precision isapproximately twice as fastasrunni
18、ngatDouble PrecisionSeenaround41%-47%fasternningatSP(Single Precision)versusDPDouble Precision)MPI Profle shows majority of data transferare pointto-pointand non-blocking communicationsMPLSendrecv and MPLWaitall are the mostused MPlcommunicationNETWORK OF EXPERTISEnalysis_InteLE5_2697v3.pdf#page#HPC
19、-AIADVISORYCOUNCIL最佳实践GROMACS(GRQulckB店oh BrowagsIhigoadpDFpresentation(AMD-basedDownload PDFPresentation(AMDOpteron61Cases:FLUENTtion (Intalx5570-baE5-2697y2(Intel E5-2697y3Intel xecn ES-26580w7ha5orE5-2697v3basediDasedhttps:/www.hpcadyicom/bestpractices.php#page#GPUDIRECT加速NCCL2测试NCCL-Test perform
20、ancemodule load SHOMEy/application/mpiw/o GPUDirect RDMAhpcx-v2.6.0-gcc-MLNXOFEDLINUX-4.7-1.0.0.1-redhat7.7-x86 64/modulefiles/hpcx7.79017mpirun-x NCCL P2P DISABLE=0 -x NCCL SHM DISABLE=0-x NCCLIB DISABLE=O-x NCCLNET GDR LEVEL=4-X NCCLSOCKET IFNAME=ibo mlx5 -xNCCL IB HCA=mlx5 0,mlx51-x NCCLDEBUG=INF
21、O -x NCCL MIN NRINGS=T-x NCCL RINGS-mca btropenib if exclude mlx5 0.mlx5 12.0333-np 2 -pernode -host ops003:1.ops004:T-pernode-report-bindings-mca mpi_show mca params 0 -mca pml ucx verbose 0-mca pml ucx -xUCXNET DEVICES=mlx5 0:1-x LD LIBRARY PATH=/glabal/home/users/pengzhiz/application/mpi/hpcx-v2.
22、6.0-gcc-MLNX OFED LINUX-4.7-1.0.0.1-redhat7.7-86 64/ucx/lib:/global/scratch/groups/mellanox/for pengzhi/ops/nvidia/cuda-10.2/lib64:/global/home/users/pengzhiz/application/nccl/FETCH-HEAD-cuda-10.2/libS(HOMEapplicatiion/nccl/FETCH-HEAD-cuda-10.2/nccl-tests/allreduce perf -b 1-f 2-e 32M-g 1-c 1-z#page
23、#SHARPALLREDUCE加速的分布式AI模型训练mpirun-x UCX_NET DEVICES=mlx5_bond_0:1-mca pml ob1 -mca btl openibhostgpuhost1:8,gpuhost2:8,gpuhost3:8,gpuhost4:8-XCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7”-XNCCL NET_GDR_LEVEL-5-XNCCLNET_GDR_READ=1xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3-XNCCL-x NCCL_COLLNET_ENABLE-1Fx NCCL_I
24、B DISABLE=0singularity exec -nv -bind /application /containers/cuda-10.2-cudnn7-devet-centos7/application/anaconda3/envs/py38-tf220/bin/python/github/tensorflow-benchmarks/tf210/scripts/tf_cnn_benchmarks/tf_cnn benchmarks.py-data_name=imagenet-batch_size=512-use_fp16-model=alexnet -num_gpus=1-variable_update=horovod#page#A100 80GB GPU FOR SUPERCOMPUTINGDEsNVIDIA#page#NVIDIA会#page#