报告预览

HPC 应用性能分析和调优.pdf

编号：29469

PDF 31页 4.13MB 下载积分：VIP专享

下载报告请您先登录！

HPC 应用性能分析和调优.pdf

1、NVIDIAHPC应用性能分析和调优PengzhiZhu，Dec 2020会#page#AGENDAHPC应用计算特征简介HPC加速关键技术介绍HPC-X+OpenMPI+UCX+HCOLL应用创析和性能调优会#page#HIGH PERFORMANCE COMPUTINGHPC应用计算特征简介具#page#page#HPC-AI怎么计算方法数据工作负载分配、计算、规约计算输入计算方法计算框架气象数据数理方程MPI应用模型数据如手机材料和结构数理方程MPI应用图像文件卷积神经网络模型Al FrameworkBack Propagation卷积神经网络模型Distributed AI Frame

2、workStochastiic Gradient Descent#page#HPC算法框架（GROMACS）的API是什么样的（py38）Scat READMEfmd.gro运行命令：（py38）S（py38）S（py38）（py38）head npt.g2Protein in water624831001GLNN6.3513.5333.926成1891GLN6.4113.5573.8440.99660.3B0H21001GLM6.3173.6263.959H36.4061801GLN3.4793.9961.5030CA1091GLN6.2343.4633.8680.13171555HA100

3、1GLN6.1923.5323.7961.11821.15870.0857CB6.2983.3391091GLN3.7970.55599.7156.2083.2743.7581001GLNHB1-0.1427-0.6654HB21091GLM6.3403.3683.7020.51310.2232-0.6924106.3803.2531881GLNCG-0.2330-0.11080.00581091GLNHG1116.4833.2943.8940.2447-0.6359-2.50841001GLNHG26.3283.2463.9840.09700.050-1.05453CD6.3831881GL

4、N3.1063.8450.0122-0.19750.3281女0E11091GLN6.4813.0623.7860.15480.5739-0.023615Lz0*81891GLNNE26.2793.883-0.1614-0.2948-0.27199196T91091GLN3.0733.913-0.3436-1.20570.6744171001GLHE226.2741.40252.9273.8732.0496-0.6262181001GLN6.1193.4343.9580.2029-1.0566lpy38/#page#HPC算法框架中的计算+通信处理流程CPUhresOnCPUsms atpea

5、lwith GPUs:100sofpsatpeaGPUvannttp:/www.gromacs.org/GPu_acceleratio#page#HPC并行代码中的MPI通信nt7GROMACS 2020 Source Code#page#HPC-AI计算集群中的聚合通信GPUGGPU3GPU3GPU3GPU2GPUAI-ReducGPUO LGPUOGPU1GPUOEGPU3LGPU2GPU3#page#page#RDMA-远端内存直接访问User ApplicationUser Applicationser2Buffer 1Buffer 1Buffer 1Buffer 1KernelOSO

6、SBuffer 1Buffer 1RDMA over InfiniBandHCAHCAHardwareBuffer 1Buffer 1DINPINTCP/IP机架1机架2#page#NVIDIAGPUDIRECTIRDMA加速技术SystemMemorySPUGPUMemoryGPUWithout GPU Direct RDMANetwork-Same Data Copied 3xCPDSystemMemoryChips为加速深度学习训练而设计GPUGPUMemory为CUDA加速卡提供最低的通信延退WithGPUDirect RDMA（RequiresRDMA/RoCE）消除不必要的系统内存

7、拷贝开销和CPU通信负载#page#MPI TAG MATCHING AND RENDEZVOUS OFFLOADSSWpa中国间#page#page#聚合通信优化技术的演进0SwitchCPUSwitowitch8Bregate20Hos282#page#HPC-XOPENMPI + UCX + HCOLL +IPM应用剖析和性能调优具#page#HPC-X软件生态系统ApplicationsMPIPGAS/UPCPGAS/SHMEMLogical Shared MemorycalShared MemoryMemoryMemoryMemoryMemoryMemoryMemoryPoint-t

8、o-Point:UCXCollectives: HCOLLReliable Messaging Optimized for InfiniBand HCAHardwareAcceleration:SHARP，Multicast，COREHybridTransportMechanismDirectEfficientMemory RegistrationTopologyAwareReceive Side Tag MatchingInfiniBand Verbs AP#page#HPC-X-加速HPC集群的软件工具包HPC-XToolkit中包含：Open MPI/ Open SHMEM集群消息传递框

9、架：UCX-Unified点对点RDMA通信加速库MXM-点对点加速库FCAv4.x（aka,HCOLL）-聚合通信Offloading加速库SHARP-InNetworkComputing聚合通信加速库-MPIProfiling工具Profiling Tools (IPM）MPl tests and micro benchmarks(OSUIMB）-MPI基准测试工具SourcesandscriptstorebuildHPC-Xifneeded(OpenMPlandUCX)-开源项目源码/bin/bashanjustloca/Desktop/hpcx1S-d*/archive/fcalhco

10、ll/sharp/uCx/utils/modulefiles/mxm/ompi/sources/#page#HPC-X- OPEN MPI (OMPI）MPI/SHMEMApplicationORTEMPIAPIcomponentOSC(MPInot shownPML（p2pmessaginglayer）baseRMA）basecmOSHMEMOB1RDMA（message matching）componentnot shownR2（BML）BTLBasesn MPI#page#HPC-X- UCX (UNIFIED COMMUNICATION X）ApplicationsMPICH.Open

11、-MPl.etc.RPC,MachineLearing，etcPGAS/SHMEM.UPC.etc.SPARK.Hadoop.etcUC-P (Protocols)- High Level APITransport selection.cross-ranspor muli-rai.fragmentationemulaton ofunsupportedoperatonsOAPIDomnainMessage Passing APIDomainTask BasodAPIDorainPGASAPIDomainRemolememoy accessStreamActiwe80086UCXUC-SUC-T

12、(Hardware Transports）- Low Level APIActiveMessage，RMA，Atomic，Tagmatching（Services）Common UtilitiesTansporRCEVersTansporfo GPUmemoryOertansporsDalaa0cessUtitiesatructureCUDADCTOnAMDROCMRCROCMCudaOFA Verbs DriverHardwarre#page#HPC-X中的MPIPROFILING分析工具MPI通信数据统计和分析观察应用运行参数对通信的影响，调优集群参数开发者通过部析，调优代码，均衡系统负载

13、如何运行MPI通信分析Export环境变量并LD_PRELOADIPMprofiler动态库export IPMLOG-FUL#page#IPM- INTEGRATED PERFORMANCE MONITORING42400n424hosts3441c+02CommunicationComputationofMPITiC店#page#page#MPI MESSAGE SIZESHPCGROMACS Profiling - MPI Message SizesMajority of datatransfermessagesare mediumsizes，exceptforMPLSendrecv h

14、asalarge concentration（from8Bto8KB）MPIBcast showssome concentration32NodesNETWORK OF EXPERTISE_InteLE5_2697v3.pdf4511#page#MPI DATA TRANSFER (POINT TO POINT）HPCGROMACS Profiling - MPI Data TransferAs the cluster grows，similar communication behavior is seenMajority ofcommunicationsarebetween neighbor

15、ing ranksNon-blocking（pointto point)data，andpoint-to-point transfersareshownin thegraphCollective data communicationsare small comparedto point-to-point communications2Nodes/56 CoreS32Nodos/896CoresNETWORK OF EXPERTISE_InteLE5_2697v3.pdf#page#应用性能基准测试和剖析HPCGROMACS SummaryLatest system generation imp

16、rove GROMACS performance at scaleCompute:Intel Haswell cluster outperformssystemarchitectureof previous generationsHaswel clusterouperformsSandyBridgeclusterby110%，andoutperfomsWestmereclusterby350%at32nodeCompute: Running more CPU cores provides higher performance7-10%higherproductiviywih28PPNcompa

17、redto24PPNNetwork:EDRInfiniBanddeliverssuperiorscalabilityinapplicationperformanceEDRInfiniBandprovides higher performanceand morescalablethan1GbE.10GbE.or40GbEPerformanceforEthernet(1GbE/10GbE/40GbE）staysflat（orstopsscaling）beyond2nodes-RunningatSingle Precision isapproximately twice as fastasrunni

18、ngatDouble PrecisionSeenaround41%-47%fasternningatSP(Single Precision）versusDPDouble Precision）MPI Profle shows majority of data transferare pointto-pointand non-blocking communicationsMPLSendrecv and MPLWaitall are the mostused MPlcommunicationNETWORK OF EXPERTISEnalysis_InteLE5_2697v3.pdf#page#HPC

19、-AIADVISORYCOUNCIL最佳实践GROMACS（GRQulckB店oh BrowagsIhigoadpDFpresentation（AMD-basedDownload PDFPresentation（AMDOpteron61Cases:FLUENTtion （Intalx5570-baE5-2697y2（Intel E5-2697y3Intel xecn ES-26580w7ha5orE5-2697v3basediDasedhttps:/www.hpcadyicom/bestpractices.php#page#GPUDIRECT加速NCCL2测试NCCL-Test perform

20、ancemodule load SHOMEy/application/mpiw/o GPUDirect RDMAhpcx-v2.6.0-gcc-MLNXOFEDLINUX-4.7-1.0.0.1-redhat7.7-x86 64/modulefiles/hpcx7.79017mpirun-x NCCL P2P DISABLE=0 -x NCCL SHM DISABLE=0-x NCCLIB DISABLE=O-x NCCLNET GDR LEVEL=4-X NCCLSOCKET IFNAME=ibo mlx5 -xNCCL IB HCA=mlx5 0,mlx51-x NCCLDEBUG=INF

21、O -x NCCL MIN NRINGS=T-x NCCL RINGS-mca btropenib if exclude mlx5 0.mlx5 12.0333-np 2 -pernode -host ops003：1.ops004:T-pernode-report-bindings-mca mpi_show mca params 0 -mca pml ucx verbose 0-mca pml ucx -xUCXNET DEVICES=mlx5 0:1-x LD LIBRARY PATH=/glabal/home/users/pengzhiz/application/mpi/hpcx-v2.

22、6.0-gcc-MLNX OFED LINUX-4.7-1.0.0.1-redhat7.7-86 64/ucx/lib:/global/scratch/groups/mellanox/for pengzhi/ops/nvidia/cuda-10.2/lib64:/global/home/users/pengzhiz/application/nccl/FETCH-HEAD-cuda-10.2/libS（HOMEapplicatiion/nccl/FETCH-HEAD-cuda-10.2/nccl-tests/allreduce perf -b 1-f 2-e 32M-g 1-c 1-z#page

23、#SHARPALLREDUCE加速的分布式AI模型训练mpirun-x UCX_NET DEVICES=mlx5_bond_0:1-mca pml ob1 -mca btl openibhostgpuhost1:8，gpuhost2:8，gpuhost3:8，gpuhost4:8-XCUDA_VISIBLE_DEVICES=0，1,2,3,4,5,6，7”-XNCCL NET_GDR_LEVEL-5-XNCCLNET_GDR_READ=1xNCCL_IB_HCA=mlx5_0，mlx5_1，mlx5_2，mlx5_3-XNCCL-x NCCL_COLLNET_ENABLE-1Fx NCCL_I

24、B DISABLE=0singularity exec -nv -bind /application /containers/cuda-10.2-cudnn7-devet-centos7/application/anaconda3/envs/py38-tf220/bin/python/github/tensorflow-benchmarks/tf210/scripts/tf_cnn_benchmarks/tf_cnn benchmarks.py-data_name=imagenet-batch_size=512-use_fp16-model=alexnet -num_gpus=1-variable_update=horovod#page#A100 80GB GPU FOR SUPERCOMPUTINGDEsNVIDIA#page#NVIDIA会#page#

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（HPC 应用性能分析和调优.pdf）为本站（X-iao）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。

上海品茶

HPC 应用性能分析和调优.pdf

HPC 应用性能分析和调优.pdf