《Intel_s Ponte Vecchio GPU - Architecture Systems and Software FINAL.pdf》由会员分享,可在线阅读,更多相关《Intel_s Ponte Vecchio GPU - Architecture Systems and Software FINAL.pdf(29页珍藏版)》请在三个皮匠报告上搜索。
1、Hong Jiang,Ph.D.Chief GPU Compute Architect,Intel FellowIntel CorporationAugust 2022Intel ConfidentialDepartment or Event Name2 2Ponte Vecchio PlatformoneAPISoftware StackPonte Vecchio Architecture HighlightsApplication PerformanceIntel ConfidentialDepartment or Event Name3 3ArchitectureXecorePonte
2、Vecchio 2-StackIntel ConfidentialDepartment or Event Name4 4Ponte Vecchio x4 Subsystem with XeLinks Ponte Vecchio OAM Ponte Vecchio x4 Subsystem with XeLinks +2S Sapphire Rapidsx4 subsystem supports all-to-all connection across XeLinksOAMs support all-to-all topologies for both 4 GPU and 8 GPU platf
3、orms Intel ConfidentialDepartment or Event Name5 5Ponte Vecchio PlatformoneAPISoftware StackPonte Vecchio Architecture HighlightsApplication PerformanceIntel ConfidentialDepartment or Event Name6 6Specification and more information:https:/oneAPI Level ZeroEcosystem HALsSYCL/OpenMP LanguagesEcosystem
4、 LanguagesToolsToolsPort,Compile,Debug,AnalyzeSW Contract:Open,Free,Everywhere,ForeveroneAPI LibrariesEcosystem&Intel LibrariesEcosystem&Intel LibrariesEcosystem Toolsx86 ISAXeGPU ISAFPGA ISAASIC ISAARM ISANV GPUISA(PTX)Intel ConfidentialDepartment or Event Name7 7Host InterfaceLevel-Zero InterfaceO
5、ptimized Middleware&FrameworksCompatibilityToolDirectProgrammingData Parallel C+And other Programming languagesAPI-BasedProgrammingLibrariesTarget System SoftwareTarget System SoftwareTarget System SoftwareTarget System SoftwareOptimized ApplicationsCPUGPUFPGAOther AcceleratorSysmanResource Mgmt Pow
6、er Frequency Temperature HealthThe oneAPILevel-Zero APIs provide a low-level hardware interface to support cross-architecture programmingInterface for oneAPIand other tools to accelerator devicesFine gain control and low latency to accelerator capabilitiesMulti-threaded design For GPUs,ships as a pa
7、rt of the driverThe Level-Zero SysmanAPIs provide in-band access to resource management features for each accelerator deviceIntel ConfidentialDepartment or Event Name8 8Assists developers migrating code written in CUDA*to SYCL once,generating human readable code wherever possible90-95%of code typica
8、lly migrates automatically Inline comments are provided to help developers finish porting the applicationOpen Source build,SYCLomatic,available todayIntel DPC+Compatibility Tool Resources including Training and Exampleshttps:/ Optimization Guide https:/ Testimonials and SamplesSamsung experience wit
9、h Compatibility Tool and Vtunehttps:/ examples of others in industry that have used tool https:/ National Lab Webinar https:/www.alcf.anl.gov/support-center/aurora/intel-dpc-compatibility-toolIntel ConfidentialDepartment or Event Name9 9Ponte Vecchio PlatformoneAPISoftware StackPonte Vecchio Archite
10、cture HighlightsApplication PerformanceIntel ConfidentialDepartment or Event Name1010Peak ThroughputPonte Vecchio 2-StackFP6452 TFLOPSFP3252 TFLOPSXMX Float 32(TF32)419 TFLOPSXMX BF16839 TFLOPSXMX FP16839 TFLOPSXMX INT81678 TOPSXecorePonte VecchioXMX:XeMatrix ExtensionsIntel ConfidentialDepartment o
11、r Event Name11 11Large Bandwidth and Cache bring data close to ComputePonte Vecchio 2-StackRegister FileL1 CacheL2 CacheHBMMaximum Size64 MB64 MB408 MB128 GBPeak Read Bandwidth419 TB/s105 TB/s13 TB/s3.2 TB/sXecorePonte Vecchio1:61:14:18:14:1Intel ConfidentialDepartment or Event Name1212Compute Effic
12、iency TechniquesRegister FileRegister cachingAccumulatorsL1 and L2 CacheWrite ThroughWrite BackWrite StreamingUncachedPrefetchSoftware(instruction)prefetch to L1 and/or L2Command Streamer prefetch to L2 for instructions and dataXecorePonte VecchioIntel ConfidentialDepartment or Event Name1313ResNet-
13、50 Training PerformanceSensitivity to L2 Cache Size4Kx4K 2D-FFT DP Performance Sensitivity to L2 Cache SizeSee backup for workloads and configurations.Results may vary.Intel does not control or audit third-party data.You should consult other sources to evaluate accuracy.2D-FFT CaseDNN CaseLarger L2
14、captures activations between layers3284348965870040005000600070000%10%20%30%40%50%60%70%80%90%100%32 MB80 MB408 MBRelative performanceL2 Cache sizeRelative PerfGFLOPS0%10%20%30%40%50%60%70%80%90%100%32 MB80 MB408 MBRelative performance L2 Cache sizeRelative PerfIn this sensitivity study,t
15、he 408MB L2 cache in Ponte Vecchio is down-configured to 80MB and 32MB.Intel ConfidentialDepartment or Event Name1414Transpose Dataport:Block data load/storeAddress calculation offloadHardware data paddingHardware TransposeLatency Coverage:Large RF/L1$Loop unrollingSoftware prefetchSee backup for wo
16、rkloads and configurations.Results may vary.XecoreIntel ConfidentialDepartment or Event Name15150%10%20%30%40%50%60%70%80%90%100%0040005000XMX EfficiencyXMX EfficiencyM,N valueM,N valueBF16BF16-GEMM EFFICIENCY(K=4096)GEMM EFFICIENCY(K=4096)A x BA x Trans(B)Trans(A)x BTranspose Dataport:Bl
17、ock data load/storeAddress calculation offloadHardware data paddingHardware TransposeLatency Coverage:Large RF/L1$Loop unrollingSoftware prefetchRoofline performance,including transpose,achieved using C+codeSee backup for workloads and configurations.Results may vary.Intel ConfidentialDepartment or
18、Event Name1616SYCL and OpenMPThread Group size of 32 and 16Common GPU Programing ModelDPC+compatibility toolSYCL eSIMDand OpenMPFlexible SIMD width in offload regionsCommon CPU Programming ModelSeamless transition from well-tuned CPU codefor each Acode 1;for each Bcode 2;for each Ccode 3;code 4;/SIM
19、T codefor each A /Thread+SIMD32code 1;for each B code 2;for each Ccode 3;code 4;/SIMD codefor each A /Threadcode 1;for each B /SIMD32code 2;for each C /SIMD8code 3;code 4;https:/intel.github.io/llvm-docs/SPMDSingle Program Multiple DataSIMTSingle Instruction Multiple ThreadSIMDSingle Instruction Mul
20、tiple DataIntel ConfidentialDepartment or Event Name17170.931.131.000.960.970.991.111.151.791.561.471.742.481.491.421.4000.511.522.53aobenchaobenchaprioriaprioriSYCL HPSYCL HPlinpacklinpackreverse timereverse timemigrationmigrationHashtableHashtablesobel filtersobel filterEasywaveEasywaveImageImageD
21、enoisingDenoisingA100-CUDAA100-SYCLPonte Vecchio-SYCLStart with some popular legacy CUDA workloadsMigrated to SYCL using DPC+Compatibility Toolo Comparable performance on competitors hardwareSame performance portable SYCL code runs on Ponte Vecchioo Ponte Vecchio outperforms NVIDIA A100,up to 2.5xSe
22、e backup for workloads and configurations.Results may vary.Ponte Vecchio-SYCL vs.NVIDIA-A100-SYCL vs.A100-CUDA(Higher is better)Intel ConfidentialDepartment or Event Name1818Start with legacy OpenMP SIMD codes,well-tuned on CPUo C,C+,FORTRAN,etc.Transition to Ponte Vecchio without code refactorizati
23、onQuickly achieving performance target on Ponte Vecchio with OpenMP SIMDo SPMD code needs more tuningHACC on Ponte Vecchio,16 MPI Ranks(Relative time,lower is better)00.511.522.533.544.5Hierarchical Parallelism,SIMDHierarchical Parallelism,SIMTSee backup for workloads and configurations.Results may
24、vary.Intel ConfidentialDepartment or Event Name1919low register pressureAvoidmemory spillNarrow SIMD width#include#define N 1024*1024*1024int main()double aN;for(int i=0;i N;i+)ai=i;#pragma omp target teams distribute parallel for map(a)for(int i=0;iN;i+)double x=0.1;#pragma omp simd simdlen(8)for(i
25、nt j=0;j16;j+)x+;ai=ai+x;for(int i=0;i N;i+)printf(%fn,ai);return 0;OpenMP loop SIMD Determine SIMD width at loop levelGo lower SIMD width to squeeze through high register pressure loop to avoid spillsSacrifice parallelism only in the high register pressure SIMD loopsIntel ConfidentialDepartment or
26、Event Name2020Ponte Vecchio PlatformoneAPISoftware StackPonte Vecchio Architecture HighlightsApplication PerformanceIntel ConfidentialDepartment or Event Name2121See backup for workloads and configurations.Results may vary.Intel does not control or audit third-party data.You should consult other sou
27、rces to evaluate accuracy.miniBUDEThroughput on Workload BIG5 with 983040 Poses,2672 Ligands,2672 Proteins(GFLOPS,higher is better)050000000250003000035000NVIDIA A100Ponte VecchioApplicationApplication SummarySummary:MiniBude is an implementation of the core computation of the Bristol Uni
28、versity DockingEngine(BUDE)in different HPC programming models,using a tuned empirical free-energy forcefield to predictthe binding energy of the ligand with the target.The benchmark is a virtual screening run of the NDM-1 protein andruns the energy evaluation for a single generation of poses repeat
29、edly,for a configurable number of iterations.Increasing the iteration count has similar performance effects to docking multiple ligands back-to-back in aproductionBUDEdockingrun.https:/ docking problemVirtual Screening by DockingPonte Vecchio with Intel OneAPIDPC+implementationISC High Performance 2
30、021 conference paper:https:/ Frontiers Asia 2022 presentation:https:/ performance leadIntel ConfidentialDepartment or Event Name2222Relative Performance of NekRSBenchmarks w/problem size of 8196(Averaged throughput,higher is better)See backup for workloads and configurations.Results may vary.Intel d
31、oes not control or audit third-party data.You should consult other sources to evaluate accuracy.https:/ceed.exascaleproject.org/docs/ceed-ms38-report.pdfExaSMR:Small modular reactors(SMRs)and advanced reactor concepts(ARCs)will deliver clean,flexible,reliable,and affordable electricity while avoidin
32、g the traditional limitations of large nuclear reactor designs,https:/www.exascaleproject.org/research-project/exasmr/Application Summary:Application Summary:NekRSNekRSis an open-source Navier Stokes solver based on the spectral element method targeting classical processors and accelerators like GPU
33、s.The code started as a fork of libParanumal in 2019.For API portable programming OCCA is used.https:/ an open-source library which aims to make it easy to program different types of devices(e.g.CPU,GPU,FPGA).It provides a unified API for interacting with backend device APIs(e.g.OpenMP,CUDA,OpenCL),
34、uses just-in-time compilation to build backend kernel,and provide a kernel language,a minor extension to C,to abstract programming for each backend.https:/libocca.org1.51.31.71.400.20.40.60.811.21.41.61.82AxHelm(FP64)AxHelm(FP64)AxHelm(FP32)AxHelm(FP32)FDM(FP32)FDM(FP32)AdvSubAdvSubPonte Vecchio wit
35、h Intel OneAPIDPC+implementation1.5x performanceleadPonte VecchioNVIDIA A100Intel ConfidentialDepartment or Event Name2323OpenMCDepleted Fuel Inactive Batch Performance on HM-Large Reactor with 40M particles(particles/second,Higher is better)0200000400000600000800000000NVIDIA A100Ponte Ve
36、cchioApplication Summary:Application Summary:OpenMCis a Monte Carlo particle transport application that has recently been ported to the OpenMP target offloading programming model for use on GPU-based systems.The Monte Carlo method employed by OpenMC is considered the gold standard for high-fidelity
37、simulation while also having the advantage of being a general-purpose method able to simulate nearly any geometry or material without the need for domain-specific assumptions.However,despite the extreme advantages in ease of use and accuracy,Monte Carlo methods like those in OpenMC often suffer from
38、 a very high computational cost.The extreme performance gains OpenMC has achieved on GPUs,as compared to traditional CPU architectures,is finally bringing within reach a much larger class of problems that historically were deemed too expensive to simulate using Monte Carlo methods.The leap in perfor
39、mance that GPUs are now offering carries with it the potential to disrupt a number of engineering technology stacks that have traditionally been dominated by non-general deterministic methods.For instance,faster MC applications may greatly expand the design space and simplify the regulation process
40、for new nuclear reactor designs-potentially improving the economics of nuclear energy and therefore helping to solve the worlds climate crisis.See backup for workloads and configurations.Results may vary.Intel does not control or audit third-party data.You should consult other sources to evaluate ac
41、curacy.Ponte Vecchio with OpenMP Target offloadExascale Compute Project Annual Meeting 2022 presentation:https:/www.alcf.anl.gov/events/2022-ecp-annual-meetingInternational Conference on Physics of Reactors 2022 presentation:https:/www.ans.org/meetings/physor2022/session/view-976/2x performanceleadh
42、ttps:/docs.openmc.orgIntel ConfidentialDepartment or Event Name2424Largest high bandwidth cache in a GPUOpen SW model with SPMD/SIMD flexibilityLeadership performanceIntel ConfidentialDepartment or Event Name2525Ready to win?Visit bit.ly/HotWings22 and match Intel speakers to their talks for a chanc
43、e to win an Intel NUC Mini PC and other prizes.Intel ConfidentialDepartment or Event Name2626Performance varies by use,configuration and other factors.Learn more on the Performance Index site.Software and workloads used in performance tests may have been optimized for performance only on Intel produ
44、cts.Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.See backup for configuration details.No product or component can be absolutely secure.Results that are based on pre-production systems and components as well as results
45、 that have been estimated or simulated using an Intel Reference Platform(an internal example new system),internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only.Results may vary based on future changes to any systems,components,specification
46、s,or configurations.Intel technologies may require enabled hardware,software or service activation.Your costs and results may vary.Intel technologies may require enabled hardware,software or service activation.Intel does not control or audit third-party data.You should consult other sources to evalu
47、ate accuracy.Code names are used by Intel to identify products,technologies,or services that are in development and not publicly available.These are not commercial names and not intended to function as trademarks.Statements in this document that refer to future plans or expectations are forward-look
48、ing statements.These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements.For more information on the factors that could cause actual results to differ materially,
49、see our most recent earnings release and SEC filings at .Intel Corporation.Intel,the Intel logo,and other Intel marks are trademarks of Intel Corporation or its subsidiaries.Other names and brands may be claimed as the property of others.Intel ConfidentialDepartment or Event Name2828Intel Confidenti
50、alDepartment or Event Name2929Cache Benefits for 2DCache Benefits for 2D-FFT and DNN:FFT and DNN:Test by Intel as of 8/1/2022,2s future Intel Xeon CPU codenamed Sapphire Rapids,1x pre-production two-stack Ponte Vecchio GPU,Ubuntu 20.04,pre-production oneAPI,configuring L2 Cache to 408MB,80MB,and 32M
51、B.2D-FFT configuration was with 4K*4K DP 2D FFT.ResNet-50 v1.5 used xemlbench,an internal python software framework;Explicit scaling with Batch-128 on each the two Ponte Vecchio stacks;Dataset:ImageNet2012,Precision:BF16/FP32 mixed,Accuracy:76.17%top1.XMX Matrix Compute Efficiency:XMX Matrix Compute
52、 Efficiency:Test by Intel as of 8/1/2022,2s future Intel Xeon CPU codenamed Sapphire Rapids,1x pre-production two-stack Ponte Vecchio GPU,Ubuntu 20.04,oneAPI pre-production software.Internal engineering benchmark software;Explicit scaling with the matrix operations with the given dimensions on eacht
53、he two Ponte Vecchio stacks.Intel DPC+Compatibility Tool Results:Intel DPC+Compatibility Tool Results:Test by Intel as of 8/1/2022,NVIDIA Config:2s Intel Xeon Platinum 8360Y,PCIe NVIDIA A100 80GB.Software:SYCL open source/CLANG 15.00,CUDA SDK 11.6 with NVIDIA-NVCC 11.6.55,cuMath11.6,cuDNN11.6,Ubuntu
54、 21.10.SYCL open source/CLANG compiler switches:-fsycl-targets=nvptx64-nvidia-cuda.NVIDIA NVCC compiler switches:-O3-gencodearch=compute_80,code=sm_80.Represented workloads with Intel Internal optimizations.Intel Config:Intel pre-production platform with 2s 4thgen Intel Xeon Scalable and 1x two-stac
55、k Ponte Vecchio GPU running pre-production oneAPI,Ubuntu 20.04.HACC SIMD and SIMT Results:HACC SIMD and SIMT Results:Test by Intel as of 8/1/2022,2s future Intel Xeon CPU codenamed Sapphire Rapids HBM,1x pre-production two-stack Ponte Vecchio GPU,SLES 15 SP3m,Ubuntu 20.04,HACC settings:16 ranks,Np3=
56、5043 particles,Geometry=4x2x2.SIMT HACC version from https:/asc.llnl.gov/coral-2-benchmarks.SIMD HACC version is Intel Internal,utilizing MPI+OpenMP.miniBUDEminiBUDE:Tested by Intel as of 8/1/2022.BIG5 dataset(2672 Ligands,2672 proteins,and 983040 poses)https:/ Config:2s Intel Xeon Platinum 8360Y,PC
57、Ie NVIDIA A100 80GB,CUDA 11.4,Intel Config:2s Intel Xeon Platinum 8360Y,1x two-stack pre-production Ponte Vecchio GPU,Ubuntu 20.04,pre-production oneAPI.NekRSNekRS:Tested by Intel as of 8/1/2022.Benchmark:NekRSAxHelm(BK5)FP64,AxHelm(BK5)FP32,FDM FP32 and advSub with problem size of 8192(E=2x163).NVI
58、DIA Config:GPU:AMD EPYC 7532 32-Core Processor,PCIe NVIDIA A100 80GB,DDR4-3200 256GB(8x32G DIMMs)RAM,Intel Config:2s Intel Xeon Platinum 8360Y 2.40GHz;Memory:256 GB DDR4 3200,1x two-stack pre-production Ponte Vecchio GPU,Ubuntu 20.04,pre-production oneAPI.OpenMCOpenMC:Test by Argonne National Labora
59、tory as of 5/23/2022,2x AMD EPYC 7532,256 GB DDR4 3200,HT On,Turbo On,ucode 0 x8301038.1x A100 40GB PCIe.OpenSUSE Leap 15.3,Linux Version 5.3.18,Libararies:CUDA 11.6 with OpenMP clang compiler.Build Knobs:cmake-preset=llvm_a100-DCMAKE_UNITY_BUILD=ON-DCMAKE_UNITY_BUILD_MODE=BATCH-DCMAKE_UNITY_BUILD_B
60、ATCH_SIZE=1000-DCMAKE_INSTALL_PREFIX=./install-Ddebug=off-Doptimize=on-Dopenmp=on-Dnew_w=on-Ddevice_history=off-Ddisable_xs_cache=on-Ddevice_printf=off.Benchmark:Depleted Fuel Inactive Batch Performance on HM-Large Reactor with 40M particles Test By Intel as of 5/25/2022,1-node,2x Intel Xeon Scalabl
61、e Processor 8360Y,256GB DDR4 3200,HT On,Turbo,On,ucode 0 xd0002c1.1x Pre-production Ponte Vecchio.Ubunt 20.04,Linux Version 5.10.54,agama 434,Build Knobs:cmake-DCMAKE_CXX_COMPILER=mpiicpc-DCMAKE_C_COMPILER=mpiicc-DCMAKE_CXX_FLAGS=-cxx=icpx-mllvm-indvars-widen-indvars=false-Xclang-fopenmp-declare-tar
62、get-global-default-no-map-std=c+17-Dgsl_CONFIG_CONTRACT_CHECKING_OFF-fsycl-DSYCL_SORT-D_GLIBCXX_USE_TBB_PAR_BACKEND=0-preset=spirv-DCMAKE_UNITY_BUILD=ON-DCMAKE_UNITY_BUILD_MODE=BATCH-DCMAKE_UNITY_BUILD_BATCH_SIZE=1000-DCMAKE_INSTALL_PREFIX=./install-Ddebug=off-Doptimize=on-Dopenmp=on-Dnew_w=on-Ddevice_history=off-Ddisable_xs_cache=on-Ddevice_printf=off Benchmark:Depleted Fuel Inactive Batch Performance on HM-Large Reactor with 40M particles