《HC2022.AMD.AlanSmith.v13.Final.20220818.pdf》由会员分享,可在线阅读,更多相关《HC2022.AMD.AlanSmith.v13.Final.20220818.pdf(23页珍藏版)》请在三个皮匠报告上搜索。
1、AMD InstinctMI200 Series Accelerator and Node ArchitecturesAlan SmithAMD Sr.Fellow and Instinct Lead SOC Architect Norman JamesAMD Fellow and Instinct Lead System Architect|2AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Relentless Demand for Scientific Compu
2、ting 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,00002005200251 exaFLOP1 petaFLOP2Xevery 1.2 yearsSpace ExplorationMachine LearningClimate ChangeChemical SciencesEnergy SolutionsReal Time SimulationWorlds Fastest SupercomputersLinpack
3、 GFlops|3AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022First Multi-die GPU3Workload-optimized Compute Architecture3rdGen AMD Infinity ArchitectureAMD InstinctMI200 Series|4AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,20
4、22MI250X MCM HBM2EHBM2EMEMORY CONTROLLERMEMORY CONTROLLERVCNL2 CACHEINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKPCI EXPRESS OR INFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKHBM2EHBM2EMEMORY CONTROLLERMEMORY CONTROLLERVCNINFINITY F
5、ABRICHBM2EHBM2EMEMORY CONTROLLERMEMORY CONTROLLERVCNL2 CACHEINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKPCI EXPRESS OR INFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKINFINITY FABRIC LINKHBM2EHBM2EMEMORY CONTROLLERMEMORY CONTROLLERVCNINFINITY FABRICSca
6、le UpMI250XMI250XMI250XScale OutExternal Infinity Fabric500 GB/sPCIE Gen4 ESM100 GB/SCoherent CPU-GPU Memory3RDGen Infinity Architecture144 GB/s58B Transistors in 6nmSHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINESH
7、ADERENGINESHADERENGINESHADERENGINESHADERENGINESHADERENGINECOMPUTE UNIT220 Compute Units880 Matrix Cores128 GB HBM2e3.2 TB/sIn-packageInfinity Fabric 400 GB/sGraphics Compute Die(GCD)|5AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Key InnovationsAMD InstinctM
8、I200 OAM SeriesUltra High BandwidthDie InterconnectCoherent CPU-to-GPU InterconnectEight Stacksof HBM2E2.5D ElevatedFanout Bridge(EFB)2ndGen Matrix Cores for HPC&AITwo AMD CDNA2 DiesAMD InstinctMI200 Series|6AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Effi
9、ciencyReal-Time Gaming(Frames/Second)High-Performance Compute(Flops/Second)General PurposeGPU Architecture(GPGPU)Compute-optimizedGPU ArchitectureGaming-optimizedGPU ArchitectureMoores Law SlowingCosts and Power IncreasingOptimal EfficiencyArchitectures optimized for target workloadsOptimal Efficien
10、cy through Domain-specific OptimizationDomain-specific Architectures|7AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022AMD CDNA 2CU Performance-per-watt0.220.44MI100MI250X2XImprovementPerformance Contributors Design Frequency Increase Leveraged CPU expertise St
11、reamlined micro-architecture and designPower Optimizations Tuned for low voltage operation Minimized clock and data movement powerArchitecture Innovations Efficient matrix data-paths Extensive operand reuse and forwardingDelivered Double Precision GEMM Performance(GFLOPS/WATT/CU)1P EPYC 77422P Optim
12、ized 3rdGen EPYCSee Endnote MI200-64|8AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022CDNA 2 Compute Unit with Enhanced Matrix CoresCombined register resources with a custom SRAM design for matrix and vector ops Reduced energy on register file accesses and inc
13、reased capacity for all operation typesEnhanced matrix cores;with emphasis on high-performance computing 4x double-precision and 2x Bfloat16 matrix OPS throughput relative to prior generation New power efficient IEEE754 compliant double-precision matrix instructions for 16x16x4 and 4x4x4(MxNxK)block
14、s Full input operand reuse and output accumulator forwarding for substantial reduction in power2x double-precision throughput and packed single-precision vector OPS relative to prior generation New instructions for packed single-precision vector operations(FMA,MUL,ADD,MOVE)|9AMD Instinct MI200 Serie
15、s Accelerator and Node Architectures|Hot Chips 34 August 22,2022 Scaling for throughput and large datasets Per GCD L2 Cache 8MB total capacity Physically partitioned into 32 slices Each slice delivers 128B/CLK Enhanced queuing and arbitration Enhanced atomic operations Per GCD Memory Subsystem 64GB
16、HBM2e per GCD Aggregate 1.6TB/s per GCD Physically partitioned into 32 channels 64B/CLK for efficient operating voltage In-package Infinity Fabric Unified Shared Memory across GCDs 400GB/s bisection bandwidthCDNA 2 Memory and Cache HierarchyL28MB16 WayHBM2e64GB1.6TB/sCU L116KB 64 WayInstruction Cach
17、e32KB 4 WayK Cache16KB 4 WayCU L116KB 64 WayOther CUS64 B/CLK64 B/CLK64 B/CLK32 B/CLKL28MB16 WayHBM2e64GB1.6TB/sCU L116KB 64 WayInstruction Cache32KB 4 WayK Cache16KB 4 WayCU L116KB 64 WayOther CUS64 B/CLK64 B/CLK64 B/CLK32 B/CLKTotal L2 of4096 B/CLKTotal L2 of4096 B/CLKGraphics Compute Die(GCD)2048
18、 B/CLKIN-PACKAGE INFINITY FABRIC400 GB/s2048 B/CLK2048 B/CLK2048 B/CLKNoCNoC|10AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Peak PerformanceMI100*(Peak)MI250X*(Peak)MI250X Peak SpeedupFP64 Vector11.5 TF47.9 TF4.2xFP32 Vector23.1 TF47.9 TF2.1xPacked FP32 Vec
19、tor23.1 TF95.7 TF4.2xFP64 Matrix11.5 TF95.7 TF8.3xFP32 Matrix46.1 TF95.7 TF2.1xBF16 Matrix92.3 TF383 TF4.2xFP16 Matrix184.6 TF383 TF2.1xMemory Size32 GB128 GB4xMemory Bandwidth1.2 TB/s3.2 TB/s2.7xShattering Performance Barriers in HPC&AI*See Endnotes|11AMD Instinct MI200 Series Accelerator and Node
20、Architectures|Hot Chips 34 August 22,2022 Unified Shared Memory CPUs hardware coherently cache DDR and HBM memory GPUs hardware coherently cache local HBM memory Each GPU connects to host over 16b link at 18GT Host memory BW is 200 GB/s(8 Ch DDR3200)Each MI250X capable of saturating 2 Dram channels
21、Allows NICs to be attached to MI250X Enables line rate for PCIe ordered traffic to host memory,local HBM memory and peer HBM memory on the same socket Flat ID based routing across 8 GCDs and the host CPU Single large system with Root Complex on the device TLB shoot downs invalidates host and IOMMU p
22、age table entries across the entire system3rdGeneration AMD Infinity Architecture|12AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022MI200 SERIES NODE TOPOLOGIES AND SYSTEMS|13AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2
23、0225 GPU-to-GPU Infinity Fabric LinksFour AMD Instinct MI250X AcceleratorsOne Optimized 3rdGen AMD EPYCProcessor1 GPU Connected PCIeNIC Per OAM2 Coherent CPU-to-GPU Links Per OAM1.54 TB/s Peak Infinity Fabric BandwidthFlagship HPC Topology with AMD InstinctMI250X GPUA Unified Computing Node|14AMD In
24、stinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Four AMD Instinct MI250X Per NodeOne Optimized 3rdGen AMD EPYC Per NodeTwo Nodes Per BladeHPE CRAY EX235A Accelerator Blade With AMD InstinctMI250X Accelerators and Optimized 3rdGen AMD EPYCCPUs|15AMD Instinct MI200 Se
25、ries Accelerator and Node Architectures|Hot Chips 34 August 22,2022RankSystemCoresRmax(PFlop/s)Rpeak(PFlop/s)Power(kW)1Frontier-HPE Cray EX235a,AMD Optimized 3rd Generation EPYC 64C 2GHz,AMD Instinct MI250X,Slingshot-11,HPEDOE/SC/Oak Ridge National LaboratoryUnited States8,730,1121,102.001,685.6521,
26、1002Supercomputer Fugaku-Supercomputer Fugaku,A64FX 48C 2.2GHz,Tofu interconnect D,FujitsuRIKEN Center for Computational ScienceJapan7,630,848442.01537.2129,8993LUMI-HPE Cray EX235a,AMD Optimized 3rd Generation EPYC 64C 2GHz,AMD Instinct MI250X,Slingshot-11,HPEEuroHPC/CSCFinland1,110,144151.90214.35
27、2,942MI250X Hits 1.1 ExaFlops!Source:TOP500 list,as of May 30,2022|16AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022MI250X Captures Top 4 Spots on the Green500 ListApplication-Specific Optimization Provides Better Performance-per-wattDomain SpecificGPUGeneral
28、 Purpose CPUApplicationsPerformance/WattPowerful Yet EfficientSource:Green500 list,as of May 30,2022RankTOP500 RankSystemCoresRmax(PFlop/s)Power(kW)Energy Efficiency(GFlops/watts)129Frontier TDS-HPE Cray EX235a,AMD Optimized 3rd Generation EPYC 64C 2GHz,AMD Instinct MI250X,Slingshot-11,HPEDOE/SC/Oak
29、 Ridge National LaboratoryUnited States120,83219.2030962.68421Frontier-HPE Cray EX235a,AMD Optimized 3rd Generation EPYC 64C 2GHz,AMD Instinct MI250X,Slingshot-11,HPEDOE/SC/Oak Ridge National LaboratoryUnited States8,730,1121,102.0021,10052.22733LUMI-HPE Cray EX235a,AMD Optimized 3rd Generation EPYC
30、 64C 2GHz,AMD Instinct MI250X,Slingshot-11,HPEEuroHPC/CSCFinland1,110,144151.902,94251.629410Adastra-HPE Cray EX235a,AMD Optimized 3rd Generation EPYC 64C 2GHz,AMD Instinct MI250X,Slingshot-11,HPEGrand Equipement National de Calcul Intensif-Centre Informatique National de lEnseignementSuprieur(GENCI
31、-CINES)France319,07246.1092150.028|17AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,20225 GPU-to-GPU Infinity Fabric LinksFour AMD Instinct MI250 AcceleratorsTwo 3rdGen AMD EPYCProcessorsPCIeSwitchesfor NIC RDMA2 PCIeGen 4 CPU-to-GPU Links Per OAMMainstream Accel
32、erated Topology with AMD InstinctMI250 GPUs|18AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,20225 GPU-to-GPU Infinity Fabric LinksEight AMD Instinct MI250 AcceleratorsTwo 3rdGen AMD EPYCProcessorsPCIeSwitchesfor NIC RDMA2 PCIeGen 4 CPU-to-GPU Links Per OAMFlagsh
33、ip Machine Learning Topology with AMD InstinctMI250 GPUs|19AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Dual AMD EPYCCPUFour AMD InstinctMI250 GPUs at 560W16 DIMMs at 3200MHz4x 2.5 Gen4 U.2 NVMe/SATA/SAS Hot-swap BaysG262-ZO0 Server2U High6x low-profile PCI
34、eGen4 x16 slots1x OCP 3.0 Gen4 x16 Mezzanine Slot|20AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022AS-4124GQ-TNMI ServerDual AMD EPYC7003 Series CPUsFour AMD InstinctMI250 GPUs at 530W32 DIMMs,up to 8TB Registered ECC DDR4-3200MHz10 Hot-swap 2.5 U.2 NVMe/SATA
35、/SAS Hybrid Drive Bays8 slots PCIe4.0 x16(Low-Profile)via PCIe Switch4U High with 4x 3000W Titanium Level Redundant Power Supplies|21AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022HBM2e Memory System128 GB per MI250XDomain-specific ArchitectureOptimal Efficie
36、ncy through Domain-specific OptimizationPowerful and EfficientNumber 1 on Top500 and Top 4 on Green5003rd Generation AMD Infinity ArchitectureUnified Shared MemorySummary|22AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Endnotes Measurements conducted by AMD
37、Performance Labs as of Sep 10,2021 on the AMD Instinct MI250X accelerator designed with AMD CDNA 2 6nm FinFET process technology with 1,700 MHz engine clock resulted in 47.9 TFLOPS peak double precision(FP64)floating-point,383.0 TFLOPS peak Bfloat16 format(BF16)floating-point performance.The results
38、 calculated for AMD Instinct MI100 GPU designed with AMD CDNA 7nm FinFET process technology with 1,502 MHz engine clock resulted in 11.54 TFLOPS peak double precision(FP64)floating-point,92.28 TFLOPS peak Bfloat16 format(BF16)performance.MI200-05The AMD Instinct MI250X accelerator has 220 compute un
39、its(CUs)and 14,080 stream cores.The AMD Instinct MI100 accelerator has 120 compute units(CUs)and 7,680 stream cores.MI200-27Video codec acceleration(including at least the HEVC(H.265),H.264,VP9,and AV1 codecs)is subject to and not operable without inclusion/installation of compatible media players.G
40、D-176Calculations conducted by AMD Performance Labs as of Oct 29th,2021,for the AMD Instinct MI250X and MI250(128GB HBM2e OAM Module)500W and 560W accelerators at 1,700 MHz peak boost engine clock designed with AMD CDNA 2 6nm FinFet process technology resulted in 6.96 TB/s peak theoretical L2 cache
41、slice bandwidth performance.Calculations by AMD Performance Labs as of OCT 5th,2020 for the AMD Instinct MI100(32GB HBM2 PCIe)300W accelerator at 1,502 MHz peak boost engine clock accelerator designed with AMD CDNA 7nm FinFET process technology resulted in 3.07 TB/s peak theoretical L2 cache slice b
42、andwidth performance.MI200-34Calculations conducted by AMD Performance Labs as of Oct 18th,2021,for the AMD Instinct MI250X and MI250 accelerators(OAM)designed with CDNA 2 6nm FinFet process technology at 1,600 MHz peak memory clock resulted in 128GB HBM2e memory capacity and 3.2768 TFLOPS peak theo
43、retical memory bandwidth performance.MI250X/MI250 memory bus interface is 8,192 bits and memory data rate is up to 3.20 Gbps for total memory bandwidth of 3.2768 TB/s.Calculations by AMD Performance Labs as of OCT 18th,2021 for the AMD Instinct MI100 accelerator designed with AMD CDNA 7nm FinFET pro
44、cess technology at 1,200 MHz peak memory clock resulted in 32GB HBM2 memory capacity and 1.2288 TFLOPS peak theoretical memory bandwidth performance.MI100 memory bus interface is 4,096 bits and memory data rate is up to 2.40 Gbps for total memory bandwidth of 1.2288 TB/s.MI200-30MI200-64:AMD testing
45、 as of 8/3/2021(MI250X)and 1/20/2021(MI100)using DGEMM 4K Trig Float Init,DPM on.Systems compared:2P Optimized 3rd Gen EPYC CPUs with 4x AMD Instinct MI250X(560W,220CUs)running ROCm 4.3.1-59 vs.1P EPYC 7742 with 1x AMD Instinct MI100(300W,120CUs)running ROCm 3.7.0-3289.GROMACS:http:/www.gromacs.org/
46、HACC:https:/cpac.hep.anl.gov/projects/hacc/Calculations as of Oct 18th,2021.AMD Instinct MI250/MI250X built on AMD CDNA 2 technology accelerators support AMD Infinity architecture with AMD Infinity Fabric technology providing up to 400 GB/s total aggregate theoretical inter GDC to GDC I/O data trans
47、port bandwidth per GPU.Peak theoretical inter GDC to GDC data transport rate performance is calculated by Baud Rate*#lanes*#directions*#links/8=GB/s per card.MI200-29Calculations as of Sep 18th,2021.AMD Instinct MI250 built on AMD CDNA 2 technology accelerators support AMD Infinity Fabric technology
48、 providing up to 100 GB/s peak total aggregate theoretical transport data GPU peer-to-peer(P2P)bandwidth per AMD Infinity Fabric link,and include up to eight links providing up to 800GB/s peak aggregate theoretical GPU(P2P)transport rate bandwidth performance per GPU OAM card for 800 GB/s.AMD Instin
49、ct MI100 built on AMD CDNA technology accelerators support PCIe Gen4 providing up to 64 GB/s peak theoretical transport data bandwidth from CPU to GPU per card,and include three links providing up to 276 GB/s peak theoretical GPU P2P transport rate bandwidth performance per GPU card.Combined with PC
50、Ie Gen4 support,this provides an aggregate GPU card I/O peak bandwidth of up to 340 GB/s.Server manufacturers may vary configuration offerings yielding different results.MI200-13 2022 Advanced Micro Devices,Inc.All rights reserved.AMD,the AMD Arrow logo,AMD CNDA,EPYC,AMD Instinct,Infinity Fabric,ROC
51、m,and combinations thereof are trademarks of Advanced Micro Devices,Inc.Ubuntu and the Ubuntu logo are registered trademarks of Canonical Ltd.Red Hat,and the Red Hat logo are trademarks or registered trademarks of Red Hat,Inc.or its subsidiaries in the United States and other countries.The OpenMP na
52、me and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.OpenCL and the OpenCL logo are trademarks of Apple Inc.used by permission by Khronos.PCI-SIG,PCIE and the PCI HOT PLUG design mark are registered trademarks and/or service marks of PCI-SIG.Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.|23AMD Instinct MI200 Series Accelerator and Node Architectures|Hot Chips 34 August 22,2022Thank You for Participating