《HC2022.NVIDIA Grace.JonathonEvans.v5.pdf》由会员分享,可在线阅读,更多相关《HC2022.NVIDIA Grace.JonathonEvans.v5.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、NVIDIA GRACEJONATHON EVANS-NVIDIA|HOT CHIPS 34NVIDIA GRACE NVIDIAs First Server CPU 72 Arm v9.0 cores SVE2 support Virtualization Extensions:Nested Virtualization,S-EL2 support RAS v1.1 GIC v4.1 SMMU v3.1 Built on TSMC 4N process nodeDatacenter ReadyNVIDIA GRACEDesigned from the Ground-Up to be a Su
2、perchipNVLINK-C2C Used to create the Grace Hopper,and Grace Superchips Removes the typical cross-socket bottlenecks Up to 900GB/s of raw bidirectional BW Same BW as GPU to GPU NVLINK on Hopper Low power interface-1.3 pJ/bit More than 5x more power efficient than PCIe Enables coherency for both Grace
3、 and Grace Hopper superchipsHigh Speed Chip to Chip Interconnect900 GB/sCPULPDDR5xCPULPDDR5xGRACECPUCPULPDDR5xCPULPDDR5xGRACECPUGRACE SUPERCHIP Targets Arm standards for off the shelf OS compatibility Arm Server Base System Architecture(SBSA)Arm Server Base Boot Requirements(SBBR)Arm Memory Partitio
4、ning and Monitoring(MPAM)Arm Performance Monitoring Units(PMUs)Standards Compliant Platform900 GB/sCPULPDDR5xCPULPDDR5xGRACECPUCPULPDDR5xCPULPDDR5xGRACECPUGRACE HOPPER Unified Memory with shared page tables Shared CPU and GPU virtual address space GPU access to pageable memory System allocator suppo
5、rt for GPU memory Yes,malloced and mmaped pointers!Native atomics,including standard C+atomic supportHeterogenous Coherency900 GB/sCPULPDDR5xGPUHBMHOPPERGPUGPUHBMCPULPDDR5xGRACECPUCPULPDDR5xGPUHBMGPUHBMCPULPDDR5xHOPPERGPUGRACECPUNVLINK-C2CSuperchip Scaling|CPU/GPU|Extended GPU MemoryEnables remote N
6、VLINK connected GPUs,to access Graces memory at native NVLINK speedsNVSWITCHGPUHBMCPULPDDR5xGRACECPUCPULPDDR5xGPUHBMHOPPERGPUNVIDIA GRACE NVIDIA fabric and distributed cache design 3,225.6 GB/s Bi-section BW Scalable to 72+cores 117MB of L3 cache Arm Memory Partitioning and Monitoring(MPAM)Supports
7、up to 4-socket coherency over Coherent NVLINKNVIDIA Scalable Coherency Fabric*Example possible fabric topology for illustrative purposesSCCCSNCoreLPDDRNVLINK C2CPCIe/cNVLINKSCCCoreSCCCSNCoreSCCCoreLPDDRSCCCSNCoreLPDDRSCCCoreSCCCSNCoreSCCCoreLPDDRNVIDIA GRACE SCF Arm standard for partitioning system
8、resources Partition IDs(PARTID)are assigned to entities making requests to memory SCF Cache resources can be partitioned between different PARTIDs Partition both Cache capacity,and Memory Bandwidth Performance Monitor Groups(PMG)can be used to monitor resource usageMemory Partitioning and Monitoring
9、GraceMEMPARTID 0PARTID 1SCF CachePARTID 0CPU 0PARTID 1CPU 1NVIDIA GRACE Up to 512GB of LPDDR5x memory 32 channels Up to 546 GB/s of memory BW But why LPPDR?MemoryLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xx56 PCIe Gen5x12 cNVLINK/PCIeL3CacheNVLINK C2CMCMCMCMCMCMCMCMC4xCPU4xCPU4xCPU4xCPU
10、4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPUL3Cache4xCPU4xCPU4xCPU4xCPU4xCPU4xCPUMCMCMCMCMCMCMCMCLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xNVIDIA GRACERemember the Superchips?Grace is always pairedHBM2e(4-sites)DDR5(8-channel)LPDDR5x(32-channel)Capacity64GBUp to 4TBUp to 512GBBWUp to 1.8TB
11、/sUp to 358GB/sUp to 546GB/sPower/GBps1x8x1xCost/GB3x1x1xMEMORY CHOICESHBM,DDR,or LPDDR?Remember?C2C BW-900 GB/sNVIDIA GRACEHow Much Memory Do I Need?Natural Language Processing GPT-3 inference fp8 175GB of memory GPT-3 training over 2.5TB of memory Extended GPU Memory to the rescue!4x decrease in t
12、he number of GPUs needed to fit the training set in memory4xHopperGraceHopperGraceHopperGraceHopperGraceHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperHopperGRACE I/O Up to 68 lanes of PCIe 4 GEN5 x16 links 128 GB/s bi-dir per x16 2 x2s for misc Up to 12 la
13、nes of coherent NVLINK Shared with two GEN5 PCIe x16 NVLINK-C2C 900 GB/s of raw bi-dir BWLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xx56 PCIe Gen5x12 cNVLINK/PCIeL3CacheNVLINK C2CMCMCMCMCMCMCMCMC4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPU4xCPUL3Cache4xCPU4xCPU4xCPU4xCPU4xCPU4
14、xCPUMCMCMCMCMCMCMCMCLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xLPDDR5xNVIDIA GRACE SUPERCHIP 144 Arm v9.0 cores with SVE2 Single thread perf optimized Up to 1TB/s memory bandwidth NVLINK-C2C for 3x typical inter-chip bandwidth Energy efficient design with LPPDDR5 allowing more power for comput
15、e 500W TDP,core+memoryPurpose built for Supercomputing and HPC37074000500600700800SpecIntRate2k17-Single Grace(estimate)SpecIntRate2k17-Grace Superchip(estimate)SPEC RATE ESTIMATESSource:Pre-silicon Estimated Performance results with GCC compiler,subject to changeSPECIntRate2k17 Estimated
16、 PerfScores50853650550700500MemReadMemSetMemCopyMemTriadGRACE CPU MEMORY BENCHMARKSource:NVIDIA Grace pre-silicon results for single Grace SOC,subject to changeGB/sStream Benchmark42940750600500ReadWriteRead/WriteHOPPER GPU TO GRACE MEMORY BENCHMARKSource:NVIDIA Grace Hopper pre-silicon results,subject to changeGPU to CPU Memory PerfGB/sNVIDIA GRACE NVIDIAs First Server SOC 72 Arm v9.0 CPU cores NVLINK-C2C LPDDR5x for low power,high bandwidth Extended GPU memory(EGM)for scale outThanks to the entire Grace CPU team!Summary