《NVSwitch HotChips 2022 r5.pdf》由会员分享,可在线阅读,更多相关《NVSwitch HotChips 2022 r5.pdf(23页珍藏版)》请在三个皮匠报告上搜索。
1、THE NVLINK-NETWORK SWITCH:NVIDIAS SWITCH CHIP FOR HIGH COMMUNICATION-BANDWIDTH SUPERPODSALEXANDER ISHII AND RYAN WELLS,SYSTEMS ARCHITECTS4th-Generation NVSwitch Chip1.Brief History of NVLink2.4th-Generation New Features3.Chip DetailsHopper-Generation SuperPODs1.NVSwitch-Enabled Platforms2.NVLink Net
2、work SuperPODs3.SuperPOD PerformanceNVLINK MOTIVATIONSGPU Operational Characteristics Match NVLink Spec Thread-Block execution structure efficiently feeds parallelized NVLinkarchitecture NVLink-Port Interfaces match data-exchange semantics of L2 as closely as possibleFaster than PCIe 100Gbps-per-lan
3、e(NVLink4)vs 32Gbps-per-lane(PCIe Gen5)Multiple NVLinks can be“ganged”to realize higher aggregate lane countsLower Overheads than Traditional Networks Target system scales(256 Hopper GPUs)allow complex features(e.g.,end-to-end retry,adaptive routing,packet reordering)to be traded-off against increas
4、ed port counts Simplified Application/Presentation/Session-layer functionality allows all to be embedded directly in CUDA programs/driverBandwidth and GPU-Synergistic OperationSMSMSMSMThread BlockThread BlockThread BlockThread BlockSM to SMGPC0L2NVLink-Port InterfacesNVLINK GENERATIONSEvolution In-s
5、tep with GPUsx86PCIex86PCIex86PCIex86PCIe2016P100-NVLink14 NVLinks40GB/s eachx820Gbaud-NRZ160GB/s total2017V100-NVLink26 NVLinks50GB/s eachx825Gbaud-NRZ300GB/s total2020A100-NVLink312 NVLinks50GB/s eachx450Gbaud-NRZ600GB/s total2022H100-NVLink418 NVLinks50GB/s eachx250Gbaud-PAM4900GB/s totalListed b
6、andwidths are full-duplex(total of both directions).Whitepaper:http:/ SERVER GENERATIONSAny-to-Any Connectivity with NVSwitch2016DGX-1(P100)140GB/s Bisection BW40GB/s AllReduce BW2018DGX-2(V100)2.4TB/s Bisection BW75GB/s AllReduce BW2020DGX A1002.4TB/s Bisection BW150GB/s AllReduce BW2022DGX H1003.6
7、TB/s Bisection BW450GB/s AllReduce BWP100P100P100P100P100P100P100P100V100V100V100V100V100V100V100V100V100V100V100V100V100V100V100V100A100A100A100A100A100A100A100A100NVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitc
8、hNVSwitchH100H100H100H100H100H100H100H1005 NVLinks4 NVLinks4 NVLinks5 NVLinks20 NVLinkNetworkPorts20 NVLinkNetworkPorts16 NVLinkNetworkPorts16 NVLinkNetworkPortsNVSwitchNVSwitchNVSwitchNVSwitchNVLINK4 NVSWITCH NEW FEATURESExpanding Server PerformanceNVLink Network Support PHY-electrical interfaces c
9、ompatible with 400G Ethernet/InfiniBand OSFP support(4 NVLinks per cage)with custom FW for active modules Additional Forward Error Correction(FEC)modes for optical-cable performance/reliabilityDoubling of Bandwidth 100Gbps-per-diff-pair(50Gbaud PAM4)x2 NVLinks and 64 NVLinks-per-NVSwitch(1.6TB/s int
10、ernal bisection BW)More BW with fewer chipsSHARP Collectives/Multicast Support NVSwitch-internal duplication of data avoid need for multiple access from/by source GPU Embedded ALUs allow NVSwitches to perform AllReduce(and similar)calculations on behalf of GPUs Roughly doubles data throughput on com
11、munication-intensive-operations in AI-applications 2022DGX H1003.6TB/s Bisection BW450GB/s AllReduce BWH100H100H100H100H100H100H100H1005 NVLinks4 NVLinks4 NVLinks5 NVLinks20 NVLinkNetworkPorts20 NVLinkNetworkPorts16 NVLinkNetworkPorts16 NVLinkNetworkPortsNVSwitchNVSwitchNVSwitchNVSwitchNVLINK4 NVSWI
12、TCHChip Characteristics32 PHY Lanes32 PHY Lanes32 PHY Lanes32 PHY LanesPORT Logic(including SHARP accelerators)PORT Logic(including SHARP accelerators)XBARLargest NVSwitch Ever TSMC 4N process 25.1B transistors 294mm2 50mmX50mm package(2645 balls)Highest Bandwidth Ever 64 NVLink4 ports(x2 per NVLink
13、)3.2TB/s full-duplex bandwidth 50Gbaud PAM4 diff-pair signaling All ports NVLink Network capableNew Capabilities 400GFLOPS of FP32 SHARP(other number formats are supported)NVLink Network management,security and telemetry enginesALLREDUCE IN AI TRAININGCritical Communication-Intensive Operationgradie
14、ntsgradientsgradientsgradientsNCCL AllReduce:Sum gradients across GPUsLocalgradientsLocalgradientsLocalgradientsLocalgradientsparametersparametersparametersparametersBatchBatchBatchBatchData parallelism:split batch across multiple GPUsgradientsparametersBatch(e.g.256 images)Forward/BackwardDatabase:
15、GBs of input data:images,sound,BASIC TRAINING FLOWALLREDUCE IN MULTI-GPU TRAININGUpdateTRADITIONAL ALLREDUCE CALCULATIONData-Exchange and Parallel CalculationgradientsgradientsgradientsgradientsNCCL AllReduce:Sum gradients across GPUsLocalgradientsLocalgradientsLocalgradientsLocalgradientsparameters
16、parametersparametersparametersBatchBatchBatchBatchData parallelism:split batch across multiple GPUsALLREDUCE IN MULTI-GPU TRAININGlocal gradientslocal gradientslocal gradientslocal gradientsNCCL AllReduce:Sum gradients across GPUsExchange Partial Local GradientsReduce(Sum)PartialsBroadcast Reduced P
17、artialsgradientsgradientsgradientsgradientsNVLINK SHARP ACCELERATIONA100H100+NVLink SHARPStep 1:Read and reduceSend Partials N readsReceive Partials N readsSend Partials N readsIn-Switch Sum -Receive Partials 1 reduced readStep 2:Broadcast resultSend New Partial N writesReceive New Partials N writes
18、Send New Partial 1 writeIn-Switch MultiCastN duplicationsReceive New Partials N writesTraffic summary(at each GPU interface)2N send,2N receiveN+1 send,N+1 receiveNVSA100NVSA100NVSH100NVSH1002x effective NVLink bandwidthNVLINK NETWORK FOR RAW BW4.5X More BW than Maximum InfiniBand(IB)NEURAL RECOMMEND
19、ER ENGINEEXAMPLE RECOMMENDER WITH 14 TB EMBEDDING TABLES0 x1x2x3x4x5xA100IBH100IBH100NVLink NetworkGPU nGPU 2GPU 1GPU 0Linear layersData parallelReplicated across GPUsEmbedding tablesModel parallelDistributed across GPUs40GB60GB60GB10GBRedistribute:Model-Parallel-Data-ParallelAll2All20GB10GB10GBProj
20、ected performance subject to change.Example model assumes DLRM with a mix of 300-hot and 1-hot embedding tables with total capacity of 14TB.Different recommender models may show different performance characteristics.NVLINK NETWORKSource GPUDestination GPUNVLinkNetworkSwitchSMSMSMSMNVLNICNVLNICNVLNIC
21、NVLNICGPUMMUHBMHBMHBMHBMNVLNICNVLNICNVLNICNVLNICLinkTLBNetwork AddressDestination GPUPhysical AddressSource GPUVirtual AddressNew Hopper NIC functions ensure request is legal&maps to GPU physical address spaceNVLinkNVLink NetworkAddress Spaces1(shared)N(independent)Request AddressingGPU physical add
22、ressNetwork addressConnection SetupDuring boot processRuntime API call by softwareIsolationNoYesMAPPING TO TRADITIONAL NETWORKINGNVLink Network is Tightly Integrated with GPUConceptTraditional ExampleNVLink NetworkPhysical Layer400G electrical/optical mediaCustom-FW OSFPData Link LayerEthernetNVLink
23、 custom on-chip HW and FWNetwork LayerIPNew NVLink Network Addressing and Management ProtocolsTransport LayerTCPNVLink custom on-chip HW and FWSession LayerSocketsSHARP groupsCUDA export of Network addresses of data-structuresPresentation LayerTSL/SSLLibrary abstractions(e.g.,NCCL,NVSHMEM)Applicatio
24、n LayerHTTP/FTPAI Frameworks or User AppsNICPCIe NIC(card or chip)Functions embedded in GPU and NVSwitchRDMA Off-LoadNIC Off-Load EngineGPU-internal Copy EngineCollectives Off-LoadNIC/Switch Off-Load EngineNVSwitch-internal SHARP EnginesSecurity Off-LoadNIC Security FeaturesGPU-internal Encryption a
25、nd“TLB”FirewallsMedia ControlNIC Cable AdaptationNVSwitch-internal OSFP-cable controllersNVLINK4 NVSWITCH BLOCK DIAGRAMNew SHARP Blocks ALU matched to Hopper unit Wide variety of operators(logical,min/max,add)and formats(S/U integers,FP16,FP32,FP64,BF16)SHARP Controller can manage up to 128 SHARP gr
26、oups in parallel XBAR BW uprated to carry additional SHARP-related exchangesNew NVLink Network Blocks Security Processor protects data and chip configuration from attacks Partitioning features isolate subsets of ports into separate NVLink Networks Management controller now also handles attached OSFP
27、 cables Expanded telemetry to support InfiniBand-style monitoringManagementXBAR(64 X 64)(NVLink plus SHARP)Port Logic 0Error Check&Statistics CollectionNVLink 0Transaction Tracking&Packet TransformsPHYDLTLPHYPCIe I/OSHARPControllerSHARPALU(Hopper)SHARPScratchSRAMSecurity ProcessorControl Processor a
28、nd State/Telemetry Proxy(including associated OSFPs)Classification&Packet TransformsRoutingPort Logic 63Error Check&Statistics CollectionNVLink 63Transaction Tracking&Packet TransformsPHYDLTLPHYClassification&Packet TransformsRoutingNVLink4-Generation NVSwitch Chip1.Brief History of NVLink2.NVLink4-
29、Generation New Features3.Chip DetailsHopper-Generation SuperPODs1.NVSwitch-Enabled Platforms2.NVLink Network SuperPODs3.SuperPOD PerformanceDGX H100 SERVER8-H100 4-NVSwitch Server 32 PFLOPS of AI Performance 640 GB aggregate GPU memory 18 NVLink Network OSFPs 3.6 TBps of full-duplex NVLink Network b
30、andwidth(72 NVLinks)8x 400 Gb/s ConnectX-7 InfiniBand/Ethernet ports 2 dual-port Bluefield-3 DPUs Dual Sapphire Rapids CPUs PCIe Gen5DGX H100:DATA-NETWORK CONFIGURATIONFull-BW Intra-Server NVLink All 8 GPUs can simultaneously saturate 18 NVLinks to other GPUs within server Limited only by over-subsc
31、ription from multiple other GPUsHalf-BW NVLink Network All 8 GPUs can half-subscribe 18 NVLinksto GPUs in other servers 4 GPUs can saturate 18 NVLinks to GPUs in other servers Equivalent of full-BW on AllReduce with SHARP Reduction in All2All BW is a balance with server complexity and costsMulti-Rai
32、l InfiniBand/Ethernet All 8 GPUs can independently RDMA data over its own dedicated 400 Gb/s HCA/NIC 800 GBps of aggregate full-duplex to non-NVLink Network devicesCX7 HCA/NIC w/PCIe SwitchCX7 HCA/NIC w/PCIe SwitchCX7 HCA/NIC w/PCIe SwitchCX7 HCA/NIC w/PCIe SwitchCX7 HCA/NIC w/PCIe SwitchCX7 HCA/NIC
33、 w/PCIe SwitchCX7 HCA/NIC w/PCIe SwitchCX7 HCA/NIC w/PCIe SwitchH100H100H100H100H100H100H100H100NVSwitchNVSwitchNVSwitchNVSwitchOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPOSFPCPUCPU4 NVLinks5 NVLinks1 400GbDGX H100DGX H100 SUPERPOD:NVLINK SWITCHNVLink Switch
34、Standard 1RU 19-inch formfactor highly leveraged from InfiniBand switch design Dual NVLink4 NVSwitch chips 128 NVLink4 ports 32 OSFP cages 6.4 TB/s full-duplex BW Managed switch with out-of-band management communication Support for passive-copper,active-copper and optical OSFP cables(custom FW)DGX H
35、100 SUPERPOD:AI EXASCALEDGX H100 SuperPOD Scalable Unit 32 DGX H100 nodes+18 NVLink Switches 256 H100 Tensor Core GPUs 1 ExaFLOP of AI performance 20 TB of aggregate GPU memory Network optimized for AI and HPC 128 L1 NVLink4 NVSwitch chips+36 L2 NVLink4 NVSwitch chips 57.6 TB/s bisection NVLink Netw
36、ork spanning entire Scalable Unit 25.6 TB/s full-duplex NDR 400 Gb/s InfiniBand for connecting multiple Scalable Units in a SuperPODSCALE-UP WITH NVLINK NETWORKDGX A100 256 PODDGX H100 256 PODA100 SuperPODH100 SuperPODSpeedupDensePFLOP/sBisectionGB/sReduceGB/sDensePFLOP/sBisectionGB/sReduceGB/sBisec
37、tionReduce1 DGX/8 GPUs2.52,400150163,6004501.5x3x32 DGXs/256 GPUs806,40010051257,6004509x4.5x 32 nodes(256 GPUs)IB HDR leaf switches IB HDR spine switches 32 nodes(256 GPUs)NVSNVSNVSNVSFully NVLink-connectedMassive bisection bandwidth0 x2x4x6x8xClimate ModellingLattice QCD3D FFTGenomics0 x5x10 x15x2
38、0 x25x30 x35xGPT-3(530B Params)GPT-3(530B Params)GPT-3(530B Params)0 x2x4x6x8x10 xVision Models10TB RecommenderGPT-3 175BSwitch-XXL 395BNVLINK NETWORK BENEFITSSpeedup over A100A100H100H100+NVLink NetworkHPCAI InferenceAI TrainingProjected performance subject to change.A100 cluster:HDR IB network.H10
39、0 cluster:NDR IB network with NVLink Network where indicated.#GPUs:Climate Modelling 1K,LQCD 1K,Genomics 8,3D-FFT 256,MT-NLG 32(batch sizes:4 for A100,60 for H100 at 1sec,8 for A100 and 64 for H100 at 1.5 and 2sec),MRCNN 8(batch 32),GPT-3 16B 512(batch 256),DLRM 128(batch 64K),GPT-3 175B 16K(batch 5
40、12),MoE 8K(batch 512,one expert per GPU)Megatron Turing NLG 530BDependent on Communication IntensitySUMMARYNVLink4-Generation NVSwitch 64 NVLink4 ports and 3.2 TB/s full-duplex BW NVLink SHARP(multi-cast and reductions off-load)Inter-Server NVLink Network support Custom FW OSFP NVLink Network cable
41、support Basis of new NVLink SwitchHopper-Generation SuperPOD 32 DGX H100 servers 18 NVLink Switches 1 ExaFLOP of AI performance 57.6TB/s NVLink Network bisection BW NVLink Network can more than double performance for communication-intensive applications Scalable to thousands of GPUs using InfiniBand to connect multiple Scalable UnitsCutting-Edge Speeds and Capabilities