《Scaling Remote direct memory access (RDMA) networks for AI Training.pdf》由会员分享,可在线阅读,更多相关《Scaling Remote direct memory access (RDMA) networks for AI Training.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Challenges and OpportunitiesScaling RDMA Networks for AI Training with RoCEV2NetworkingScaling RDMA Networks for AI Training Adi GangidiProduction Network Engineer,AI Systems,MetaRohit PuriSoftware Engineer,AI Network Systems,MetaAgendaMeta RDMA deployment overviewOperational ChallengesSolutionsChal
2、lenges we continue to solveCommodity Ethernet InfrastructureMulti-hop Network based on commodity Network switchesOur Production RDMA network is unique Native for AI workloadsOur AI Training workloads use network+compute as single large system.Network primitives are integral part of the overall appli
3、cation pipelineRoCEV2,LosslessTransportWe use RoCEV2(RDMA over converged ethernet-Routable version)transport in its native behavior.We configured the network to be lossless by using PFC/DCQCNLarge deployment ScaleThousands of endpoints in a single fabric.Many of such network fabric instancesOverview
4、 of our RDMA DeploymentWorkloads/Collectives GPU based communication primitivesAll to All and All-Reduce Topology2 stage Clos fabricCommodity Ethernet switchesRoutingStatic Routing with ECMP fallback during failuresTransportRoCEV2Congestion Control(DCQCN)Priority flow control(PFC)+Explicit congestio
5、n notification(ECN)PFC helps us achieve“Lossless fabric”End-pointsBuilt with Open Compute form factor components ZionEXWedge400 ToR Switch for in-rack RDMA SwitchingEach ZionEX servers:8x GPU and 8 NICsWithin server:Shared memory fabricAcross servers:RDMA fabric RDMA Network integral to Application
6、Pipeline XX+1X+2X+3X+4X+5All Reduce#n+1All Reduce#nAll to All#1 All to All#nAll Reduce#1AlltoAll#2All to All#nCompute#1Compute#2Compute#nCompute#n+1Compute#n+2Rough Timeline of a Single Training Iteration from a Single End-point Does not scaleDoes not represent a trace of our workloadsNetwork Collec
7、tive interlock to Compute operationCompute Interlock to Network collective operationNetworkCollective PrimitivesLow network flow entropySome of the network primitives in our workload dont have enough flow entropy to completely depend on ECMPWe currently use static routing on our production network i
8、n steady state.Our Challenges.Load balancing during Network failuresStatic routing works well during steady state but does not adapt to Network failures and drains such as at switch/Link.Performance impact of dropsDefault transport implementation implements go-back-N algorithm-resulting in significa
9、nt performance loss for few packetsDebuggabilityNetwork vs non-network contribution of Job failures or Job slow downEnd-point/GPU HW traces instead of Kernel/CPU tracking due to Kernel bypass in RDMARDMA Verbs tracing as opposed to flow tracingNetwork Routing and Load BalancingStatic Routing at stea
10、dy stateNo ECMPWhy?Lack of enough flow entropy in workloadsDestination prefix based ECMP for traffic affected by failuresNetwork drains and failures commonIssues with static routing schemeUnequal utilization of network with specific workload scheduling schemesECMP spread of traffic during failuresSt
11、atic Routing ChallengesIn corner cases,oversubscription is seen on on egress traffic due to dependence only on destination IP prefix Challenges with Fallback Routing:ECMP14 way ECMPS7 S8Load Balancing Solutions Even rack placement with job scheduler changesRealizing more entropy with existing worklo
12、adsIncrease flows by queue pair multiplexingCustomized hashing schemesTraffic Engineering Software based Controller to program routesBased on demand,job scheduling signalsLoad Balancing Solutions(cont.)Spray/Reorder and Assembly:In-Network reorder and AssemblyReorder first hop and re-assemble at the
13、 last hop network switchSuitable for supporting RoCEV2 or other transportsCustom Transport that dont require ordering guaranteesSensitivity to loss High workload sensitivity to loss with RDMA networksIsolated link flaps can fail jobsTuning Timeouts Can have side effectsSmall amount of loss can cause
14、 significant performance degradationDue to go back N routing schemeNot so impactful challenges Well known challenges with RoCEv2 that didnt have significant effect on our deployments Head of line blocking(HoL)Static Carved buffers on Spine+PFC Watchdog in placeScaling to large node count in a job NI
15、C QP Cache limitationTuning congestion Control PFC seems to do the trick+In-cast scenarios relatively rarePFC should cause HoL and slow down all jobs in the network-But we able to limit impact to just the rack due to statically carved buffers in spinePathological static routing case revisited to sho
16、wcase why HoL is not an large concern for our networksSensitivity to end-point issues:GPU/NICCollective communication operations Synchronous data exchanges between various end-pointsOne slow endpoint can slow down the whole jobExample:Endpoint with marginal link,PCIe issuesOne bad endpoint can fail
17、the jobExample:GPU memory page retirementEndpoint issues common with our deploymentsRDMA servers more complex and failure rate is higherRestarting a job is expensiveDebuggabilityDetermining Network vs Non-Network root-causeJob FailuresCollective operation timeoutHard to analyze logs when multiple en
18、dpoints reporting timeoutSlow jobsOut of sequence packetsHigh signal due to usage of static routingIndicates loss anywhere on the fabric include within the switchCongestion MetricsPause duration at end-pointsCNP(ECN)handled by NICTooling for DebuggabilityPre-flight Checks Well known Symmetric distri
19、buted benchmark run before start of every jobRDMA U-turn test:GPUs in same server connect to each other via RDMA fabricsUDP and RDMA pinger Identify loss while the workload is in flightJob log analyzerTimeout matrixNetwork and Hardware event correlatorTracingEnd-point/GPU tracesRDMA Verbs tracingTra
20、ffic Profiling at 1 ms granularity in Gbps(on our 100G Installation)#of msRecapScaled RoCEV2 for our workloads with unique constraintsOperational Challenges:Low network flow workloads Load balancing during Network drains Sensitivity to packet loss DebuggabilityProof point that commodity ethernet RDM
21、A deployments can scaleMultiple challenges remainMany directions of solutionsWe Scaled RoCEV2 for our workloads with unique constraintsIn process ran into unique Operational Challenges:Low network flow workloads Load balancing during Network drains Sensitivity to packet loss DebuggabilityProof point that commodity ethernet RDMA deployments can scaleMultiple challenges remainMany directions of solutionsCollaboration on unified solutions This should be the last slide before your closing slide.Recap and Call to ActionQuestions?Thank you!