《Scheduled Ethernet Fabric for Large scale AI training cluster.pdf》由会员分享,可在线阅读,更多相关《Scheduled Ethernet Fabric for Large scale AI training cluster.pdf(14页珍藏版)》请在三个皮匠报告上搜索。
1、Scheduled Ethernet Fabric for Large scale AI training clustersPengfei Huo,Senior Network Architect,ByteDanceRajasekar Jegannathan,Lead AI-InfrastructureOozie Parizer,Senior Director,Product Marketing,BroadcomScheduled Ethernet Fabric for Large scale AI training clustersARTIFICIAL INTELLIGENCE(AI)NET
2、WORKINGChallenges in AI training fabric How Scheduled Ethernet Fabric worksBenchmark ResultsCall for Actions AgendaChallenges in AI NetworkSmall Number of flowsHigh bandwidth per flowGPUs drive high bandwidthHigh Demand for Network Throughput/UtilizationJob ConcurrencyUnexpected Network failuresTopo
3、logy changes Hash PolarizationUneven Link UtilizationOut of OrderHOL BlockingCongestion Hot spotsPFC propagationCrosstalk between JobsSlow failure detection/failoverFailovers create congestionWorkload CharacteristicsChallenges to NetworkEqual spraying over all links of the fabric,independent of flow
4、 sizeUniform link utilization avoids hot spotsConsistent high performance at all network loads Perfect Load BalancingABEnd-to-end scheduled fabric No PFC propagationIsolation of“slow receivers”-no HOL blockingExcels regardless of the workload patternsJust works“out of box”regardless of type/performa
5、nce of endpointsNative multi-tenancy support Congestion-Free OperationABSelf-healing fabric:hardware-based failure detect and recoveryLinear,predictable change in performance with link failuresNon-scheduled fabrics may have unpredictable,greater than linear degradation with high convergence timeFewe
6、r checkpoints shorter JobCompletion Time(JCT)Zero Impact Failover(ZIF)1:12:11:1 bandwidth ratio for leaf to GPU and to spine 2:1 oversub for leaf to GPU and to spine Network Topology for Benchmark Test200G200G200G200G200G200GGPU-1GPU-2GP-127GPU-128GPU-1GPU-2GPU-127GPU-128Scheduled FabricTypical Ethe
7、rnet*With the scheduled fabric solution from DriveNetsNetwork Cloud SoftwareBenchmark results for Scheduled Fabric verse Typical EthernetMeasured 20%Job Completion Time(JCT)reductionImprovement mainly attributed to perfect load balancingPerfTest Throughput*Over 8 CPU serversMeasured%improvement over
8、 Typical Ethernet20.3%0.0%5.0%10.0%15.0%20.0%25.0%Job Completion Time Improvement3K QP,64KB per QPTested under Standard DCQCN CC,alltoall,2GB message size4X premium from 2%to 8%when cluster doubles to 128 GPUsNormalized by typical Ethernet performance with 64 GPUs2%throughput improvement with single
9、 job 64 GPU8%throughput improvement with single job 128 GPUThe larger cluster,the higher performance premium by scheduled fabricNative multi-tenancy supportClose to ZERO impact with multiple jobsSizeable degradation with Typical Ethernet Scheduled Fabric demonstrates consistent performance with bigg
10、er cluster sizes and more concurrent jobs GPU Benchmark PerformanceAll2All Througput using different GPU cluster sizes for a single jobNormalized Throughput100%92%102%100%64 GPU128 GPUTypical EthernetScheduled FabricSeamless Operation&MaintenanceOne Platform:orchestration,controllerDistributed tools
11、:GRPC/SNMP/ERSAPN/White BoxCommerical BoxScheduled FabricFully compatible with existing O&M tool kitsNo changes to user habits or behaviorsMinimum integration effortSupport all popular telemetry featuresAdditional telemetry for operation and troubleshootingHW based Mirror On DropHW based sFlowFlow l
12、atency monitoring Microburst HW detection Topology information queryAgnostic to endpoints No dependence on sophisticated CC mechanismNo dependence on endpoint reorderingInteroperates with NICs and GPUs from different vendorsMinimal tuning effort Performance is independent of pattern and port/flow sp
13、eedSelf-adaptable to various training modelsEasy scaling Any size scale for AI inference networking and training networkingInservice network upgradesTime to Deployment100GE,200GE,400GE and 800GE Vendor-AVendor-BVendor-CCall To ActionStay tuned,the first Scheduled Fabric deployment for AI cluster wit
14、h 1K+GPUs to be announced soon!More about the first deployment(performance and operations)will be shared at OCP Global Summit in Sept/Oct24Please contribute Continue to test and quantify performance delta for Scheduled Fabric versus Typical Ethernet for larger clusters “Distributed Forwarding in a Virtual Output Queue(VOQ)Architecture”Additional informationhttps:/www.opencompute.org/documents/ddc-v3-ocp-base-specification-revision-5-0-pdfhttps:/ https:/ Thank you!