《Hotchip Dojo System v25.pdf》由会员分享,可在线阅读,更多相关《Hotchip Dojo System v25.pdf(45页珍藏版)》请在三个皮匠报告上搜索。
1、Super-Compute System Scaling for ML Training Bill Chang,Rajiv Kurian,Doug Williams,Eric QuinnellPath to General AutonomyModel Architecture Vision,Path Planning,Auto-Labeling New Models Architectures Parameter Sizes Increasing Exponentially Training Data Video Training Data With 4D Labels Ground Trut
2、h Generation Training Infrastructure Training and Evaluation PipelineFlexible System Architecture Software at ScaleAccelerated ML Training SystemComputeI/OMemoryTypical SystemFixed RatioComputeI/OMemoryOptimized ML Training SystemML Requirements EvolvingComputeI/OMemoryDisaggregated System Architect
3、ureFlexible RatioComputeI/OMemoryOptimized ComputeTechnology-Enabled ScalingSystem-On-Wafer Technology-25 D1 Compute Dies+40 I/O Dies-Compute and I/O Dies Optimize Efficiency and Reach-Heterogenous RDL Optimized for High-Density and High-Power Layout Maximize Performance and Yield-Known Good Die and
4、 Fault Tolerant Designs-Each Tile Assembled With Fully Functional Dies-Harvesting and Fully Configurable Routing for YieldTraining TileUnit of Scale-Large Compute With Optimized I/O-Fully Integrated System Module(Power/Cooling)Uniform High-Bandwidth-10 TB/s on-tile bisection bandwidth-36 TB/s off-ti
5、le aggregate bandwidth9 PFLOPS BF16/CFP8 11 GB High-Speed ECC SRAM 36 TB/s Aggregate I/O BWFlexible Building Block9 TB/s9 TB/sScale With Multiple Tiles No Additional Power/Cooling Design NeededTileTileTileTileTileTileComputeI/OMemoryDisaggregated MemoryV1 Dojo Interface Processor32GB High-Bandwidth
6、Memory-800 GB/s Total Memory Bandwidth 900 GB/s TTP Interface-Tesla Transport Protocol(TTP)-Full custom protocol-Provides full DRAM bandwidth to Training Tile 50 GB/s TTP over Ethernet(TTPoE)-Enables extending communication over standard Ethernet-Native hardware support 32 GB/s Gen4 PCIe InterfaceDo
7、jo Interface Processor-PCIe TopologyPCIeTileDIPHBMPCIe HostDIPHBMDIPHBMDIPHBMDIPHBM160GB Total DRAM per Tile edge-Shared memory for training tiles 5 DIP Cards Provide Max Bandwidth-4.5 TB/s aggregate bandwidth to DRAM over TTP 80 Lanes PCIe Gen4 Interface-Provide standard connectivity to hostsTileI/
8、OMemoryScalable CommunicationComputeLatencyBandwidthTesla Transport ProtocolNodeTTPoETTPD1TileDIPDojo Interface Processor-Z-Plane TopologyEthernet SwitchTTPoE-Point-to-Point over Ethernet-Provides high-radix connectivity in Z-plane TTP network-Enables“shortcuts”across the network-Manage latency for
9、sync and control across compute planeDojo Interface Processor-Z-Plane TopologyEthernet SwitchTTPoE-Point-to-Point over Ethernet-Provides high-radix connectivity in Z-plane TTP network-Enables“shortcuts”across the network-Manage latency for sync and control across compute plane30 HopsDojo Interface P
10、rocessor-Z-Plane TopologyEthernet Switch4 HopsTTPoE-Point-to-Point over Ethernet-Provides high-radix connectivity in Z-plane TTP network-Enables“shortcuts”across the network-Manage latency for sync and control across compute planeDojo Network Interface CardDNICHostRemote DMA over TTPoE-DMA to/from a
11、ny TTP endpoint(compute SRAM,DRAM)-Leverage switched Ethernet networks Enables Remote Compute for Pre/post-processingTTPoECPUDRAMRemote DMA TopologyScale-Out for CPU/Memory Bound Pre-Processing WorkloadsTileDIPHBMDIPHBMDIPHBMDIPHBMDIPHBMTileDIPHBMDIPHBMDIPHBMDIPHBMDIPHBMEthernet SwitchDNICDNICDNICDN
12、ICDNICTileTileCPUDRAMCPUDRAMCPUDRAMCPUDRAMCPUDRAMV1 Dojo Training MatrixDIPDIP x5HBMDIPDIP x54.5 TB/s4.5 TB/s9 TB/s9 TB/s4.5 TB/s4.5 TB/sTileTileTileTileTileTileHBMSwitchCPUDRAMHBMDIP x5HBMDIP x5DNICCPUDRAMDNICCPUDRAMDNICCPUDRAMDNICCPUDRAMDNICCPUDRAMDNICCPUDRAMCPUDRAMCPUDRAMCPUDRAM1 EFLOP BF16/CFP8
13、1.3 TB High-Speed ECC SRAM 13 TB High-BW DRAMComputeI/OMemoryDisaggregated Scalable SystemTileInterface ProcessorNetwork InterfaceSoftware at ScaleModel ExecutionWorkloads operate almost entirely out of SRAM Single copy of parameters-replicated just in time High utilizationUnlike typical accelerator
14、s,all forms of parallelism may cross die boundaries Thanks to High TTP BandwidthDIPDIP x5HBMDIPDIP x54.5 TB/s4.5 TB/s9 TB/s9 TB/s4.5 TB/s4.5 TB/sTileTileTileTileTileTileHBMHBMDIP x5HBMDIP x5Parameters Are Distributed Across the DIPs Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SM
15、odel ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R S K1 R SC2 K1 R SC2 K2 R SK12 K2 R SK12Parameters Are Sharded Across the Tiles at Load Time Once per training runModel ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2Inputs Sharded Across the DIPs in the
16、 Batch Dimension Inputs Are Also Sharded(by Batch)Across the Tiles Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 C H WN4 C H WN4 C H WN4 C H WN4Parameters Are Replicated Across the Tiles Just in Time A single copy of parameter in the entire system-use the hig
17、h BW to replicate parameters just in time Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 K1 R SC2 K1 R SC2 K1 R SC2 K1 R SC2Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2C K1 R SC K1 R SC K1 R SC K1 R SParameters Are Rep
18、licated Across the Tiles Just in Time A single copy of parameter in the entire system-use the high BW to replicate parameters just in time The First Layer Is Run in a Data Parallel Manner Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2C K1 R SC K1 R SC K1 R SC
19、K1 R S C H WN4 C H WN4 C H WN4 C H WN4 K1 H WN4 K1 H WN4 K1 H WN4 K1 H WN4Model ExecutionParameters For the Next Layer Are Replicated Concurrently 1 copy per 2 tiles.The next layer is better executed in a model parallel mannerDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 K2
20、 R SK12 K2 R SK12 K2 R SK12 K2 R SK12Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 K1 R SC2 K1 R SC2Discard Replicated Parameters and Input for Minimal SRAM Footprint Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 K1 H
21、WN4 H WN4K12 H WN4K12Replicate Input Activation for the Next Layer-Split Across Channels Only 1 N/4 batch shownCompute Partial Sum for Each N/4 Batch on Each Tile Only 1 N/4 batch shownModel ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 H WN4K12 H WN4K12 K2 R SK12
22、K2 R SK12 K2 H WN4 K2 H WN4Model ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 K2 H WN4 K2 H WN4 K2 H WN4Reduce Partial Sum for Each N/4 Batch Across Tiles Small packet size,fine-grained synchronization and low-latency network makes pipelined partial sums workModel
23、 ExecutionDIP x5DIP x5TileTileTileTileP1 K1 K2 R SP0 C K1 R SI0 C H WN2I0 C H WN2 K2 H WN4 K2 H WN4 K2 H WN4 K2 H WN4Same Computation Runs on Every Other N/4 Batch Combination of data and model parallelEnd-To-End Training WorkflowData LoadingCompute(Training)Post ProcessingFile Loading Decode Augmen
24、tation Ground Truth GenerationOutput Compression File WriteVideo-Based TrainingData LoadingMulti-camera,multi-frame models-Requires decoding GOP_SIZE/2 frames for first per-camera frame and 1 decode for every frame afterFlexible compute required for:-Augmentation-Image rectification-Ground truth gen
25、erationDecodePCIEStorage BWCPU Cores0%25%50%75%100%Data Loading Needs of Different ModelModel 1Model 2Requirements as%of a Single Hosts CapacityDecodePCIEStorage BWCPU Cores0%88%175%263%350%438%525%613%700%Data Loading Needs of Different ModelsModel 3Requirements as%of a Single Hosts CapacityModel 1
26、Model 2Disaggregated Data Loading TierDIPDIP x5HBMDIPDIP x5TileTileTileTileTileTileHBMSwitchHBMDIP x5HBMDIP x5DNICCPUDRAMDNICCPUDRAMDNICCPUDRAMDNICCPUDRAMDisaggregated Data Loading TierDIPDIPTileTileTileTileTileTileHBMDIP x5DNICCPUDRAMDNICCPUDRAMDNICCPUDRAMDNICCPUDRAMBatch 1ABatch 1BBatch 1CDIP x5HB
27、MHBMDIP x5Batch 1DDIP x5HBMMemoryIODisaggregated ResourcesML ComputeModel 1Model 2Model 3Resources Can Be Partitioned per JobDojo Supercomputer for ML TrainingMemoryML ComputeIOTrainingNew integration enable high-bandwidth and performance Uniform high-bandwidth enables full exploitation of parallelism by software Vertically integrated I/O addresses all workload bottlenecks including data loading