《分解基础架构的纵向扩展和横向扩展挑战.pdf》由会员分享,可在线阅读,更多相关《分解基础架构的纵向扩展和横向扩展挑战.pdf(26页珍藏版)》请在三个皮匠报告上搜索。
1、OCP Global Summit October 18,2023|San Jose,CASiamak TavallaeiChief Systems ArchitectCXL Advisor to the Board,CXL ConsortiumScale-up and Scale-out Challenges for a Disaggregated AI/ML InfrastructureCXL as the standard protocol for data-movement through coherent memory and for the associated managemen
2、t and system composabilityPhotonics Interconnect for Extended ConnectivityTitle:Scale-up and Scale-out challenges for disaggregated AI/ML infrastructureAn Exa-FLOP AI/ML/HPC system requires 256 xPUs at 4 Tera-FLOP each.Such an xPU may consist of a tightly connected constellation of chiplets.To feed
3、the pipeline of a 4-Tera-FLOP compute engine,we need large SRAM backed by over 80GiB of high-bandwidth DRAM.To operate such an xPU,we may require 1000W of power and cooling.Each Node may hold eight of such xPUs(32 Nodes 8kW each).Each Rack may hold four such Nodes(32kW).These Nodes need to connect t
4、o the public world and be interconnected to each other(parameter exchange)via an efficient fabric.Enabling technologies include PCIe,CXL,HBM,UCIe,and photonics.Four x16 CXL 3.0 ports running at 64GT/s offer 1TB/s of aggregate peak bandwidth.A Photonics interconnect offers a degree of freedom on dist
5、ance and placement of compute and memory components while avoiding hop latency.Software/hardware codesign allows data to be present at the right xPU ahead of execution to keep the pipelines full.AbstractWhile focusing on AI/ML workloads,this presentation addresses a general-purpose system to serve m
6、ultiple use cases,algorithms,and frameworks.In response to the fact that Artificial Intelligent and Machine Learning(AI/ML)frameworks are in flux and the associated systems are rather complex and expensive.While the compute elements may be optimized for different algorithms,and software programmers
7、may apply different communication techniques for intermediate data-movement steps such as parameter exchange(all-reduce,all-gather,all-to-all,etc.),we will describe the techniques that the system-level solution should cover for both data-and model-parallelism techniques.ScopeOther sessions cover com
8、putational elements and software algorithms.This talk covers memory and interconnect and how to physically realize the system and manage it.Efficient data-movement(storage,network,memory,compute element)For general-purpose,we need balancedFor special-purpose,we need flexibilityEase of maintenance is
9、 key for at-scale deploymentScopeThe cure for uncertainty is flexibility!General-purpose solutions form the high-volume spikeWhereas there are many different special-purpose solutions;they form the long tails!ScopeCompute:Exa-FLOP(256 xPUs at 4 Tera-FLOP each)Memory:SRAM backed by high-bandwidth DRA
10、M and Bulk DRAMInterconnect:Tightly connected constellation of chiplets,chips,and systemsInfrastructure:Power&Cooling&Mechanical Enclosure(Racks,Clusters,Rows,and Datacenters)Public network(front-end);private network(back-end)Parameter and Data exchange via an efficient fabricEnabling technologies(P
11、CIe,CXL,HBM,UCIe,and photonics)Photonics interconnect for the required bandwidth(TBs/s)while maintaining low latencyComposing systems over a distance(compute,memory,storage)Software/hardware codesign for just-in-time data placementOutlineScale(Distance,Density,Power/Cooling,Management,Maintenance)Ba
12、ndwidth,Latency(hop-count)Connectivity(how many places to connect?)Data-movement(communication)All-to-all,All-reduce,All-gatherTimeEnergyKey points Compute ElementMemoryStorageThe rest is interconnect for moving data!Point-to-point,Fabric,NetworkingTopology CongestionBlocking Pathways(a la patch pan
13、el)vs.non-BlockingComputing256 xPUs of 4 Tera-FLOPs eachOrganization:(8 x 4 x 8)8 Racks of four Nodes each32 Nodes of eight xPU eachThese FLOPs may be“compressed”Dynamic range,Resolution,PrecisionOperand Width(bits of data to move,to operate on,to store)IntegerDouble-precision Floating Point32FP,16F
14、P,8FP,Consumed Energy for Computation and Data-movement1 Exa-FLOP of Computing96 GiB of HBM3(Four 8-stack devices with 24Gb memory technology)768 GiB of Bulk DRAM memory(with 24Gb memory technology)4 x 192 GiB DIMMs(DR,DDP,x4 DRAMs,in four sockets)8 x 96 GiB DIMMs(dual-ranked,single-die package,in e
15、ight sockets)8 x 96 GiB LPDDR Devices(soldered down)2TiB of pooled memory8 x 256 GiB DIMMs(four DDR5 Buses 2DPC)Memory for 1 Exa-FLOP of ComputingDDR5 BusCXL over UCIe(standard packaging,advanced packaging)CXL over PCIe 6.0HBM3InfiniBand,Ethernet(RoCE)NVLINKEnabling TechnologiesPCIe Revision5.06.0La
16、nes per Link44bit-rate per direction(GT/s)3264NRZ,PAM4Granular Link count44Width1616BW per Module(Gbps)128256BW(Gbps)5121024X16 PCIe BWBW per direction(GB/s)64128BW(bi-directional)GB/s128256Tx+RxInterleave count44Interleaved Bandwidth(GB/s)5121024Tx+Rxx16 CXL over PCIeBandwidth ComparisonNVLINK Gene
17、ration18 x v4Lanes per Link2Bit-rate per Lane(Gbps)100per directionModule count18Switch,GPUWidth36BW per Module(Gbps)200BW(Gbps)3600BW per direction(GB/s)450BW(bi-directional)GB/s900Tx+RxNVLINKHBM Generation33+Signals per Channel6464Bit-rate per signal(Gbps)89.375single-endedBus Width10241024BW per
18、Bus(Gbps)81929600BW(bi-directional)GB/s10241200single-endedHBMPackaging and(UCIe bit rate)Std 16 Std 8Adv 8Bits per Module161664bit-rate per direction (GT/s)16882,4,8,12,16Module count816168,16Width1282561024BW per Module(Gbps)256128512BW(Gbps)204820488192BW(bi-directional)GB/s2562561024single-ended
19、UCIe ExamplesDDR5DDR5DDR5DDR5Number of Stacks per Device1222Mem Technoloty(Gb)16162416Capacity per DRAM Device(Gib)16324832Bits per DRAM Device4444DIMM Bus Width(bits)64646464On-DIMM Devices for one Rank16161616#of Ranks2224#of Devices per DIMM(without ECC)32323264DIMM Capacity(GiB)64128192256DRAM b
20、us bit-rate(MT/s)5600560056005200DRAM bus peak bandwidth(GB/s)44.844.844.841.6Execution time vs Communication time ParallelismPipelining to overlap communication and computationSoftware/hardware co-designSynchronizing Computation and CommunicationPreparing and staging the data ahead of execution(jus
21、t-in-time delivery)Parameter ServersRemoving Time BubblesConnectivity (can I get there?)Throughout (how many ports?how much BW?fat pipes)Latency (response time?)Physical attach points(shoreline,robustness,pluggable,serviceable)Topology (efficient pathways)Hop countBandwidthFault-toleranceQuality of
22、serviceInterconnect AttributesThroughput (fat pipes)Signaling (resolution,freq.and pulse-amplitude modulation,PAM)Bit Rate (serialization,time-division multiplexing,TDM)Link Width (lane count,space-division multiplexing,SDM)(wavelength-division multiplexing,WDM)Bandwidth (SDM,TDM,PAM,WDM,)Latency (h
23、op count,distance)Protocol EfficiencyConnectivity (how many places can I go?how may ports?)Reliability Bit-error-rate EMI/EMC/Noise Immunity Signal Integrity Signal LossInterconnect ConsiderationsTopologyFabric ComputingIt is not sufficient to account for bandwidthLatency is important as well!The nu
24、mber of independent connections is key to reducing hop count!Distance,Repeaters,Switches,CablesSpeed of Light and Time of Flight!Switches(Packet-switching,Circuit-switching)Latency impact of hoppingConnectivityOn-die,on-package,chip-to-chipIntra-chassis,across chassis in a RackDatacenter-scaleEnabli
25、ng Technologies:CXL,UCIe,NVMe,PCIe,Ethernet(RoCE),InfiniBand,PhotonicsWithin OCP:Server Project:ODSA,DC-MHS,OAI,Extended ConnectivityNetworkingPhotonicsInterconnectPublic network(interacting with external data and producing results)ComputationParameter exchange The concept of torus(each node can be simple or a hierarchy)A full mesh(simple nodes or a hierarchy)InterconnectBalanced for connectivity,throughput,latency,deployment,and maintenanceOCP Global Summit|October 18,2023|San Jose,CA