1、Advantages and Use Cases for Adding the CXL interface to DPUsPavel(Pasha)Shamis,Sr.Principal EngineerKshitij Sudan,Director Storage&Accelerator Segment MarketingArm Inc.Enabling technologies for disaggregation and heterogeneityStorage Disagg(NVMe-oF)CompleteAccelerator Disagg(GPUs,FPGA,TPU,Habana,)I
2、n execution/deploymentMemory Disagg(DRAM,SCM)Arch ExplorationStorage+Memory convergence(SDIO)ConceptTodayFuture PastDisaggregationHeterogenouscomputeFPGAsSmartNICsVideoaccelTPUGen1 comp.storageGPU(multi-tenant)CXL Type-3(mem)CXL Type-1CXL Type-2GPUSan Jose,CA April 26-28,2022Drivers for CXL interfac
3、e on DPUs Domain specific accelerators gaining prominence in datacenter Line-rate processing of network traffic is critical for certain use-cases Like compression,de-dupe,encryption,streaming data-processing Near-storage/near-memory processing technologies advancing Computational storage,in-memory p
4、rocessingAll these use-cases need larger DPU memory at low-costCXL interface addresses this need3San Jose,CA April 26-28,2022Emerging datacenter storage&memory architectureDatacenter NetworkWarm caching tierTarget DPUNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVM
5、e FlashNVMe FlashPCIe SwitchCXL mem expansionCXLStorage/Cold tierNICSoCDRAMDRAMPCIeHBAHBAHBAHBAServer SoCDRAMDRAMDRAMDirect Attached Storage(Flash)Initiator DPU(NVMe-oF)CXLMem expansionOS page$Compute Server 1Compute Server NCompute Server 0General Purpose ComputeCXL mem expansionCXLPCIeCXLCXLContro
6、llerDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDDR InterfaceCXL InterfaceDisaggregated DRAM PoolSan Jose,CA April 26-28,2022Sharing DPU attached memory with host Advantages Lower pin-count on host Lower cost/bit Technology for page/data management over described by large cloud providersTMO:Transparent
7、Memory Offloading in Datacenters,J Weiner et.al.ASPLOS-2022Software-driven far-memory in warehouse-scale computers,A.Lagar-Cavilla et.al.APLOS-2019First-generation Memory Disaggregation for Cloud Platforms,H.Li et.al.Arxiv-2022 Concerns Stranded-ness of DRAM directly attached to DPUServer SoCDRAMDRA
8、MDRAMDirect Attached Storage(Flash)Initiator DPU(NVMe-oF)Compute Server 0General Purpose ComputeCXL mem expansionCXLPCIeCXLTo mem poolSan Jose,CA April 26-28,2022Reducing stranded-ness with memory pools Expand DPUs ability to leverage pooled memory Resolves stranded-ness challenges Future CXL specif
9、ications,discussing potential to share memory at the memory pool Security challenges 6Server SoCDRAMDRAMDRAMDirect Attached Storage(Flash)Initiator DPU(NVMe-oF)Compute Server 0General Purpose ComputeCXLPCIeCXLTo mem poolTo mem poolCXLControllerDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMDDR InterfaceCXL
10、 InterfaceSan Jose,CA April 26-28,2022Appliance use-case for DPUs with CXL Advantages Builds on DPU-as-target use-case Uses large CXL-connected memory capacity for“conditioning tier”Lower pin-count for DPU memory Lower DPU memory cost/bit Concerns Increased BOM cost with CXL interface and CXL memory
11、 expansion card7Target DPUNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashNVMe FlashPCIe SwitchCXL mem expansionCXLSan Jose,CA April 26-28,2022DPU+NVDIMM Storage Appliance Study Bluefield 16 core SoC for off-chip in-memory processing Direct access to memory
12、and attached storage Integrated into Infiniband adapter Runs full Linux/IB stack NVDIMM-N NVDIMM-N installed with battery pack and fits into DDR module slot Appears as unique PMEM device type to Linux Using Linux DAX,appears as file system but behaves as memory when files are mmapd Syncs files to on
13、-module battery backed NVM at processor request or on emergency power loss,so data is persistent.HPE Apollo 70 client OpenSHMEM programming model8Grodowitz,Megan,Pavel Shamis,and Steve Poole.OpenSHMEM I/O Extensions for Fine-Grained Access to Persistent Memory Storage.Smoky Mountains Computational S
14、ciences and Engineering Conference.Springer,Cham,2020.San Jose,CA April 26-28,2022Edge Sort Total Runtime9024686802224262830Slowdown vs 2 PEsNumber of PEsWeak Scaling of Total Runtime for Graph Edge DecompositionFspace read app slowdownFile IO read app slowdownFspace generate/w
15、rite app slowdowFile IO generate/write app slowdownIdeal no slowdownLinear slowdownPOSIX File I/O on NFS degrades linearly as number of processes increase.All access in parallel to non-overlapping file regions.Read app(App1)also writes back after sort so performance degrades worse for App1 as expect
16、ed.Fspace File I/O over same network fabric shows only small performance degradation as number of processes increases.Key challenges emerging in datacenter memory and storage hierarchiesCloud vendor specific ProgrammabilityNew memory technologiesIncreasing network costsIncreasing compute heterogenei
17、tyMinimizing software overheads for emerging storage devicesNew price/perf points and interfaces for storage(SCM,CXL)Need to minimize data movementImproves efficiency via accelerationWatch the companion talk in session B-201:Extending DPUs to Enable Software-Defined I/O(SDIO)San Jose,CA April 26-28,2022Thank You!Thank You!11