《Hot_Chips_2022_CXL_Memory_Expander_final.pdf》由会员分享,可在线阅读,更多相关《Hot_Chips_2022_CXL_Memory_Expander_final.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、Scaling of Memory Performance and Capacitywith CXL Memory Expander August,2022|Samsung Electronics Co.,Ltd.S.J.Park,K.-S.Kim,H.Kim,J.So,J.Ahn,J.Jung,I.Yun,S.Ryu,W.-J.Lee,J.-G.Lee,H.-Y.Ryu,C.Y.Lee,J.Prout,K.-C.Ryoo,S.-J.Han,M.-K.Kook,J.S.Choi,J.Gim,Y.S.Ki,S.Ryu,C.Park,D.-G.Lee,J.Cho,H.Song,and J.Y.Le
2、eAgendaIndustry Trends and ChallengesIntroduction of CXL(Compute Express Link)CXL Memory Expander FeaturesSMDK:Unified Software Solution for CXLApplication Benchmark Test ResultsSummary and Future PlanIndustry Trends and ChallengesArtificial IntelligenceBig DataEdgeCloud5GMassive demand for data-cen
3、tric technologies and applicationsMemory bandwidth and density not keeping up with increasing CPU core countNeed a next gen interconnect for heterogeneous computing and server disaggregationThe Fast-Growing Computing WorkloadsNETtalk19602020PerceptronALVINNRNN for SpeechTD-Gammon v2.1LeNet-5BiLSTM f
4、or SpeehDeep Belief Nets&Layer-wise pretrainingAlexNetResNetsGPT-3LaMBDAFirst EraModern EraLarge-scale adoption of AI and MLSmarter devicesHyper-connected networksSuper-intelligent servicesDigital transformationPandemicEvolution of Hyperscale Computing EnvironmentFrom Converged to Composable Archite
5、ctureNetwork ChallengeConverged ArchitectureTOR based Rack Scalable ArchitectureServerTORTOR.CPUDRAMGPUSSDStorageNetwork&Storage SwitchHyper-Converged ArchitectureInterconnect ChallengeDisaggregated/ComposableArchitectureServer&Storage Combined Architecture.SmartNICMS CatapultAWS NitroCPUDRAMGPUSSDD
6、ivergence ChallengeCPUDRAMGPUSSDPooled Arch.:Memory,Compute,StorageThe Rising Need for Better ConnectivitySoC InterconnectProcessorInterconnectData CenterInterconnectCustomerInterconnectDIE/PACKAGENODEDATA CENTERMOBILE/BROADBANDA new class of interconnectfor device connectivity in the era of AI Can
7、be tailored and optimized for various AI applicationsCXL:Solution for the Era of HPCKey Features of CXL InterfaceCache CoherenceConnectivityByte AddressableLow LatencyCXL as the core of composble computing infrastructureCXL FeaturesCXL is a high-performance,low-latency protocol that leverages PCIe p
8、hysical layerProcessorPCle ConnectorPCIeChannelPCIe CardCXL Card High-speed and low-latency interconnect Leverages PCIe Physical layer(PCIe 5.0,PCIe 6.0)Supports various types of memories(volatile,non-volatile)CPU and CXL device memory coherency Supports switching and memory pooling Supports link le
9、vel integrity and data encryption Open standard(non-proprietary)Broad industry support in CXL consortium Regular specification updates(CXL 1.1,CXL 2.0,CXL 3.0)CXL Use Cases(1/2)Capacity and Bandwidth ExpansionIMDB ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMDB ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMDB ServeryTBDRAMC
10、PU 0yTBDRAMCPU 1CXL zTBCXL zTBIMC ServerxGBDRAMCPU 0 xGBDRAMCPU 1IMC ServerxGBDRAMCPU 0 xGBDRAMCPU 1IMC ServeryGBDRAMCPU 0yGBDRAMCPU 1CXL zGBCXL zGBIMC ServerxGBDRAMCPU 0 xGBDRAMCPU 1IMC ServeryGBDRAMCPU 0yGBDRAMCPU 1CXL zGBCXL zGBCapacity ExpansionBandwidth ExpansionCXL Use Cases(2/2)Tiering and Po
11、olingIMC ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMC ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMC ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMC ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMC ServerxTBDRAMCPU 0 xTBDRAMCPU 1IMC ServeryTBDRAMCPU 0yTBDRAMCPU 1CXL zTBCXL zTBMEMORY BOXCXL yTBCXL yTBMemory TieringMemory PoolingCXL Memory Expansion
12、 SolutionDDRx 512GBMax.8TB for 1CPUCPUDDRx 512GBDDRx 512GBDDRx 512GB8x 2DPCCPU8x 2DPC(DIMM/channels)DDRx 512GBDDRx 512GBDDRx 512GBDDRx 512GBMem Ex 512GBMem Ex 512GBMem Ex 512GBMem Ex 512GB8x CXL linksMax.16TB for 1CPUDoubled Capacity than Conventional MemoryNote:Max capacity varies with system confi
13、gurationsCXL Memory ExpanderData AccelerationHigh Capacity/BandwidthEnhanced Security/RASDDRGPUCPUSmartNICCXLinterfaceCXLControllerDDRCXL Memory ExpanderNew Solution for Memory Dominant ApplicationsCXL Memory Expander Line-upFPGAPCIe 3.0(x16)Host(Controller)ASICPCIe 5.0(x8)DDR43200,128GBMedia(DRAM)D
14、DR54800+,512GBAs of August,2022Built with FPGA and ASIC Controller2122CXL Memory Expander(1/3)Solution OverviewEnclosure(2T)SPIPMICs*DDR5 DRAMsDDR5 DRAMsCXLControllerE3.S Form Factor*Bottom-sidePCIex8DebugPortsCXL Memory Expander(2/3)Form Factor-EDSFF(E3.S)Media-DDR5 4800Module Capacity-Max 512 GBCX
15、L Link Width-x8 Maximum CXL Bandwidth-32GB/s(PCIe 5.0)Other Features-RAS,Interleaving,Diagnostics etc.Availability-Q322 for evaluation/testingProduct FeaturesCXL Memory Expander(3/3)Supported Features CXL 2.0 Device Type:Type 3 Support viral and data poisoning Memory error injection Multi-symbol ECC
16、 Media scrubbing Post package repairs(hard/soft)*Image Source:CXL ConsortiumCXL 2.0 Switching BenefitsSMDK*,Unified Interface for Memory*Scalable Memory Development KitSW development kit to enable Software-Define Memory system on heterogeneous memoriesHPC Applications(ML/AI,IMDB,Bigdata,etc)SMDKAllo
17、catorSMDK KernelCompatiblePathIntelligent Tiering EngineOptimizationPathMemory Pool ManagementDRAM PartitionCXL PartitionCPUHardwareSoftwareDRAMServerMainBoardCXL Memory ExpanderSMDKIntelligent Tiering Engine supports memory tieringscenarios with prioritiy,capacity,bandwidth,and so onMemory Pool Man
18、agement supports scalability reflecting memory request status and system resourceMemory Partitioning allows logical memory views for heterogeneous physical DRAM and CXL memoryTwo selectable paths,Compatible and Optimization Path,without or with modification of application SWPluginKernelBenefits of S
19、MDKClient Experience Transparent as well as Optimized Memory usesDifferentiatedCloud PerformanceCXL EcosystemOSS for CXL Industry and Research fieldSMDKUnified Interface for MemoryMain MemoryDDR5,LPDDR5,Etc.CXL Type2Accelerator+Mem.ExpanderCXL Type3MemoryExpanderCXLCXLCXLUnified SW SolutionFull-stac
20、k SW all about heterogenous memory systemSMDK is available as open source on GitHubhttps:/ SetupConfiguration of Test BedBMSMDKCXL CRBML/AIIMDBBertNASNETMemcachedRedisMLCStreamSWHWDDR5 DDR5 DDR5 4800DDR5 DDR5 CXL DRAM(FPGA)DDR5 DDR5 CXL DRAM(ASIC)ContainerMemory Expanderw/EDSFF Riser CardMemory Benc
21、hmark Test ResultsComparable Performance with DDR MemoryMLC 1:1 R/WSTREAM CopyDDRCXL-FPGACXL-ASIC4.6x4.7x0.190.190.880.921.01.0ML/AIProcess#1ML/AIProc#2ML/AIProc#NtestsettestsettestsettestsettestsettestsettestsettestsettestsetDDRDDRML/AI#N+1ML/AI#N+2ML/AI#N+Ktestsettestsettestsettestsettestsettestse
22、ttestsettestsettestsetCPUcorecorecorecorecorecorecorecoreCXL DRAMCXL DRAMDDR BWSaturationCXL BWsaturationNormalized BandwidthSystem Test Results(ML/AI)*Bidirectional Encoder Representations from Transformers1.00 1.12 1.16 1.11 1.00 1.17 1.30 1.45 1.79 CXL-FPGACXL-ASIC1.001.261.391.331.011.351.642.00
23、2.89ML/AI Applications(BERT*&NASNet*)DDR1-CXL2-CXL4-CXLk-CXLDDR1-CXL2-CXL4-CXLk-CXLInferences per Minutes(Normalized)Theoreticalmax.BERTNASNet*Neural Architecture Search Network+100%+45%Theoreticalmax.(See appendix for detail test condition)System Test Results(IMDB)IMDB Redis*Memory Usage(Scale-up v
24、s Scale-out)Single Node(DDR+CXL FPGA)2-Node Cluster(DDR x 3)304556992749665949189128B4KB1MB128B4KB1MBSETGETSingle-Node(DRAM+CXL)2-Node Cluster(DRAM)System#1RedisDDR5(32GB)CXLMem(64GB)ClientSet60GBGet60GBSystem#1ClientSystem#2EthernetRedisDDR5(32GB)RedisDDR5(32GB)RedisDDR5(32GB)ClusterSet6
25、0GBGet60GBvsCXL Link2.86x2.64xPerformance MB/sScale-up vs Scale-out*Remote dictionary server(See appendix for detail test condition)A Proven Memory Expansion SolutionIncreasing System Memory CapacityWidening Memory BandwidthSupporting RAS/Security based on Memory Controller83%IncreaseRASSecurity2XIn
26、creaseSummary and Future PlanAI and pandemic drive demand for memory bandwidth and capacity,and new interconnect standard CXL allows expansion of memorySamsung developed the industrys first ASIC-based 512GB CXL memory expander,which will be available for early evaluation this quarterMemory intensive
27、 applications such as IMDB and AI/ML have been tested on CXL memory expander with an open-source software,SMDKSamsung to cooperate further on CXL 3.0 and beyond,and providenext-gen memory solutions like memory disaggregation,SDM*,and more*Software-defined memoryEnhanced Data ServiceAI/ML NLP,Recomme
28、ndationEdge ComputingIndustry FirstCXLTMMemory ExpandersAppendixTest Condition(ML/AI and IMDB)ML/AIFor BERT and NasnetTensorFlow(CPU)=1.11.0+,Python 3.7,Numpy 1.20.0For BERTMulti-process,3 cores/process,batch-size:128,max_seq_num:256,num-test-data/process:512dataset=CoLAdo_train=true,do_eval=true,da
29、ta_dir=$GLUE_DIR/CoLAvocab_file=vocab.txtinit_checkpoint=$BERT_BASE_DIR/bert_model.ckptmax_seq_length=128,train_batch_size=32learning_rate=2e-5,num_train_epochs=3.0For NASNetMulti-process,3 cores/process,batch-size:100,eval_image_size:236,num-test-data/process:200 dataset_name=imagenet,num_preproces
30、sing_threads=4labels_offset=0,model_name=inception_v3preprocessing_name=inception_v3moving_average_decay=None,quantize=False,use_grayscale=FalseIMDBFor scale-up vs scale-outRedis-server:mastercluster-enabled yescluster-node-timeout 300000save stop-writes-on-bgsave-error yesrdbcompressionyesrdbchecks
31、umyesrdb-del-sync-files norepl-diskless-sync nordb-del-sync-files noreplica-serve-stale-data yesreplica-read-only yesrepl-diskless-sync-delay 5repl-diskless-load disabledrepl-disable-tcp-nodelaynoreplica-priority 100client-output-buffer-limit replica 0 0 0maxclients1000000maxmemory-policy noeviction
32、maxmemory-samples 10maxmemory-eviction-tenacity 100repl-diskless-sync no#master-replicadisk-based syncrdb-del-sync-files noreplica-serve-stale-data yesreplica-read-only yesreplica-priority 100client-output-buffer-limit replica 0 0 0io-threads 4io-threads-do-reads yesRedis-server:replicasave port 638
33、0replicaof 127.0.0.1 6379replica-read-only yesstop-writes-on-bgsave-error yesrdbcompression yesrdbchecksum yesrdb-del-sync-files norepl-diskless-sync nordb-del-sync-files noreplica-serve-stale-data yesreplica-read-only yesrepl-diskless-sync-delay 5repl-diskless-load disabledrepl-disable-tcp-nodelay noreplica-priority 100client-output-buffer-limit replica 0 0 0maxclients 1000000maxmemory-policy noevictionmaxmemory-samples 10maxmemory-eviction-tenacity 100replica-lazy-flush nolazyfree-lazy-user-del nolazyfree-lazy-user-flush nooom-score-adj nooom-score-adj-values 0 200 800disable-thp yes