《SNIA-SDC23-Mills-CXL-Memory-Disaggregation-and-Tiering-Lessons- from-Storage.pdf》由会员分享,可在线阅读,更多相关《SNIA-SDC23-Mills-CXL-Memory-Disaggregation-and-Tiering-Lessons- from-Storage.pdf(23页珍藏版)》请在三个皮匠报告上搜索。
1、1|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021CXL Memory Disaggregation and TieringLessons Learned from StoragePresented by Andy Mills,SMART Modular Technologies2|2023 SNIA.All Rights Reserved.SummaryMemory tiering can learn a lot from storage Scaling High availability and re
2、dundancy Recovery objectives(RTO/RPO)Stateful vs.Stateless virtualized environmentsTopics covered Quick Refresh on SANs,tiering,CXL memory expansion/sharing Case Study:SSD Transparent Page Tiering a Real World Product Lessons learned that are useful for memory tiering3|2023 SNIA.All Rights Reserved.
3、Background 1980s Multi-parallel processor system design(Transputer),real time embedded signal processing,parallel processor programming1990s Early involvement in 100Mbps optical fiber networks(FDDI),Ethernet networks,SDH/SONET and network redundancy/failover2000s Shared networked storage HA and virt
4、ualized storage appliances supporting tiering,remote replication,snapshot2010s Software defined storage,workload driven dynamic storage allocation and real time server/OS based storage tiering development and deploymentDeveloped production transparent page tiering solution for SSDs for Linux/Windows
5、2020s PCIe Gen 4/5 Advanced SSDs and CXL memory disaggregation architecture and development Now exploring pooled memory add-in components and appliances4|2023 SNIA.All Rights Reserved.Disaggregation and Composability Disaggregation 1993-4 Storage Area Networks(FCSI formed in 1992)first industry stan
6、dards based storage“disaggregation”2013 Intel and Facebook at Open Compute Summit first use the term disaggregation 2017 Intel White Paper:Disaggregated servers drive data center efficiency and innovation-Decouple CPU/DRAM and NIC/Drives from other server components Composability 2020 GPUs start to
7、become disaggregated 2022 First demo of memory disaggregation*Merriam-Webster Dictionary:https:/www.merriam- Disaggregate verb:to separate into component parts*Composability:ability of system to configure disaggregated components 5|2023 SNIA.All Rights Reserved.CXL Memory Expansion TypesCPUCPU6-12 D
8、DR5DIMMs6-12 DDR5DIMMs6-12 DDR5DIMMs6-12 DDR5DIMMsMCMCMCMCMCMCMCMCCXL 3.0 Fabric SwitchCXL 3.0 Memory Expansion UnitsJBOM ChassisCXL 1.1/2.0 Memory ExpansionCXL/DDR5 Capable Server MotherboardCXL 2.0 Memory Expansion Add-in-cardsCXL 2.0 E3.S/L CMM and NV-CMMsCXL 3.0 Fabric/Expansion Cable Connection
9、CXL DIMMsCustom ModulesStandard DIMMsStandard DIMMs6|2023 SNIA.All Rights Reserved.Latency and Bandwidth,NUMA and CXLCXL SwitchCPU 0+I/OLocal CPU CachesCPU 1+I/OLocal CPU CachesNearDDR DRAMFarDDR DRAMHBMHBMCXL Expansion Memory Controller CXL Expansion Memory Appliance50-100ns256GB/s-1TB/s60-100ns50-
10、200GB/s120-200ns50-200GB/s180-260ns60-100GB/sCXL Switch+CXL Endpoint280-500ns60-100GB/s1 NUMA HopPCIe Gen 5/CXL SlotPCIe Gen 5/CXL Host AdapterCXL Switch+100ns+Far CXL Expansion Memory Controller 240-320ns60-100GB/sPCIe Gen 5/CXL SlotLatencies approximate and vary by controller/memory types7|2023 SN
11、IA.All Rights Reserved.Caching RefresherCachePrimary Storage or Memory MediaCache EngineHitMissCopyReadCachePrimary Storage or Memory MediaCache EngineWrite throughCopyWriteCacheTableCacheTableCachePrimary Storage or Memory MediaCache EngineWriteWriteCacheTableLazy WriteCapacity visible to applicati
12、on in all cases is only that of the primary storage“tier”Caches rarely are larger than a few%of the primary storageDiminishing point of return due to cache flushing to make room for new dataRead OperationWrite Through OperationWrite Back OperationCopies of data managed hereAll data eventually ends u
13、p here8|2023 SNIA.All Rights Reserved.Transparent TieringVirtualization EngineFast Tier ReadSlow Tier ReadReadPage TranslationTableCapacity visible to application in all cases is the sum of the Fast and Slow tiers,unless a reservation scheme is usedReads and writes and direct through to the media vi
14、a a simple page translation tableData/files are split across the fast and slow tiers(not copied)Background Tiering EngineFast TierSlow Tier Storage or MemoryvPage pPage,tier IDVirtualization EngineFast Tier WriteSlow Tier WriteWritePage TranslationTableBackground Tiering EngineFast TierSlow Tier Sto
15、rage or MemoryvPage pPage,tier IDVirtualization EnginePage TranslationTableSlow Tier Storage or MemoryvPage pPage,tier IDRead OperationWrite OperationBalancing OperationororHot PagesCold PagesFast TierBackground Tiering EngineTieringPoliciesTieringPoliciesTieringPoliciesData split across fast and sl
16、ow9|2023 SNIA.All Rights Reserved.Disaggregated Storage Refresher-SANsCPUMemoryLANDisksLAN/networked drive copy or backupCPUMemoryLANDisksSANFloppy netRAIDSANSANRAIDNetworked Servers(no SAN)SAN Based EnvironmentNetworked Backup/TieringEarly manual copyAutomated copying of filesAs file sizes increase
17、dBatch file based tiering File extent tiering(partial files)Block based tieringFile stubbingSAN Backup/TieringAutomated copying of filesFile extent tieringAs SAN appliances migrated to virtualizedTransparent block based activity tiering using SSD and HDD combinationsRemote replicationSAN ApplianceJB
18、OD ExpansionInternal block tiering between SSD and HDDsCompute NodeCompute NodesNewer forms:NVMe over fabrics that bridge gap between DAS and SAN10|2023 SNIA.All Rights Reserved.SAN High AvailabilityComputeMemorySAN Switch ASAN Adapter Removing single points of failure by ensuring multiple paths bet
19、ween compute nodes and storage Active-Active both paths used for parallelism Active-Passive earlier dual controllers only used one for active,the other for standby but needs to be in sync Active-Failover one controller has failed until replaced Ability to replace controllers,switches and other key c
20、omponents(e.g.PSUs)without taking the system offlineComputeMemoryComputeMemoryComputeMemorySAN Switch BShared Storage(Dual Ported)Shared Storage Controller AShared Storage Controller BSAN AdapterSAN AdapterSAN AdapterSAN AdapterSAN AdapterSAN AdapterSAN Adapter11|2023 SNIA.All Rights Reserved.Case S
21、tudy:Transparent Page Block TieringAuto Discovery and ClassificationProfiling the tiers using live probe/performance dataPage VirtualizationEfficient,low tax method for virtualizing and aggregating physical storage components to an application,optimized for PCIe Gen 4/5 NVMeMemory cache to RAM or NV
22、DIMM as“third”tierHot Page Tracking and RankingHigh frequency sampling of storage IO access patterns to determine high use areasPreference is to leverage/utilize hardware counters or fast memory if possibleCold Page Tracking and RankingLess intensive,background task to determine which areas of stora
23、ge are not being heavily usedBackground MigrationMigrating performance data from cold to hot tiers and visa versaPolicy based APIsPromote and policy setupPage pinningManual or directed promote/demote controlsKey Tiering ComponentsDiscovery and APIsHot/Cold Page TrackingBackground MigrationFast Media
24、Slower MediaVirtual SSDVirtual SSDPage Virtualization12|2023 SNIA.All Rights Reserved.Storage Tiering StackNVMeAHCISCSI/RAID/SANFast Block Level Remapping Layer(1us)Block Virtualization LayerOS/Hypervisor/File System/ApplicationsMicroTiering Block Data MigrationPolicy Driven Stats&Decision EngineBlo
25、ck Device Layer InterfaceLocal Machine Utilities/ToolsKernel DriverDevice Striping,Mirror,Redundant CopyUEFIVirtual BootStorage DevicesClass LibrarySystem Management and LoggingVirtual Block DevicesFile SystemLogical Volume ManagersKernel Block I/ODevice DriversEFI BIOS DriversRAM Cache13|2023 SNIA.
26、All Rights Reserved.Tiering Engine High Level FunctionsHost Storage I/OVirtual to PhysicalRemapping Scan Page DatabaseRank Pages by Relative Activity and Activity PolicyProduce Prioritized Data Movement ListEvaluate traffic patternsMove dataAdjust Virtual Page MapsAdjust/Update RigidityUser Policies
27、Optional learn and lock to pin pages via application directives Page Activity DatabaseMeasure/StoreData PathRAMLBA Command/Control PathVirtual RemappingTables AnalyzeModifyStorage DevicesStatistics TableAPIRepeat(default 2s)Capture AttributesWithin Page BoundariesAccess Pattern Page TablesBlock Stor
28、ageBlock StorageBlock StorageHost Block Storage I/O Commands and DataData Movement EngineGoal of the engine is to continuously modify the virtual mapping until optimal performance is achieved using a policy driven model14|2023 SNIA.All Rights Reserved.Background Data Movement0,QVirtual Page States.N
29、 Virtual Volume Pages.Page of B BlocksFast Tier PagesSlow Tier Pages0,P1,XExample Page Size:128K-4M0,P1,XHost I/OForeground Page TranslationStatistics Table1,YPage mapped to slow tier heating up1,Y0,QPage on fast tier cooled downPage Promote/Demote QueuesPages Migrated in BackgroundHighly Active pag
30、e stays on fast tierMostly Inactive page stays on slow tier15|2023 SNIA.All Rights Reserved.Mix of Static and Dynamic AllocationOS Kernel Transparent Memory TieringCPUsMCPCIe/CXLSoftware Defined TieringApplication(user)O/S or Container(user/kernel)Hypervisor(kernel)Tiering Virtualization(kernel)Tier
31、ing Data movement(kernel)Tiering telemetry tables(kernel)Core ACore BCore DI/O DieCore CCore ECore FCore HCore GPCIePCIeDDR5DDR5DDR5DDR5DDR5DDR5Core CXL Driver(kernel)Background tasks and processor/memory/PCIe affinity dependentDynamically Allocated to CoresCompute NodeProcess AllocationDynamically
32、Allocated to CoresSoftware Defined TieringKernel tiering form of software defined tiering16|2023 SNIA.All Rights Reserved.Page Statistics TableRD IOsWR IOsRD BlocksWR BlocksCur Policy CountTime Last AccessedRigidity Control/Lock/PcountHost Access Count.Mega Region CountersVirtual Pages0N-1NRD IOsWR
33、IOsRD BlocksWR BlocksPromotes Pending/ThresholdTotal PromotesVirtual Page Counters17|2023 SNIA.All Rights Reserved.Tiering Policy Engine Allows users/administrators/system architects to tune policies Case study example used page activity counters Policy settings Promote on read IO threshold Promote
34、on read and write IO threshold Promote on write IO threshold Same as above for MB/size(i.e.amount)of data changed per page Rigidity settings how fluid should a range of pages be Page locking pre and post promote actions Numerous rates,time driven policies about when and how aggressively to move data
35、18|2023 SNIA.All Rights Reserved.Analytics Data Collection and ReportingDeep sub-file,block-based analytics that can determine active capacity over time and help size workloads from both a performance AND capacity required standpointWorkload BurstsTime of DayRelative Tier ActivityRD vs.WR ActivityTi
36、er Promote ActivityLong Term ActivityActivity Over Capacity RangeZoom in on ActivityTime based Activity ViewVolume based Activity ViewRestful/JSON or Redfish type interfaceInterface to logging and alert systems e.g.Splunk,ElasticStack,Logstash,Kibana19|2023 SNIA.All Rights Reserved.Intelligent CXL M
37、emory ControllerBFuture Memory Tiering HA ApplianceIntelligent CXL Memory ControllerACXL 3.0 SwitchCXL 3.0 AdapterCXL 3.0 AdapterCXL 3.0 SwitchDP CMMDP CMMDP CMMDP CMMDP CMMCompute NodeHA CXL Memory Appliance.Intelligent CXL Memory ControllerBIntelligent CXL Memory ControllerADP CMMDP CMMDP NV CMMDP
38、 CXL NVMeDP CXL NVMeHA CXL Hybrid Memory Appliance.Low Latency MemoryHigh Latency Memory(low bandwidth)Low Latency MemoryPersistent Storage20|2023 SNIA.All Rights Reserved.Summary and Lessons Learned Kernel based VMAP metadata maintenance Took many iterations to get this right and solid Lived throug
39、h a“vmap repair”nightmare as we hardened for power loss and removable drives Processor Affinity and I/O Handler Process Placement Tiering engine processes were allocated dynamically and not always on a CPU nearest the I/O path handlers Moving data can significantly impact the application need polici
40、es to deal with this OS maintenance(e.g.indexing,virus scans)messes with your algorithms need policies to deal with this Translation of I/O(Load/Store)Access Using system memory for storing the tables is fast,however careful attention needs to be paid to CPU association of table vs.IO to prevent lar
41、ge context switches or wait times Significantly more challenging for software based memory tiering Low level device media conflict management SSD housekeeping and block migration often conflicts with tiering migration Important for hybrid/persistent CXL storage and intelligent CXL appliances No one
42、size fits all Mission critical vs.non-critical,tiering in scale up vs.scale out,hyper-converged vs.tenant based21|2023 SNIA.All Rights Reserved.OCP Composable Memory Systems OCP Composable Memory System(CMS)is a sub-project within the Server Project Led by Manoj Wadekar(Meta)and Reddy Chagam(Intel)M
43、embers include device vendors,CPU vendors,CSP,ISV Charter Focus on key applications driving CMS adoption Establish CMS architecture and nomenclature Identify gaps in specifications across full stack Offer benchmarks enabling innovations in new and emerging use cases Currently working on draft specif
44、ication for memory tiering More at:https:/www.opencompute.org/projects/composable-memory-system 22|2023 SNIA.All Rights Reserved.SMART at SDC23Demoing E3.S CXL 2.0 Memory Module at SDC23 HackathonWednesday,September 20 starting at 10:35 am in Salon 823|2023 SNIA.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.