《Hot Chips 2022 CXL3 Coherence Deep Dive.pdf》由会员分享,可在线阅读,更多相关《Hot Chips 2022 CXL3 Coherence Deep Dive.pdf(48页珍藏版)》请在三个皮匠报告上搜索。
1、PublicCoherence Deep Dive for CXLRob Blankenship Intel Corporation and CXL Protocol Working Group co-chairAugust 2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial Public Coherence/Caching Primer CXL Cache Hierarchy CXL.Cache Deep Dive What is new in CXL3(Device Scaling)CXL.Mem Deep Dive
2、What is new in CXL3 Direct P2P to HDM/Multi-Host CoherenceAgenda8/18/20222Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial PublicCaching PrimerCopyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8/18/20223Public Caching temporarily brings data closer to the consumer Improves latency a
3、nd bandwidth using prefetching and/or locality Prefetching:Loading Data into cache before it is required Spatial Locality(locality is space):Access address X then X+n Temporal Locality(locality in Time):Multiple access to the same DataCaching Overview8/18/20224Copyright|CXL Consortium 2020-Hot Chips
4、 2022 CXL Tutorial AcceleratorLocal Data CacheAccess Latency:10nsDedicated Bandwidth:100+GB/sReadDataReadDataHost MemoryAccess Latency:200nsShared Bandwidth:100+GB/sPublic Modern CPUs have 2 or more levels of coherent cache Lower levels(L1),smaller in capacity with lowest latency and highest bandwid
5、th per source.Higher levels(L3),less bandwidth per source but much higher capacity and support more sources Device caches are expected to be up to 1MB.CPU Cache/Memory Hierarchy with CXL8/18/20225Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial Note:Cache/Memory capacities are examples and
6、not aligned to a specific product.CPU Socket 0CPUL150 KB500 KB L2.10 MB L3(aka LLC)500 KB L210 GB Directly Connected Memory(aka DDR)CXL.Cache.CXL.mem10 GBHome AgentCXL.ioPCIeCPU Socket 1CPUL150 KBCPUL150 KBCPUL150 KBCoherent CPU-to-CPU Symmetric LinksCXL.mem10 GBCXL.CacheCXL.ioPCIeWr Cache50KBDevice
7、1 MBDevice1 MBWr Cache50KBPublicHow do we make sure updates in cache are visible to other agents?Invalidate all peer caches prior to update Can managed with software or hardware CXL uses hardware coherenceDefine a point of“Global Observation”(aka GO)when new data is visible from writesTracking granu
8、larity is a“cacheline”of data 64-bytes for CXLAll addresses are assumed to be Host Physical Address(HPA)in CXL cache and memory protocols Translations done using Address Translation Services(ATS).Cache Consistency8/18/20226Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial Public Modern CPU c
9、aches and CXL are built on M,E,S,I protocol/states Modified Only in one cache,Can be read or written,Data NOT up-to-date in memory Exclusive Only in one cache,Can be read or written,Data IS up-to-date in memory Shared Can be in many caches,Can only be read,Data IS up-to-date in memory Invalid Not in
10、 cache M,E,S,I is tracked for each cacheline address in each cache Cacheline address in CXL is Addr51:6 Notes:Each level of the CPU cache hierarchy follows MESI and layers above must be consistent Other extended states and flows are possible but not covered in context of CXLCache Coherence ProtocolC
11、opyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8/18/20227Public All peer caches managed by the“Home Agent”within the cache level.A“Snoop”is the term for the Home to check cache state and causing cache state changes.Example CXL Snoops:Snoop Invalidate(SnpInv):Causes a cache to degrade to I-
12、state,and must return any Modified data.Snoop Current(SnpCurr):Does not change cache state,but does return indication of current state and any modified data.How are Peer Caches Managed?8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8PublicCXL Cache ProtocolCopyright|CXL Consortiu
13、m 2020-Hot Chips 2022 CXL Tutorial 8/18/20229PublicSimple set of 15 reads and writes from the device to host memoryKeep the complexity of global coherence management in the host.CXL3 enables up to 16 cache devices below each root port Prior generations limited to 1 per root port.Cache Protocol Summa
14、ry 8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 10PublicCache Protocol Channels3 channels in each direction:D2H vs H2DData and RSP channels are pre-allocatedD2H Requests from the device H2D Requests are snoops from the hostOrdering:H2D Req(Snoop)push H2D RSP8/18/2022Copyright|C
15、XL Consortium 2020-Hot Chips 2022 CXL Tutorial 11PublicRead Flow Diagram to show message flows in timeX-axis:AgentsY-axis:Time8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 12PublicRead Flow Diagram to show message flows in timeX-axis:AgentsY-axis:Time8/18/2022Copyright|CXL Conso
16、rtium 2020-Hot Chips 2022 CXL Tutorial 13PublicCPU Socket 0CPUL150 KB500 KB L2.10 MB L3(aka LLC)500 KB L210 GB Directly Connected Memory(aka DDR)CXL.Cache.CXL.mem10 GBHome AgentCXL.ioPCIeCPU Socket 1CPUL150 KBCPUL150 KBCPUL150 KBCoherent CPU-to-CPU Symmetric LinksCXL.mem10 GBCXL.CacheCXL.ioPCIeWr Ca
17、che50KBDevice1 MBDevice1 MBWr Cache50KBMapping Flow Back to CPU HierarchyCXLDevice8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 14PublicCPU Socket 0CPUL150 KB500 KB L2.10 MB L3(aka LLC)500 KB L210 GB Directly Connected Memory(aka DDR)CXL.Cache.CXL.mem10 GBHome AgentCXL.ioPCIeCPU
18、 Socket 1CPUL150 KBCPUL150 KBCPUL150 KBCoherent CPU-to-CPU Symmetric LinksCXL.mem10 GBCXL.CacheCXL.ioPCIeWr Cache50KBDevice1 MBDevice1 MBWr Cache50KBMapping Flow Back to CPU Hierarchy Peer Cache can be:Peer CXL Device with CacheCPU Cache in Local SocketCPU Cache in Remote SocketCXLDevicePeerCachePee
19、r CachePeer CachePeer CachePeer Cache8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 15PublicCPU Socket 0CPUL150 KB500 KB L2.10 MB L3(aka LLC)500 KB L210 GB Directly Connected Memory(aka DDR)CXL.Cache.CXL.mem10 GBHome AgentCXL.ioPCIeCPU Socket 1CPUL150 KBCPUL150 KBCPUL150 KBCohere
20、nt CPU-to-CPU Symmetric LinksCXL.mem10 GBCXL.CacheCXL.ioPCIeWr Cache50KBDevice1 MBDevice1 MBWr Cache50KBMapping Flow Back to CPU Hierarchy Peer Cache can be:Peer CXL Device with CacheCPU Cache in Local SocketCPU Cache in Remote Socket Memory Controller can be:Native DDR on Local SocketNative DDR on
21、Remote SocketCXL.mem on peer DeviceCXLDevicePeerCacheHomePeer CachePeer CacheHomeMemoryControllerMemoryControllerMemoryControllerMemoryControllerPeer CachePeer Cache8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 16PublicExample#2:Write For Cache Writes there are three phases:Owne
22、rshipSilent WriteCache EvictionCXLDevicePeerCacheHomeISMemoryControllerLegendCache State:ModifiedExclusiveSharedInvalidAllocate Tracker Deallocate Tracker 8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 17PublicExample#2:Write For Cache Writes there are three phases:OwnershipSilen
23、t WriteCache EvictionOwnershipCXLDevicePeerCacheHomeISI ES IMemoryControllerLegendCache State:ModifiedExclusiveSharedInvalidAllocate Tracker Deallocate Tracker 8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 18PublicExample#2:Write For Cache Writes there are three phases:Ownership
24、Silent WriteCache EvictionOwnershipCXLDevicePeerCacheHomeISI ES IE MMemoryControllerLegendCache State:ModifiedExclusiveSharedInvalidAllocate Tracker Deallocate Tracker Write8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 19PublicExample#2:Write For Cache Writes there are three pha
25、ses:OwnershipSilent WriteCache EvictionOwnershipWriteEvictionCXLDevicePeerCacheHomeISI ES IE MDataM IMemoryControllerLegendCache State:ModifiedExclusiveSharedInvalidAllocate Tracker Deallocate Tracker 8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 20PublicExample#3:Steaming Write
26、 Direct Write to Host Ownership+Write in a single flow.Rely on completion to indicate ordering May see reduced bandwidth for ordered traffic Host may install data into LLC instead of writing to memoryCXLDevicePeerCacheHomeISS IDataMemoryControllerLegendCache State:ModifiedExclusiveSharedInvalidAlloc
27、ate Tracker Deallocate Tracker 8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 21Public15 Request in CXL Reads:RdShared,RdCurr,RdOwn,RdAny Read-0:RdownNoData,CLFlush,CacheFlushed Writes:DirtyEvict,CleanEvict,CleanEvictNoData Streaming Writes:ItoMWr,WrCur,WOWrInv,WrInv(F)8/18/2022C
28、opyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 22PublicCXL Memory ProtocolCopyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8/18/202223PublicSimple reads and writes from host to memory Memory Technology Independent HBM,DDR,PMem Architected hooks to manage persistenceIncludes 2-bits
29、 of“meta-state”per cacheline Memory Only device:Up to host to define usage.For Accelerators:Host encodes required cache state.Host-managed Device Memory(HDM)comes in 3 types:Host Managed Coherence(HDM-H)Device Managed Coherence(HDM-D)Device Managed Coherence with Back-Invalidation(HDM-DB)new in CXL3
30、Memory Protocol Summary8/18/202224Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial Public3 channels in each direction M2S Request(Req),Request w/Data(RwD)S2M Non-Data Response(NDR),Data Response(DRS)which are pre-allocated.M2S BIRsp,S2M BISnp used for HDM-DB to manage coherence New in CXL3.
31、Limited Ordering Req channel for HDM-D memory(CXL2 Accelerators)NDR Channel for conflict flows with HDM-DBMemory Protocol Channels8/18/202225Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial PublicExample#1:WriteMedia ECC handled by deviceHDM-H provide 2-bits of host defined Meta Value which
32、 device optionally supportsNote:only host caching of HDM-H(Host only coherent)CXL Memory Only DeviceHostCXL DeviceMemory ControllerMemory MediaSent when data visible to future readsNew MetaValue8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 26PublicExample#2:ReadMeta Value Change
33、 requires device to write.Memory Only DeviceHostCXL DeviceMemory ControllerMemory MediaNew MetaValueOld MetaValue8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 27PublicExample#3:Read no MetaHost may indicate no Meta-state update required on readsMemory Only DeviceHostCXL DeviceMe
34、mory ControllerMemory MediaNo Meta Update Needed8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 28PublicExample#4:MemInvUsed to read/update Meta-state without reading the data itself.Memory Only DeviceHostCXL DeviceMemory ControllerMemory MediaNew MetaValueOld MetaValue8/18/2022Co
35、pyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 29Public“Device Coherent”Provide ability for host and device to cacheRequest MetaValue field indicates host cache state.Any Host can be in M,E,S,I states Shared Host can be in S or I states and indicating the host requesting S-state.Invalid Hos
36、t is in I-state and is not requesting cache state.Request SnpType indicates Device Cache state change SnpInv Invalidate Device Cache SnpData Device Cache in I or S state.Device Coherence Engine(Dcoh)is the final conflict resolution arbiter between host and device accesses for HDM-D*memory.HDM-D/HDM-
37、DB Common Attributes8/18/202230Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial Public CXL.mem requests indicate coherence required from the host.CXL.Cache used for device to change host cache state Host must detect device accessing its own memory and trigger special flows which return a“Fo
38、rward”message.Can be blocked behind access to host memory.Requires device to implement full directory tracking(aka Bias Table)Device Coherent(HDM-D)Specifics8/18/202231Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial PublicFull Directory Bias TableDevice state of 1 or 2 bits per Cacheline i
39、ndicating if host has a cached copy Device Bias:No host caching,allowing direct reads Host Bias:Host may have a cached copy,so read goes through the host Optionally tracking Shared vs Any state in host.With Shared State,the device may directly read data,but must not modify.Host tracks which peer cac
40、hes have copiesPublicHost Bias ReadMemRdFwd message sent after coherence resolved on M2S Request ChannelHost/HADCOHDev$Dev MemBIAS=HOSTPeer CacheRdOwn XHitSRspISnpInv XSS to IMemRdFwd XDataAccelerator Device with HDM-DRdOwn XIBIAS=DEVICEGO-EI to EMemRdFwd XPublicDevice Bias ReadNo messages on CXL in
41、terfaceHost/HADCOHDev$Dev MemBIAS=DEVICEPeer CacheIMemRdFwd XDataAccelerator Device with HDM-DRdOwn XIGO-EI to EPublicDevice Cache EvictionsE/M in cache imply Bias=Device so no indication to hostHost/HADCOHDev$Dev MemBIAS=DEVICEPeer CacheIMemWr XCmpAccelerator Device with HDM-DDirtyEvict XMGO-WrPull
42、M to IData XPublicHost Bias Streaming WriteMemRdFwdmessage sent after coherence resolvedHost/HADCOHDev$Dev MemBIAS=HOSTPeer CacheWOWr XHitSRspISnpInv XSS to IMemWr XDataAccelerator Device with HDM-DWOWr XIBIAS=DEVICEGO-WrPullMemWrFwd XCmpExtCmpPublicDevice Bias Streaming WriteNo message to hostHost/
43、HADCOHDev$Dev MemBIAS=DEVICEPeer CacheIAccelerator Device with HDM-DWOWr XIMemWr XDataGO-WrPullCmpExtCmpPublicHDM-DB New in CXL3Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8/18/202238PublicHDM-DB“Device Coherent with Back-Invaldation”(HDM-DB)adds BISnp and BIRsp channel for optimize co
44、herence management enabling inclusive Snoop Filter(SF)architectures.Same“BIAS Table”states tracking host coherence:I,S,AInclusive SF architecture may block M2S Request waiting for Back-Invalidation Snoop(BISnp)to complete which enables sizing to match host caching expect instead of memory capacity.P
45、ublicEnables Inclusive Snoop Filter(SF)to track host cachingDevice can block new requests waiting for SF VictimBack-Invalidation Snooping(HDM-DB)8/18/202240Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial Public To improve efficiency there is BISnp messages that cover more than one cachelin
46、e(aka“Block”).Either 2(128B)or 4(256B)cachelines are supported.Block Access with BISnp8/18/202241Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial PublicNew Use Models with HDM-DBCopyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8/18/202242Public HDM-DB enables direct P2P from CXL or
47、 PCIe sources in CXL3 In prior generation all HDM access must go through the host CPU to resolve coherence.HDM-DB will directly resolve coherence with the host before committing the P2P.Direct Peer-to-Peer(P2P)to HDM8/18/202243Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial H1CXL SwitchD1C
48、XL Type-2/3D2PCIe AccelD3CXLAccelHDM-DBPublic Pooled Memory and CXL Switching added in CXL2 allow for dedicated assignment of memory resources from to a host.Shared Memory assigned to multiple hosts enabled in CXL3 Multi-Host Hardware Coherent Shared Memory possible with HDM-DB MORE on these uses in
49、 Fabric TutorialPooled and Shared Memory8/18/202244Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial H1H2H3H#CXL SwitchD1D2D3D4D#.Hardware Coherent SharedSoftware Coherent SharedPublic45Summary CXL protocols are evolving CXL2 added switching and pooled memory capabilities.CXL3 enabling new c
50、apabilities:CXL.Cache Scaling CXL.Mem Back-Invalidation Channel for SF,Direct P2P,Multi-Host Coherence Port Based Routing(covered in Fabric Tutorial)Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 8/18/202245PublicThank YouCopyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial PublicJoin Today!puteexpresslink.org/joinFollow Us on Social MediaComputeExL Consortium Channel8/18/2022Copyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial 47PublicAudience Q&ACopyright|CXL Consortium 2020-Hot Chips 2022 CXL Tutorial