上海品茶

PCIEEX~1.PDF

编号:161449 PDF 12页 1.02MB 下载积分:VIP专享
下载报告请您先登录!

PCIEEX~1.PDF

1、PCI Express correctable errors handling(RAS)solution implementation considerations in Metas AI/ML Training ClustersPCI Exp Error Handling(RAS)ConsiderationsCarlos Fernandez,Hardware Systems Engineer,MetaAnil Agrawal,Hardware Systems Engineer,MetaPCIe Express correctable errors handling(RAS)solution

2、implementation considerations in Metas AI/ML Training ClustersSUSTAINABLE SCALABLE COMPUTATIONAL INFRASTRUCTUREHW MGMTMetas AI/ML Training Clusters are built using a large number of PCIe devices including:GPUs,NICs,NVME Storage,and PCIe Switches.It is important to implement a robust fault handling(R

3、AS)solution within this PCIe device hierarchy to ensure target uptime,availability,and serviceability objectives.A high rate of PCIe correctable errors is expected.In this presentation,we would like to share our learnings and an innovative solution we developed to manage such large scale PCIe correc

4、table errors within Meta AI/ML training clusters.PCIe Error Handling-AbstractAI/ML Training Cluster 30K ft viewAI/ML Training Cluster-OverviewReference:https:/ Teton Training-OverviewReference: Teton Training-Platform ViewOAM:OCP Accelerator ModuleGrand Teton Training Platform-PCIe HierarchyA Large

5、PCIe Device Hierarchy Increased PCIe Correctable ErrorsB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F endpoint,CPU-NICB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F upstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 3

6、2GT/s,width x16 B:D.F endpoint,IOX-NICB:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,current speed 8GT/s target speed 32GT/s.B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,IOX-SSD,current speed 8GT/s target speed 16GT/sB:D.F downstream_

7、port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,GPUB:D.F downstream_port,speed 32GT/s,width x16 B:D.F endpoint,PCIe Gen 5 SwitchB:D.F downstream_port,speed 32GT/s,width x16B:D.F endpoint,PCIe Switch management endpointB:D.F root_port,slot#,device present,speed 32GT/s,width x16B:D.F u

8、pstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,IOX-NIC2B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,current speed 8GT/s target speed 32GT/sB:D.F downstream_port,slot#,device present,speed 8GT/s,width x4

9、 B:D.F endpoint,IOX-SSD,current speed 8GT/s target speed 16GT/sB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,GPUB:D.F downstream_port,speed 32GT/s,width x16 B:D.F endpoint,PCIe Gen 5 SwitchB:D.F downstream_port,speed 32GT/s,width x16B:D.F endpoint PCIe Switch manag

10、ement endpointPCI Express Fault Domain and CoverageCPU Root PortPCIe DevicePhysical LayerTransaction LayerData Link LayerPCIe LinkPhysical LayerTransaction LayerData Link LayerPCIe FaultsError ClassError TypeFault Coverage(RAS Features)Bit errorsCorrectedRx Error,Bad TLP,Bad DLLP,Replay Timer Time-o

11、ut,REPLAY_NUM Rollover,Corrected Internal,Header Log OverflowError Reporting,Link CRC and Retry,Link Retraining and RecoveryBus faults,Protocol errorsUncorrected NonFatalUnsupported Request,Completion Timeout,Completer Abort,Unexpected Completion,ACS Violation,Poison TLP Received,Poison TLP EgressEr

12、ror Reporting,Contain the poisoned data,DPCUncorrected FatalData Link Protocol Error,Surprise Link Down,Receiver Overflow,Flow Control Protocol,Malformed TLP,Uncorrectable InternalError Reporting,DPCInternal errorsUncorrected NonFatal/FatalBit flips in data buffers within the intermediate modules th

13、at remain undetected by the link integrity.Vendor specific Error Detection and Reporting,ECRCPCIe LinkPCI Express SwitchUSPDSPDSP:Downstream PortUSP:Upstream PortDPC:Downstream Port ContainmentExisting PCIe Correctable Error ReportingThe PCIe CE are logged in BMCs NVRAMSMM Handler:1.Logs SEL,2.Notif

14、y OS AER HandlerPCIe CE is detected within HWPCIe Error Logger service polls and computes error rateDC Tooling polls periodically and checks for error rate threshold and triggers alarmPlatform Error Handler(SMM)PCIe Error LoggerSMISCIBMC(SEL)Meta Specific Remediation ToolingPollingPCI Express Switch

15、USPDSPPCI Express AER is configured in Firmware First Mode.Both correctable and uncorrectable errors trigger SMI.Hardware triggers SMI for every PCI Express Correctable Error.No Threshold Used.Risk of increased software latency and associated application stallsProposed PCIe Correctable Error Reporti

16、ng SolutionRasdaemon logs PCIe CE in the BMCs NVRAMPlatform triggers MSIPCIe CE is detected within HWPCIe Error Logger service polls and computes error rateMeta DC Tooling polls periodically and checks for error rate threshold and triggers alarmOS AER Driver Handles the Error and Notify RasdaemonPla

17、tform Error Handler(SMM)PCIe Error LoggerSMISCIBMC(SEL)Meta Remediation ToolingPollingPCI Express SwitchUSPDSPMSIRasdaemonPCI Express AER is configured in Meta Platform specific Hybrid Mode.Correctable error trigger MSI and Uncorrectable error trigger SMI.Hardware triggers MSI for every PCI Express

18、Correctable ErrorOSs AER handler and Rasdaemon handles the error and saves the SEL in the existing BMCs NVRAMDevelop PCI Express Firmware Specification ECR to revise the FW/OS interface Implementation(OSPM).Work with ecosystem players to implement the changes in the Platform FW,Linux OSPM,and AER drivers.Contribute the“PCI Express Correctable Error Handling Requirements”to the OCP Hardware Fault Management Sub-project(part of Hardware Management Project).Next StepsHW MGMTHW Fault MGMTThank you!

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(PCIEEX~1.PDF)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

 wei**n_... 升级为至尊VIP  魏康**e... 升级为至尊VIP  

 魏康**e... 升级为高级VIP  wei**n_... 升级为至尊VIP 

182**45...  升级为标准VIP wei**n_... 升级为至尊VIP 

zho**ia... 升级为高级VIP    137**69... 升级为高级VIP

137**75...  升级为高级VIP  微**... 升级为标准VIP 

wei**n_...   升级为高级VIP 135**90...  升级为高级VIP

 134**66...  升级为标准VIP  wei**n_... 升级为至尊VIP 

136**56...  升级为至尊VIP  185**33... 升级为标准VIP

 微**...  升级为至尊VIP wei**n_...  升级为至尊VIP 

189**71...   升级为标准VIP wei**n_... 升级为至尊VIP 

 173**29... 升级为标准VIP  158**00... 升级为高级VIP

 176**24... 升级为高级VIP  187**39...   升级为标准VIP

 138**22...  升级为高级VIP 182**56... 升级为高级VIP 

186**61... 升级为高级VIP  159**08...   升级为标准VIP

158**66... 升级为至尊VIP  微**... 升级为至尊VIP 

wei**n_... 升级为标准VIP   wei**n_... 升级为高级VIP

 wei**n_... 升级为高级VIP  wei**n_... 升级为至尊VIP

wei**n_...  升级为高级VIP  158**25... 升级为标准VIP

189**63... 升级为标准VIP   183**73... 升级为高级VIP 

wei**n_...  升级为标准VIP 186**27... 升级为高级VIP 

 186**09... 升级为至尊VIP  wei**n_... 升级为标准VIP 

139**98... 升级为标准VIP  wei**n_...  升级为至尊VIP 

wei**n_...  升级为标准VIP  wei**n_... 升级为标准VIP

 wei**n_... 升级为标准VIP  wei**n_... 升级为标准VIP

陈金  升级为至尊VIP  150**20...  升级为标准VIP

183**91...  升级为标准VIP  152**40...  升级为至尊VIP

wei**n_...  升级为标准VIP wei**n_...  升级为高级VIP

 微**...  升级为高级VIP wei**n_... 升级为高级VIP

juo**wa...  升级为标准VIP  wei**n_... 升级为标准VIP

wei**n_...  升级为标准VIP   wei**n_... 升级为标准VIP

wei**n_...   升级为标准VIP 180**26...  升级为至尊VIP

 wei**n_... 升级为至尊VIP  159**82... 升级为至尊VIP

wei**n_...  升级为标准VIP 186**18... 升级为标准VIP 

 A**y  升级为标准VIP 夏木  升级为至尊VIP

138**18...  升级为高级VIP wei**n_...  升级为高级VIP 

 微**... 升级为高级VIP wei**n_...  升级为至尊VIP

 wei**n_...  升级为至尊VIP 136**55...  升级为高级VIP

小晨**3 升级为高级VIP  wei**n_... 升级为至尊VIP  

wei**n_...  升级为标准VIP  130**83... 升级为标准VIP

185**26... 升级为至尊VIP  180**05...  升级为标准VIP

185**30... 升级为至尊VIP   188**62...   升级为高级VIP

eli**pa... 升级为至尊VIP wei**n_... 升级为高级VIP

 137**78... 升级为至尊VIP  wei**n_... 升级为高级VIP

菜**1... 升级为高级VIP  丝丝   升级为高级VIP

wei**n_... 升级为高级VIP wei**n_... 升级为标准VIP

139**03... 升级为标准VIP  微**... 升级为至尊VIP 

 wei**n_... 升级为高级VIP   159**15... 升级为高级VIP

  wei**n_... 升级为至尊VIP wei**n_... 升级为高级VIP 

 海豚 升级为至尊VIP 183**48...  升级为高级VIP

ec**儿... 升级为高级VIP   wei**n_... 升级为至尊VIP