1、PCI Express correctable errors handling(RAS)solution implementation considerations in Metas AI/ML Training ClustersPCI Exp Error Handling(RAS)ConsiderationsCarlos Fernandez,Hardware Systems Engineer,MetaAnil Agrawal,Hardware Systems Engineer,MetaPCIe Express correctable errors handling(RAS)solution
2、implementation considerations in Metas AI/ML Training ClustersSUSTAINABLE SCALABLE COMPUTATIONAL INFRASTRUCTUREHW MGMTMetas AI/ML Training Clusters are built using a large number of PCIe devices including:GPUs,NICs,NVME Storage,and PCIe Switches.It is important to implement a robust fault handling(R
3、AS)solution within this PCIe device hierarchy to ensure target uptime,availability,and serviceability objectives.A high rate of PCIe correctable errors is expected.In this presentation,we would like to share our learnings and an innovative solution we developed to manage such large scale PCIe correc
4、table errors within Meta AI/ML training clusters.PCIe Error Handling-AbstractAI/ML Training Cluster 30K ft viewAI/ML Training Cluster-OverviewReference:https:/ Teton Training-OverviewReference: Teton Training-Platform ViewOAM:OCP Accelerator ModuleGrand Teton Training Platform-PCIe HierarchyA Large
5、PCIe Device Hierarchy Increased PCIe Correctable ErrorsB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F endpoint,CPU-NICB:D.F root_port,slot#,device present,power:On,speed 32GT/s,width x16B:D.F upstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 3
6、2GT/s,width x16 B:D.F endpoint,IOX-NICB:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,current speed 8GT/s target speed 32GT/s.B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,IOX-SSD,current speed 8GT/s target speed 16GT/sB:D.F downstream_
7、port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,GPUB:D.F downstream_port,speed 32GT/s,width x16 B:D.F endpoint,PCIe Gen 5 SwitchB:D.F downstream_port,speed 32GT/s,width x16B:D.F endpoint,PCIe Switch management endpointB:D.F root_port,slot#,device present,speed 32GT/s,width x16B:D.F u
8、pstream_port,PCIe Gen 5 SwitchB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,IOX-NIC2B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,current speed 8GT/s target speed 32GT/sB:D.F downstream_port,slot#,device present,speed 8GT/s,width x4
9、 B:D.F endpoint,IOX-SSD,current speed 8GT/s target speed 16GT/sB:D.F downstream_port,slot#,device present,speed 32GT/s,width x16 B:D.F endpoint,GPUB:D.F downstream_port,speed 32GT/s,width x16 B:D.F endpoint,PCIe Gen 5 SwitchB:D.F downstream_port,speed 32GT/s,width x16B:D.F endpoint PCIe Switch manag
10、ement endpointPCI Express Fault Domain and CoverageCPU Root PortPCIe DevicePhysical LayerTransaction LayerData Link LayerPCIe LinkPhysical LayerTransaction LayerData Link LayerPCIe FaultsError ClassError TypeFault Coverage(RAS Features)Bit errorsCorrectedRx Error,Bad TLP,Bad DLLP,Replay Timer Time-o
11、ut,REPLAY_NUM Rollover,Corrected Internal,Header Log OverflowError Reporting,Link CRC and Retry,Link Retraining and RecoveryBus faults,Protocol errorsUncorrected NonFatalUnsupported Request,Completion Timeout,Completer Abort,Unexpected Completion,ACS Violation,Poison TLP Received,Poison TLP EgressEr
12、ror Reporting,Contain the poisoned data,DPCUncorrected FatalData Link Protocol Error,Surprise Link Down,Receiver Overflow,Flow Control Protocol,Malformed TLP,Uncorrectable InternalError Reporting,DPCInternal errorsUncorrected NonFatal/FatalBit flips in data buffers within the intermediate modules th
13、at remain undetected by the link integrity.Vendor specific Error Detection and Reporting,ECRCPCIe LinkPCI Express SwitchUSPDSPDSP:Downstream PortUSP:Upstream PortDPC:Downstream Port ContainmentExisting PCIe Correctable Error ReportingThe PCIe CE are logged in BMCs NVRAMSMM Handler:1.Logs SEL,2.Notif
14、y OS AER HandlerPCIe CE is detected within HWPCIe Error Logger service polls and computes error rateDC Tooling polls periodically and checks for error rate threshold and triggers alarmPlatform Error Handler(SMM)PCIe Error LoggerSMISCIBMC(SEL)Meta Specific Remediation ToolingPollingPCI Express Switch
15、USPDSPPCI Express AER is configured in Firmware First Mode.Both correctable and uncorrectable errors trigger SMI.Hardware triggers SMI for every PCI Express Correctable Error.No Threshold Used.Risk of increased software latency and associated application stallsProposed PCIe Correctable Error Reporti
16、ng SolutionRasdaemon logs PCIe CE in the BMCs NVRAMPlatform triggers MSIPCIe CE is detected within HWPCIe Error Logger service polls and computes error rateMeta DC Tooling polls periodically and checks for error rate threshold and triggers alarmOS AER Driver Handles the Error and Notify RasdaemonPla
17、tform Error Handler(SMM)PCIe Error LoggerSMISCIBMC(SEL)Meta Remediation ToolingPollingPCI Express SwitchUSPDSPMSIRasdaemonPCI Express AER is configured in Meta Platform specific Hybrid Mode.Correctable error trigger MSI and Uncorrectable error trigger SMI.Hardware triggers MSI for every PCI Express
18、Correctable ErrorOSs AER handler and Rasdaemon handles the error and saves the SEL in the existing BMCs NVRAMDevelop PCI Express Firmware Specification ECR to revise the FW/OS interface Implementation(OSPM).Work with ecosystem players to implement the changes in the Platform FW,Linux OSPM,and AER drivers.Contribute the“PCI Express Correctable Error Handling Requirements”to the OCP Hardware Fault Management Sub-project(part of Hardware Management Project).Next StepsHW MGMTHW Fault MGMTThank you!