1、AI-ML model for dynamic server fans speed control achieves better energy efficiency than the traditional fans control methods.Model runs on an ML engine of a BMC chip.Increasing Energy Efficiency of Server Cooling Over Traditional Methods with a Deep Reinforcement Learning Agents Running on an OCP C
2、ompliant BMC PlatformsRaghu Kondapalli,Chief Technology Officer,Axiado CorporationSundaram Arumugasundaram,Principal Security Architect,Axiado CorporationZhichao Zhang,Principal Machine Learning Architect,Axiado CorporationIncreasing Energy Efficiency of Server Cooling Over Traditional Methods with
3、a Deep Reinforcement Learning Agents running on an OCP Compliant BMC platformsDC SustainabilitySUSTAINABILITYTCU chip consists of below components:1.App processors:cores for running apps like BMC,host vulnerability management,extended detection and Response(XDR)agents2.programmable AI engine to run
4、ML models like server thermal management3.Smart-NIC for control/management plane like BMC traffic4.hardware Root of trust(HRoT)and TPM(Trusted Platform Module)to enhance server securityAxiado offers Smart-SCM that is compliant with the Open Compute Project(OCP)datacenter-ready secure control module(
5、DC-SCM)standard.Trusted Control/Compute Unit(TCU)OverviewAI-Powered Dynamic Thermal Management(DTM)from BMC:BMC is ideal for server thermal management due to its existing role in various server management functions,including power control.Faster Thermal Prediction and Calibration:TCU collects sensor
6、 data directly,bypassing the host OS,enabling faster thermal prediction and fan speed calibration.Rich Dataset for Decision Making:As an OCP DC-SCM compliant BMC,TCU gathers comprehensive data from all chassis components(CPUs,GPUs,etc.)via diverse connections(I2C,eSPI,USB,PCI-e),providing a rich dat
7、aset for optimal fan control decisions.Next-Gen Thermal Management:The Power of ML on TCU/BMC OverviewNext-Gen Thermal Management:The Power of ML on TCU/BMC OverviewDedicated ML engine for DTM-ML model:As TCUs ML engine only runs the DTM-ML model,it offers timely inference and fan speed controlHardw
8、are-Based Security:Leveraging confidential computing and other security features,TCU protects the DTM-ML model from potential vulnerabilities on the host OS,offering a more secure solution.Proactive Management with PMC Data:TCU utilizes CPU and GPU Performance Monitoring Counters(PMC)to proactively
9、manage thermals based on workload demands.Integration of AXIADOs DTM-ML with the open standards like openBMC and DMTFs redfish is work in progress.ML-DTM Result Significant Energy Savings$0$50$100$1501 yr energy costDRL AgentMedium SpeedHigh speedFan Energy Savings:Energy savings up to 50%Annual sav
10、ings per server:$70Annual savings for 100K servers:$7 millionFAN cyclesKilowatt-hour per hour1 Year($0.10 per kwh)TCU DTM-MLOptimized0.076$67Medium Fan Speed65.5%0.131$115High Fan Speed80%0.160$140Data Collection:pulling data from sensors every 5 seconds for six monthsTrained on random and diverse i
11、ntensity workloads with a massive data setAnalysis and Prediction ML type:DRL(Deep Reinforcement Learning)Continuous Learning:improves energy efficiency over time.Surpasses PID Fan Controllers:delivers superior results to PID controllers through broader dataset correlation.Unlike reactive PID contro
12、llers,it proactively adjusts fan speeds based on anticipated workload demands.DTM-ML model training and deployment DetailsDRL is a revolutionary AI methodology that combines reinforcement learning and deep neural networks.By iteratively interacting with an environment and making choices that maximiz
13、e cumulative rewards,it enables agents to learn sophisticated strategies,directly learn rules from sensory inputs,which makes use of deep learnings ability to extract complex features from unstructured data https:/www.geeksforgeeks.org/what-is-reinforcement-learning/.DTM-DRL self learns from the env
14、ironment,continuously improves the efficiency of the balance between temperature and energy usage and proactively anticipates cooling needs based on workloads and other dynamic factors.Benefits of DRL(Deep Reinforcement Learning)AlphaGo ZeroReinforcement Learning solving many complex problemsTo get
15、the best saving,policies need to be learnt from different environment,hardware and workloadDynamic Thermal Management with Proactive Fan Speed Control Through Reinforcement LearningOne rule cant fit them all Mastering the game of Go without human knowledgeReinforcement learning to learn the meaning
16、of states from the environmenthttps:/blog.google/inside-google/infrastructure/safety-first-ai-autonomous-data-center-cooling-and-industrial-control/AI for Google data center cooling success storyRules dont get better over time,but AI does.Reinforcement Learning redefines DTM by replacing heuristic h
17、uman input with self-optimizing AI agents.Human vs.AI Agency:From manually tuned protocols to AI-driven,Q-Learning-based autonomous agents.AI Superiority:RL agents predictive management cuts fan power by 40%.Outcome:Autonomous agents offer continual learning,precision,and efficiency,redefining DTM i
18、n data-centric environments.AI-Powered Dynamic Thermal Management(DTM)1.Precision Control:The RL model develops fine-tuned cooling algorithms,directly improving energy management.2.Intelligent Adaptation:It swiftly adjusts to fluctuations,ensuring consistent performance under varying load conditions
19、.3.Sustainable Operations:Forecasts and adjusts to future demands,significantly reducing the carbon footprint and operating costs.Diverse Policies Learned by Deep Reinforcement LearningTemperatureFan SpeedThe real-time temperature data manually collected from Axiado HQs server room,its evident that
20、server temperatures vary significantly not only by rack position but also by time of day.For instance,the consistent decrease in temperatures from 10 PM to 7 AM across most servers suggests ambient factors,possibly related to lower night-time room temperatures or reduced server activity,greatly infl
21、uence server temperatures.Leveraging this data can inform efficient cooling,power usage,cost reduction and other server optimization strategies.Server Room Temperature MonitoringTCU BMC Integration:A power-efficient BMC controller that is OCP compliant,equipped with an on-chip NPU,requiring only 0.5
22、TOPs for this application.This integration is not just a step towards modernization but a leap towards cost-effective and green computing,given the necessity of BMC controllers in modern data centersSmart Scaling:Tailored AI dynamically adapts to diverse server configurations,ensuring optimal perfor
23、mance across any data center layout.Operational Excellence Reimagined:Transition from traditional,labor-intensive methods to AI-driven strategies.Our real-world deployments demonstrate how integrating AI with real-time sensor data and machine learning not only enhances system reliability but also si
24、gnificantly reduces operational costs.Energy Efficiency&Sustainability:Leveraging AI for real-time control of cooling systems results in up to 40%savings on cooling energy costs.This approach not only slashes energy bills but also substantially reduces the carbon footprint,contributing to greener da
25、ta center operations.Summary Redefining Data Centers with AI-Driven DTMAchieving up to 18.6%PUE ImprovementReallocating fan power the PUE changed from 1.09 to 1.61 A 50%reduction in fan power leads to a new PUE of 1.31.With a 5%reduction in fan power,the new PUE is 1.57With a 50%reduction in fan pow
26、er,the new PUE is 1.31Air cooling at the server level is widely used in data centers,particularly for configurations up to 10 kW per rack.While alternative cooling methods are gaining traction for higher-density setups,server-level air cooling remains a common,cost-effective choice.AI-Driven DTM PUE
27、 ImpactCall for ActionProblem to SolveLets collaborate to create an ML based fans speed control as part of OCP,OpenBMC and DMTF and save energy.How to get involved in the ProjectBy piloting the deployment of the DTM-ML model in your data center.Timeline for Contribution AvailabilityFrom now to end of 2025Timeline for Product AvailabilityFrom now to end of 2025Where to find additional information(URL links):Work In ProgressThank you!