《HotChips_AI_Keynote_V1.5.1_Final.pdf》由会员分享,可在线阅读,更多相关《HotChips_AI_Keynote_V1.5.1_Final.pdf(104页珍藏版)》请在三个皮匠报告上搜索。
1、Beyond Compute:Enabling AI Through System IntegrationComputing Input DataUseful OutputsProcessingENIACCalculatorPersonal ComputerCray-1Laptop ComputerDatacentersSmart Phone ComputerConsolesTraining DatacentersTraining ServerIoT ComputerFSD ComputerHuman ComputerAbacusDiff EngineComputing Input DataU
2、seful OutputsProcessingSource:Google ImagesProjections In 2011Source: Reality+New ProjectionsVol of Data(Zettabytes)05010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 20257964.24133261815.512.596.552Big Data-Data Explosion Corporate Data Type ChangesMor
3、e than 80%data is unstructured in nature5,00020000192020010,00015,00020,00025,00030,00035,00040,00045,000Structured DataUnstructured DataWorldwide Corporate Data GrowthExabytes80%of Data Growth is UnstructuredNew Types of ProcessingTraditional ProcessingProducts&Prof
4、its from Data ProcessingSource:IDCCorporate Data Type ChangesMore than 80%data is unstructured in nature5,00020000192020010,00015,00020,00025,00030,00035,00040,00045,000Structured DataUnstructured DataWorldwide Corporate Data GrowthExabytes80%of Data Growth is Unstru
5、cturedMachine Learning(ML)/AIAlgorithmic CodesProducts&Profits from Data ProcessingSource:IDCSoftware 2.0Software 1.0“The history of numerical weather prediction and climate simulation is almost exactly coincident with the history of digital computing itself.”-V.Balaji,Climbing down Charneys ladder:
6、Machine Learning and the post-Dennard era of computational climate scienceClimate Super-Computing Architectures Over Time 1950 1960 1970 1980 1990 2000 2010 2020 1 0101010102 3 4 5 6 7 8 9 10 History of GFDL Computing Growth of Computational Power with Time IBM 701 IBM 704 IBM 7090 IBM 70
7、301/Stretch CBC 6600 UNIVAC1108IBM 360/91IBM 360/195TI ASCCDC CYBER 205(2X)CRAY Y-MPCRAY C90T3ECRAY T90/16CRAY T90/24CRAY T90/30SGI CLUSTER/01SGI CLUSTER/02SGI CLUSTER/04R&D HPCSR&D HPCS UPGRADENCAA ORNLNCAA ORNL UPGRADEScalar Vector Parallel Vector Scalable First simulation of water,cloud-radiation
8、,ice albedo feedbacks First GFDL Hurricane Model Coupled model on which first IPCC was based First simulation of chemistry-transport-radiation of Antarctic Ozone Hole First estimate effects of 2x CO2 GFDL Hurricane Model goes operational-Interactive atmospheric chemistry and aerosols -Stimulation an
9、d prediction of Category 4-5 hurricanes-Projections of fish stock under climate change-FV3 models used both for climate studies and US operational weather forecasting-Coupled ENSO forecasts Attribution of role of Ozone depletion in climate change Log(Computer Power)History of computational power at
10、the NOAA Geophysical Fluid Dynamics Laboratory.Computational power is measured in aggregate floating point operations per second Source:V Balaji,Princeton-NOAA :Climbing down Charneys ladder:Machine Learning and the post-Dennard era of computational climate science Years of Delivery History of GFDL
11、Computing Growth of Computational Power with Time Log(Computer Power)Years of Delivery Climate Super-Computing Architectures Over Time 1950 1960 1970 1980 1990 2000 2010 2020 1 0101010102 3 4 5 6 7 8 9 10 IBM 701 IBM 704 IBM 7090 IBM 70301/Stretch CBC 6600 UNIVAC1108IBM 360/91IBM 360/195T
12、I ASCCDC CYBER 205(2X)CRAY Y-MPCRAY C90T3ECRAY T90/16CRAY T90/24CRAY T90/30SGI CLUSTER/01SGI CLUSTER/02SGI CLUSTER/04R&D HPCSR&D HPCS UPGRADENCAA ORNLNCAA ORNL UPGRADEFirst simulation of water,cloud-radiation,ice albedo feedbacks First GFDL Hurricane Model Coupled model on which first IPCC was based
13、 First simulation of chemistry-transport-radiation of Antarctic Ozone Hole First estimate effects of 2x CO2 GFDL Hurricane Model goes operational-Interactive atmospheric chemistry and aerosols -Stimulation and prediction of Category 4-5 hurricanes-Projections of fish stock under climate change-FV3 m
14、odels used both for climate studies and US operational weather forecasting-Coupled ENSO forecasts Attribution of role of Ozone depletion in climate change“this represents a sea change in computational Earth system science that rivals the von Neumann revolution.”-On the Increase of Machine Learning t
15、echniques in climate computing As quoted in Machine Learning and the post-Denard era of computational climate science ML/AIScalar Vector Parallel Vector Scalable History of computational power at the NOAA Geophysical Fluid Dynamics Laboratory.Computational power is measured in aggregate floating poi
16、nt operations per second Source:V Balaji,Princeton-NOAA :Climbing down Charneys ladder:Machine Learning and the post-Dennard era of computational climate science Traditional Data-New Processing Methods Climate AI,a pioneer in applying artificial intelligence to climate risk modeling,today announced
17、its team has solved a critical weather forecasting challenge.Leveraging advances in AI to improve weather and climate forecasts.“Artificial intelligence and machine learning breakthroughs are changing weather forecasting,and resource-heavy regional weather models might soon be completely replaced by
18、 machine learning approaches.”Dr.Stephan Rasp,Lead Data Scientist Climate AI Source:Climate AISource:Los Almos National LabReal World DataOnly Machine Learning techniques can enable theseExploding ML Use Cases New Data TypesTraditional DataReal World Data.Accelerators:Strictly Structured Algorithmic
19、 Computers:Semi-Structured Learning Computers:Unstructured/Any typeComputing Architecture Categories ENIACCalculatorPersonal ComputerCray-1Laptop ComputerDatacentersConsolesTraining DatacentersTraining ServerIoT ComputerFSD ComputerHuman ComputerAbacusDiff EngineInput DataUseful OutputsProcessingSma
20、rt Phone ComputerAI -Tasks requiring near human intelligence in real world settings ML -Subset of AI for specific tasks by learning from data&making predictions DL -Subset of ML using Deep neural network architectures Artificial Intelligence Machine Learning Deep Learning Image Courtesy:Buffaloboy P
21、rocessing for Learning:AI-ML-DL Why Do We Need a Different Compute Platforms?Input DataProgram Logic+Traditional ComputersUseful OutputsWhy Do We Need a Different Compute Platforms?Input DataProgram LogicTrained Logic+Traditional ComputersInput Data+Output DataUseful OutputsWhy Do We Need a Differen
22、t Compute Platforms?Input DataProgram LogicTrained LogicLearning Computers+Traditional ComputersInput Data+Output DataUseful OutputsWhy Do We Need a Different Compute Platforms?Input DataProgram LogicTrained LogicLearning Computers+Traditional ComputersInput Data+Output DataInput Data+Trained LogicU
23、seful OutputsUseful OutputsLearning ComputersWhy Do We Need a Different Compute Platforms?Input DataProgram LogicTrained LogicLearning Computers+Traditional ComputersInput Data+Output DataInput Data+Trained LogicUseful OutputsUseful OutputsLearning ComputersTrainingWhy Do We Need a Different Compute
24、 Platforms?Input DataProgram LogicTrained LogicLearning Computers+Traditional ComputersInput Data+Output DataInput Data+Trained LogicUseful OutputsUseful OutputsLearning ComputersTrainingInferenceExponential Rate Gaps in Training SystemsAI Model Compute Needs(Log Scale)Moores LawSource:OpenAI(https:
25、/ GapExponential Rate Gaps in Training SystemsMoores LawSource:OpenAI(https:/ adept SystemsAI Model Compute Needs(Log Scale)Huge GapDesigning Solutions for AI Level NeedsDatasets Models Compute Scale SWHWAI Training Systems Define goals Dataset(s)Model(s)Desired outputs Iterative flowsHigh Manual Ef
26、fortHuman ReviewsTypical ML Training Flows Datasets Models Compute Scale SWHWAI Adept Systems 2D Image Labeling of Real World Inputs 100X Labeling ThroughputLabel Once,Simultaneously Labels All Cameras at Many Frames4D-Space+Time Labeling 1,000-Person In-House Data Labeling Team Fully Custom Built D
27、ata Labeling&Analytics Infrastructuretime#labelsData Labeling Growth Source:MetaAI-How Facebook Annotates Multimodal Training Data for ML“We refer to this framework as Human-AI loop(Halo.)researchers can streamline annotation tasks,visualize the results and accuracy metrics of annotations,and export
28、 the annotations to start their training modules.”:MetaAIOther Frameworks Led to Advent of programmable computers Colossus/Eniac/ACESource:Jack Copeland,Alan Turnings Electronic Brain“Once the human element is eliminated,the increase in speed is enormous”-Alan Turning (As stated in his report from e
29、arly 1940s)Historical Parallels Led to Advent of programmable computers Colossus/Eniac/ACESource:Jack Copeland,Alan Turnings Electronic Brain“Once the human element is eliminated,the increase in speed is enormous”-Alan Turning (As stated in his report from early 1940s)Historical Parallels How to do
30、this for AI?Solution in ML/AI Space Itself!Chicken and Egg Problem?Solution in ML/AI Space Itself!Recursive Loops Solution in ML/AI Space Itself!Use of Offline Models for Real World Dataset Curation/Labeling With Reduced Human Loop Dependence 1101102103Network ComplexityResnetAutomation N/WsMany Ord
31、ers of Magnitude104Dataset Labeling AutomationAutolabelingClipVideosGPSIMUOdometryOffline Neural NetworksEgo Trajectory&Static World ReconstructionMovingObjects&KinematicsLabelsname:cipv-low-vis,requester:img-vid-cipv-low-vis-seq,description:Low visibility with a CIPV,query:$and:$eq:$decimate:$conv:
32、$and:$eq:active-gear,4,/In drive$not:VisionSceneTags.main.scene_tag_array13.activated,/GARAGE_DOOR_CLOSED$not:VisionSceneTags.main.scene_tag_array15.activated,/INDOOR$gt:TelemetryOutput.distance_travelled_m,1000,$not:“lss_app.right_lane.lane_change,/No right lane change$not:“lss_app.left_lane.lane_c
33、hange,/No left lane change$not:moving_object_output0.cutin_active_in_scene,/No cutin$lt:moving_object_output0.max_region_tag_cutin_prob,0.1,$lt:moving_object_output2.max_region_tag_cutin_prob,0.1,$gt:veh-speed-mps,2.2/5 mph ,h:1,1,1,1,1,1,1,1,1,1 /10s ,N:50,/1s period stateless-child:true ,10 ,Ask t
34、he Test Fleet for Interesting Clips10k Such Clips Collected&Automatically Labelled Within a WeekAnd the Test Fleet Giveth BackInvestments for Offline Dataset Speedup Labeling EffortTimeManualAutomationMore ML ResourcesAI Adept SystemsDatasets Models Compute Scale SWHWVideo Training ModulesRecurrent
35、Neural Net20 x80 x300 W x H x C20 x80 x300 x12 W x H x C x TTransformer20 x80 x300 x12 W x H x C x TRead out Token20 x80 x300 W x H x CMHSAKeyQueryQueryMHSAKeyValueQuery3D CONV20 x80 x300 W x H x C3D CONV3D CONV3D CONV20 x80 x300 x12 W x H x C x TMany More in ResearchKnowledge graphs Semantic networ
36、ks ART Networks Multi modal AI BigGAN Transfer Learning Datacentric AI Image Source:Google ImagesDatasets Models Compute Scale SWHWAI Adept Systems Source:OpenAI(https:/ Rate Gaps in Training SystemsThe RealityAI Model Compute Needs(Log Scale)1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+00201220132
37、001720182019Growth Effective ComputeMoores Law 8x Source:D Hernandez,T Brown,OpenAI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over Time1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+002000182019Growth Effective ComputeAlgorithmic Effici
38、ency 25x Moores Law 8x Source:D Hernandez,T Brown,OpenAI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over Time1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+002000182019Growth Effective ComputeMoores Law 8x Algorithmic Efficiency 25x Scale&$s:37,500
39、 x Source:D Hernandez,T Brown,OpenAI:Measuring the Algorithmic Efficiency of Neural Networks Effective Compute Over Time1.E+071.E+061.E+051.E+041.E+031.E+021.E+011.E+002000182019Growth Effective ComputeScale Up FlexibilityScale OutSource:D Hernandez,T Brown,OpenAI:Measuring the
40、 Algorithmic Efficiency of Neural Networks Effective Compute Over TimeBig DataClimbing Up to ML2005-2010Big DataClimbing Up to MLBig Compute2010-2018Big DataClimbing Up to MLBig ComputeBig Models2018-NowMachine LearningBig DataClimbing Up to AIBig ComputeBig ModelsRealworld DataBig DataClimbing Up t
41、o AIBig ComputeBig ModelsBetter ComputeRealworld DataBig DataClimbing Up to AIBig ComputeBig ModelsBetter ComputeBetter ModelsRealworld DataFlexible Compute Real world datasets Gigantic models Huge Scale out Real time performance Feed the beastAI System Traits ChipPackageBoardsBoxesRacksTraditional
42、Hierarchies Datacenter/BuildingsExample HierarchySource:GraphcoreSource:Nvidia000ChipPackageSystemOrders of MagnitudeTraditional BW&Latency Scaling Discontinuities BW/Latency Orders of MagnitudeTraditional Hierarchy PowerSource:Kogge&Shalf-Article in Computing in Science&Engineering 1,000
43、10,000100101DP FlopRegister1 mm on-chip5 min on-chip15mm on-chipOff-chip/DRAMLocal InterconnectCross SystempJ per 64-bit operation2008(45 nm)2018(11 nm)2008(45 nm)2018(11 nm)Mitigation for Integration Hierarchy Discontinuities Reticle Sized Dies2-3x Reticle Sized Interposers/EMIBs/MCMsPower TrendGPU
44、 TDP TrendTDP(W)250300350400450500550600650700200212022Cooling DifficultiesSource:White-paper on Emergence and Expansion of Liquid Cooling in Mainstream Data Centershttps:/ of Cooling DifficultyYear of IntroductionInverse Thermal ResistanceYear of Introduction2000-2010 Single Core Trend20
45、11-2017 Multi-Core Trend2018-2025 Power War Trend3U(135mm)8U(356mm)20162022Lateral Power Delivery Challengeshttps:/ Xu,Power Delivery in High Current 3-D SystemsProcess Vmin(V)0.500.600.700.800.901.0020022 CNET MEDIA INC./Photo by Stephen ShanklandSource:Tech Scaling GapsPerformance/Capab
46、ilitiesTimeComputeMemoryComm.StorageTech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageEdge ComputeTraditional SystemsTech Scaling GapsPerformance/CapabilitiesTimeComputeMemoryComm.StorageEdge ComputeAI TrainingScaling Gaps for Efficient ScaleoutBW&Latency Losses Power To Traverse
47、 Hierarchies Device vs I/O Scaling Integration Platform Constraints Cooling Needs Power Delivery Need Integrated Solutions With the Whole System in Mind!Discrete/Chip Centric Approaches Are Severely Limiting Value Propositions Sys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe
48、SWDatacenter Arch EvolutionSys MemCPUsPCIe SlotsPCIe SlotsAccAccMemMemAccAccMemMemAccAccMemMemSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccMemMemAccAccMemMemAccAccMemMemSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPC
49、Ie SWSys MemCPUsPCIe SWPCIe SWDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAc
50、cAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAc
51、cAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchDatacenter Arch EvolutionSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchInternet Optimized Latencies Packet Sizes Bandwidth Legacy Support Baggage HPC/CPU Ce
52、ntric Rack-Rack Optimized Latencies Bandwidth Packet Sizes AI Centric Source:Beyond Compute-Communication FocusSource:https:/ spring conference 2021H100H100H100H100H100H100H100H100H100NV SwitchNVLink-Network OSFP ConnectorsNVLink CablesGen-4 NV LinkGen-4 NV LinkPCLe Gen 5 x 16Next Generation ML Infr
53、astructure for large Model Training TPU v4 chips are networked together into a Cloud TPU v4 pod by ultra-fast interconnect that provides 10 x the bandwidth per chip at scale compared to typical GPU-based large scale training systems.Large models are very communication intensive:local computation oft
54、en depends on results from remote computation that are communicated across the network.TPU v4s ultra-fast interconnect has an outsized impact on computational efficiency of large models by eliminating latency and congestion in the network.Tensix coreEthernet(NoC NoC(Integration Wormhole chipRate of
55、ChangeTypical HPC Improvements Traditional Integration,Comm N/W,Cooling et al Accelerating the AI AcceleratorsTime-Achieved PerformanceNumerous Opportunities To Do Better/QuickerWider aperture beyond chips and into Systems Reduce the Drag coefficient from the traditional hierarchies Clean abstractio
56、n exists from frameworks to underlying HW Flexible vs Fixed ratios of Compute-Memory-I/Os by disaggregation Concentrate on the full solution stack Can We Do Better?What would you do if designing from first principles for AI?Dojo D1 Chip645mm27nm Technology 50 BillionTransistors 11+Miles Of Wires362
57、TFLOPs BF16/CFP8 22.6 TFLOPs FP32 10TBps/dir.On-Chip Bandwidth 4TBps/edge.Off-Chip Bandwidth 400W TDPDojo Unique InnovationSilicon WaferDojo Unique InnovationSilicon WaferKnown Good DiesTest&SortDojo Unique InnovationSilicon WaferKnown Good DiesTest&SortPkgPCBDojo Unique Innovation:HierarchiesSilico
58、n WaferKnown Good DiesTest&SortPkgPCBDojo Unique Innovation:Flattened HierarchiesSilicon WaferKnown Good DiesTest&SortNew IntegrationReconstructed Fanout Wafer9 PFLOPs(BFP16/CFP8)40 TB/s Bisec BW(X+Y)36TB/s I/O BWVertical Power DeliveryVerticalLateralImpedanceFrequencyTraining TileTraining Tile25x D
59、1 Dies 9 PFLOPs 36TB/s I/O BW DatacenterDatacenter/BuildingsComputeI/OMemoryDisaggregated Scalable SystemTileInterface ProcessorNetwork InterfaceFeeding the Beast(s)Sys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SW
60、AccAccAccAccMemMemMemMemSwitchTraditional ML Model FittingSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchModels Have To Fit Into Each Accelerator Sys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAc
61、cMemMemMemMemSwitchSys MemCPUsPCIe SWPCIe SWAccAccAccAccMemMemMemMemSwitchXTraditional ML Gigantic Model Fitting Issues-Spilling to System Memory or Local Pool of Drams-Cant Assume Future Models Will Fit Into One AcceleratorGigantic ModelsDesigned from the get go to splitting Models across multiple
62、Chips/TilesSource:Google Research:PaLM:Scaling Language Modeling with PathwaysDatacenter NetworkAI Focussed System ArchitecturesSource:http:/ “A Large body of programming must be completed beforehand,If any serious work is to be done on the machine when it is made.”-Alan Turing“A Large body of progr
63、amming must be completed beforehand,If any serious work is to be done on the machine when it is made.”-Alan TuringSource:http:/ Ease of programability is essential:Fully Flexible,Compiler Friendly Yet High Perf ArchitecturesSoftware Stack Dojo ExtensionDojo Compiler EngineLLVMPyTorch-Extension JIT N
64、N CompilerLLVM BackendDojo DriversMulti-Host,Multi Partition ManagementDojo Interface ProcessorsIngest&Shared MemExaPODNeural Net ModelsPCIeSerdesHW Help for SW Stacks Compiler Friendly ISAs HW Sync/Barriers Flexible StateMachines for ML Layers Fire and Forget Communication Protocols Fault Tolerance
65、 Software Stack OpportunitiesTake Advantage of Clean Abstraction Layers for ML Need Changes for Massively Parallel Architectures More Compilers,Less Kernels Need Renewed Focus on Distributed Compiler Technology Reduced OS RolesBeyond Compute:Many Aspects-One GoalScaleout :Seamless Scale Out FabricCo
66、mmunication :TerraBytes per SecMemory :GBs/TBsDisaggregation :Ratios Move With WorkloadsCompiler Technologies :Truly Distributed CompilersNetworking Topologies :TTP/TTPoEsFramework Enhancements :Dojo-PyTorch.DataTypes :CFP Formats for EfficiencyCompute :CPU+GPU+NPU+NNABeyond Compute:Scaleout From ML
67、 to AI!ExaPodJob Specific Sizing of ResourcesTightly Integrated Yet Disaggregated SystemInnovation Opportunities Beyond Compute for AI SystemsArchitectures:Scale Out Focused Parallel Architectures Integration:Reduce Traditional Hierarchy Imposed Tax on Performance and Power Disaggregation:Alterable
68、Ratios of Compute/Memory/Comm./Storage Abstractions:Taking Advantage of Clean Abstraction Layers of Frameworks Algorithms:Flexibility of Compute To Adapt to New Algorithms and Workloads Compilers:Explore/Revive Distributed Compiler Technologies Innovation Opportunities Beyond Compute for AI SystemsA
69、rchitectures:Scale Out Focused Parallel HW Architectures Integration:Reduce Traditional Hierarchy Imposed Tax on Performance and Power Disaggregation:Alterable Ratios of Compute/Memory/Comm./Storage Abstractions:Taking Advantage of Clean Abstraction Layers of Frameworks Algorithms:Flexibility of Com
70、pute To Adapt to New Algorithms and Workloads Compilers:Explore/Revive Distributed Compiler Technologies Design Approach of Exploring the Full Solution Space Across System and Software Next Phase in Computing EvolutionINTEGRATED SYSTEMS122 Years of Moores LawNot on Same Scale As the Above ChartLearn more at T You!