上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

SESSION 11 - Industry Invited.pdf

编号:155001 PDF 115页 32.04MB 下载积分:VIP专享
下载报告请您先登录!

SESSION 11 - Industry Invited.pdf

1、ISSCC 2024SESSION 11Industry Invited 2024 IEEE International Solid-State Circuits Conference1 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsAMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class Systems

2、Alan Smith,Eric Chapman,Chintan Patel,Raja Swaminathan,John Wuu,Tyrone Huang,Wonjun Jung,Alexander Kaganov,Hugh McIntyre,Ramon Mangaser 2024 IEEE International Solid-State Circuits Conference2 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class Sys

3、temsOutline Overview and motivation Logical and physical organization Chiplet modular construction Advanced packaging Power and thermal management Generational scaling 2024 IEEE International Solid-State Circuits Conference3 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and

4、AI Accelerator for Exa-Class SystemsThe AMD Instinct Accelerator JourneyMultiple generations of architecture focused advancing HPC and AI computeMI100AMD CDNA1STDOMAIN SPECIFIC GPU ACCELERATOR FOR COMPUTEDedicated acceleration for compute datatypes FP64|FP32|FP16 for HPC&AIMI200AMD CDNA 21STMCM GPU

5、ACCELERATORDenser compute architecture with leading memory capacity/bandwidthMI300AMD CDNA 3 1STCHIPLET ARCHITECTURE GPU ACCELERATORFocused improvements on Unified memory,AI data format performance and in-node netw 2024 IEEE International Solid-State Circuits Conference4 of 22Public11.1:AMD Instinct

6、TMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsMI300 GPU ACCELERATOR OVERVIEW304 CDNA 3 CU192 GB HBM3 Memory5.3 TB/s Memory Bandwidth228 CDNA 3 CU|24 Zen4 CPU128 GB HBM3 Memory5.3 TB/s Memory BandwidthCPU hosted PCIe accelerator deviceSelf-hosted accelerated proce

7、ssing unit(APU)AMD Instinct MI300XAMD Instinct MI300AArchitected to deliver maximum HPC and AI capability from the latest silicon and advanced packaging technology 2024 IEEE International Solid-State Circuits Conference5 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI A

8、ccelerator for Exa-Class Systems3D+2.5D Packaging Motivation0X10X20X30X40X50X60XOff-PackageCopperOff-PackageOpticalOn-Package2.5D(HBM)3D StackedRelative Bits/JouleRelative Bits/JouleThe key to power-efficient performance is tight integrationAdvanced 3D Hybrid bonding provides by orders of magnitude

9、the densest,most power efficient chiplet interconnectAdvanced 2.5D enables more compute and HBM in a packageIncreased system-level efficiencyISSCC22ISSCC20 2024 IEEE International Solid-State Circuits Conference6 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerat

10、or for Exa-Class SystemsAMD Instinct MI300A Modular Chiplet Package24“Zen 4”Cores ISSCC23228 AMD CDNA 3 Compute Units3D hybrid bonding2.5D silicon interposer 128 Channel HBM3 Interface256MB AMD Infinity CacheInfinity Fabric Network-on-Chip2 x16 PCIe5+4thGen Infinity Fabric Links6 x16 4thGen Infinity

11、 Fabric Links8 physical stacks AMD Instinct MI300A:128 GB(8-high)2024 IEEE International Solid-State Circuits Conference7 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsMI300X Block Diagram and Chiplet Design12 chiplets as a single devi

12、ce:4 IOD+8 XCDHBMXCDIODI/OBlock DiagramChip Image 2024 IEEE International Solid-State Circuits Conference8 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsAMD Instinct MI300 Accelerator Modular ConstructionIOD R180IOD-MirrorIOD Mirror R1

13、80IODXCDXCDR180XCDXCDR180XCDXCDR180CCDCCDR180CCDR180Multi-variant(MI300A/MI300X)architecture requires all chiplets to act as if they are LEGO blocksMany new construction and analysis tools needed to be developed to enable this capabilityMirrored versions of the IODs enable symmetric constructionFour

14、 unique tape-outs:two IODs(mirrored),one XCD,and one CCD 2024 IEEE International Solid-State Circuits Conference9 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class Systems3D Hybrid Bonding EvolvedAMD 3D V-Cache TechnologyHybrid Bonding size:7 x 1

15、0 mmLogic die as baseN7(X3D)on N5 base(CCD)dieSignificant performance gains for desktop gaming and serversUp to 2.5TB/s vertical bandwidthAMD Instinct MI300 AcceleratorLeverage integration and manufacturing learnings from V-cacheHybrid Bonding size:13 x 29 mm(0.45x reticle)Logic die on top enables i

16、mproved thermals N5 XCD/CCD stacked on N6 base die(IOD)Same 9 TSV pitchUp to 17TB/s vertical bandwidth 2024 IEEE International Solid-State Circuits Conference10 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsMI300 Advanced PackagingCarr

17、ier SiXCDXCDIODDMYCarrier SiCCDIODCCDCCDDMYHBMHBMSilicon InterposerLGA padsLIDBSM+TIMIllustration purpose onlyAdvanced 3D Hybrid Bonded Architecture compute density and perf/WAdvanced 2.5D Architecture for IOD-IOD and HBM3 integrationLarge module on substrateBPMBPV Bond pad via(BPV)on thick AL vs.Cu

18、 2024 IEEE International Solid-State Circuits Conference11 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsUSR Clocking and TXFIFOUltra short-reach(USR)demonstrated 0.6 UI Eye Width at 10-12 BER 2024 IEEE International Solid-State Circui

19、ts Conference12 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsAMD Instinct MI300 AcceleratorFloorplan Power TSVsPower delivery to top die must supportIOD mirroring XCD/CCD rotation(0 and 180 degree)Different stacked die(CCD and XCD)Thi

20、s placed new symmetry requirements on power gridSignificant advanced planning to ensure exact alignment of all power and ground TSV+BPVs XCDXCDCCDCCDCCDBPVsTSVsuBUMPsHybrid BondedStacked Chiplet(SC)Base Chiplet(AID)2024 IEEE International Solid-State Circuits Conference13 of 22Public11.1:AMD Instinc

21、tTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsAMD Instinct MI300 Power DeliveryXCD and IOD on split power planesTwo XCD share one planeXCD groups managed independentlyIOD share a unified power planeXCD Power is delivered though IOD TSVsUSR on independent power p

22、lane 2024 IEEE International Solid-State Circuits Conference14 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsAMD Instinct MI300 AcceleratorPower Management and Heat Extraction Key to MI300 power efficiency is the ability to dynamically

23、“slosh”power between fabric(IOD),GPU(XCD),and CPU(CCD)Massive HBM and Infinity Cache bandwidth can drive high data movement power in the SOC domainCompute capability can similarly consume high powerCreates 2 types of extreme operating conditions-GPU-intensive and Memory-intensiveBoth thermal and pow

24、er delivery must support the full range careful engineering of TSVs and power mapCPU-intensiveGPU-intensiveCPU+GPUbalancedMemory-intensivePowerPower SharingCPU CCDsGPU XCDsAIDHBMGPU-intensiveMemory-intensiveIllustration purpose only 2024 IEEE International Solid-State Circuits Conference15 of 22Publ

25、ic11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class Systems3D Signal InterfaceSame simple digital signal interface between dies as 3D V-CacheEnabled by HB technologys low parasiticsISSCC 202220mTSVCLKInCLKOutIsolateXESDIsolateTSVweak9m 2024 IEEE Internation

26、al Solid-State Circuits Conference16 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class Systems3D Clocking TopologyPer XCD GFX CLK3D Signal TSV ConnectionUnified SOC CLKHBM MemoryUSR Inter-IOD ConnectivityHigh Speed IOSignal TSVs on same SOC clock

27、 domain on IOD and XCDIOD forwards the clock to the XCD through the same TSV structure 2024 IEEE International Solid-State Circuits Conference17 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsGenerational ScalingChiplet technologies Adv

28、anced packagingN5 and N6 process nodesSingle GPU accelerator deviceUnified 8-stack HBM3 memory2.5X matrix FMA FP16 OPS/CLK1.5X HBM capacity and peak BW1.2X higher peak engine clocksProductAMD Instinct MI250XAMD Instinct MI300AAMD Instinct MI300XGPU ArchitectureAMD CDNA 2AMD CDNA 3AMD CDNA 3Lithograp

29、hyTSMC 6nm FinFETTSMC 6nm,TSMC 5nmTSMC 6nm,TSMC 5nmPower560W760W750WPeak Engine Clock1700 MHz2100 MHz2100 MHzPeak DP(FP64)Performance47.9 TFLOPS61.3 TFLOPS81.7 TFLOPSPeak DP Matrix(FP64)Performance95.7 TFLOPS122.6 TFLOPS163.4 TFLOPSPeak bfloat16 Matrix Performance383 TFLOPS980.6 FLOPS1307.4 FLOPSMem

30、ory TypeHBM2eHBM3HBM3Memory Clock1.6 GHz2.6 GHz2.6 GHzMemory Interface8192-bit8192-bit8192-bitPeak Memory Bandwidth3276.8 GB/sec5324.8 GB/sec5324.8 GB/sec00.511.522.53MFMA64 KOPS/CLKMFMA16 KOPS/CLKHBM CapacityHBM Pin Width*RateMI250XMI300AMI300X(die 0)(die 1)2024 IEEE International Solid-State Circu

31、its Conference18 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class Systems18|AMD SALES KICKOFF|JANUARY 25,2024|AMD Confidential Internal Use OnlyAMD InstinctMI300AAMD InstinctMI300X 2024 IEEE International Solid-State Circuits Conference19 of 22P

32、ublic11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsWe would like to thank our talented AMD team across the global AMD design,validation,and manufacturing sites and to all AMDers who contributed to MI300.Thank You!2024 IEEE International Solid-Sta

33、te Circuits Conference20 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-Class SystemsReferencesB.Munger et al.,“Zen 4”:The AMD 5nm 5.7GHz x86-64 Microprocessor Core,2023 IEEE International Solid-State Circuits Conference(ISSCC),San Francisco,CA,USA,

34、2023,pp.38-39,doi:10.1109/ISSCC42615.2023.10067540.“4TH Gen AMD EPYCTMProcessor Architecture,”https:/ et al.,Pioneering Chiplet Technology and Design for the AMD EPYC and Ryzen Processor Families:Industrial Product,2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture(ISCA),Vale

35、ncia,Spain,2021,pp.57-70,doi:10.1109/ISCA52012.2021.00014.T.Vijayaraghavan et al.,Design and Analysis of an APU for Exascale Computing,2017 IEEE International Symposium on High Performance Computer Architecture(HPCA),Austin,TX,USA,2017,pp.85-96,doi:10.1109/HPCA.2017.42.M.J.Schulte et al.,Achieving E

36、xascale Capabilities through Heterogeneous Computing,in IEEE Micro,vol.35,no.4,pp.26-36,July-Aug.2015,doi:10.1109/MM.2015.71.J.Wuu et al.,3D V-Cache:the Implementation of a Hybrid-Bonded 64MB Stacked Cache for a 7nm x86-64 CPU,2022 IEEE International Solid-State Circuits Conference(ISSCC),San Franci

37、sco,CA,USA,2022,pp.428-429,doi:10.1109/ISSCC42614.2022.9731565.R.Agarwal et al.,3D Packaging for Heterogeneous Integration,2022 IEEE 72nd Electronic Components and Technology Conference(ECTC),San Diego,CA,USA,2022,pp.1103-1107,doi:10.1109/ECTC51906.2022.00178.A.Smith and N.James,AMD Instinct MI200 S

38、eries Accelerator and Node Architectures,2022 IEEE Hot Chips 34 Symposium(HCS),Cupertino,CA,USA,2022,pp.1-23,doi:10.1109/HCS55958.2022.9895477.2024 IEEE International Solid-State Circuits Conference21 of 22Public11.1:AMD InstinctTMMI300 Series Modular Chiplet Package HPC and AI Accelerator for Exa-C

39、lass SystemsCOPYRIGHT AND DISCLAIMER2024 Advanced Micro Devices,Inc.All rights reserved.AMD,the AMD Arrow logo,AMD EPYC,AMD Infinity Fabric,AMD Infinity Cache,AMD Instinct MI250X,AMD Instinct 300 Series and combinations thereof are trademarks of Advanced Micro Devices,Inc.Other product names used in

40、 this publication are for identification purposes only and may be trademarks of their respective companies.The information presented in this document is for informational purposes only and may contain technical inaccuracies,omissions,and typographical errors.The information contained herein is subje

41、ct to change and may be rendered inaccurate releases,for many reasons,including but not limited to product and roadmap changes,component and motherboard version changes,new model and/or product differences between differing manufacturers,software changes,BIOS flashes,firmware upgrades,or the like.An

42、y computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information.However,AMD reserves the right to revise this information and to make changes from time to time to the content

43、hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED AS IS.AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,ERRORS,OR OMISSIONS THAT MAY APPEAR IN THIS INFORMAT

44、ION.AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY,OR FITNESS FOR ANY PARTICULAR PURPOSE.IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE,DIRECT,INDIRECT,SPECIAL,OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREI

45、N,EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.Please Scan to Rate This PaperD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE

46、International Solid-State Circuits Conference1 of 28A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint Tony F.Wu,Huichu Liu,H.Ekin Sumbul,Lita Yang,Dipti Baheti,Jeremy

47、 Coriell,William Koven,Anu Krishnan,Mohit Mittal,Matheus Trevisan Moreira,Max Waugaman,Laurent Ye,Edith Beigne Meta Platforms,Inc.D11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at

48、 Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference2 of 28“Wow,Saturn is so cool!”IMMERSIVEVIDEOSPHOTOSTEXTD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction a

49、t Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference3 of 28Physical worldVirtual realityAugmented realityMeta,Reality LabsD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Ene

50、rgy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference4 of 28O P T I C S&D I S P L A Y SA U D I OI N T E R A C T I O NC O M P U T E RV I S I O NA IS Y S T E MD E S I G NU XSo Why is AR Challenging?Socially AcceptableLightweightAll Day BatteryDoes Not Heat UpMeta,

51、Reality LabsD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference5 of 28AR Silicon RequirementsVideo:Somasun

52、daram,Kiran,et al.Project Aria:A new tool for egocentric multi-modal AI research.arXiv:2308.13561(2023).Low power High performance Small form factorD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%E

53、nergy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference6 of 28This Work:A path to AR siliconVideo:Somasundaram,Kiran,et al.Project Aria:A new tool for egocentric multi-modal AI research.arXiv:2308.13561(2023).Low power High performance Small form factor0.15 pJ/B

54、yte inter-die access No increase in chip footprintLarger workloads deployed3D integration enablesD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE Int

55、ernational Solid-State Circuits Conference7 of 283D Integration:Enabling Small Form Factor Logic/SRAM on logic/SRAM No special drivers/ESD needed Bottom DieTop DieLogic+SRAMLogic+SRAMBackside TSVsSoIC Bonds4.1mm3.7mmD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Us

56、ing Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference8 of 28Prototype 3D AR SoCImplemented in 7nm TechnologyWafer-on-wafer Stacked at 2um pitch33k+signal connections between dies6M+power conne

57、ctions between diesConnections not limited by technologyD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conferenc

58、e9 of 28Prototype 3D AR SoC3D Expanded SRAM,16 MB on 16 MB3D Augmented ML AcceleratorD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International S

59、olid-State Circuits Conference10 of 28Prototype 3D AR SoC3D Expanded SRAM,16 MB on 16 MB3D Augmented ML AcceleratorShared amongst several accelerators for ARImportant for many AR workloadsQDI:Moreira,ASYNC Symposium,2023 Non-ML Workloads emulated w/traffic generatorsD11.2 A 3D Integrated Prototype S

60、ystem-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference11 of 28Prototype 3D AR SoC3D Expanded SRAM,16 MB on 16 MB3D Augmented ML AcceleratorD11

61、.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference12 of 283D Augmented ML AcceleratorBottom DieIn-House ML A

62、ccelerator(w/1 MB Local SRAM,1K processing elements,4 compute cores)D11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circui

63、ts Conference13 of 283D Augmented ML AcceleratorIn-House ML Accelerator(w/1 MB Local SRAM,1k processing elements,4 compute cores)Additional 3 MBs Local SRAM w/Same Latency+registers+glue logicBottom DieTop Die27k SoIC BondsD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applicat

64、ions Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference14 of 28Prototype 3D AR SoCDesigning 33k 3D InterconnectsD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applic

65、ations Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference15 of 28Designing 3D Interconnects Flop-to-flop connections Treat dies as hierarchical blocksSoIC BondsTop DieBottom DieCLKDQDQDQD

66、QInput DelayOutput DelayInput DelayOutput DelayD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference16 of 28

67、Inter-die Clock Forwarding Top and bottom dies place-and-routed separatelySoIC BondsCLKDQDQDQDQCLKDQDQDQDQLong Common CLK Path Short divergent 3D CLK paths3D-forwarded CLK:Small cumulative variationBTMulti-die process variation:Bottom die:Top dieBTTop DieBottom DieSoIC BondTop DieBottom DieD11.2 A 3

68、D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference17 of 28Inter-die Clock Forwarding Clocks forwarded from bottom

69、to top dieSoIC BondsCLKDQDQDQDQCLKDQDQDQDQLong Common CLK Path Short divergent 3D CLK paths3D-forwarded CLK:Small cumulative variationBTMulti-die process variation:Bottom die:Top dieBTTop DieBottom DieSoIC BondTop DieBottom DieD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Appl

70、ications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference18 of 28Inter-die Clock Forwarding Problem:Long divergent clock paths large cumulative variation Over 200+Cross-corner combinati

71、ons for timing signoffSoIC BondsCLKDQDQDQDQCLKDQDQDQDQLong Common CLK Path Short divergent 3D CLK paths3D-forwarded CLK:Small cumulative variationBTMulti-die process variation:Bottom die:Top dieBTTop DieBottom DieSoIC BondTop DieBottom Die!Long Divergent Clock PathsD11.2 A 3D Integrated Prototype Sy

72、stem-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference19 of 28CLKDQDQDQDQCLKDQDQDQDQMulti-die process variation:Bottom die:Top dieBTInter-die C

73、lock Forwarding Solution:custom clock tree constraints to maximize common clock paths SoIC BondsTop DieBottom Die!Short Divergent Clock PathsLong Common Clock PathD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch

74、with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference20 of 28Energy cost of inter-die communication11.21.41.61.822.22.42.6456585Measured Shared Memory Access Energy(pJ/B)Bandwidth(GB/s)Bottom Die SRAMTop Die SRAM24.7C25.5C26.6C*0.63V VDD 500 MHz

75、*D11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference21 of 28Minimal access energy overhead0.030.130.950.88

76、Shared Memory Access Energy(pJ/B)*SoIC BondSoIC DriverNetwork-on-Chip1 MB SRAMSoICEnergy cost of inter-die communication*0.63V VDDD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at

77、 Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference22 of 28Enabled Application:Simultaneous 2-Hand TrackingRaw VideoDirect Pose EstimationD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch w

78、ith up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference23 of 28Enabled Application:Simultaneous 2-Hand TrackingRaw VideoDirect Pose Estimation40%2x batch-2 batch-4 HPE40%2x batch-2 batch-4 HPED11.2 A 3D Integrated Prototype System-on-Chip for Augme

79、nted Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference24 of 28More than ML ApplicationsISP LogicSRAMSoIC BondLPDDR4X DRAMIso-Area FootprintD11.2 A 3D Integrated Prot

80、otype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference25 of 28Enabled Application:Full HD Image ProcessingISP LogicSRAMSoIC BondLPDDR4X

81、 DRAMFull HD Image Processing in same energy footprint as compressed image processingIso-Area FootprintD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IE

82、EE International Solid-State Circuits Conference26 of 28A path to AR silicon Low power High performance Small form factor0.15 pJ/Byte inter-die access No increase in chip footprintUp to 2X larger workloads enabledD11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using

83、 Face-to-Face Wafer Bonded 7nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference27 of 28Thank YouQuestions?D11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7

84、nm Logic at 2 m Pitch with up to 40%Energy Reduction at Iso-Area Footprint 2024 IEEE International Solid-State Circuits Conference28 of 28A path to AR silicon Low power High performance Small form factor0.15 pJ/Byte inter-die access No increase in chip footprintUp to 2X larger workloads enabledD11.2

85、 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at 1 Tbit/s bandwidth to shared memories Ensures AI cores will not stallNot even in highly congested multi-core scenariosNoC11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-an

86、d Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference11 of 23AI Core Key Components Matrix-Vector MultiplierD-IMC based Data Processing UnitElement-wise vector operationsApply activation functions Depth-Wise Processing UnitDepth-wise convolutionPooling and

87、Up-sampling 4 MiByte L1 SRAM RISC-V control coreNoC11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference12 of 23AI Core Operational Model Dataflow engine:RISC-V controlled Dual high-throughput streami

88、ng data pathsOne for MVMOne for DWPUCan operate fully in parallel Background weight loadingWrite weights for next operationIn parallel with operationEnabled by multiple weight setsNoC11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE Internati

89、onal Solid-State Circuits Conference13 of 23AI Core Deployment Scenarios A single AI CoreCan execute all layers of a neural networkEliminates need for external interactions Flexibile deployment of multiple AI CoresManage different neural networks independently In multi-network applicationsJointly ta

90、ckle a workload to enhance throughputWork on same neural network to reduce latencyRISC-V System Controller LPDDR4xSecurity NAI Core NAI CoreONKNetwork 2Network 311.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circu

91、its Conference14 of 23Data Processing UnitsFull-precision accumulationBit-serial MVM Bit-serial inputs,bit-parallel weights8 cycles for 8b x 8b operations D-IMC512 1b inputs512 x 512 8b weight matrix512 results per cycle For 1b input x 8b weight Accumulated over input bits512 complete results per 8

92、cycles11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference15 of 23Clock and activity gating Bank gatingIf all 64 outputs of a IMC bank pair are unused Entire bank clock gated Block gatingWithin an ac

93、tive bankBlocks correspond to 64 inputsIf block is not used Block is clock-gated Inputs are silenced Energy efficiency highEven at low utilization11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference1

94、6 of 23Outline Metis Measurement resultsOn beta version of Metis AIPU SoC,12nm process)D-IMCMetis SoC Conclusions11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference17 of 23D-IMC measurement resultsP

95、eak throughput57.3 TOPS at 0.7V,875MHz15 TOPS/W for random uniform activations and weights(no sparsity)82 TOPS/W under high sparsity conditions at reduced throughput82 TOPS/W11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Soli

96、d-State Circuits Conference18 of 23D-IMC comparison to state-of-the-art ObservationsMassive size stands outStill,good energy efficiency and compute densityFurther scaling will provide significant further improvements11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inferenc

97、e at the Edge 2024 IEEE International Solid-State Circuits Conference19 of 23Metis AIPU SoC performance Measured accuracyClosely aligns with reference accuracy obtained using FP32 arithmeticDeviation from FP32 accuracy92 FPS/W354 FPS/W11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy

98、-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference20 of 23YOLOv5s on Metis Demo preview11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference21 of 23Silicon Implem

99、entation 11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference22 of 23Outline Metis Measurement results Conclusions11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energy-Efficient Inference

100、 at the Edge 2024 IEEE International Solid-State Circuits Conference23 of 23Conclusions Metis AIPU SoC is an innovative and advanced inference solution optimized for AI computer vision applications at the Edge.Features four homogeneous AI-Cores,each capable of independently performing full neural ne

101、twork inference offloading.Every core boasts a large D-IMC array for top-tier energy efficiency,achieving 15 to 82 TOPS/W.Demonstrates exceptional performance e.g.,353 FPS/W at 2502 FPS for ResNet-50 and 92 FPS/W at 497 FPS for YoloV5s.11.3 Metis AIPU:A 12nm 15TOPS/W 209.6TOPS SoC for Cost-and Energ

102、y-Efficient Inference at the Edge 2024 IEEE International Solid-State Circuits Conference24 of 23Please Scan to Rate This Paper11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference1 of 36IBM NorthPole:An Architecture

103、for Neural Network Inference with a 12nm ChipAndrew S.Cassidy,John V.Arthur,Filipp Akopyan,Alexander Andreopoulos,Rathinakumar Appuswamy,Pallab Datta,Michael V.Debole,Steven K.Esser,Carlos Ortega Otero,Jun Sawada,Brian Taba,Arnon Amir,Deepika Bablani,Peter J.Carlson,Myron D.Flickner,Rajamohan Gandha

104、sri,Guillaume J.Garreau,Megumi Ito,Jennifer L.Klamo,Jeffrey A.Kusnitz,Nathaniel J.McClatchey,Jeffrey L.McKinstry,Yutaka Nakamura,Tapan K.Nayak,William P.Risk,Kai Schleupen,Ben Shaw,Jay Sivagnaname,Daniel F.Smith,Ignacio Terrizzano,Takanori Ueda,Dharmendra ModhaIBM Research11.4:IBM NorthPole:An Archi

105、tecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference2 of 36Why NorthPole?11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference3 of 36The operating&capital costs

106、 of AI are unsustainable11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference4 of 36Dharmendra S.Modha,and Raghavendra Singh.“Macaque brain long-distance network.”PNAS 2010;107:13485-13490.2010 by National Academy of

107、SciencesThe brain is vastly more energy-efficientthan modern computers11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference5 of 36Architecture trumps Moores Law11.4:IBM NorthPole:An Architecture for Neural Network Inf

108、erence with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference6 of 36Same accuracy&throughput using fewer transistorsLower capital cost(cheaper tomanufactureAI hardware)Same accuracy&throughput using less energyLower operating cost(cheaper to runAI application)NorthPole AI Image Cla

109、ssification(ResNet-50)11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference7 of 36Comparative numbers as published bypapers,respective parties,3rd parties(reference available for every datapoint)Same accuracy&throughp

110、ut using fewer transistorsLower capital cost(cheaper tomanufactureAI hardware)Same accuracy&throughput using less energyLower operating cost(cheaper to runAI application)NorthPole AI Image Classification(ResNet-50)11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024

111、IEEE International Solid-State Circuits Conference8 of 36NorthPole AI Image Classification(ResNet-50)NorthPole achieves a 25 higher energy metric,5 higher space metric,and 22 lower time metric than a GPU on a comparable 12-nm process.Same accuracy&throughput using fewer transistorsLower capital cost

112、(cheaper tomanufactureAI hardware)Same accuracy&throughput using less energyLower operating cost(cheaper to runAI application)11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference9 of 36NorthPole AI Object Detection(Y

113、olo-v4)Same accuracy&throughput using fewer transistorsLower capital cost(cheaper tomanufactureAI hardware)Same accuracy&throughput using less energyLower operating cost(cheaper to runAI application)11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE Internatio

114、nal Solid-State Circuits Conference10 of 36NorthPole AI Language Modeling(BERT-base)Same accuracy&throughput using fewer transistorsLower capital cost(cheaper tomanufactureAI hardware)Same accuracy&throughput using less energyLower operating cost(cheaper to runAI application)11.4:IBM NorthPole:An Ar

115、chitecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference11 of 36Inspired by the brainoptimized for silicon11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference12

116、 of 36Axiomatic ArchitectureAxiom 1:Axiom 2:Axiom 3:Axiom 4:Axiom 5:Axiom 6:Axiom 7:Axiom 8:Axiom 9:Axiom 10:neural inference specialization brain-inspired low precisionbrain-inspired distributed,modular core array with massive compute parallelism within and among cores brain-inspired memory near co

117、mputebrain-inspired networks-on-chip silicon-optimized networks-on-chipstall-free,deterministic control(for high compute utilization)co-optimized training algorithms co-designed softwaresimplest usage model write input,run network,read output11.4:IBM NorthPole:An Architecture for Neural Network Infe

118、rence with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference13 of 36NorthPole Core:ComputelVMM Unit:2,048(4,096 and 8,192)operations per core per cycle at 8-bit(at 4-bit and 2-bit,respectively)precisionlVector Unit:256 operations per core per cycle at FP16 precisionlActFx Unit:32 o

119、perations per core per cycleUnifiedMemoryVectorthread3:0Vector3:0PS,ConstantmemoryWeightmemoryVMMthreadVMMActFxthreadActFxANoCthreadI/MNoCthreadANoCINoCMNoCPSNoCActivation Repack11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circu

120、its Conference14 of 36NorthPole Core:Memoryl0.75MB Unified Memory(activation data+model parameters+program)lWeight,Partial Sum,Constant local buffers UnifiedMemoryVectorthread3:0Vector3:0PS,ConstantmemoryWeightmemoryVMMthreadVMMActFxthreadActFxANoCthreadI/MNoCthreadANoCINoCMNoCPSNoCActivation Repack

121、11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference15 of 36NorthPole Core:CommunicationlPartial Sum NoC(PSNoC)communicates within a layer for spatial computinglActivation NoC(ANoC)reorganizes activations between lay

122、erslModel NoC(MNoC)delivers weights during layer execution lInstruction NoC(INoC)delivers the program for each layer prior to layer startUnifiedMemoryVectorthread3:0Vector3:0PS,ConstantmemoryWeightmemoryVMMthreadVMMActFxthreadActFxANoCthreadI/MNoCthreadANoCINoCMNoCPSNoCActivation Repack11.4:IBM Nort

123、hPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference16 of 36NorthPole Core:Controll8 parallel threads per corelStall-free,deterministic control for high compute utilizationlno data-dependent branching,no cache misses,no speculati

124、ve executionUnifiedMemoryVectorthread3:0Vector3:0PS,ConstantmemoryWeightmemoryVMMthreadVMMActFxthreadActFxANoCthreadI/MNoCthreadANoCINoCMNoCPSNoCActivation Repack11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference17

125、 of 36Operation:Two Layer Timing Diagram Part IlCompute and communication are both concurrent and pipelined(within units and between units)ANoCINoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Threads3:0ActFx ThreadLayer nLayer n+1MNoC ThreadANoC(opt.)INoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadV

126、ector Threads3:0ActFx ThreadMNoC Thread1nn+1n+2CommunicationComputeControlMemoryANoC ThrANoC Thr(opt.)8INoC ThrINoC Thr12Weight BufferInstruction Buffer AInstruction Buffer BPS BufferWeight BufferPS BufferUM AUM BUM A811.4:IBM NorthPole:An Architecture for Neural Network Infere

127、nce with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference18 of 36Operation:Two Layer Timing Diagram Part IIlMemory is double bufferedANoCINoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Threads3:0ActFx ThreadLayer nLayer n+1MNoC ThreadANoC(opt.)INoCMNoCVMMPSNoCVector Unit3:0Act

128、FxVMM ThreadVector Threads3:0ActFx ThreadMNoC Thread1nn+1n+2CommunicationComputeControlMemoryANoC ThrANoC Thr(opt.)8INoC ThrINoC Thr12Weight BufferInstruction Buffer AInstruction Buffer BPS BufferWeight BufferPS BufferUM AUM BUM A8ANoCINoCMNoCVMMPSNoCVector Unit3:0ActFxVMM Thre

129、adVector Threads3:0ActFx ThreadLayer nLayer n+1MNoC ThreadANoC(opt.)INoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Threads3:0ActFx ThreadMNoC Thread1nn+1n+2CommunicationComputeControlMemoryANoC ThrANoC Thr(opt.)8INoC ThrINoC Thr12Weight BufferInstruction Buffer AInstruction Buffer BP

130、S BufferWeight BufferPS BufferUM AUM BUM A811.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference19 of 36Operation:Two Layer Timing Diagram Part IIIlParallel and concurrent control threadslAll operations(com

131、pute,communication,memory,control)are fully deterministiclAll operations are orchestrated by a software scheduler at compile timeANoCINoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Threads3:0ActFx ThreadLayer nLayer n+1MNoC ThreadANoC(opt.)INoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Thre

132、ads3:0ActFx ThreadMNoC Thread1nn+1n+2CommunicationComputeControlMemoryANoC ThrANoC Thr(opt.)8INoC ThrINoC Thr12Weight BufferInstruction Buffer AInstruction Buffer BPS BufferWeight BufferPS BufferUM AUM BUM A8ANoCINoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Threads3:0ActF

133、x ThreadLayer nLayer n+1MNoC ThreadANoC(opt.)INoCMNoCVMMPSNoCVector Unit3:0ActFxVMM ThreadVector Threads3:0ActFx ThreadMNoC Thread1nn+1n+2CommunicationComputeControlMemoryANoC ThrANoC Thr(opt.)8INoC ThrINoC Thr12Weight BufferInstruction Buffer AInstruction Buffer BPS BufferWeight BufferPS

134、 BufferUM AUM BUM A811.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference20 of 36NorthPole Core Physical Layoutn Memory near ComputelCompute(red)and Control(yellow)blocks are tightly integrated with associa

135、ted Memory(blue)11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference21 of 36NorthPole Core Physical Layoutn Dense Interconnectivity lOver 4,096 wires crossing each core both horizontally and vertically11.4:IBM NorthP

136、ole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference22 of 36NorthPole Inference Chipn OrganizationlI/O Interfacel32MB Frame Buffer(SRAM)l256-Core Array with 192MB of SRAMlCortex-like modularity enables homogeneous scalability intwo

137、-dimensions11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference23 of 36NorthPole Inference ChiplFabricated in GlobalFoundries 12nm FinFET processl22 Billion transistors in 795mm2 silicon arealFully operational in fir

138、st silicon implementation 11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference24 of 36Test Results:Voltage/Frequency Shmoo PlotlNorthPole is operational across a range of voltage/frequencies11.4:IBM NorthPole:An Arch

139、itecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference25 of 36Test Results:Voltage/Frequency Scaling(ResNet-50)lVoltage/frequency scaling adjusts operating point for optimizing power/energy11.4:IBM NorthPole:An Architecture for Neural Network I

140、nference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference26 of 36NorthPole Systems11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference27 of 3611.4:IBM NorthPole:An Architecture for Neural Netw

141、ork Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference28 of 3628Front CoverHeatsinkNorthPole PCBInsulatorStiffenerBack CoverTailstockNorthPole PCIe Assembly11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-S

142、tate Circuits Conference29 of 36Single NorthPole Assembly in a Server111.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference30 of 36Four NorthPole Assemblies in a Server 8,10,16 NorthPole assemblies per server possible

143、123411.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference31 of 36NorthPole Applications11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conf

144、erence32 of 36AlgorithmsRuntimeCompilerNorthPole Binary(ELF)End-to-end Software ToolchainWrite tensorRead tensorQuantized Neural NetworkTrained Neural NetworkTrainingRuntimeTraining with Quantization:Fine-tuning After Quantization(FAQ)Learned Step-size Quantization(LSQ)Layer-specific precision selec

145、tion:Entropy Approximation Guided Layer selection(EAGL)Accuracy-aware Layer Precision Selection(ALPS)Input FrameOutput Result11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference33 of 36NorthPole running YOLOv411.4:IB

146、M NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference34 of 36NorthPole running PSPNet11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference35

147、of 36Demo of NorthPole processing a Yolo-v4 network with input from two cameras in real-time using under 5W of chip power11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference36 of 36AcknowledgementThis material is bas

148、ed upon work supported by the United States Air Force under Contract No.FA8750-19-C-1518.Support from OUSD(R&E)is gratefully acknowledged.11.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference37 of 3611.4:IBM NorthPole:An Architecture for Neural Network Inference with a 12nm Chip 2024 IEEE International Solid-State Circuits Conference38 of 36Please Scan to Rate This Paper

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(SESSION 11 - Industry Invited.pdf)为本站 (2200) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部