上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

SESSION 2 - Processors and Communication SoCs.pdf

编号:154960 PDF 298页 29.94MB 下载积分:VIP专享
下载报告请您先登录!

SESSION 2 - Processors and Communication SoCs.pdf

1、ISSCC 2024SESSION 2Processors and Communication SoCs2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference1 of 36A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU Subsystem-Based 5G Mobile SoC Anshul Varma,Sumanth Gururajarao,Hsi

2、nChen Chen,Tao Chen,Gordon Gammie,Hugh Mair,Jen-Hang Yang,Hao-Hsiang Yu,Shun-Chieh Chang,Cheng-Hao Yang,Li-An Huang,Kumar Ramanathan,Ramesh Halli,Efron Ho,Ta-Wen Hung,Sung S.-Y.Hsueh,LiangChe Li,Achuta Thippana,Ericbill Wang,SA Huang2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem

3、2024 IEEE International Solid-State Circuits Conference2 of 36Outline Fully OoO Tri-Gear Processor Subsystem Circuit TechnologiesLogic SRAMStandard Cell ImprovementsHigh Bandwidth Voltage ControlDual-Rail EqualizerHardware DVFS with PMIC Passive Discharge Summary2.1:A 4nm 3.4GHz Tri-Gear Fully Out-o

4、f-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference3 of 36CPU SubsystemBoth HP&BP cores upgraded to Cortex-X4OoO Cortex A720 Efficiency core provides 68%power savingTri-Gear CPU Subsystem(ISSCC22#2.5)Fully OoO Tri-Gear CPU Subsystem(Current)Big/LittleAll-Big2.1:A 4n

5、m 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference4 of 36Single Thread Performance:HP Core15%IPC uplift driving higher single-thread performance-38%power+16%peakperformance HPPeakPerf.Power Iso-Perf.CPU CoreAreaX4(vs.X3)152%(+16%)143%(-

6、38%)174%(+12%)X3131%232%155%A715100%100%100%Dimensity9200/A715(ISSCC22#2.5)as performance/power/area referencePrior(ISSCC22#2.5)Current Work2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference5 of 36Single Thread Performance:BP Co

7、re2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference6 of 36OoO Cortex 720 HE core:up to 68%power savingHigher IPC from OoO-op.,Private L2$Low leakage library implementationL3$/Non-CPU power is a factor+172%peakperformance-68%pow

8、erSingle Thread Performance:HE CoreHEPeakPerf.Power Iso-Perf.CPU CoreAreaA720(vs.A510)71%(+172%)7.8%(-68%)90%(+114%)A51026%24%42%A715100%100%100%Dimensity9200/A715(ISSCC22#2.5)as performance/power/area referencePrior(ISSCC22#2.5)Current Work2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU su

9、bsystem 2024 IEEE International Solid-State Circuits Conference7 of 36Outline Fully OoO Tri-Gear Processor Subsystem Circuit TechnologiesLogic SRAMStandard Cell ImprovementsHigh Bandwidth Voltage ControlDual-Rail EqualizerHardware DVFS with PMIC Passive Discharge Summary2.1:A 4nm 3.4GHz Tri-Gear Ful

10、ly Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference8 of 36Level-1 Load-Store Data Cache trends Load-Store Data Cache trends over generations:Increasing total size of Load-Store Data cache Increasing segmentation Smaller unit instance Revisit circuit topology

11、 for power efficient smaller instances Alternative options to traditional Sense-Amplifier based SRAM2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference9 of 36Traditional HP CPU Level-1 Cache SRAMX%derateIcell_avgIcell_6-SigSmall

12、SignalFull-RailblbblIcell_6-SigIcell_avg Icell variation:RDF/Stochastic effects 6-Sigma tolerance to guarantee functionality Small Signal Swing with worst worst case Icell Almost Full-Rail Swing with nominal Icell Large-cap on bit-lines,Sense-Amplifier Increased Read energy Dual power supply to supp

13、ort large DVFS range2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference10 of 36Logic Memory arch:Bit cell 12-T full-CMOS bit cell No Read-disturb,Write-contention Eliminate Second array supply Decoupled Read/Write bitlines Native

14、ly support Read+Write Independent optimization of read and write circuits/segmentation Scalable to additional read ports for higher bandwidth2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference11 of 36Logic Memory arch:Read Circui

15、ts Read-Word-Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference12 of 36 Read-Word-Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic M-Input AOILogi

16、c Memory arch:Read Circuits2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference13 of 36 Read-Word-Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic M-Input AOI N-Input NANDLogic Memory arch:Read Circuits2.1:A 4nm 3

17、.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference14 of 36 Read-Word-Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic M-Input AOI N-Input NAND P-Input NORLogic Memory arch:Read Circuits2.1:A 4nm 3.4GHz Tri-Gear Fully Out-o

18、f-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference15 of 364-Logic Stages from rwl to Output LatchM*N*P*Q entries Read-Word-Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic M-Input AOI N-Input NAND P-Input NOR Q-Input NANDLogic Memory arch:Read Circ

19、uits2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference16 of 36 Read-Word-Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic M-Input AOI N-Input NAND P-Input NOR Q-Input NAND Only one fan-in leg of AOI/NAND/NOR/NAN

20、D switches during read-access-73%switching capacitance vs.6T bit cell SRAM4-Logic Stages from rwl to Output LatchM*N*P*Q entriesLogic Memory arch:Read Circuits2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference17 of 36 Read-Word-

21、Lines and Bit-cellslogically recombined throughAOI,NAND,NOR Logic M-Input AOI N-Input NAND P-Input NOR Q-Input NAND Only one fan-in leg of AOI/NAND/NOR/NAND switches during read-access-73%switching capacitance vs.6T bit cell SRAM Potential performance limiter for larger memories Meets HP/BP CPU freq

22、uencies with 256 entries(M=N=P=Q=4)4-Logic Stages from rwl to Output LatchM*N*P*Q entriesLogic Memory arch:Read Circuits2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference18 of 36Power benefits vs.6T based SRAM3.7x/73%reduction i

23、n read energyOnly 1 of the fan-in legs of AOI,NAND,NOR stages switchesOnly toggle when read valueopposite idle value2x/51%reduction in writeenergy60%reduction in powerfor typical work-loads4.6%reduction in totalCPU power for typicalwork-loads2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU s

24、ubsystem 2024 IEEE International Solid-State Circuits Conference19 of 36Outline Fully OoO Tri-Gear Processor Subsystem Circuit TechnologiesLogic SRAMStandard Cell ImprovementsHigh Bandwidth Voltage ControlDual-Rail EqualizerHardware DVFS with PMIC Passive Discharge Summary2.1:A 4nm 3.4GHz Tri-Gear F

25、ully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference20 of 36Split Double Height Reduce metal zero resistanceSplit double height form factorShort M1 stub connects two smaller M0 wiresM0,higher R POM0M1,lower R POM0M12.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Ord

26、er ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference21 of 36Multiple Via Connection Reduce via resistanceMultiple port shapes with must-join attributeRouter makes multiple parallel connectionsRouting(Single stack VIA)Cell(Single M2 pin)M2M2M1M4M6VIA*1Routing(VIA pillar)Ce

27、ll(Must Join M1 pins)M1M2M4M6VIA*N2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference22 of 36High Speed Combo Cell Reduce cell routing lengthIdentify frequently used cell pairs from timing critical pathsCombined cell with shorter

28、 M0 connectionCELL AInput pinCELL BOutput pinM0M1M2HSCC(CELL A+B)Output pinInput pinM02.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference23 of 36Outline Fully OoO Tri-Gear Processor Subsystem Circuit TechnologiesLogic SRAMStandar

29、d cell improvementsHigh Bandwidth Voltage ControlDual-Rail EqualizerHardware DVFS with PMIC passive discharge Summary2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference24 of 36 FLL using ROSC with programmable code drives CPU clo

30、ck ROSC code not allowed to go below MinCode Speed-binned for max CPU frequency Voltage determined by VF table lookup+unit step change 1msVoltage/Frequency Control:Prior Work 2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference25

31、of 36Heavy load:Target Voltage FLL Freq Light load:Target FLL Freq Code Voltage FrequencyCodeVoltageTime(ms)Code errorMinCodeROSC CodeFreq.errorSW Based Voltage Control:Prior Work1232.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Confer

32、ence26 of 36High Bandwidth Voltage ControlAchieves complete HW DVFS Energy Aware Scheduler(EAS)provides a performance target HW realigns Voltage and Freq.to meet target at lowest power Voltage updates as quick as every 30usPerformance targetFrequencytargetVoltagerequestFrequency/code errorPowersuppl

33、yCPU2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference27 of 36HW DVFS OperationROSC tracks target frequency with optimal voltageTime(ms)Freq(MHz)Voltage(a.u.)2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 202

34、4 IEEE International Solid-State Circuits Conference28 of 36HW DVFS Power SavingsVoltage distribution of HW&SW DVFS for mid-range workload 80mV voltage reduction for top 20%of distributionHigh VoltageLow Voltage80mVFrequencyCumulative2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem

35、 2024 IEEE International Solid-State Circuits Conference29 of 36Outline Fully OoO Tri-Gear Processor Subsystem Circuit TechnologiesLogic SRAMStandard Cell ImprovementsHigh Bandwidth Voltage ControlDual-Rail EqualizerHardware DVFS with PMIC Passive Discharge Summary2.1:A 4nm 3.4GHz Tri-Gear Fully Out

36、-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference30 of 36SRAM Power Management Off-chip LDO controlled by software VSRAMbuck max VLOGIC+LDO dropout=Power loss Frequent voltage changes makes LDO control complex On-chip LDO with voltage sense automates control,has

37、 poor efficiency Dual-rail Equalizer eliminates LDO,VSRAM buck=VSRAM VoltageOff-chip LDOOn-chip LDO with voltage senseProposed solution without LDOVBITCELL=max(VLOGIC,VSRAM)2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference31 of

38、 36Dual-Rail Equalizer Dual-rail equalizer dynamically switches between VLOGIC&VSRAM Improves power efficiency and reduces cost Eliminates software dependence for voltage control Built-in hysteresis to limit unnecessary comparator toggling from supply noise and power switch ripple2.1:A 4nm 3.4GHz Tr

39、i-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference32 of 36Outline Fully OoO Tri-Gear Processor Subsystem Circuit TechnologiesLogic SRAMStandard Cell ImprovementsHigh Bandwidth Voltage ControlDual-Rail EqualizerHardware DVFS with PMIC Passive Disch

40、arge Summary2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference33 of 36PMIC Passive Discharge Disable PMIC active pull-down during a voltage decrease Supply decays based on load,not with a specific slew rate2.1:A 4nm 3.4GHz Tri-G

41、ear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference34 of 36Wasted PowerPMIC CurrentPMIC VoltagePMIC VoltagePMIC CurrentPower benefit up to 12.2%for popular gaming scenarioApplication-levelPower MeasurementsHW DVFS with PMIC Passive Discharge2.1:A 4nm

42、3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference35 of 36Up to 28%CPU and 22%system power savingsScenarioCPU currentCPUPower SavingSystem currentSystem Power SavingPriorThis workPriorThis workHeavy723 mA523 mA28%1558 mA59.1FPS1220 mA59.6

43、FPS22%Light219 mA171 mA22%658 mA59.8FPS524 mA59.9FPS20%System Level Power Comparison2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem 2024 IEEE International Solid-State Circuits Conference36 of 36Summary Fully OoO Tri-Gear CPU SubsystemHigher IPC delivers higher performance&power s

44、avings 4.6%CPU power savings from 12T full-CMOS L1$Higher peak frequency with standard cell improvements 12.2%CPU power savings with HW DVFS control Overall,28%CPU and 22%system power saving2.1:A 4nm 3.4GHz Tri-Gear Fully Out-of-Order ARMv9.2 CPU subsystem37 of 36Please Scan to Rate This Paper2.2:Ze

45、n 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference1 of 29Zen 4c:The AMD 5 nm Area-Optimized x86-64 Microprocessor CoreT.Burd1,S.Venkataraman1,W.Li1,T.Johnson1,J.Lee1,S.Velaga1,M.Wasio1,T.Yiu1,F.Bodine1,M.McCabe1,U.Salim1,S.K.Thouta1,M.G

46、olden1,S.Ramachandran 1,G.Devi1,J.Wuu2,Y.Kuszczak3,G.Singla3,C.Henrion2,A.Robison2,S.Balagangadharan4,U.Nair4,N.Srivastava4,H.Prasad4,M.Polimetla4,P.Chennupati4,E.Gupta4,M.Vykuntam4,S.Sarkar4,P.K.Duvvuru4,T.Mardi4,Swetha.G41AMD Santa Clara,CA,2AMD Fort Collins,CO,3AMD Markham,Canada,4AMD Bangalore,I

47、ndia2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference2 of 29OutlineMarket segments and design goalsCore architectureSRAM optimizationPhysical design optimizationL3 cacheCore frequency and powerCCD(CPU Compute Die)Server and mobil

48、e configurationsPerformance comparisonsConclusion2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference3 of 29“Zen 4c”Market Segments4thGen AMD EPYC Server Processors5 nm Compute Chiplet+6 nm IO ChipletPower-efficient and cost-optimiz

49、ed x86 coreBergamo:Up to 128-core server processors 400W TDP(max)for cloud optimized computeSiena:Up to 64-core server processors 200W TDP(max)for edge computeFour“Zen 4c”cores+two“Zen 4”cores for a low-cost,power-efficient mobile APU 15W TDPAMD RYZEN 7545U and 7440U Mobile Processors4 nm SoC“Bergam

50、o”:Cloud Native“Siena”:Intelligent Edge2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference4 of 29“Zen 4c”Design Goals Focus on Power and Area Efficiency Same TSMC 5nm technology as“Zen 4”core Area optimized design delivering sizabl

51、e improvement in Perf/Watt ISO performance at reduced area and power compared to“Zen 4”core Versus“Zen 4”achieved 35%area reduction 25%1improvement in Performance/mm2 on SPECrate2017_int_base 9%2improvement in Performance/Watt on SPECrate2017_int_base1.SP5-001C,SP5-143A 2.Comparing at 360W TDP(See e

52、ndnotes)2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference5 of 29“Zen 4c”Core Architecture“Zen 4c”has the same architecturalperformance as“Zen 4”“Zen 4c”is fully instruction setarchitecture(ISA)and featurecompatible as“Zen 4”Munge

53、r,ISSCC232.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference6 of 29Structure Sizes and LatenciesZen 4Zen 4cLDQ8888STQ6464Micro-op cache6.75k ops6.75k opsL1 I/D-cache32/32k32/32kL2 cache1M1ML3 cache/core4M2ML2 TLB3k3kL2 latency14 cy

54、cles14 cyclesZen 4Zen 4cIssue width(Int+FP/SIMD)10+610+6Int reg224224Int scheduler9696FP reg192192ROB320320FADD/FMUL/FMA latency3/3/4 cycles3/3/4 cyclesL1 BTB1.5k1.5kL2 BTB7k7kCore architecture the same as“Zen 4.”For server,L3 cache per core changed:4 MB 2 MB.2.2:Zen 4c:The AMD 5nm Area-Optimized x8

55、6-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference7 of 29SRAM Optimization(1)“Zen 4”CPU core used 8T bitcell arraysEnables high-frequency,used on all prior“Zen”coresL2 cache uses 6T arrays“Zen 4c”converted these to 6T bitcell arraysDouble-pumped to maintain architectura

56、l performanceAggressive design optimizations pushed frequency ofdouble-pumped 6T arrays to within 20%of 8T arraysDelivered 40%macro area reductionWordlineBitlineBitlineStandard illustrationWrite WordlineWrite BitlineRead WordlineRead Bitline121,2 Hand-drawn illustration based on commonly available i

57、llustrations.2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference8 of 29SRAM Optimization(2)8T 6T conversion yielded 40%macro area reductionL2 CacheArraysCPU CoreArrays“Zen 4”“Zen 4c”Reduced double-pumped 6T array frequency enabled

58、significantadditional area and power reductionCPU CoreArraysL2 CacheArrays2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference9 of 29Indirect Impact of Optimized Frequency Comprehensive synthesis place and route methodologies levera

59、ged fromgraphics drove additional area reductionLower Cac,leakage Smaller core area smaller clock meshMore multi-bit flip-flops also reduced clock loadingLower Cac(Active Capacitance)L2 CacheCPU CoreL2 CacheCPU Core“Zen 4”3.84 mm2“Zen 4c”2.48 mm22.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microproc

60、essor Core 2024 IEEE International Solid-State Circuits Conference10 of 29Direct Impact of Optimized Frequency Smaller total standard cell areaReduced 20%Reduced Cac,leakage Fewer low Vt devicesReduced leakage More multi-bit flip-flopsReduced input clock CacRequired fewer clock gaters/buffers“Zen 4”

61、“Zen 4c”StdcellsStdcells2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference11 of 29Summary of AMD Zen 4c Core Improvements-35%-25%-50%2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-St

62、ate Circuits Conference12 of 29Core Complex(CCX)Core+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L216 MB L3 CacheCore+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L232 MB L3 Cache“Zen 4”“Zen 4c”Reduction in core area coupled with an optimized and halved L3 cache enables an 8-core compute co

63、mplex in roughly half the area Munger,ISSCC232.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference13 of 29L3 Cache Optimization1.Lower frequency target enabledPairs of L3 data macros merged 10%macro area reductionReduced standard cel

64、l area 20%2.Reduced L3 per core:4 MB 2 MBCombined,enables twice the CPU cores per chipletL3 CacheArrays“Zen 4”32MBL3 Cache“Zen 4c”16MBL3 CacheL3CacheArrays2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference14 of 29Measured Core Pow

65、er EfficiencySPECrate2017_int_basepower for eight CPU cores+16 MB L3 cache(32 MB for“Zen 4”)Very high core-count serversrequire 2W per core,below theminimum operating voltage of“Zen 4,”necessitating frequencyscaling to achieve2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE

66、 International Solid-State Circuits Conference15 of 29Measured Silicon Frequency2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference16 of 29CPU Compute Die(CCD)ChipletSystem management unit(SMU)Microcontroller,power&thermal manageme

67、nt,clocking,reset,and fusesInfinity FabricTM(Network on Chip)Multiplexes the two 8-core CCXs to the SerDesInfinity FabricTMOn-Package(IFOP)linksEach SerDes link is comprised of16 TX lanes20 RX lanesTwo Clk and two Control lanesSpeeds up to 36 GbpsCore+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core

68、+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L2Core+L216 MB L3 Cache16 MB L3 CacheIFOP1SMUDFT&Debug73 mm29B TransistorsIFOP22.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference17 of 29Server Configurations4-8 CCDs(5 nm)IO Die(6

69、 nm)11B Transistors,387 mm24thGen AMD EPYCTM“Bergamo”82B Transistors4thGen AMD EPYCTM“Siena”47B Transistors2-4 CCDs(5 nm)Same IO die used in all“Zen 4”and“Zen 4c”server products200-400WTDP70-225WTDP2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State C

70、ircuits Conference18 of 29Optimized Server PerformanceBenchmarkUnits“Zen 4c”128 Cores(9754)“Zen 4”96 Cores(9654)UpliftSPECint20171,2operations/sec19501790+9%SPECpower_ssj20083,4operations/sec/W33.3k30.6k+9%Vmmark 3.1.15SAN storage scoring44.15 49 tiles40.66 42 tiles+9%Chaos V-Ray6max vsamples229k209

71、k+10%HPL Benchmark7GFLOPs10,1348,856+14%TPCTMExpress Benchmark AI8AI test cases/min18411554+18%1.SP5-143A 2.SP5-001C 3.SP5-072A 4.SP5-011E 5.SP5-049D6.SP5-038C 7.SP5-154 8.SP5-051 9.FN-9 10.FN-10 (See end notes)28275076017H754“Zen 4c”delivers 9-18%performance increase over“Zen

72、4”on select cloud server benchmarks2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference19 of 29Competitive Cloud Server PerformanceFor cloud native workloads(all 2P systems)“Bergamo”delivers up to 2.6x higher performance and on aver

73、age has 2.0 x higher performance“Bergamo”is 2.0 x more energy-efficient(SPECpower_ssj 2008)Source:https:/ 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference20 of 29Competitive Edge Server PerformanceBenchmarkUnits“Siena”64-core 8534PXeonP

74、latinum 52-core 8471NUpliftSPECint20171operations/sec474 200W TDP450 300W TDP+5%SPECint20171/Woperations/sec/W2.371.5+58%SPECpower_ssj20082operations/sec/W27.3k12.8k+113%FFmpeg(DTS raw to VP9 codec)3frames/hour438k264k+66%1.FN-11.2.SP6-007 3.SP6-014 4.SP6-006(See endnotes)On Average over Neural Magi

75、c DeepSparse,oneDNN,OpenCV,OpenVINO,and TensorFlowAI/ML Workloads41P“Siena”64-core 8534PN(175W TDP)247W measured average system power1P Xeon Platinum 52-core 8471N(300W TDP)423W measured average system power8534PN exceeds competition across the workloads8534PN has 50%lower average system power compa

76、red to competition2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference21 of 29Heterogenous CCX for ClientEight core“Zen 4”APU(“Phoenix”)modifiedSupports Heterogenous core design with 2 Zen 4 cores+4 Zen 4c coresDelivers max performa

77、nce when running few threadsDelivers improved many-threads performanceHas high-frequency 16 MB L3 cache to operate at“Zen 4”frequenciesLike server configurations,“Zen 4c”delivershigher performance when operating at lower powerper coreDelivers higher 6-core performance for TDPs=9%performance increase

78、 over“Zen 4”on cloudserver benchmarks1,2On AI/ML workloads,8534PN has 50%higher performance/Watt41.SP6-003 2.SP6-007 3.SP6-014 4.SP6-006 (See endnotes)2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference26 of 29AcknowledgementsWe wo

79、uld like to thank our talented AMD design team acrossBangalore,Hyderabad,Fort Collins,Santa Clara,Austin,Boston and all AMDerswho contributed to“Zen 4c.”Come check out“Bergamo”up-close and de-lidded at the AMD exhibition booth.2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEE

80、E International Solid-State Circuits Conference27 of 29EndnotesSP5-001C:SPECrate2017_int_base comparison based on published results as of 11/10/2022.Configurations:2P AMD EPYC 9654(1790 SPECrate2017_int_base,192 total cores,www.spec.org/cpu2017/results/res2022q4/cpu-32607.html)vs.2P AMD

81、 EPYC 7763(861 SPECrate2017_int_base,128 total cores,www.spec.org/cpu2017/results/res2021q4/cpu-30148.html).SPEC and SPECrate are registered trademarks of the Standard Performance Evaluation Corporation.See www.spec.org for more information.SP5-143A:SPECrate2017_int_base comparison base

82、d on performing system published scores from www.spec.org as of 6/13/2013.2P AMD EPYC 9754 scores 1950 SPECrate2017_int_base http:/www.spec.org/cpu2017/results/res2023q2/cpu-36617.html is higher than all other 2P servers.1P AMD EPYC 9754 scores 981 SPECrate2017_int_base score(981.4 scor

83、e/socket)http:/www.spec.org/cpu2017/results/res2023q2/cpu-36613.html is higher per socket than all other servers.SPEC,SPEC CPU,and SPECrate are registered trademarks of the Standard Performance Evaluation Corporation.See www.spec.org for more information.SP5-011E:SPECpower_ssj2008 compa

84、rison based on published 2P server results as of 6/13/2023.Configurations:2P AMD EPYC 9654(30,602 overall ssj_ops/W,2U,https:/spec.org/power_ssj2008/results/res2022q4/power_ssj-01204.html)is 1.81x the performance of best published 2P Intel Xeon Platinum 8490H(16,902 overall ssj_ops/W,2U

85、,https:/spec.org/power_ssj2008/results/res2023q2/power_ssj-01251.html).SPEC and SPECpower_ssj are registered trademarks of the Standard Performance Evaluation Corporation.See www.spec.org for more information.SP5-072A:As of 6/13/2023,a 4th Gen EPYC 9754 powered server has highest overal

86、l scores in key industry-recognized energy efficiency benchmarks SPECpower_ssj2008,SPECrate2017_int_energy_base,SPECrate2017_fp_energy_base and VMmark Server-Power-Performance.See details at https:/ based on published scores from https:/ as of 6/13/2023.Comparison of 2P AMD EPYC 9754(229,471 max vsa

87、mples,https:/ 2.40 x the performance of published 2P Intel Xeon Platinum 8490H(95,471 max vsamples,https:/ AMD EPYC 9654(209,102 max vsamples,https:/ shown for reference at 2.19x.2P EPYC 7763(115,385 vsamples,https:/ at 1.21x and 2P Intel Xeon Platinum 8380(62,919 vsamples,https:/ at 0.66x for refer

88、ence.Chaos,V-Ray and Phoenix FD are registered trademarks of Chaos Software EOOD in Bulgaria and/or other countries.NOTE:Red text only needs to be included with charts that show the 7763/9654/8380.SP5-051:TPCx-AI SF3 derivative workload comparison based on AMD internal testing running multiple VM in

89、stances as of 6/13/2023.The aggregate end-to-end AI throughput test is derived from the TPCx-AI benchmark and as such is not comparable to published TPCx-AI results,as the end-to-end AI throughput test results do not comply with the TPCx-AI Specification.Configurations:2 x AMD EPYC 9754 on Titanite(

90、BIOS and Settings:AMI Core Ver.5.25,Project Ver.RTI1000Fand Default BIOS settings(SMT=on,Determinism=Auto,NPS=1),1.5TB(24)Dual-Rank DDR5-4800 64GB DIMMs,1DPC,SK Hynix SHGP31-500GM 500GB NVMe,Ubuntu 22.04 LTS(8-instances,30 vCPUs/instance,1841 AI test cases/min);2 x AMD EPYC 9654 on Titanite(BIOS and

91、 Settings:AMI Core Ver.5.25,Project Ver.RTI1000F and Default BIOS settings(SMT=on,Determinism=Auto,NPS=1),1.5TB(24)Dual-Rank DDR5-4800 64GB DIMMs,1DPC,Samsung SSD 983 DCT 960GB,Ubuntu 22.04.1 LTS(6-instance,28 vCPUs/instance,1554 AI test cases/min);2 x Intel(R)Xeon(R)Platinum 8490H on Dell PowerEdge

92、 R760(BIOS and Settings:ESE110Q-1.10 and Package C1E,Default BIOS settings(C State=Disabled,Hyper-Threading,Turbo boost=enabled(ALL)=Enabled,SNC(Sub NUMA)=Disabled),2TB(32)Dual-Rank DDR5-4800 64GB DIMMs,1DPC,Dell 1.7TB NVMe,Ubuntu 22.04.2 LTS(4-instance,30 vCPUs/instance,831 AI test cases/min).Resul

93、ts may vary due to factors including system configurations,software versions and BIOS settings.TPC Benchmark is a trademark of the TPC.SP5-049D:VMmark 3.1.1 matched pair comparison based on published results as of 9/19/2023.2-node,2P AMD EPYC 9754,512 total cores,SAN storage scoring 44.15 49 tiles o

94、n VMmark 3.1.1.Source:https:/ 2-node,2P AMD EPYC 9654,384 total cores,SAN storage scoring 40.66 42 tiles on VMmark 3.1.1.Source:https:/ is a registered trademark of VMware in the US or other countries.SP5-154:HPL benchmark based on AMD internal testing as of 6/13/2023.2P server configurations:2P EPY

95、C 9754,BIOS 1003F(Memory Target Speed=DDR4800,TSME=Disabled,IOMMU=Auto,TDP Control=Manual,TDP=400,PPT Control=Manual,PPT=400,Determinism Control=Manual,Determinism Enable=Power,NUMA nodes per socket=NPS4,SMT Control=Disable),768 GB(24x 32GB 2R DDR5-4800)scores an average 10,134 GFLOPS which is 1.66x

96、 the performance of AMD estimated 2P Xeon Platinum 8490H(6115 GFLOPS).2P EPYC 9654,BIOS 1003F(Memory Target Speed=DDR4800,TSME=Disabled,IOMMU=Auto,TDP Control=Manual,TDP=400,PPT Control=Manual,PPT=400,Determinism Control=Manual,Determinism Enable=Power,NUMA nodes per socket=NPS4,SMT Control=Disable)

97、,768 GB(24x 32GB 2R DDR5-4800)scores 8856 GFLOPS for 45%better GFLOPS as reference.Results may vary due to factors including system configurations,software versions and BIOS settings.2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Confere

98、nce28 of 29EndnotesSP6-007:Server-side Java overall operations/watt(SPECpower_ssj2008)claim based on 1P published results at spec.org as of 9/18/2023.1P servers:EPYC 8534P(64-core,200W TDP)scoring 27,342 overall ssj_ops/W,https:/spec.org/power_ssj2008/results/res2023q3/power_ssj-01309.h

99、tml,52.5W active idle,7,289,747 ssj_ops100%target load 212W)vs.1P Xeon Platinum 8471N(52-core,300W TDP,https:/spec.org/power_ssj2008/results/res2023q3/power_ssj-01294.html,88.2W active idle,6,749,219 ssj_ops100%target load 419W)scoring 12,821 ssj_ops/W for 2.10 x the SPECpower_ssj2008 o

100、verall ssj_ops/W.Assuming an 8kW rack deploying servers,37 ea.EPYC 8534P vs.19 ea.Xeon 8471N can fit within the power budget delivering 1.91x the total server-side Java throughput/rack.This scenario contains many assumptions and estimates and,while based on AMD internal research and best approximati

101、ons,should be considered an example for information purposes only,and not used as a basis for decision making over actual testing.SPEC and SPECpower_ssj are registered trademarks of the Standard Performance Evaluation Corporation.SP6-014:Transcoding(FFmpeg DTS raw to VP9 codec)aggregate frames/hour/

102、system W comparison based on AMD internal testing as of 9/16/2023.Configurations:1P 64C EPYC 8534P(437,774 fph median performance,16 jobs/8 threads each,avg system power 362W)powered server versus 1P 52C Xeon Platinum 8471N(263,691 fph median performance,13 jobs/8 threads each,avg system power 522W)

103、for 1.66x the performance and 2.39x relative performance per W.Scores will vary based on system configuration and determinism mode used(max TDP power determinism mode profile used).This scenario contains many assumptions and estimates and,while based on AMD internal research and best approximations,

104、should be considered an example for information purposes only,and not used as a basis for decision making over actual testing.SP6-006:AI/ML workloads(Neural Magic DeepSparse,oneDNN,OpenCV,OpenVINO,and TensorFlow averaged)performance/system W/system$comparison based on Phoronix Test Suite paid testin

105、g as of 8/18/2023.Configurations:1P 64C EPYC 8534PN(0.96x relative performance,247 avg system W,est$8,482 system cost USD)powered server versus 1P 52C Xeon Platinum 8471N(1.07x relative performance,423 avg system W,est$8,477 system cost USD)powered server for 0.89x the performance,42%lower system po

106、wer(1.52x the performance/system W),comparable system cost for 1.52x the overall system performance/W/$.Assuming an 8kW rack deploying servers,32 ea.EPYC 8534PN vs.18 ea.Xeon 8471N can fit within the power budget delivering 1.58x the total AI/ML throughput/rack on average.Testing not independently v

107、erified by AMD.Scores will vary based on system configuration and determinism mode used(default TDP power determinism mode profile used).This scenario contains many assumptions and estimates and,while based on AMD internal research and best approximations,should be considered an example for informat

108、ion purposes only,and not used as a basis for decision making over actual testing.Results may vary and are normalized to EPYC 8534P Performance Determinism mode(always 1 in all measurements).Estimated system pricing based on Bare Metal Server GHG TCO v9.52.PHX-3:Testing as of 5/8/23 by AMD Performan

109、ce Labs utilizing System configuration for AMD Ryzen 7 PRO 7840U 28W TDP:MAYAN FP7-101DRC3INT-230331(CRB),16GB RAM,1TB NVMe SSD,Integrated Radeon Graphics,Windows 11 Pro.System configuration for Ryzen 7 PRO 6850U 28W TDP:Mayan B10L(B)CRB,16GB RAM,1TB NVMe SSD,AMD Radeon graphics,Windows 11 Pro using

110、 the following tests:Cinebench R23 1T,Cinebench R23 nT,3DMark Night Raid Graphics,PassMark 10 CPU Mark,PCMark 10 Express,PCMark 10 Productivity Test Group,and Puget Photoshop GPU.Laptop manufacturers may vary configurations yielding different results.PHXP-3PHX-50:Testing as of 9/21/23 by BOXX Techno

111、logies,comissioned by AMD,utilizing Dell Latitude 5440 with Intel Core i5 1345U processor,Intel Integrated graphics,16GB RAM,512GB NVMe SSD and Windows 11 Pro,Dell Latitude 5440 with Intel Core i5 1350P processor,Intel Integrated graphics,16GB RAM,256GB NVMe SSD and Windows 11 Pro,and HP EliteBook 8

112、45 G10 with Ryzen 5 PRO 7540U processor,Integrated Radeon Graphics,32GB RAM,256GB NVMe SSD,Windows 11 Pro.Using the following tests:Geekbench v5 Single Core,Geekbench v5 Multi Core,Blender Bench CPU-Classroom,Passmark 11 Overall,PCMark 10 Productivity Test Group,Puget Bench Adobe Photoshop Overall,P

113、assmark 11 3D Graphics Mark.PC manufacturers may vary configurations yielding different results.Results may vary.FN-9:https:/www.spec.org/cpu2017/results/res2020q2/cpu-22554.html FN-10:https:/www.spec.org/cpu2017/results/res2018q3/cpu-08666.htmlFN-11:https:/www.spec.org/cpu

114、2017/results/res2023q3/cpu-38888.html2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core 2024 IEEE International Solid-State Circuits Conference29 of 29DisclaimerThe information presented in this document is for informational purposes only and may contain technical inaccura

115、cies,omissions,and typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons,including but not limited to product and roadmap changes,component and motherboard version changes,new model and/or product releases,product differences betwe

116、en differing manufacturers,software changes,BIOS flashes,firmware upgrades,or thelike.Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD assumes no obligation to update or otherwise correct or revise this information.However,AMD reserves th

117、e right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED AS IS.”AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO

118、 RESPONSIBILITY FOR ANY INACCURACIES,ERRORS,OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY,OR FITNESS FOR ANY PARTICULAR PURPOSE.IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE,DIRECT,INDIRECT,SPECI

119、AL,OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.2024 Advanced Micro Devices,Inc.All rights reserved.AMD,the AMD Arrow logo,EPYC,Infinity Fabric,Ryzen,and combinations thereof are trademarks

120、 of Advanced Micro Devices,Inc.PCIe is a registered trademark of PCI-SIG.Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.2.2:Zen 4c:The AMD 5nm Area-Optimized x86-64 Microprocessor Core30 of 29DisclaimerThe informa

121、tion presented in this document is for informational purposes only and may contain technical inaccuracies,omissions,and typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons,including but not limited to product and roadmap changes,

122、component and motherboard version changes,new model and/or product releases,product differences between differing manufacturers,software changes,BIOS flashes,firmware upgrades,or thelike.Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated.AMD as

123、sumes no obligation to update or otherwise correct or revise this information.However,AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED

124、 AS IS.”AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,ERRORS,OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY,OR FITNESS FOR ANY P

125、ARTICULAR PURPOSE.IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE,DIRECT,INDIRECT,SPECIAL,OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.2024 Advanced Micro Devices,Inc.All righ

126、ts reserved.AMD,the AMD Arrow logo,EPYC,Infinity Fabric,Ryzen,and combinations thereof are trademarks of Advanced Micro Devices,Inc.PCIe is a registered trademark of PCI-SIG.Other product names used in this publication are for identification purposes only and may be trademarks of their respective co

127、mpanies.Please Scan to Rate This Paper2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference1 of 35Emerald Rapids:5th-Generation Intel Xeon Scalable ProcessorsAshley O.Munch1,Nevine Nassif1,Carleton L.Molnar1,Jason Crop2,Rich

128、 Gammack1,Chinmay P.Joshi3,Goran Zelic1,Kambiz Munshi1,Min Huang4,Charles R.Morganti2,Sireesha Kandula1,Arijit Biswas1Intel1Hudson,MA;2Fort Collins,CO;3Hillsboro,OR;4Santa Clara,CA2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits C

129、onference2 of 35Outline OverviewFeaturesProcess Technology Floorplan Power and Performance Last Level Cache(LLC)Memory IO Clocking&Timing Summary2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference3 of 35Overview:Features 6

130、4 high performance Raptor Cove cores including Intel Advance Matrix Extensions(AMX)for machine learning 8 channel DDR5 5600 MT/s 1DPC 5 x16 PCIe/Compute Express Link(CXL)Lanes 32 GT/s 4 x24 Ultra Path Interconnect(UPI)Lanes 20 GT/s Integrated Accelerators Quick Assist Technology(QAT),Dynamic LoadBal

131、ancer(DLB),Data Streaming Accelerator(DSA),In Memory Analytics Accelerator(IAA)2-die multi-chip package2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference4 of 35Overview:Process Technology Intel 7 Dual poly pitch SuperFin

132、transistors 17 metal layers including 2 thickmetal layers Optimized transistor speed with afocus on leakage and dynamiccapacitance+3%frequency/W enhancements 5%frequency increase at iso leakage 0.5%Cdyn reduction2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE Inte

133、rnational Solid-State Circuits Conference5 of 35Package2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference6 of 35Outline Overview Floorplan Power and Performance Last Level Cache(LLC)Memory IO Clocking&Timing Summary2.3:5t

134、h-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference7 of 35FloorplanThe highest core count configuration(XCC)isconstructed of 2 unique die built as mirrors ofeach other2.3:5th-Generation Intel Xeon Scalable Processors,codename Em

135、erald Rapids 2024 IEEE International Solid-State Circuits Conference8 of 35FloorplanThe highest core count configuration(XCC)isconstructed of 2 unique die built as mirrors ofeach otherUtilizing 3 embedded silicon bridges allowingfor 7 full bandwidth SCF mesh crossingsThe result is a quasi-monolithic

136、 die of withaggregate area(1500.56 mm)beyond thereticle limit of a single monolithic die2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference9 of 35FloorplanEach die is built as a 77 array ofmodular integration components2.3

137、:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference10 of 35FloorplanEach die is built as a 77 array ofmodular integration componentsComponents are arranged with core&cache in the center2.3:5th-Generation Intel Xeon Scalable P

138、rocessors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference11 of 35FloorplanEach die is built as a 77 array ofmodular integration componentsComponents are arranged with core&cache in the center,the UPI,PCIeand accelerators in the north2.3:5th-Generation Intel Xeon Scala

139、ble Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference12 of 35FloorplanEach die is built as a 77 array ofmodular integration componentsComponents are arranged with core&cache in the center,the UPI,PCIeand accelerators in the north,theDDR&memory controllers on

140、thewest and east2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference13 of 35FloorplanEach die is built as a 77 array ofmodular integration componentsComponents are arranged with core&cache in the center,the UPI,PCIeand acce

141、lerators in the north,theDDR&memory controllers on thewest and east and MDF on south2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference14 of 35FloorplanEach die is built as a 77 array ofmodular integration componentsCompon

142、ents are arranged with core&cache in the center,the UPI,PCIeand accelerators in the north,theDDR&memory controllers on thewest and east and MDF on southConnected with an on-die scalablecoherent fabric(SCF)in a meshtopology2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024

143、 IEEE International Solid-State Circuits Conference15 of 35FloorplanEach die is built as a 77 array ofmodular integration componentsComponents are arranged with core&cache in the center,the UPI,PCIeand accelerators in the north,theDDR&memory controllers on thewest and east and MDF on southConnected

144、with an on-die scalablecoherent fabric(SCF)in a meshtopologyTo enable full bandwidth to each I/Oagent,two dedicated interleavingrows of fabric are used in the northallowing parallel paths to/from thecores and memory2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE I

145、nternational Solid-State Circuits Conference16 of 35Outline Overview Floorplan Power and Performance Last Level Cache(LLC)Memory IO Clocking&Timing Summary2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference17 of 35Power an

146、d PerformanceKey power and performance improvements over prior generationRaptor Cove CoreCore count increaseLLC size increase 1.875 to 5 MB per tileDDR5 speed increase 4800 to 5600 MT/s 1DPCUPI speed increase 16 to 20 GT/sSoC Die topology change:4 to 2 die packageIdle power reductions Single phase F

147、ully Integrated Voltage Regulator(FIVR)mode Enhanced Active Idle Mode2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference18 of 35Power and PerformancePerformance improvements within same power envelopeEmerald Rapids(EMR)64

148、Core versus Sapphire Rapids(SPR)56 Core 350W1.21XGeneral Integer Compute1.42XAI WorkloadsSee backup for workloads and configurations.Results may vary.2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference19 of 35Outline Overv

149、iew Floorplan Power and Performance Last Level Cache(LLC)Memory IO Clocking&Timing Summary2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference20 of 35Last Level Cache(LLC)Unified cache architecture sharedacross all cores wi

150、th a total 300MBPer-core tile LLC size 2.5X of priorgeneration30%higher density with same pipelinelatency at iso-frequencyAll arrays built using 6T SRAM withread/write assist(RWA)circuitsDECTED in LLC data for SER benefit,SECDED for the remaining arraysExtensive power saving features,30%higher densi

151、ty LLC with same pipeline latency atiso-frequency Improvements in design,modeling and errorcorrection for large die and system effects increasingrobustness Increase in cores and over 2.84 times the cacheimplemented in lower total silicon area on the samebase processSee backup for workloads and confi

152、gurations.Results may vary.2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids 2024 IEEE International Solid-State Circuits Conference35 of 35AcknowledgementsThank you to all the Intel Xeon Scalable processor development teams for the uncompromising dedication in bringing this

153、product to market.2.3:5th-Generation Intel Xeon Scalable Processors,codename Emerald Rapids36 of 35Please Scan to Rate This Paper2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference1/32ATOMUS:A 5nm 32TFLOPS/128T

154、OPS ML System-on-Chip forLatency Critical ApplicationsC-H.Yu,H-E.Kim,S.Shin,K.Bong,H.Kim,Y.Boo,J.Bae,M.Kwon,K.Charfi,J.Kim,H.Kim,M.Shim,C.Ha,W.Shin,J-S.Yoon,M.Chi,B.Lee,S.Choi,D.Kim,J.Woo,S.Yoon,H.Jo,H.Kim,H.Heo,Y-J.Jin,J.Yu,J.Lee,H.Kim,M.Kang,S.Choi,S-G.Kim,M.Choi,J.Oh,Y.Kim,H.Kim,S.Je,J.Ham,J.Yoon

155、,J.Lee,S.Park,Y.Park,J.Lee,B.Hong,J.Ryu,H.Ko,K.Chung,J.Choi,S.Jung,Y.F.Arthanto,J.Kim,H.Cho,H.Jeong,S.Choi,S.Han,J.Park,K.Lee,S-I.Bae,J.Bang,K-J.Lee,Y.Jang,J.Park,S.Park,J.Park,H.Shin,S.Park,J.OhJinwook OhRebellions,Republic of Korea2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Cri

156、tical Applications 2024 IEEE International Solid-State Circuits Conference2/32Outline Motivation ML System-On-ChipNeural EngineMemory SubsystemSynchronization Benchmark MeasurementUtilization and LatencyPerformance Efficiency Conclusion2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency

157、Critical Applications 2024 IEEE International Solid-State Circuits Conference3/32Latency-critical AI ApplicationsMany cloud services show input query distributions ranging fromsingle to multiple streamsService models in public cloud platforms are often centered aroundresponsiveness and stable 99-per

158、centile latencyHigh Frequency TradingAI-as-a-Service at CloudGenAI-based Service at Edge Device/Server Inference Tasks Where Every Millisecond Matters2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference4/32Syste

159、m Requirements for Generative AIText to Video(Make-a-video,Meta)Need balance of high compute and bandwidth capabilityFlexibility is preferred for large supportability as well as for high utilization(performance)Text to Image(DALL-E2,OpenAI)Text to Text(Transformer,Google)Memory OpCompute OpRelative

160、Density From Text LLM to Multi-modal Models2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference5/32Latency vs.Utilization Contrasting Spaces To Conquer at The Same TimeUtilization comes with the expense of laten

161、cy sacrifice(batches,et.al)Tackling Both Spaces SimultaneouslyReducing the cold latency as small as possible finer granularityResolving the dependencies as quickly as possible Multiple layered NoC&memory subsystem hierarchies data dependency Fine-granule&multiple layered sync protocols control depen

162、dency2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference6/32Outline Motivation ML System-On-ChipNeural EngineMemory SubsystemSynchronization Benchmark MeasurementUtilization and LatencyPerformance Efficiency Co

163、nclusion2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference7/32Chip SpecificationSingle ChipProcessSamsung 5nmFP16(Bfloat)32 TFLOPSINT8128 TOPSOn-chip SRAM64 MBTDP60-130 Watt(Configurable)External Memory16GB,GD

164、DR6(ECC enabled)256 GB/sHost/C2C I/FPCIe Gen5,64 GB/sMulti-instance NPU SupportUp to 16 Independent tasksGDDR6GDDR6GDDR6GDDR6Peri I/FNeural EngineCluster 0Neural EngineCluster 1L2 Shared MemoryCPRoTPCIe Gen5 x16Die Photograph2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Ap

165、plications 2024 IEEE International Solid-State Circuits Conference8/32Top Block Diagram of ML SoCTask DMANeural Engine ClusterATOMUS Top DiagramMulti-core SoC with multi-layer memory subsystemFlexible neural engines and dedicated HW-based on-chip synchronization2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML S

166、ystem-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference9/32Top Block Diagram of ML SoCTask DMANeural Engine ClusterATOMUS Top DiagramSoC ComponentOperationT-DMASoC level data transfer N-DMANE task data transferL0/L1 Cache/L2/GDDRMemory subsystemCP,RoTC

167、ontrol unitSystem/Cluster NoCLow-latency data NoCPCIe(HDMA)Host/P2P Interconnect 2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference10/32SW Semantic at ATOMUSRuntimeCPTM ComputeUnitCompilerCFGTaskCMDSemanticCom

168、mand(CMD)TaskProgramTarget UnitCPTask ManagerCompute units in Neural EngineContentHigh-level Op Type,Associated Resources,DependencyLow-level Op Type,Associated Resources,DependencyInstruction Stream for ML OpsMemory SpaceHost+Device SpaceDevice Space(L0L2/DRAM)Neural Engine Space(L0)Program2.4:ATOM

169、US:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference11/32Neural EngineCompute UnitsHeterogenous SIMD/MIMD elements4 MB Scratch-Pad for Working MemoryInstruction Level Parallelism/SynchronizationTask Manager(TM)Multiple

170、Task Queues(COMP/DMAs)Task Level Parallelism/SynchronizationNeural DMA(N-DMA)Prioritized Multiple DMA Operation2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference12/32L0/L1 Control for Parallel ExecutionL0 Sync

171、 Bus:Instruction-based sync for parallel execution of processorsL1 Sync Bus:Task-level sync for parallel execution by TMsL0 Sync BusL1 Sync Bus2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference13/32UnitSizePur

172、poseL2 Shared Memory32MBIncreasing effective bandwidth by minimization of port conflictsL1 Cluster CacheNARequired bandwidth reduction for intra cluster shared dataL0 Scratch Pad(SP)4MBLatency reduction while maximizing bandwidthL0 SP via Local NoC16MBLatency reduction while maximizing bandwidthPres

173、erving streaming data within a clusterHierarchical Memory Subsystem(HMS)2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference14/32Bandwidth and Latency ImprovementsDirect data from each L0 to GDDR load/store late

174、ncy reduction 4XShared data from each L0 to GDDR accumulative required BW reduction 92%3.4XL1 AccessL2 AccessMulti-level interleavingBank/Bank Group/Rank97%Efficiency 92%85%50%2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Ci

175、rcuits Conference15/32Multi Level Sync with Dedicated L1/L2 Sync BusHW:L0 sync(Neural Engine),L1/L2 sync(Engine Cluster and T-DMA)SW:L3 sync(L2 and CP in command level)L4 sync(L3 and H-DMA in command buffer level)Multi-Level SoC Synchronization2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for

176、Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference16/32 Sequential Command ExecutionSync Mechanism for Maximum ParallelismCP-centric dependency controlExecution time(T-DMA to N-COMP1)comparisonCPDep.UpdateDep.UpdateDep.UpdateDep.UpdateDep.UpdateH-DMA H-DMA0T-DMAT-D

177、MA0N-DMAN-DMA0N-DMA1N-COMPN-COMP0N-COMP1TimeNeural Engine0Execution Time2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference17/32 Parallel Command Execution by CPSync Mechanism for Maximum ParallelismCP resolves

178、 the command dependency across 4 command queueExecution time(T-DMA to N-COMP1)comparisonCPDep.UpdateDep.UpdateDep.UpdateH-DMA H-DMA0T-DMAT-DMA0N-DMA N-DMA0N-DMA1N-DMA0N-COMPN-COMP0N-COMP1TimeN-DMA0 Early Kick-offNeural Engine0Execution Time2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Late

179、ncy Critical Applications 2024 IEEE International Solid-State Circuits Conference18/32 Parallel Task Execution by TMSync Mechanism for Maximum ParallelismTM resolves the task dependency across 4 task queuesTM updates dependency table at the end of local executionExecution time(T-DMA to N-COMP1)compa

180、risonCPDep.UpdateDep.UpdateDep.UpdateH-DMA H-DMA0T-DMAT-DMA0N-DMA N-DMA0 N-DMA1N-DMA0N-COMPN-COMP0N-COMP1TimeTM-based N-DMA/N-COMPControl Neural Engine0Execution Time2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Con

181、ference19/32 Dedicated L1/L2 Sync based Global TM SynchronizationSync Mechanism for Maximum ParallelismTM resolves T-DMA dependency to all Neural Engines via L1/L2 sync busExecution time(T-DMA to N-COMP1)comparisonCPDep.UpdateH-DMA H-DMA0T-DMAT-DMA0N-DMA N-DMA0 N-DMA1N-DMA0N-COMPN-COMP0N-COMP1N-COMP

182、N-COMP0N-COMP1N-COMPN-COMP0N-COMP0TimeT-DMA0-dependentCOMP taskscan be executed simultaneouslyNeural Engine0Neural Engine1Neural Engine7Execution Time2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference20/32Effi

183、ciency Uplift with Command SchedulingV,FMin1Normalized Performance/WattV,FNormV,FMaxPreemptive(V,F)schedulingFine power domain with on-chip PVT*sensorsNetwork-based OPP*(Diffusion vs LLM)Performance-oriented OD/SOD implementation(original design)*OPP:Operating Performance Point*PVT:Process,Voltage a

184、nd TemperatureCP-basedPMIC/PLL monitoring Command-aware PreemptivePower ControlCommand-level V/F configurationenhance optimum VDD(P,V,T,ISoC,Next CMD)=V,Fnew2345Optimum OPPHigher V,F with optimum energy efficiency Original Design CMD-aware V,F2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for L

185、atency Critical Applications 2024 IEEE International Solid-State Circuits Conference21/32Outline Motivation ML System-On-ChipNeural EngineMemory SubsystemSynchronization Benchmark MeasurementUtilization and LatencyPerformance Efficiency Conclusion2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip f

186、or Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference22/32Utilization MeasurementCycles consumed for a host memory copy are included into a total execution cycles.Compute utilization:active cycles of Neural Engine/total cycle count90%compute utilization over model

187、suite(vision and transformer)High Memory UtilLow Memory UtilCompute UtilizationDMA Utilization2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-State Circuits Conference23/32Utilization Breakdown of MLPerfTMNetworksT-DMAN-DMAN-COMPT-D

188、MAN-DMAN-COMPT-DMAN-DMAN-COMPT-DMAN-DMAN-COMPBERT-LargeResNet50-SSResNet50-MSRetinaNetATOMUS maintain parallel processing of(prefetching)memory tasks andcompute tasks extremely well2.4:ATOMUS:A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications 2024 IEEE International Solid-Sta

189、te Circuits Conference24/32Small Batch Inference SolutionBERT-Base(FP16)Input Sequence Size=128Latency(ms)Throughput(inf/sec)ModelUse caseTypical Batch SizeTime seriesFinance1-8Typical CVIndustrial AI1-32DiffusionIT Service1Transformer(e.g.BERT)Search Engine1-128LLM(e.g.T5,LLama2)Chatbot 1-128AI Acc

190、eleration for Small Batch InferenceBest and stable performance for low-batch inferenceMulti-card deployment efficiently supports multi stream tasks(e.g.16 for BERT)02004006008004050607012481632LatencythroughputGPU L4 2.9ms(Pytorch 2.0)vs.ATOMUS 400XGPUReal-Time Ray-Tracing on Edge Device

191、RequirementSponzan mesh primitivesRender complexity O(nlogn)2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference11 of 60This work:Effi

192、cient Mobile Ray Tracing Rendering Processor for ARnEnd-to-end Inverse-rendering&Ray-tracing solution for mobile devicesnO(nlogn)to O(1)background complexity reductionnScalable partitioning scheme for 13X speed up over baselinenGlobal schedulers improve memory confliction and PE utilizationPhoneVRAR

193、 A MapL MapUser Defined ObjectVirtual Object InsersionInverse Rendering by Neural Network 3D Scene ReconstructionEncoder DecoderN Map+Ray-Tracing(RT)Rendering ProcessingReal sceneInverse Rendering ModeRender ModePlatform2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Aug

194、mented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference12 of 60Outlinen Introduction and Motivationn Overall Algorithms and FlowlEnd-to-end Inverse Rendering and Ray-Tracing lRay-tracing Rendering with scalable partit

195、ioning schemen Hardware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-Tracing and Inverse RenderinglGlobal RT Scheduler and Global Memory Access Controller n Measurement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality

196、with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference13 of 60Overall Algorithms and FlowStep3:Ray-Tracing RenderingStep2:3D ConstructionStep1:Inverse Rendering 2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorea

197、listic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference14 of 60Overall Algorithms and FlowStep3:Ray-Tracing RenderingStep2:3D ConstructionStep1:Inverse Rendering Step3:Ray-Tracing RenderingStep2:3D Construct

198、ionStep1:Inverse Rendering 2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference15 of 60Inverse Rendering with Camera Inputn Input:2D C

199、amera(x,y,R,G,B)n Output:Background physical attributes:Surface Albedo map,Surface Normal map,Lighting map,Depth map2D InputPhysical EncoderPhysical DecoderInverse Render ResultBackground Clustering Physically-Based Inverse RenderingSkip Index MappingThreshold Skip Per-Pixel LightingSurface AlbedoMa

200、p SurfaceNormal Map A N LightingMapL%DepthMapD.Background Physical AttributesStep1:Inverse Rendering 2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State

201、Circuits Conference16 of 60Inverse Rendering with Camera Inputn Input:2D Camera(x,y,R,G,B)n CNN-based physical encoder-decodern Output:Background physical attributes:Surface Albedo map,Surface Normal map,Lighting map,Depth map2D InputPhysical EncoderPhysical DecoderInverse Render ResultBackground Cl

202、ustering Physically-Based Inverse RenderingSkip Index MappingThreshold Skip Per-Pixel LightingSurface AlbedoMap SurfaceNormal Map A N LightingMapL%DepthMapD.Background Physical AttributesStep1:Inverse Rendering 2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Re

203、ality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference17 of 602D InputPhysical EncoderPhysical DecoderInverse Render ResultBackground Clustering Physically-Based Inverse RenderingSkip Index MappingThreshold Skip Per-Pixel Lig

204、htingSurface AlbedoMap SurfaceNormal Map A N LightingMapL%DepthMapD.Background Physical AttributesStep1:Inverse Rendering Physical Attributes MEM(PAMEM)Saved toInverse Rendering with Camera Inputn Input:2D Camera(x,y,R,G,B)n CNN encoder-decodern Output:Background physical attributes:surface albedo m

205、ap,surface normal map,lighting map,depth map2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference18 of 603D Lighting MapColorPositionIn

206、tensityAlbedo MapBackground Surface Albedo IndexNormal MapBackground Surface Normal VectorPPCDPer Pixel Compression Decoder(PPCD)Pixel ID xPixel ID yX-CMP0Y-CMP0CMPCMPCMPCMPX ChannelY ChannelCompression Offset.Unified Addr.Converter(UAC)+Channel ID xChannel ID yMUXMUXMUXMUXMUXMUXUACPE(x,y)PAMEM Acce

207、ss Addr.Physical Attributes MEM(PAMEM)Step1:Inverse Rendering Inverse Rendering with Camera Inputn PAMEM is used for storing background physical attributes2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering f

208、or Mobile Devices 2024 IEEE International Solid-State Circuits Conference19 of 603D Lighting MapColorPositionIntensityAlbedo MapBackground Surface Albedo IndexNormal MapBackground Surface Normal VectorPPCDPer Pixel Compression Decoder(PPCD)Pixel ID xPixel ID yX-CMP0Y-CMP0CMPCMPCMPCMPX ChannelY Chann

209、elCompression Offset.Unified Addr.Converter(UAC)+Channel ID xChannel ID y10000X5000X22X2.6XBottleneck:Complex Background Mesh76%Workload reduction due to Step 1&2O(nlogn)Ray Generation(Per PE)BBOX IntersectScene IntersectOBJ IntersectMissClosest HitIterative TraversalShadingResultRay-Tracing Computi

210、ng FlowOverall Performance Improvementn Workload saving compared with conventional RTn Reduced background complexity:On-Chip memory saving2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices

211、 2024 IEEE International Solid-State Circuits Conference30 of 60Outlinen Introduction and Motivationn Overall Algorithms and FlowlInverse Rendering for 3D constructionlRay-tracing Rendering with scalable partitioning schemen Hardware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-T

212、racing and Inverse RenderinglGlobal RT Scheduler and Global Memory Access Controller n Measurement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE Internationa

213、l Solid-State Circuits Conference31 of 60Scalable 3D Partitioning Schemen Challenge:On-Chip memory limitationn Solution:Global Tracing Bounding BoxDirect Segment without Global Tracing LightLightLost Shadow InformationKeep Shadow InformationRendering with Global Tracing Bounding BoxSegmented Object

214、2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference32 of 60nGlobal Tracing Bounding Box:EBBOX and TBBOXlEmpty Bounding Box(EBBOX)and

215、Target Bounding Box(TBBOX)nScalable flow embedded in each PE controllern13X Average Speed up with 5.6%memory overheadScalable 3D Partitioning SchemeTotal V#:34834Total T#:69451Skip Internal Triangle Intersection Computation for Speed UpKeep Estimated Light Transportation TraceLight RayShadow Ray Cas

216、t(x0,y0)3D Model/BBOX:Pre-Defined by UserKeep shadow from BBOXEmpty BBOXTarget BBOXCast Ray2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits C

217、onference33 of 60nGlobal Tracing Bounding Box:EBBOX and TBBOXlEmpty Bounding Box(EBBOX)and Target Bounding Box(TBBOX)nScalable flow embedded in each PE controllern13X Average Speed up with 5.6%memory overheadScalable 3D Partitioning SchemeTotal V#:34834Total T#:69451Skip Internal Triangle Intersecti

218、on Computation for Speed UpKeep Estimated Light Transportation TraceLight RayShadow Ray Cast(x0,y0)3D Model/BBOX:Pre-Defined by UserKeep shadow from BBOXEmpty BBOXTarget BBOXCast RayScalable 3D Model Partitioning FlowCast-Ray(per PE)BBOX Intersection Evaluation Triangle Intersection Evaluation BBIF=

219、0Skip Intersection ComputationBBIF=1RT Light Transport&Shading3D Partition Check TBBOXEBBOXRT Light Transport(Skip Shading)Intersect2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024

220、IEEE International Solid-State Circuits Conference34 of 60Scalable 3D Partitioning SchemenGlobal Tracing Bounding Box:EBBOX and TBBOXlEmpty Bounding Box(EBBOX)and Target Bounding Box(TBBOX)nScalable flow embedded in each PE controllern13X Average Speed up with 5.6%memory overheadTotal V#:34834Total

221、T#:69451Skip Internal Triangle Intersection Computation for Speed UpKeep Estimated Light Transportation TraceLight RayShadow Ray Cast(x0,y0)3D Model/BBOX:Pre-Defined by UserKeep shadow from BBOXEmpty BBOXTarget BBOXCast RayScalable 3D Model Partitioning FlowCast-Ray(per PE)BBOX Intersection Evaluati

222、on Triangle Intersection Evaluation BBIF=0Skip Intersection ComputationBBIF=1RT Light Transport&Shading3D Partition Check TBBOXEBBOXRT Light Transport(Skip Shading)Intersect Speed Up with Model PartitioningNormalized Run TimeTest Objects14X39X5.7X1.5X15X10X11X8.4X*Left bar of each pair:baseline desi

223、gnCast RayShadingIntersection Evaluation01On-Chip Memory Overhead for BBOX W/O W/5.6%Normalized Model Size012.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid

224、-State Circuits Conference35 of 60Outlinen Introduction and Motivationn Overall Algorithms and FlowlInverse Rendering for 3D constructionlRay-tracing Rendering with scalable partitioning schemen Hardware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-Tracing and Inverse Rendering l

225、Global RT Scheduler and Global Memory Access Controller n Measurement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Confer

226、ence36 of 60Overview of PBRT Rendering Processor n 86 2D computing array supports RT and IR moden Inverse render moden Ray-Tracing modePE0PE0PE0WMEMOMEM(196Kb)ACTPE1PE1PE1PE2PE2PE2PE3PE3PE3PE4PE4PE4PE5PE5PE5Mem CtrlACTMem CtrlMem CtrlMem CtrlMem CtrlMem CtrlWMEMWMEMACTACTACTACTRow0Row1Row7VCOSCAN I/

227、OMode CtrlCGGlobal RT Scheduler(GRTS)8 x 6 2D Computing Array.Top-level Architecture2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conferen

228、ce37 of 60Input DataWeight DataWeight StationaryReLUReLUReLUReLUReLUReLUOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMMEM CtrlMEM CtrlMEM CtrlMEM CtrlMEM CtrlMEM CtrlScalingScalingScalingScalingScalingScaling.PE(7,5)PE(7,4)PE(7,0)PE(7,3)PE(7,1)PE(7,2)Taska,bTaskm,nTasky,z.MappingPAMEMBANKPE(0,0).P

229、E(0,1)PE(0,2)PE(0,3)PE(0,4)PE(0,5)GRTSPE(1,0)PE(1,1)PE(1,2)PE(1,3)PE(1,4)PE(1,5)GMACADDR0ADDR1.ADDRNOMEMBANKSPushAccess.Global MEM Access CtrlGlobal RT SchedulerTaski,j.W0B0,A0Overview of PBRT Rendering Processor n 86 2D computing array supports RT and IR moden Inverse render moden Ray-Tracing modeP

230、E0PE0PE0WMEMOMEM(196Kb)ACTPE1PE1PE1PE2PE2PE2PE3PE3PE3PE4PE4PE4PE5PE5PE5Mem CtrlACTMem CtrlMem CtrlMem CtrlMem CtrlMem CtrlWMEMWMEMACTACTACTACTRow0Row1Row7VCOSCAN I/OMode CtrlCGGlobal RT Scheduler(GRTS)8 x 6 2D Computing Array.Top-level ArchitectureData Flow for Inverse Render and Ray-tracing ModeRT

231、ModeIR Mode2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference38 of 60Input DataWeight DataWeight StationaryReLUReLUReLUReLUReLUReLUO

232、MEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMOMEMMEM CtrlMEM CtrlMEM CtrlMEM CtrlMEM CtrlMEM CtrlScalingScalingScalingScalingScalingScaling.PE(7,5)PE(7,4)PE(7,0)PE(7,3)PE(7,1)PE(7,2)Taska,bTaskm,nTasky,z.MappingPAMEMBANKPE(0,0).PE(0,1)PE(0,2)PE(0,3)PE(0,4)PE(0,5)GRTSPE(1,0)PE(1,1)PE(1,2)PE(1,3)PE(1,4)

233、PE(1,5)GMACADDR0ADDR1.ADDRNOMEMBANKSPushAccess.Global MEM Access CtrlGlobal RT SchedulerTaski,j.W0B0,A0Overview of PBRT Rendering Processor n 86 2D computing array supports RT and IR moden Inverse render moden Ray-Tracing modePE0PE0PE0WMEMOMEM(196Kb)ACTPE1PE1PE1PE2PE2PE2PE3PE3PE3PE4PE4PE4PE5PE5PE5Me

234、m CtrlACTMem CtrlMem CtrlMem CtrlMem CtrlMem CtrlWMEMWMEMACTACTACTACTRow0Row1Row7VCOSCAN I/OMode CtrlCGGlobal RT Scheduler(GRTS)8 x 6 2D Computing Array.Top-level ArchitectureData Flow for Inverse Render and Ray-tracing ModeRT ModeIR Mode2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for

235、Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference39 of 60Outlinen Introduction and Motivationn Overall Algorithms and FlowlInverse Rendering for 3D constructionlRay-tracing Rendering with scala

236、ble partitioning schemen Hardware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-Tracing and Inverse Rendering lGlobal RT Scheduler and Global Memory Access Controller n Measurement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augment

237、ed Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference40 of 60PE CTRLCGOBJMEM(4KB)Mix-Precision MAC(INT8/16/32)Div64 SQRT64RIHDRT CTRLHW Reuse for INT32 OperationsA1 A016b16bB1B016b 16bA0B0A0B1A1B0A1B1+32b 32b Result PE(

238、Ray-Tracing Mode)Reconfigurable PE for RTn PE Computing Units for RT mode:MAC8-32,DIV64,SQRT642.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuit

239、s Conference41 of 60Reconfigurable PE for RTRT ModePE CTRLCGOBJMEM(4KB)Mix-Precision MAC(INT8/16/32)Div64 SQRT64RIHDRT CTRLHW Reuse for INT32 OperationsA1 A016b 16bB1B016b16bA0B0A0B1A1B0A1B1+32b 32b Result Ray-Tracing Intersection Hashmap Decoder(RIHD)Bounding Box Intersection Flag(BBIF)Axis Aligned

240、 Bounding Box(AABB)IDTriangle Mesh Group Address(TMGA)BBOX Intersection Evaluator(BBIE)OBJMEMPE CTRLGRTSRay-Tracing Controller(RTC)Triangle Intersection Evaluator(TIE)PAMEMPE Task IDGMACObject Shader(OS)PE Compute Units(PCU)OMEMDFFsPE(Ray-Tracing Mode)n PE Computing Units for RT mode:MAC8-32,DIV64,S

241、QRT64n RIHD is to record and fetch object info from OBJMEMn RTC is for PE RT state control in RT mode2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State

242、Circuits Conference42 of 60PE CTRLCGOBJMEM(4KB)Mix-Precision MAC(INT8/16/32)Div64 SQRT64RIHDRT CTRLHW Reuse for INT32 OperationsA1 A016b 16bB1B016b16bA0B0A0B1A1B0A1B1+32b 32b Result Ray-Tracing Intersection Hashmap Decoder(RIHD)Bounding Box Intersection Flag(BBIF)Axis Aligned Bounding Box(AABB)IDTri

243、angle Mesh Group Address(TMGA)BBOX Intersection Evaluator(BBIE)OBJMEMPE CTRLGRTSRay-Tracing Controller(RTC)Triangle Intersection Evaluator(TIE)PAMEMPE Task IDGMACObject Shader(OS)PE Compute Units(PCU)OMEMDFFsPE(Ray-Tracing Mode)Reconfigurable PE for RTn PE Computing Units for RT mode:MAC8-32,DIV64,S

244、QRT64n RIHD is to record and fetch object info from OBJMEMn RTC is for PE RT state control in RT modeRT Mode2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid

245、-State Circuits Conference43 of 60Reconfigurable PE for RTn PE Computing Units for RT mode:MAC8-32,DIV64,SQRT64n RIHD is to fetch object spatial info from OBJMEMn RTC is for RT state control in RT moden RT computation flowRT ModeRay-Tracing Light Transportation EffectRay Dir.=Reflection+RefractionL

246、2 x N x(N.dot(L)Refraction=I x eta+(-N)x(eta x cos(i)-sqrt(k)cos(i)=-MAX(-1,MIN(1,I.dot(N)etat=OBJ Refractive Indexk=1 eta x eta x(1-cos(i)x cos(i)=RT Shading FlowRay-GenerationRay-IterationLight TransportationObject HitObject Surface Shading ComputationPE CTRLCGOBJMEM(4KB)Mix-Precision MAC(INT8/16/

247、32)Div64 SQRT64RIHDRT CTRLHW Reuse for INT32 OperationsA1 A016b16bB1B016b16bA0B0A0B1A1B0A1B1+32b 32b Result PE(Ray-Tracing Mode)2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE

248、 International Solid-State Circuits Conference44 of 60Reconfigurable PE for RTn PE Computing Units for RT mode:MAC8-32,DIV64,SQRT64n RIHD is to fetch object spatial info from OBJMEMn RTC is for RT state control in RT moden RT computation flowRT ModeRT Shading FlowRay-GenerationRay-IterationLight Tra

249、nsportationObject HitObject Surface Shading ComputationPE CTRLCGOBJMEM(4KB)Mix-Precision MAC(INT8/16/32)Div64 SQRT64RIHDRT CTRLHW Reuse for INT32 OperationsA1 A016b16bB1B016b 16bA0B0A0B1A1B0A1B1+32b 32b Result PE(Ray-Tracing Mode)RT Shading EffectObject Shading+=Ambient Lighting+Diffuse Lighting+Spe

250、cular LightingLight Color x Background Albedo Index x Light IntensityLight Color x OBJ Albedo Index x Light Intensity x MAX(0,N.dot(L)Light Color x OBJ Specular Index x Light Intensity x(R.dot(V)OBJ SpecularExp2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Rea

251、lity with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference45 of 60PE CTRLCGOBJMEM(4KB)MAC(INT8)Div64SQRT64RIHDRT CTRLPE(Inverse Render Mode)MAC(INT8)Input Disable&Clock Gating for Power Saving in IR INF ModeClock Gating ModuleReco

252、nfigurable PE for IRn PE Computing Units for IR mode:MAC8-16n Power gating for RT logic for energy saving2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-St

253、ate Circuits Conference46 of 60Outlinen Introduction and Motivationn Overall Algorithms and FlowlInverse Rendering for 3D constructionlRay-tracing Rendering with scalable partitioning schemen Hardware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-Tracing and Inverse Rendering lGlo

254、bal RT Scheduler and Global Memory Access Controller n Measurement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conferenc

255、e47 of 60GRTS and GMAC for Memory Access n Baseline:Memory access confliction and low PE utilizationn Solution:Global RT Scheduler(GRTS)and Global Mem Access Control(GMAC)Multiple PEs in IDLE StageExecution PE(Computing)Idle PEFinished PE(OMEM Access)Execution PE(IMEM Access)GMACGlobal OMEMMultiple

256、PEs Accessing The Same MEMPAMEMGRTSRT Mode Statusxxxxxxxxxxxxxxxx2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference48 of 60GRTS and

257、GMAC for Memory Access n CLK-controlled RTTC is used to check PE access and stateClock Rotate RegisterADDR0ADDR1.ADDRNOMEMFULLOMEM EMPTYGlobal Mem Access Ctrl(GMAC)Accumulate Pixel Write AddressGlobal RT Scheduler(GRTS)Taska,bTaski,jTasky,z.CLKRT Token Checker(RTTC).PE Array.Write Req.PE(ID)PushChec

258、k Write Request PE Status UpdatePixel Task IDPE CTRLRay Gen.Ray Iter.RTTC.Computing Status2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Co

259、nference49 of 60GRTS and GMAC for Memory Access n CLK-controlled RTTC is used to check PE access and staten GRTS and GMAC update memory access and states every clock cycleClock Rotate RegisterADDR0ADDR1.ADDRNOMEMFULLOMEM EMPTYGlobal Mem Access Ctrl(GMAC)Accumulate Pixel Write AddressGlobal RT Schedu

260、ler(GRTS)Taska,bTaski,jTasky,z.CLKRT Token Checker(RTTC).PE Array.Write Req.PE(ID)PushCheck Write Request PE Status UpdatePixel Task IDPE CTRLRay Gen.Ray Iter.RTTC.Computing Status2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering an

261、d Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference50 of 60GRTS and GMAC for Memory Access n CLK-controlled RTTC is used to check PE access and staten GRTS and GMAC updates memory access and states every clock cyclen 42.8X and 16X speed up with 2.8%and

262、0.6%area overheadSpeed UpBaseline W/42.8XArea Overhead BaselineW/2.8%Speed UpBaselineW/16XBaselineW/0.6%GRTSGMACNormalized Run TimeNormalized Run TimeNormalized AreaNormalized Area01011010Area Overhead Clock Rotate RegisterADDR0ADDR1.ADDRNOMEMFULLOMEM EMPTYGlobal Mem Access Ctrl(GMAC)Accumulate Pixe

263、l Write AddressGlobal RT Scheduler(GRTS)Taska,bTaski,jTasky,z.CLKRT Token Checker(RTTC).PE Array.Write Req.PE(ID)PushCheck Write Request PE Status UpdatePixel Task IDPE CTRLRay Gen.Ray Iter.RTTC.Computing Status2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Re

264、ality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference51 of 60Outlinen Introduction and Motivationn Overall Algorithms and FlowlInverse Rendering for 3D constructionlRay-tracing Rendering with scalable partitioning schemen Ha

265、rdware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-Tracing and Inverse RenderinglGlobal RT Scheduler and Global Memory Access Controller n Measurement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rend

266、ering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference52 of 60Chip ImplementationOMEM IMEM/PAMEMProcessMax IR FrequencySupply VddIR Power SRAMArea(mm2)8 x 6 PE ArrayVCOSCAN IOTOP Ctrl296KB28nm CMOS3.56mm2 Bit PrecisionINT8,16,32,64RT Power IR Effic

267、iency(FPS/W)RT Efficiency(FPS/W)0.60.9VMax RT FrequencyWMEM200MHz148MHz40mW 0.9V55mW 0.9V50014181.60mm2.22mm2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid

268、-State Circuits Conference53 of 60Measurement ResultnPower and frequency with voltage scalinglIR mode power is 40mW,RT mode power is 55mW at 0.9V lIR mode frequency is 200MHz,RT mode frequency is 142MHz at 0.9VnEfficiency over Prior ASICslRender efficiency is 790FPS/W for IR-RT and 1418FPS/W for RT

269、only.l3.95-28.8X power efficiency over prior ASICs.02040608010012345Frequency(MHz)Voltage(V)Frequency v.s.Voltage00.6Power(mW)Voltage(V)Power v.s.Voltage2000.70.80.9IR ModeRT ModeIR ModeRT Mode0.60.70.80.90200MHz55mW40mW0.5142MHz0.5100402060This WorkEfficiency(FPS/W)ISSCC23 23.95X28.8XReconf.SIMT 30

270、200400600800*Different Rendering algorithmEfficiency over Prior ASICs2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference54 of 60Rende

271、ring Case Demonstrationn Average inverse rendering quantization loss:4.15%n Background cluster:Light ray transportation effectn Impact is nearly negligible in most rendering casesIR INF Quantization Inverse Render INF Average Quantization Accuracy LossRelative Accuracy0100%6.4%8.9%0.4%0.9%Lighting M

272、apAlbedo MapNormal MapDepth MapInverse Rendering InputRendering ResultBackground Cluster rendering ResultBaselineBaselineClusteredClusteredSSIM=0SSIM=02.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for M

273、obile Devices 2024 IEEE International Solid-State Circuits Conference55 of 60Rendering Case Demonstrationn Virtual object insertion:IR and RT moden Real-time AR rendering:24+FPSn Render performance:average 26FPSVirtual Object Insertion Cases38FPS24FPS29FPS13FPSUtah TeapotTea CupStanford BunnySpheres

274、2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference56 of 60Rendering Case Demonstrationn 3D Rendering for AR insertion:RT mode only,n

275、o IR neededn Render performance:78FPS+User-Defined Objects Pre-Defined Background ScenePhysically-based AR Object InsertionRT78FPSInsert Spheres with Different MaterialsRefractionShadowReflectionOriginal 2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality w

276、ith Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference57 of 60Comparison with Prior WorksArchitectureSolutionsBit PrecisionSupply VddPower Frequency/SupplyPeak Throughput Efficiency (FPS/W)SRAMArea(mm2)90Reconf.SIMT 3Process(nm)This

277、 Work28ISSCC23228163.5620.25SIMTSystolic ArraySystolic ArrayBVH-Acceleration+Ray-Tracing Process3D model DNN RenderingInverse Rendering+Ray-Tracing Process 1.2V0.9VClock Frequency100MHz 400 MHz50MHz 100 MHz142MHz 200MHz19.3KB2MB296KBFP32INT8-INT64FP8-FP16221mW 200MHz,1.2V40mW 200MHz,0.9V(IR)55mW 142

278、MHz,0.9V(RT)310mW 100MHz,0.7V199(0.7V)1418(RT only)NVIDIA GTX1080Ti16471SIMTNvidia Optix1.0V1480MHz2MBFP32,FP64250W 1480MHz,1V27.38*0.0030.6-0.95VThroughput(FPS)32 1180.7513 386*790(IR and RT)n Supports both Ray Tracing and Inverse Renderingn An efficient Ray Tracing rendering processor for mobile A

279、Rn 3.95-28.8X power efficiency over prior ASICs2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference58 of 60Outlinen Introduction and M

280、otivationn Overall Algorithms and FlowlInverse Rendering for 3D constructionlRay-tracing Rendering with scalable partitioning schemen Hardware ArchitecturelOverall Chip ArchitecturelReconfigurable PE for Ray-Tracing and Inverse Rendering lGlobal RT Scheduler and Global Memory Access Controller n Mea

281、surement Resultsn Summary2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference59 of 60SummarynThere is a strong demand on physical-base

282、d photorealistic rendering on mobile devicesnConventional Ray-tracing solutions,e.g.GPUs are highly expensive for mobile devicesnProposed a hardware efficient RT solution for edge devices:lReconfigurable PE architecture for RT and IR lFOV-focused 3D construction reducing scene complexity from O(nlog

283、n)to O(1)lScalable partitioning scheme with 13x average speed uplGRTS and GMAC for 42.8x and 16x speed up over baselinelMeasured peak RT and RT-IR performance of 1418FPS/W and 790FPS/W at 0.9Vl3.95-28.8X energy efficiency over prior ASICs2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for

284、Photorealistic Augmented Reality with Inverse Rendering and Background Clustering for Mobile Devices 2024 IEEE International Solid-State Circuits Conference60 of 60Thank you!2.5:A 28nm Physical-Based Ray-Tracing Rendering Processor for Photorealistic Augmented Reality with Inverse Rendering and Back

285、ground Clustering for Mobile Devices61 of 60Please Scan to Rate This Paper2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference1 of 37A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communic

286、ation SystemsTang Lee,Ting-Yang Chen,I-Hsuan Liu,Chia-Hsiang YangNational Taiwan University2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference2 of 37Outline Introduction Preliminaries Algorithm-Architectur

287、e Co-Optimization System Architecture Chip Implementation Summary2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference3 of 37IntroductionFuture wireless communication system needs to support deviceswith high

288、 mobility 1Severe Doppler spread associated with fast time-varying channelHowever,current mainstream orthogonal frequency divisionmultiplexing(OFDM)technique is vulnerable to Doppler spread1 Z.Wei,IEEE Trans.on Communications,20212.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen C

289、ommunication Systems 2024 IEEE International Solid-State Circuits Conference4 of 37Orthogonal Frequency Division Multiplexing(OFDM)OFDM multiplexes symbols over time-frequency(TF)domainThe orthogonality between subcarriers degrades due to inter-carrierinterference(ICI)under severe Doppler spreadPerf

290、ormance of OFDM degrades for high-mobility scenarios2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference5 of 37Orthogonal Time Frequency Space(OTFS)ModulationOTFS multiplexes symbols over delay-Doppler(DD)d

291、omainEquivalent channel for symbols reflects physical geometry of environmentThe delay and Doppler spread of each propagation path are well-separatedOTFS technique is a promising modulation scheme for high-mobility scenariosSource:M.Ramachandran,Journal of Indian Institute of Science,2020 2.6:A 131m

292、W 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference6 of 37Comparison between OFDM and OTFS ModulationsThe bit error rate(BER)performance of OTFS modulation is morerobust than OFDM in high-mobility scenarios256x32 mu

293、lti-user multi-input multi-output(MU-MIMO),256 quadrature amplitude modulation(QAM),w/o calibration2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference7 of 37State-of-the-Art Massive MU-MIMO DetectorExistin

294、g MU-MIMO detectors only support OFDM,but OTFS has32x higher run-time complexity according to experiment on CPU3 Y.-T.Chen,VLSI,20176 C.-C.Wen,VLSI,20204 W.Tang,ISSCC,20182.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Ci

295、rcuits Conference8 of 37State-of-the-Art Massive MU-MIMO DetectorExisting MU-MIMO detectors only support OFDM,but OTFS has32x higher run-time complexity according to experiment on CPU3 Y.-T.Chen,VLSI,20176 C.-C.Wen,VLSI,20204 W.Tang,ISSCC,2018This work presents the first massive MU-MIMO detector for

296、 OTFS systems2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference9 of 37Outline Introduction Preliminaries Algorithm-Architecture Co-Optimization System Architecture Chip Implementation Summary2.6:A 131mW 6

297、.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference10 of 37OFDM:Time-Frequency Signal ProcessingOFDM system is implemented by inverse fast Fourier transform(IFFT)at Tx and fast Fourier transform(FFT)at Rx2.6:A 131mW 6.

298、4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference11 of 37OTFS:Delay-Doppler Signal ProcessingInverse symplectic finite Fourier transform(ISFFT)and Heisenbergtransform at Tx while SFFT and Wigner transform at Rx2.6:A 1

299、31mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference12 of 37Massive MU-MIMO OTFS SystemIntegration of OTFS and massive MU-MIMO for future high-speed,high-mobility communication systems2.6:A 131mW 6.4Gbps 25632 Mul

300、ti-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference13 of 37Message-Passing(MP)Detection Algorithm 2Interference term is approximated as a Gaussian random variableThe mean and variance are iteratively updated and exchangedover a factor

301、 graph with symbol estimates2 H.Zhang,IEEE Wireless Communications Letters,20212.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference14 of 37Outline Introduction Preliminaries Algorithm-Architecture Co-Optimi

302、zation System Architecture Chip Implementation Summary2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference15 of 37Variance Estimation FormulationAdaptive variance estimation is adopted to approximate theint

303、erference variance 3Structure of Gram matrix in OTFS system is fully utilized3 Y.-T.Chen,VLSI,2017System Setting:U=8,M=8,N=42.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference16 of 37Low-Complexity Varianc

304、e Estimation(1/2)The structure of Gram matrix is leveraged to reduce complexityHermitian symmetry,variance sharing,and diagonal maximum searchSystem Setting:U=8,M=8,N=42.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circu

305、its Conference17 of 37Low-Complexity Variance Estimation(2/2)The overall MP complexity is reduced by 75%and the variance canstill be estimated accuratelySystem Setting:256x32 MU-MIMO,M=16,N=42.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE Internat

306、ional Solid-State Circuits Conference18 of 37Low-Complexity Mean Computation(1/2)Block-wise computation flowBlock sparsity in each submatrix is utilized to reduce complexity inestimating interference means by skipping associated computations in 2System Setting:M=8,N=42 H.Zhang,IEEE Wireless Communic

307、ations Letters,20212.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference19 of 37Low-Complexity Mean Computation(2/2)Delay-time(DT)-domain computation flowBlock-wise circular convolution in DD domain is imple

308、mented efficiently byelementwise multiplications in DT domain by Fourier transformOverall MP complexity is reduced by 70%System Setting:M=8,N=42.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference20 of 37Res

309、idual Noise(RN)Update FormulationLayered scheduling is usually applied to improve the convergencespeed through layer-by-layer detectionUpdated symbols are forwarded to the next layerAdditional memory storage for storing partial interference of each layer2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS

310、 Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference21 of 37Memory-Efficient Residual Noise(RN)UpdateRN is updated recursively by the difference between symbols in thecurrent and last iterationsCorrelated symbols can be reused for four consecutive laye

311、rs2 better convergence speed with 94%memory size reduction2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference22 of 37Outline Introduction Preliminaries Algorithm-Architecture Co-Optimization System Archite

312、cture Chip Implementation Summary2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference23 of 37System ArchitectureLLR:Log-likelihood ratio2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen C

313、ommunication Systems 2024 IEEE International Solid-State Circuits Conference24 of 37Processing Core Architecture Interference cancellation unit(ICU)Interference cancellation Symbol estimation Mean computation unit(MCU)Mean computation Variance computation unit(VCU)Variance computation Core memory ba

314、nk Storage for residual noise Storage for reused symbols2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication Systems 2024 IEEE International Solid-State Circuits Conference25 of 37Mean Computation Unit:Proposed Design(1/2)Mean values are computed along diagonal band of

315、submatrix in ablock-by-block mannerTo eliminate data dependencyMean values for lower-left blocks are derived from symbols updated in thepreceding layerMean values for upper-right blocks are pre-computed as partial sums2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communication

316、 Systems 2024 IEEE International Solid-State Circuits Conference26 of 37Mean Computation Unit:Proposed Design(2/2)Sum unit and accumulation unit are designed for performingtemporal and sequential operations,respectively2.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Gen Communicatio

317、n Systems 2024 IEEE International Solid-State Circuits Conference27 of 37Mean Computation Unit:Latency ReductionCompared with row-wise baseline design,the proposed architectureachieves full hardware utilizationSystem Setting:M=16,N=42.6:A 131mW 6.4Gbps 25632 Multi-User MIMO OTFS Detector for Next-Ge

318、n Communication Systems 2024 IEEE International Solid-State Circuits Conference28 of 37Mean Computation Unit:Dual-Mode MultiplierMaximal input symbol difference of MCU decreases as MP detectionprogresses,leading to an underutilized MCUA dual-mode multiplier in CMAC computation unit is proposed tored

319、uce latency by 24%in high-SNR regime(with BER 8Gbps/users;IEEE 802.3 standard.High ThroughputLow LatencyConfigurability2030msLatencyLess than 2msLatency10msLatency4G5GB5G/6GLaser-mm-wave AggregationCooperative NetworksLow LatencyIncreasingly stringent latencyrequirements from 4G to 6G;3Accumulate An

320、tenna:071122331 This channel estimatorinvolves three main modules:Input Buffer B:stores the descrambled data(per ant.);2GAMP:implemented channel estimation algorithm;3Output Buffer:buffers the channel matrix data(per subcarrier).Pipelined architecture;Rearrange the data dimensionChannel EstimatorInp

321、ut Buffer BGAMPOutput BufferXGEETH Frame ParserPreprocessor1Pipeline processing23 These high-throughputdesigns contribute to 9.6Gbps system throughput.2.7:BayesBB:A 9.6Gbps 1.61ms Configurable All-Message-Passing Baseband-Accelerator for B5G/6G Cell-Free Massive-MIMO in 40nm CMOS 2024 IEEE Internati

322、onal Solid-State Circuits Conference13 of 23Low-Latency Design:Fully Unfolded BP Detector88 MIMO DetectorData ConversionSoft DemodulationInterference MeasurementPoster.Mess.CalculationPing-PongBufferInterference MeasurementPoster.Mess.CalculationPoster.Mess.AccumulationPing-PongBufferPing-PongBuffer

323、Ping-PongBufferPrior Prob.CalculationBP MIMO(1st iteration)Interference MeasurementPoster.Mess.CalculationPing-PongBufferInterference MeasurementPoster.Mess.CalculationPoster.Mess.AccumulationPing-PongBufferPing-PongBufferPing-PongBufferPrior Prob.CalculationBP MIMO(2nd iteration)Interference Measur

324、ementPoster.Mess.CalculationPing-PongBufferInterference MeasurementPoster.Mess.CalculationPoster.Mess.AccumulationPing-PongBufferPing-PongBufferPing-PongBufferPrior Prob.CalculationBP MIMO(5th iteration)Poster.Mess.AccumulationSoft DemodulationPoster.Mess.CalculationCALCALInterference Meas.AVGAVGAVG

325、.PE CorePE for polar decodinga)iteration number fordecoders;b)CRC check mode;c)Other configurableparameters:2.7:BayesBB:A 9.6Gbps 1.61ms Configurable All-Message-Passing Baseband-Accelerator for B5G/6G Cell-Free Massive-MIMO in 40nm CMOS 2024 IEEE International Solid-State Circuits Conference15 of 2

326、3OutlineIntroduction and ChallengesBayesBB Architecture DesignAlgorithm and Overall ArchitectureHigh-Throughput ArchitectureLow-Latency MIMO DetectorConfigurable Channel Decoder Test ScenarioBayesBB ResultsConclusion2.7:BayesBB:A 9.6Gbps 1.61ms Configurable All-Message-Passing Baseband-Accelerator f

327、or B5G/6G Cell-Free Massive-MIMO in 40nm CMOS 2024 IEEE International Solid-State Circuits Conference16 of 23Room 1Room 2Room 3Room 4UEUEUEUEUEUEUEUEUEUEUEUERAURAURAURAURAURAURAURAURAURAURAURAURoom 5RAURAURAURAUUEUEUEUEServerRAU1RAU2RAU3RAU4UE1UE2UE3UE4Room1BayesBBAntenna ArrayRAUBayesBBData SentDat

328、a ReceivedAntenna ArrayThe OFDM systems subcarrier center frequency is 3.65GHz.The 128128 Cell-Free MIMO testbedBayesBB test scenarioSystems centered frequency Test Scenario:128128 Cell-Free MIMO Testbed1234BayesBB is demonstrated in a 128128 scalable Cell-Free MIMO testbed(each BayesBB supports 88

329、MIMO scenarios):Diagram of Cell-Free MIMO architecture;1Real photo of the Cell-Free MIMO in Room1;Specifical test scenario when applied for UE3s baseband signal processing;Observed OFDM subcarriers,and center freq.is 3.65GHz.2342.7:BayesBB:A 9.6Gbps 1.61ms Configurable All-Message-Passing Baseband-A

330、ccelerator for B5G/6G Cell-Free Massive-MIMO in 40nm CMOS 2024 IEEE International Solid-State Circuits Conference17 of 23Using 40nm Technology High-speed InterfaceLow LatencyConfigurabilityRegister and RAM initialization Successes!Two Decoding Modes Available:LDPC and Polar Codes.Register Reading an

331、d Writing Success!Graphical user interface(GUI)for BayesBBConfiguration interface for BayesBBTest Scenario:GUI Design for BayesBBThis GUI enables to configure the BayesBB and shows itsreal-time processing results.Three features:1)high throughput(9.6Gbps),2)low latency(1.61ms),and 3)configurability(5

332、G LDPC&polar codes).Configure BayesBBsregisters and RAMs with SPI serial bus!2.7:BayesBB:A 9.6Gbps 1.61ms Configurable All-Message-Passing Baseband-Accelerator for B5G/6G Cell-Free Massive-MIMO in 40nm CMOS 2024 IEEE International Solid-State Circuits Conference18 of 23OutlineIntroduction and Challe

333、ngesBayesBB Architecture DesignAlgorithm and Overall ArchitectureHigh-Throughput ArchitectureLow-Latency MIMO DetectorConfigurable Channel DecoderTest Scenario BayesBB ResultsConclusion2.7:BayesBB:A 9.6Gbps 1.61ms Configurable All-Message-Passing Baseband-Accelerator for B5G/6G Cell-Free Massive-MIMO in 40nm CMOS 2024 IEEE International Solid-State Circuits Conference19 of 23Frequency domain resul

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(SESSION 2 - Processors and Communication SoCs.pdf)为本站 (2200) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部