《Analog In-memory computing with multilevel memristive devices for high performance computing - Glenn Ge - TetraMem Inc.pdf》由会员分享,可在线阅读,更多相关《Analog In-memory computing with multilevel memristive devices for high performance computing - Glenn Ge - TetraMem Inc.pdf(24页珍藏版)》请在三个皮匠报告上搜索。
1、Glenn Ge,Co-founder&CEOTetraMem Inc.Analog In-memory computing with multilevel memristive devices for high performance computing The AI and AI Chip MarketSource:ARK Big Idea 2022 AI applications will add$30 trillion to the global equity market capitalization during the next 2 decades and AI chip mar
2、ket will reach$150 billion at 2030 with 30%CAGR3Source:Global X 2023 OpenAIs Sora Ignites Increased Computing Demand41 minute of OpenAIs Sora video may take over an hour to generate=Without new efficient computing,we need 18,460,000 A100 to generate the videos daily watched in TikTok*.*Source:Video
3、picture from OpenAI&analysis from Minsheng Securities,2024Customer Pain-points For Energy Efficiency AMDs plenary talk,ISSCC 2023 AMDs plenary talk by Dr.Lisa Su,ISSCC 2023 In-Memory Computing(IMC)Solution For AI ComputingALUCache Memory(L1/L2/L3)ALUALUALUALUALUALUALUDRAM MemoryTraditional von Neuma
4、nn Architecture IMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC UnitIMC Architecture Data processed in the same physical location as it is stored with minimum intermediate data movement&storage=low power consumption Massive
5、parallel computing process by cross-bar array architecture with device-level grain cores=high throughput Computing by physical laws(Ohms law and Kirchhoffs current law)=low latency Data movementSuperior architecture,but right device is the keyTang etc.2019 Symposium on VLSI CircuitsNote:nvCIM=nonvol
6、atile compute in memory,PE:processing element SRAM:Very fast and little energy(pJ level)for data movement but very limited size(few K to a few hundred M Byte on chip)DRAM:Large memory size(GB)but high energy(1000 pJ level)and slow speed(ns)Computing Memory:Memory Device with Special Attributes For C
7、omputingJ.Joshua Yang et al.,Memristive devices for computing,Nature Nanotechnology8,13-24(2013)(3,200 citations).Memory devicesVolatileSRAMDRAMNon-volatile TraditionalNANDNOREEPROMEmerging MRAMFRAMPCRAMReRAMOther ReRAMsComputing memristorIMC architecture is great!But suitable memory device is the k
8、ey for the success.However,most of them are invented for memory application.Memory devices need decades of development?7Current Memory Device Main Limitations For Computing ApplicationsVolatile binary deviceAccompany storage memory neededMin.6T/cell,expensiveSpeed vs power tradeoffExpensive flash me
9、mory processScalability issue for 28nmHigh operation voltageLong program&erase timeRetention issue for MLPSRAMNOR-FLASHSTT/SOT MRAMPCRAMComplicated Magnetic Tunnel Junction(MTJ)stack structureYield and cost challengeMulti-level challengeLow off/on resistance ratioWide distribution of the SET statesC
10、onductance driftTemperature-induced conductance variationsScaling disturbance issue Each memory device can find its own application space with its unique set of the device performance attributes.However,none of the current memory devices can meet the ideal requirements for the IMC.Therefore,at Tetra
11、Mem,we built the solution from material and device level.8MTJWLBLSLOther ReRAMsLimited multi-level(Binary or 6 bits/cell)Retention issue for MLPEndurance issueI-V linearity issue VLSI,2019Current Memory Device Main Limitations For Computing ApplicationsResults From TetraMem Device Published In Thous
12、ands of conductance levels in memristors monolithically integrated on CMOS”,Nature,Mar 2023 https:/rdcu.be/c8GWo Computing Memristor Optimized For Computing ApplicationBottom ElectrodeExcellent I-V linearity11 bit/cell multi-level programming 125CSuperior retention and stability at elevated temperat
13、ure Excellent device uniformity Superior memory grade endurance with 100 million cycle(108)Neural network weight(memory conductance)mapping in 256x256 array One shot programming cross the wafers measurement The Secret of World-record Accuracy On Top of Material&Device Engineering Just as the functio
14、ning of synapses and neurons is driven by the intricate movement of ions,so too is the main operation of memristors By control of the conductive filament size,ion concentration and height,different multi-levels for the cell resistance can be precisely achieved Zoom-in image of the 2048 resistance le
15、velsStable 11 bits multi-levels demonstrated with high I-V linearityTop electrode metal(TE)Bottom electrode metal(BE)Dielectric layer(e.g.metal oxide)+VTEBE-VTEBEVirgin StateLow Resistance State(LRS)High Resistance State(HRS)FormingResetSetMetallic conductive filament1 shot coarse tuningFeedback loo
16、p fine tuningDenoising5-bit resolution8-bit resolution11-bit resolutionHigh SpeedHigh accuracy10 nm10 nmNano-scale unstablephase is responsible for noise Programming Trade-offsMemristive SwitchDenoising ProcessScalable Technology For Scaling and Backend 3D StackingStacking:Eight Layers of Memristor
17、Crossbars-A 3D AI processorNature Electronics.Apr 2020,cover page of the issueNature Nanotechnology 14,35-39(2019).Highlighted in the Editorial“Testing memory downsizing limits”.NANO pattern written into the array6-nm half-pitch;4.5 Tbit/in2;100/m wire resistanceScaling:22 nm2 Memristor in a Crossba
18、r Array-Smallest working devices in electronic circuitsThe device has very scalable device roadmap to support future sub-10nm and 3D integration development.Go SmallGo 3DEight Layers of Computing Memristor Crossbars with 300 nm linewidth 3D Convolutional Neural Network(CNN)video processorResults Fro
19、m TetraMem Chip Published In“Programming memristor arrays with arbitrarily high precision for analog computing”,Science,Feb 2024 https:/www.science.org/doi/10.1126/science.adi9405TetraMems MX100 Test SoCHigh Precision Computing By True Analog Computing In traditional Bit Slicing approach,the accurac
20、y of the VMM result relies heavily on the programming accuracy of each cell in the array.This approach forces analog devices to behave like digital devices,negating the advantages of analog devices and analog computing.In True Analog approach,we use the weighted sum of multiple devices to represent
21、one number but utilize the subsequently programmed devices to compensate for the conductance error of the previously programmed devices.Adding more device will eventually make the final value arbitrarily precise.Analog Computing Solves Numerical Problem With Arbitrarily High Precision For the first
22、time,TetraMem chip demonstrated the same precision as software in solving complex numerical problems with analog computingScalability And Energy Efficiency From The Analog Computing ApproachComputing iteration number can be reduced by increasing mesh size(using large arrays);The proposed solution is
23、 more than one order magnitude more energy efficient than digital solution;Solve complex time evolving problems which is highly error sensitive(due to accumulation and propagation).Software resultsMemristor array resultsMagnetohydrodynamics(MHD)Problem ExampleIn Memory Computing with Analog Non-vola
24、tile Memory CrossbarOne-step VMM(MAC)CalculationComputing using 1T1R Crossbar ArrayProgrammable 8-bit NVM variable resistorConfigured as N x N crossbarsMultiply Accumulate(MAC)OperationoVoltage is input,Current is outputo1/R is weight,transistor for selection and current controloIn Memory Computing(
25、IMC)by Ohms law and Kirchhoffs current lawEfficient Processing(orders of efficiency and speed improvement)oNo need to load weights,no memory wall trade-offoSame input loaded to multiple filters in paralleloMassive operations in paralleloMinimize intermediate load stores oCNN=Many MAC operationsoArbi
26、trary precision with architecture level innovation In-memory computing!Parallel computing!Analog computing!Ik=#!#$Gik ViYk=(!#$Wik Xi1TetraMem Software Solution for the AI ApplicationsQuantizationOptimizationLocalizationChip#1Analog IMCAI-engineOther blocksChip#2Analog IMCAI-engineOther blocksChip#1
27、,000,001Analog IMCAI-engineOther blocksUnique advantages:Model Deploy.On-chip Cali.ONE pre-trained modelMILLIONS of Chips with AI engines Near-zero boot-on timeNon-volatile 8-bit weight storageUltra-low latencyUltra-fast direct uint8-uint8 NN layer operation Ultra-low powerSequential but fast layer
28、executionMinimize memory/peripheral circuit overheadOverseen with system model to overcome the analog loss TetraMems MX100 SoC Chip Architecture RISC-V with multiple digital functions,can perform the neural network model inference independently 620K neural parameters with 10 cross-bay arrays,each ce
29、ll up to UINT8 weight supportMNIST Dataset ClassificationOptical recognition of handwritten numbers.The dataset that is used to train this NN is called MNIST3-layer CNN model Implemented on 1 NPU98+/-1%(98.6)accuracy on 10000 testing imagePCN(Pupil Center Net)Gaze TrackingProcessing the output of an
30、 AR/VR headset camera to predict the gaze direction for each eye7-layer CNN modelImplemented on 8 NPUs0.0006-0.0020(0.0008)MSE(Mean Square Error)pupil center tracking78 FPS(Frame per Second)at 400MHz sclkVWW(Visual Wake Words)Person DetectionBenchmark to determine if there is a person in the frame5-
31、layer CNN modelImplemented on 9 NPUs80%+/-1%(81%)accuracy on reference eval set56 FPS(Frame per Second)at 400MHz sclkReal Neural Network Demo Based on MX100Scalable System Architecture For Various Applications1T1R Cross Bar Array(256x256-1K x 1K)A/B SW DACsADCsA DAC regB DAC regCtrlLinear transformC
32、onfig./Para.regActivationPoolingData output bufferData input bufferSpec.xbar for depthwise conv.onlyBUSOther functional blocksIO,PMU etc.MAC EnginesSecurity featuresMAC bufferDMACPUsBUSSingle MAC Engine Core ArchitectureHighly Scalable System Architecture DesignSupport Tiny Edge(e.g.smart sensor)to
33、server/data center(e.g.HPC accelerator)22Pathway to HPC and Data Center SolutionLarge SoC based on high performance NPU MacroConvert to ONNXHPC and AIGCEfficient ML and Runtime Compiler+=Enabling HPC&AIGC12nm/5nm/NPU SoC7G-200G storage at 8-bitTowards 300 TOPS/W efficiencyEfficient runtime for large
34、 modelsLow latency human-machine interactionMulti-model fusion support for CNN,RNN,LSTM,Transformer etc.Enable large models applications for:Text generation,Q&APicture generation,text to imagesAudio/speed generationVideo generationGPTGANNLGBERTTransformerTetraMem high-performance NPU Macro TetraMem Instinct TM SoftwareAccelerate the World