《HotChips34 - Groq - Abts - final.pdf》由会员分享,可在线阅读,更多相关《HotChips34 - Groq - Abts - final.pdf(69页珍藏版)》请在三个皮匠报告上搜索。
1、 2022 Groq,Inc.|PublicHotChips34-2022The Groq Software-defined Scale-out Tensor Streaming MultiprocessorFrom chips-to-systems architectural overview 2022 Groq,Inc.|PublicHotChips34-2022Dennis AbtsChief Architect&Groq F2Dennis Abts,John Kim,Garrin Kimmell,Matthew Boyd,Kris Kang,Sahil Parmar,Andrew Li
2、ng,Andrew Bitar,Ibrahim Ahmed,Jonathan Ross 2022 Groq,Inc.|PublicHotChips34-20223Outline01Tensor Streaming Processor(TSP)Background02Software-defined Hardware and Deterministic Execution03TSP Microarchitecture04System Packaging,Topology,Routing,and Flow Control05Summary 2022 Groq,Inc.|PublicHotChips
3、34-20224The Software-defined ApproachHardware-software co-design is nothing new What we are doing is re-examining the hardware-software interfacesStatic-dynamic Interface:what is performed at“compile time”(statically)versus“execution time”(dynamically).This interface is managed by the runtime layer.
4、Hardware-software Interface:what architectural state is“visible”to the compiler such that we can can reason about correctness and providing predictable performance“Nodes”in the computational graph represent operators and“edges”are the operands and results Operators fire only when all their input ope
5、rands are availableMachine learning models are a good fit for thisstatic analysis and deterministic executionHardware-software InterfaceSoftwareRuntime SystemTSP HardwareParallelizing CompilerCoreMLPyTorchTensorFlowCustom ApplicationsKerasXGBoostScikit-learnModel ConvertersONNXMLIRAssemblerBare-meta
6、l Programming Interfacehardware-softwareInterfaceExceptionHandling 2022 Groq,Inc.|PublicHotChips34-20225Designing for DeterminismBuilding hardware to be an efficient compiler targetDesign choices along the way need to accommodate the“design for determinism”design philosophyHardware must enable the c
7、ompiler and runtime interfaces to reason about program executionMemory consistency model must be well understooddisallowing reordering of memory refsNo“reactive components”like arbiters,crossbars,replay mechanisms,caches,etcSoftware must have access to the architectural-visible machine state in orde
8、r tointercept the data(operands)with the instruction that will execute on themCompiler knows the exact location of every tensor on-chipIn this way,the compiler is orchestrating the arrival of operands and the instructions which use them.The producer-consumer stream programming model allows a set of“
9、streaming register files”to track the state of each tensor flowing through the chip.2022 Groq,Inc.|PublicHotChips34-2022Intel Cascade Lake6Speculative execution and out-of-order retirement to improve instruction level parallelism-increases tail latencyImplicit data flow through cache memory hierarch
10、ies introduce complexity and non-determinism(e.g.DRAM L3 L2 L1 GPRs)to hide DRAM access latency&pressure-not energy or silicon efficientAvoiding Complexity at the chip levelConventional CPUs Add Features and ComplexityRequires dynamic profiling to understand execution time and throughput characteris
11、tics of deep learning modelA large,single-level scratchpad SRAM,fixed,deterministic latencyExplicitly allocate tensors in space and time unlocking massive memory concurrency,and compute flexibility along multiple dimensions:Device,hemisphere,memory slice,bank,address offsetThe TSP Simplifies Data Fl
12、ow Through Stream ProgrammingPredictable performance at scale 2022 Groq,Inc.|PublicHotChips34-20227Warehouse-scale Computers(WSCs)and supercomputersScaling to 20K+nodesIncreasingly heterogeneous(SmartNICs,CPU,GPU,FPGA)Latency variance limits application scale Related to diameter of the networkGlobal
13、 adaptive routing is complex(out of order messages,faults,hotspots,route/load imbalance,congestion)Software-defined networkingCompiler controls traffic pattern in Groq C2C networkAdaptive routing on chip to reduce variance on latency and reduce buffer occupancyExtended with existing topologies,witho
14、ut pitfallsHigh-radix switches to increase pin-bandwidth on each node/switchLow-diameter network topology(eg.Dragonfly,Flattened Butterfly,HyperX)Avoiding Complexityat the system levelOblivious RoutingAdaptive RoutingDISTRIBUTION OF PACKET LATENCYOblivious RoutingAdaptive RoutingMAX BUFFER ENTRIES O
15、F MIDDLE STAGE SWITCH IN A 1K NODE NETWORK 2022 Groq,Inc.|HotChips34-2022 2022 Groq,Inc.|PublicHotChips34-2022TSP MicroarchitectureWhat does software-defined hardware mean?8 2022 Groq,Inc.|PublicHotChips34-20229Functionally Sliced MicroarchitectureReorganizing the multicore meshIFIDEXMEMWBCanonical
16、5-Stage PipelineOn-chip NetworkCores 2022 Groq,Inc.|PublicHotChips34-202210Functionally Sliced MicroarchitectureReorganizing the multicore meshIFIDEXMEMWBCanonical 5-Stage PipelineOn-chip NetworkCoresINSTR DispatchNET1$LSUD$INTFP 2022 Groq,Inc.|PublicHotChips34-202211Functionally Sliced Microarchite
17、ctureReorganizing the multicore meshIFIDEXMEMWBCanonical 5-Stage PipelineOn-chip NetworkCoresInstruction FlowData FlowFP/INTNET/SXMMEM(LSU+DS+IS)MEM(LSU+DS+IS)NET/SXMFP/INTFunctional SliceIFODEXEXWBInteger ALU Instruction PipelineIDIFODMEMMEMWBIDMemory(load/store)Instruction PipelineInstructional Co
18、ntrol DispatchFP/INTICU:instruction control unitsMEM:on-chip memory(SRAM)VXM:vector processingMXM:matrix operationsSXM:data reshapes and IOTensor operands and results flow on“streams”horizontallyInstructions flow vertically executed in a SIMD manner 2022 Groq,Inc.|PublicHotChips34-202212Functionally
19、 Sliced MicroarchitectureReorganizing the multicore meshIFIDEXMEMWBCanonical 5-Stage PipelineOn-chip NetworkCoresInstruction FlowData FlowFP/INTNET/SXMMEM(LSU+DS+IS)MEM(LSU+DS+IS)NET/SXMFP/INTFunctional SliceIFODEXEXWBInteger ALU Instruction PipelineIDIFODMEMMEMWBIDMemory(load/store)Instruction Pipe
20、lineInstructional Control DispatchFP/INTReorganize a Conventional Manycore 2D MeshMEM:on-chip memory(SRAM)VXM:vector processingMXM:matrix operationsSXM:data reshapes and IOTensor operands and results flow on“streams”horizontallyInstructions flow vertically executed in a SIMD mannerINSTR DispatchNET1
21、$LSUD$INTFP 2022 Groq,Inc.|PublicHotChips34-2022IFIDEXMEMWBCanonical 5-Stage PipelineOn-chip NetworkCoresInstruction FlowData FlowFP/INTNET/SXMMEM(LSU+DS+IS)MEM(LSU+DS+IS)NET/SXMFP/INTFunctional SliceIFODEXEXWBInteger ALU Instruction PipelineIDIFODMEMMEMWBIDMemory(load/store)Instruction PipelineInst
22、ructional Control DispatchFP/INTReorganize a Conventional Manycore 2D MeshMEM:on-chip memory(SRAM)VXM:vector processingMXM:matrix operationsSXM:data reshapes and IOTensor operands and results flow on“streams”horizontallyInstructions flow vertically executed in a SIMD mannerINSTR DispatchNET1$LSUD$IN
23、TFP 2022 Groq,Inc.|PublicHotChips34-2022234544 Instruction queues dispatching instructions14The chip is split into two“hemispheres,”East and West with the VXM at the middle of the chipThe vector unit(VXM)has 16 PEs per lane 5,120 total each capable of up to one 32-bit computati
24、ons per cycle,or four INT8 operations per cycleThe matrix unit(MXM)has 320 lanes x 320 features:Each MXM plane stores 102,400“weights”409,600 MACCs(multiply-accumulators)on-chipPeak of 750 Tops(Int8Acc32)peak(900 MHz)1 TeraOps/sec per mm2TSP Data Flow 2022 Groq,Inc.|PublicHotChips34-2022SoftwareAbst
25、ractionsSIMD+spatial architecture+streaming 15 2022 Groq,Inc.|PublicHotChips34-202216GroqChip Scalable ArchitectureNetworking480 GB/s bandwidthExtensible network scalabilityMultiple topologiesData SwitchShift,transpose,permuter for improved data movement and data reshapesInstruction ControlMultiple
26、instruction queues for instruction parallelismGroqChip 1SRAM MemoryMassive concurrency80 TB/s of BWStride insensitiveGroq TruePoint Matrix 4x Engines 320 x320 fused dot productInteger and floating pointProgrammable Vector Units5,120 Vector ALUs*for high performanceMatrix Multiply UnitMatrix Multiply
27、 UnitVector UnitMemoryInstruction Control UnitIOIOPCIeMemory 2022 Groq,Inc.|PublicHotChips34-202217GroqChip Building BlocksSIMD UnitInstruction Dispatch320-element vector 2022 Groq,Inc.|PublicHotChips34-202218GroqChip Building BlocksBuild different types of specialized SIMD unitsMXMMatrix-Vector/Mat
28、rix-Matrix MultiplyVXMVector-Vector OperationsSXMData ReshapesMEMOn-chip SRAM 2022 Groq,Inc.|PublicHotChips34-202219GroqChip Building BlocksLay out SIMD units across chip areaMXMVXMSXMMEMMEMSXM MXM 2022 Groq,Inc.|PublicHotChips34-202220GroqChip Building BlocksSynchronized instruction dispatch across
29、 all SIMD units for lockstep executionInstruction DispatchInstruction Flow144 Instruction Dispatch Paths 2022 Groq,Inc.|PublicHotChips34-202221GroqChip Building BlocksHigh-bandwidth“Stream Registers”for passing data between unitsData FlowData Flow144 Instruction Dispatch PathsInstruction FlowInstruc
30、tion Dispatch 2022 Groq,Inc.|PublicHotChips34-202222An ISA That Empowers SoftwareSoftware-controlled memory enabled by low-level abstraction exposed by architectureNo dynamic hardware caching Compiler aware of all data locations at any given point in timeFlat memory hierarchy(no L1,L2,L3,etc)Memory
31、exposed to software as a set of physical banks that are directly addressedLarge on-chip memory capacity(220 MB)at high-bandwidth with(55TBps)reduces need to spill to non-deterministic DRAM Provides enough“scratchpad”memory to hide external memory accesses behind computeMXMSXMMEM88 SRAM banksVXMMEM88
32、 SRAM banksSXMMXM 2022 Groq,Inc.|PublicHotChips34-202223An ISA That Empowers SoftwareCompiler empowered to perform Cycle-accurate instruction schedulingFunctional units execute in lockstep One instruction issued per cycle at each dispatch pathCan be viewed as fully-pipelined 144-wide VLIW instructio
33、nsLittle hardware control needed for managing instruction execution Send(vector)enables fine-grained communication across the 16 directly connected links on each TSPComparison made with 8 GPU A100 system with NCCLA100 system has approximately 3x higher network channel bandwidthWhen normalized,Groq T
34、SP matches the bandwidth at large tenor size while significantly improving bandwidth at intermediate tensor size 2022 Groq,Inc.|PublicHotChips34-2022Summary and TakeawaysNext steps and future work66 2022 Groq,Inc.|PublicHotChips34-2022Under planning,subject to changeToy LSTMToy RNNbidir-LSTMYoloFFTL
35、STMDistilBERTRN-50/101RN-18/34/101GPT-NeoTacotron 2T5VoVNetMobileNetBregmanBERT-LargeQ1 2022 Q2 2021Q3 2021Q4 2021WaveGlowToy MLPQ8BERTElectraEfficientNetSqueezeNetSegformerDETRMobileBERTLSTM-STEfficientDetBERT-64MT5 encoderMT5 smallrealmsplinterT5 encodersmall-XLNet1hot-CVSTAC LSTM A/B/CBidir LSTMM
36、obNet2SVMsmall-CLIPWebGPTGNNCompiler ProgressWhat weve enabled thus farCOMPUTER VISIONTEXT-TO-SPEECHNATURAL LANGUAGE PROCESSINGHIGH PERFORMANCE COMPUTINGLEGENDQ2 2022 Random Forest 2022 Groq,Inc.|PublicHotChips34-202268Low-latency and high-throughput are necessary for compute-intensive deep learning
37、Delivering predictable and repeatable performance is critical for many user-facing applicationsBatch-1 inference is important for responsiveness and delivering quality-of-service(QoS)that is impossible to do with more traditional microarchitectures using crossbars,cache hierarchies,etc.Determinism e
38、nables software-defined hardware and entails a design philosophy that spans both hardware and softwareISA is not about abstraction of hardware details,but about exerting control of underlying hardware 144 independent instruction control units(ICUs)of the TSPExpose the architecturally-visible state(G
39、PRs,SRAM,instruction buffers,etc)Software-based replay and exception handlingExtending the single-chip TSP determinism to the multiprocessor using software scheduled networking to explicitly schedule tensors on the network linksSynchronous communication model allows for lock-free communication up to very large systemsSummary and Takeaways 2022 Groq,Inc.|PublicHotChips34-2022FOLLOW US ON:69Thank YouContact me or the Groq team at