《HotChips_tesla_dojo_uarch.pdf》由会员分享,可在线阅读,更多相关《HotChips_tesla_dojo_uarch.pdf(28页珍藏版)》请在三个皮匠报告上搜索。
1、The Microarchitecture of Teslas Exa-Scale Computer Emil Talpes,Douglas Williams,Debjit Das SarmaWhat is DOJO?2Teslas in-house supercomputer for Machine Learning Highly scalable and fully flexible distributed systemOptimized for Neural Network training workloadsGeneral-purpose system capable of adapt
2、ing to new algorithms and applicationsBuilt from grounds up with large systems in mindNot evolved from existing small systemsAnatomy of a distributed system3Distributed systems are built as hierarchies of nesting boxesCPU-Die-Module-Board-Rack-Cabinet-SystemIntegration gets looser as we move outward
3、 lower bandwidth,higher latenciesSystem is described by three modelsCompute architecture of the inner boxCommunication how data moves between boxesSynchronization how events get ordered across the entire systemThis talk describes our way of filling these boxesHigh throughput,general purpose CPUDOJO
4、nodes are full-fledged computers Dedicated CPU,local memory,communication interfaceSuperscalar,multi-threaded organization Optimized for high-throughput math applications rather than control heavy codeCustom ISA optimized for ML kernelsMicroarchitecture of the DOJO nodeProcessing pipeline32B fetch w
5、indow holding up to 8 instructions 8-wide decode handling 2 threads per cycle4-wide scalar scheduler,4-way SMT 2 integer ALUs 2 address units Register file replicated per thread2-wide vector scheduler,4-way SMT 64B wide SIMD unit 8x8x4 matrix multiplication unitsSMT support focuses on single threade
6、d application No virtual memory,limited protection mechanisms,SW-managed sharing of resources Typical application uses 1 or 2 compute threads and 1-2 communication threads1.25MB SRAM per node 400 GBps load,270 GBps storeGather engine 8B and 16B granularityLoad,store,load+execute from local memory Ex
7、plicit transfer instructions for remote memory accessList parsingNode memory2D mesh spanning all processing nodes Eight packets per cycle across the node boundaryEach node has independent network connection Direct SRAM connection,one read and one write packet per cycle Single cycle per hop in every
8、directionBlock level DMA operations for data push and pullSeamless connection to neighboring nodesNetwork interfacePipeline width reduces progressively 8-way in Decode 4-way in the Scalar engine 2-way in the Vector engineSimple primitives can execute early in Decode Looping,list parsing PredicationS
9、calar instructions Regular integer code,address generation Network synchronization primitivesVector datapath 8x8 matrix multiplication instructions 64B SIMD pipeline Special ML formats(CFP8,storage CFP16)Special ML instructions(e.g.,stochastic rounding,etc.)Datapathparse_record DmaDBmovi32 x1,x0,db_
10、local_addr,6.;bring in a src addressmovi32 x1,x1,db_node_addr,24vxor r31,r31,r31 ;create a zero registerlooploop db_cntdb_type=0:st x2!+64,r31 ;paddingdb_type!=0:ldr x2!+64,x1!+64,s7 ;pull requestloop_endnext_recorddb_type=0:loop_endmov x3,db_offsetadd x1,x1,x3db_type=1:loop_enddb_type=3:loop_break
11、all_done;exitmovi32 x1,x0,db_local_addr,6 ;bring in new src addrmovi32 x1,x1,db_node_addr,24loop_endall_done:swait s7,db_total_cnt;wait for data to arriveList parsingCONCAT operations list#define db_typedb0:3#define db_cntdb3:12#define db_offsetdb15:9#define db_local_addrdb15:17#define db_node_addrd
12、b32:32#define db_total_cntdb3:21List parser allows efficient packaging of complex transfer sequencesMost instructions execute in the front-endSequence can run asynchronously on its own threadFully featured,general-purpose instruction set 64b scalar instructions 64B wide SIMD instructions Load+execut
13、e encodings to reduce pressure on architectural registers Masked executionNetwork transfer and synchronization primitives Local to remote memory transfers Semaphore and barrier supportML specific primitives 8x8 Matrix multiplication engine Inline support for loading and gathering operands,transposin
14、g or expanding compressed operands Special set of shuffle,transpose and convert instructions Stochastic rounding Implicit 2D padding DOJO Instruction SetInst TypeUnique opcodes VariantsFront-end 1221Scalar74143Vector1421095FP32 has more range and precision than the application requires 16b FP format
15、s allow higher representation density,with some usage restrictions IEEE FP16 does not have enough range to cover all layers BFP has more range,but sparser coverage densityDOJO arithmetic formatsCFP8 has even higher total representation rangeConfigurable bias allows the application to shift the repre
16、sentation range based on local needsDOJO arithmetic formatsCFP16 can be used when higher precision is required(e.g.,gradients)DOJO implements FP32,BFP16,CFP8 and CFP16Up to 16 data types can be used at any time,with each 64B packet sharing a single typeDOJO arithmetic formatsFirst integration box-D1
17、 Die14TSMC 7nm,645mm2Physically and logically arranged as a 2D array 354 DOJO processing nodes on dieExtremely modular design362 TFlops BF16/CFP8,22 TFlops FP32 2GHz440 MB SRAMCustom low power serdes channels on all edges 576 bidirectional channels 2 TB bandwidth on each edgeSeamless connection to n
18、eighboring diesSecond integration box Dojo Training Tile155x5 array of known good D1 chips 4.5TB/s off-tile bandwidth per edge Half of in-tile bandwidthFully integrated module Electrical+thermal+mechanical 15kW of power deliveryCustom power delivery Horizontal data communication plane Vertical power
19、 delivery and cooling 15kW per moduleCustom high-density connectors Seamless connection to neighboring training tilesDOJO System Topology16Full system is a plane of DOJO training tiles organized as a 2D meshDOJO interface processors Located on the edges of the mesh Provide connectivity to the outsid
20、e world Provide shared memory supportShared DRAM vs.private SRAM Each training tile 11GB of private SRAM,highly distributed 160GB of shared DRAM,lower bandwidth but more contiguousCommunication mechanisms17Logical 2D mesh connecting all processing nodes 256GBps bidirectional bandwidth on every row a
21、nd column on-die Single cycle per hop 5TB cross-section 2TB bandwidth on every D1 die edge within the training tile 100ns die-to-die latency 900GBps bandwidth on every D1 die edge out of the training tileCommunication mechanisms18Logical 2D mesh connecting all processing nodes 256GBps bidirectional
22、bandwidth on every row and column on-die Single cycle per hop 5TB cross-section 2TB bandwidth on every D1 die edge within the training tile 100ns die-to-die latency 900GBps bandwidth on every D1 die edge out of the training tileEdge communication-PCIe links between DIPs and host systems 32GBps bandw
23、idth per DIP,160GBps per hostCommunication mechanisms19Logical 2D mesh connecting all processing nodes 256GBps bidirectional bandwidth on every row and column on-die Single cycle per hop 5TB cross-section 2TB bandwidth on every D1 die edge within the training tile 100ns die-to-die latency 900GBps ba
24、ndwidth on every D1 die edge out of the training tileEdge communication-PCIe links between DIPs and host systems 32GBps bandwidth per DIP,160GBps per hostGlobal communication Z-plane links between DIPs Very long global communication routes are expensive in the mesh Tesla Transport Protocol links all
25、 DIPs through an ethernet switch 32GBps bandwidth per DIPCommunication mechanisms20Why treat long range communication differently?DOJO network has less bandwidth for long routes Long routes span multiple integration boundaries Long routes consume a lot more of the system resources Packets need to tr
26、averse multiple local communication linksStrong incentive software to keep most communication local Amount of data transferred should go down quickly with distanceRestrict global traffic to synchronization and All Reduce primitives Data parallel instances should be kept relatively small01234564812 1
27、6 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120Bandwidth vs route length(TBps)0408012 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120Resource utilization vs route lengthDie boundaryTile boundaryDie boundariesTile
28、 boundaryBulk Communication21Any processing node can access data stored anywhere in the system Can issue PUSH or PULL requests for each packet Can use bulk transfer requests to initiate larger transfersSRAM to DRAM DIPs accept DMA commands to copy data to and from DRAM Contiguous or strided transfer
29、s up to 3 dimensions SRAM to SRAM PUSH Broadcast Enabled within sets of nodes within the same D1 die PUSH/PULL memory region Copy a contiguous region from a single source to a single destination List-based operations Allow complicated data gather and scatter patterns Dedicated instructions sequences
30、 or threads utilizing the list parsing enginesDOJO System Network22Flat addressing scheme exposes system topology to softwareTarget hardware simplicity!DOJO System Network23Flat addressing scheme exposes system topology to softwareRouting scheme Simple 2D routing within the destination D1 die All ro
31、uters must work on a functional die Dead processing nodes are avoided by SWTarget hardware simplicity!DOJO System Network24Flat addressing scheme exposes system topology to softwareRouting scheme Simple 2D routing within the destination D1 die All routers must work on a functional die Dead processin
32、g nodes are avoided by SW Each D1 die has a programable routing table Can route packets towards the TTP-based Z dimension Allows routing around dead D1 diesTarget hardware simplicity!DOJO System Network25Flat addressing scheme exposes system topology to softwareRouting scheme Simple 2D routing withi
33、n the destination D1 die All routers must work on a functional die Dead processing nodes are avoided by SW Each D1 die has a programable routing table Can route packets towards the TTP-based Z dimension Allows routing around dead D1 diesPacket ordering DOJO network does not guarantee end-to-end traf
34、fic orderingCounting Arriving packets must be counted at destination before useTarget hardware simplicity!DOJO System Synchronization26Counting semaphores are hardware managed event trackers HW ensures update atomicity and starvation avoidanceEvent generation Execute S(emaphore)SET on one of the loc
35、al threads Execute R(emote)S(emaphore)SET on a remote thread Receive data packet from the networkEvent monitoring Execute S(emaphore)WAIT on one of the local threads Execute R(emote)S(emaphore)WAIT on a remote thread Waiting threads are put to sleep until condition is metTypical use models Mutex Pro
36、ducer-consumer thread synchronization(local and remote)Completion for network transfersDOJO System Synchronization27Software defined barrier trees Compiler defines sets of nodes for each barrier domain and a communication tree to cover each set Barrier domains can span any number of nodes Nodes can
37、be assigned to multiple barrier domainsSoftware signals reaching and checking the barrier All nodes execute B(arrier)ARM to complete an upstream wave Root node triggers a downstream wave All nodes execute B(arrier)CHECK to wait until the downstream wave reaches themSummary28DOJO is a large-scale dis
38、tributed system One exa-pod has more than 1,000,000 CPUsDOJO component modules At the edge between general-purpose and application-optimized hardwareDefining characteristic is extreme scalability De-emphasize poorly scaling mechanisms like virtual memory,coherency,global data structures DOJO relies less on local storage and more on fast data movement Order of magnitude higher interconnect bandwidth than typical distributed systemsIntegration and utilization of these components is the subject of the next presentation!