《HC2022.NVIDIA.Mike_Ditty.v6.pdf》由会员分享,可在线阅读,更多相关《HC2022.NVIDIA.Mike_Ditty.v6.pdf(17页珍藏版)》请在三个皮匠报告上搜索。
1、NVIDIA ORIN SYSTEM-ON-CHIPMICHAEL DITTY|AUGUST 2022 INTRODUCING ORINAdvanced CPU12x ARM Cortex-A78AE CoresARM Arch V8.2Rich IO ConnectivityUp to x4 10 Gb Ethernetx24 SERDES,x16 CSIAmpere GPUUp to 2 GPC/8 TPC/16 SMs5.3 FP32 CUDA TFLOPs10.6 FP16 CUDA TFLOPSStrong DL PerformanceUp to 275 INT8 DL TOPs (
2、170 GPU+105 DLA)85 FP16 DL TOPs(GPU)Higher DRAM BWUp to 256-bit LPDDR5205 GB/sEnhanced PVAUp to 512 INT16 GMAC/s2048 INT8 GMAC/sSafety IslandUp to 10K ASIL D DMIPS4x Lockstep ARM Cortex-R52SOC SafetyFUSA ASIL-B Chip ASIL-D SystematicProcessSamsung 8nmORIN CPU COMPLEX ARM Cortex-A78AE V8.2 high-perfo
3、rmance CPU Lockstep Support 2.2 GHz frequency Cache hierarchy L1(per core):64 KB I$,64 KB D$L2(per core):256 KB L3(per cluster):2 MB L3 per cluster System Cache(L4):4 MB shared cacheCPU ComplexCPU ClusterCortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64
4、 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L22MB L3CPU ClusterCortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L22MB L3CPU ClusterCortex-A78AE64 KB L1 I64 KB L1 D25
5、6 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L2Cortex-A78AE64 KB L1 I64 KB L1 D256 KB L22MB L34MB System CacheCPU PERFORMANCEOrin Silicon Based MeasurementsBenchmarkScoreSPEC CPU2006 speed integer single core31.8*SPEC CPU2006 rate integer 12-core269.5*SPEC C
6、PU2006 speed floating-point single core41.6*SPEC CPU2006 rate floating-point 12-core332.0*SPEC CPU2017 rate integer single core4.04*SPEC CPU2017 rate integer 12-core39.36*Geekbench 5 single core754Geekbench 5 12-core7773Notes:CPU clock speed used for testing is 2.2 GHz Memory running at 3200 MHz for
7、 all tests SPEC CPU2017 single-core uses rate,not speed*SPEC scores are estimatesORIN SAFETY ISLAND Isolated ASIL-D Compute Subsystem Lockstep ARM Cortex-R52 CPUs Dedicated IO Dedicated Security ProcessorPower IsolationR52R5232 KB L1 I32 KB L1 D512KB Tightly Coupled Memory3MB SRAMSecurity EngineHW S
8、afety MgrSPICANUARTGPIOHW Sync/IPC/TimersDMA=SOCSafety Island FabricORIN SAFETY ISLANDCPU&Memory ConfigurationSafety Island SpecOrinCPU configuration4x Cortex-R52 Lockstep pairsAggregate ASIL D DMIPs10KICache/DCache per core pair32KB/32KBTightly Coupled Memory per core pair512KBShared SRAM3MBTotal S
9、hared SRAM+TCM5MBAMPERE GPUOrin features the Ampere GPU architecture with enhanced DL throughput,the latest graphics capabilities including ray-tracing,and improved power efficiency.2 GPC/8 TPC/16 SM2x Xavier 192 KB L1-cache per SM 4 MB L2-cache Enhanced Tensor CoresOrin Ampere GPUGPCTPCSM192KB L1SM
10、192KB L1TPCSM192KB L1SM192KB L1TPCSM192KB L1SM192KB L1TPCSM192KB L1SM192KB L1GPCTPCSM192KB L1SM192KB L1TPCSM192KB L1SM192KB L1TPCSM192KB L1SM192KB L1TPCSM192KB L1SM192KB L14 MB L2Host InterfaceAMPERE GPU MIG(Multi-Instance GPU):GPU can be split into two separate GPUs for compute Sparsity:fine graine
11、d structured sparsity doubles throughput and reduces memory usage 2x CUDA floating-point performance:higher compute math speedCompute EnhancementsNEXT GEN DLADeep Learning Accelerator Focused on INT8 Inference Performance Increased Performance to 52.5 TOPS(int8)Aligned with GPU Compatible Math Pipel
12、ine Structured Sparsity TensorRT API Performance Structured Sparsity Larger SRAM HW support for layer scheduling Dedicated depth-wise convolution engine Additional native HW features supportedDLAMicro-controllerDepth-wise processorHardware scheduler1 MBshared buffer608 KBbufferPost processingConvolu
13、tion core608 KBbufferPost processingConvolution coreMemory InterfaceNEW FEATURESCompared to Xavier DLAFeatureCommentsSoftmaxNew Function SupportClamped RELUNew Function SupportExclusive Average PoolingNew Function SupportPer channel scalingNew Function SupportFull-channel normalizationNew Function S
14、upportUINT8 supportNew data format supportSupport 3D ConvolutionNew Function SupportHardware SchedulerNew Engine for optimizationStructured SparsityNew Optimization FeatureGroup function OptimizationOptimization for Group Function PerformanceDepth-Wise Convolution EngineHighly optimized dedicated en
15、gine for DW performanceMULTI-ORIN Support for up to 4x Orin SOCs with direct high-speed connections Gen4 PCIe x4 Support Root Port and End Point Modes 10 Gb EthernetHigh Speed Data SharingOrinOrinOrinOrinADDITIONAL ENHANCEMENTS AV1 Video Encode&Decode support 8K60 Display Support 10Gb Ethernet Impro
16、ved Optical Flow Accelerator Improved ISP Gen4 PCIeJETSON AGX ORINUP TO 8X PERFORMANCE1.45.3Jetson AGX XavierJetson AGX Orin 64GB32275Jetson AGX XavierJetson AGX Orin 64GB140269Jetson AGX XavierJetson AGX Orin 64GB137205Jetson AGX XavierJetson AGX Orin 64GB11.4105Jetson AGX XavierJetson AGX Orin 64G
17、B8x DL/AI1.5x DRAM BW1.9x CPU3.7x CUDA9x DLAINT8 TOPSGB/sEstimated SpecInt 2006FP32 TFLOPsINT8 TOPs*MaxN performanceJETSON BLASTS AHEAD Delivers Up to 5x More Inference Performance and 2.3x Energy Efficiency0.0 x1.0 x2.0 x3.0 x4.0 x5.0 xPerfPerf/WattPerfPerf/WattPerfPerf/WattPerfPerf/WattPerfPerf/Wa
18、ttResNet-50SSD-SmallSSD-LargeRNN-TBERT-LargeEdge Performance and Performance/WattJetson AGX XavierJetson AGX OrinPerformance Normalized to Jetson AGX Xavier(v 1.1 submission)MLPerf v2.0 Inference Edge Closed and Edge Closed Power;Performance/Watt from MLPerf results for respective submissions for Da
19、ta Center and Edge,Offline Throughput and Power.NVIDIA Xavier AGX Xavier:1.1-110 and 1.1-111|Jetson AGX Orin:2.0-140 and 2.0-141.MLPerf name and logo are trademarks.See www.mlcommons.org for more information.JETSON ORIN Up to 275 INT8 TOPS powered by Ampere GPU+DLA Up to 12x A78AE ARM CPUs Up to 64
20、GB memory,204 GB/s TDP from 10W 60WJetson AGX and Jetson NX Orin based productsDRIVE ORIN AUTOMOTIVE Up to 254 INT8 TOPS powered by Ampere GPU+DLA Up to 12x A78AE ARM CPUs Up to 204 GB/s 50-60W air cooled,100W liquid cooled Scaling to 4 high BW connected Orin Connectivity via Gen4 PCIe x4 or 10GbEHigh Performance scaling to 4 Orin