《HC2022.KAIST.JihoonKim.v04.pdf》由会员分享,可在线阅读,更多相关《HC2022.KAIST.JihoonKim.v04.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、1 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsTrinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsJi-Hoon Kim1),Seunghee Han1),Kwanghyun Park2),Soo-Young Ji3)and Jo
2、o-Young Kim1)1)2)3)2 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsIn-DBMS Data Analytics Three Important yet Independent Technology TrendsML-based Advanced Data AnalyticsDatabaseHW AccelerationNear-Data/In-Storage Proces
3、singEnterprise-level DBMSIn-DBMS ML supportGPU-based DBMSASIC/FPGA/GPUData-intensive applicationSmartSSD/SmartNIC/HMC3 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsTrinity:In-Database,In-Storage Platform Full Stack Syste
4、m for Near-Data-based In-DBMS ML Inference ML-basedData AnalyticsDatabaseHW AccelerationIn/Near-Data ProcessingEnterprise DBMSsMADlib Spark MLQ100BlazingSQLCPU-FPGAHMCSmartSSDSmartNICGorgonDAnAAquomanMondrianSoftware StackHardwareTrinitySmartSSD-enabled DBMS(PostgreSQL+)Conventional SW StackCPU Exec
5、utorExtended SW StackMADlibSmartSSD Executor(XRT Platform)Host code(XRT C/C+API)Device code(.sv)XRT Linux Kernel DriveSmartSSDNAND Flash(3.84TB)PCIe SwitchFPGA HW Accelerator(i-DPA)FPGADRAM(4GB)PCIe4 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform fo
6、r Advanced Data AnalyticsTrinity:In-Database,In-Storage Platform Full Stack System for Near-Data-based In-DBMS ML Inference SmartSSD-enabled DBMS(PostgreSQL+)Conventional SW StackCPU ExecutorExtended SW StackMADlibSmartSSD Executor(XRT Platform)Host code(XRT C/C+API)Device code(.sv)XRT Linux Kernel
7、DrivePCIeSmartSSDNAND Flash(3.84TB)PCIe SwitchHW Accelerator(i-DPA)FPGADRAM(4GB)Data FormatConvertingSoftwareDBMSSW-HWInterfaceHardwareAcceleratorDynamic Offloading DecisionSeamless Integration of SmartSSDDirect Page DecodingDynamic Tuple BindingHeterogeneous Core Arch.w/Reconfig.On-chip Interconnec
8、tTrinity shows up to 57.18x faster query processing speed than CPU-based DBMS5 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsComputational Storage Device New Hardware Backend:Samsungs SmartSSD Xilinx Kintex UltraScale+FPG
9、A,4GB DRAM and 3.84TB NAND flash Direct FPGA-to-SSD data access using internal PCIe switch69mm15mm100mmCPU(Host)SSDDRAMSSD ControllerFPGAPCIe SwitchSmartSSDSSD R/WFPGA DRAM R/WP2PComm.6 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Dat
10、a AnalyticsSoftware Stack for Trinity Seamless Integration of SmartSSD in DBMS Extended SW stack extended analyzer1)+optimizer2)1)Converting query information to the HW data format2)Making runtime offloading decision to select an optimal HW backendQuery info.Operation:LinregrTable oid:219,517Filteri
11、ng:=Aggregation:CNTMeta-DataYesNoPosgreSQL pipeline(CPU Executor)(HW Data format)Query planSmartSSD Cost ModelCPUCost ModelFinal DecisionParserExtended AnalyzerQuery ExtractorQuery CheckerData ReconstructorExtended OptimizerPredictor(Cost model)Executor7 of 16HOTCHIPS 2022Trinity:End-to-End In-Datab
12、ase Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsExtended Query Optimizer Performance Cost Model Determining optimal hardware backend(SmartSSD vs CPU)Showing 5.3%&12.97%average error+96%offloading accuracySmartSSD Cost ModelCPU Cost ModelLatency(ms)Database Size(GB)Har
13、dwareCost modelError(%)!#$%!&=(!%+%$#!)*+$+,+$)+-Equation-based performance modelRegression-based performance model!#$%+&($%)*+#,-.+#,-.+(&)LatencyLatencyQuery ComplexityQuery ComplexityFine-tuning8 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for
14、 Advanced Data Analytics1.Database Page Decoder Page&data processing unit Page-level parallelism2.Database Tuple Binder Dynamic tuple binding Tuple-level parallelism3.Heterogeneous Core Arch.Reconfig.on-chip interconnect Task-level parallelismOverall Architecture of i-DPA*Database Page DecodersData
15、DecodersModel DecoderPB(8KB)PPUDPUPage Buffer(8KB)Page Processing UnitData Processing UnitCH#1-NUnicastingBroadcastingQuery Processing Core#1-NIMEM(8KB)WMEM(16KB)Database Tuple BinderReconfigurable On-Chip InterconnectLinear Comp.UnitPE Array#0-7PE#0 PE#1PE#2 PE#3Adder TreeConfigurable AggregationUn
16、itRelational Comp.UnitFiltering UnitAggr.UnitTree Comp.UnitTree PE#0-7CheckerAddress GeneratorComparatorMulti-Func Comp.UnitOMEM(8KB)Ctrl.Top Aggregation UnitTop RelationalAggregatorOutput DMA*i-DPA=in-Database Processing Accelerator9 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machin
17、e Learning Acceleration Platform for Advanced Data AnalyticsDatabase Page Decoder Direct Tuple Extraction from Database Page Removing host interaction Keep BW benefit of in-storage processing Faster page decoding w/page-level parallelism JJSSD(Database)Page TableCPUFPGAw/o Page Decoderw/Page Decoder
18、Database Page DecodersCtrlPage BufferRFALURegMask info.FilteringRegPage Processing UnitData Processing UnitCH#1CH#NPage-level parallelismPCIe1212Unnecessary Data Copy10 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsDataba
19、se Tuple Binder Dynamic Tuple Binding Dynamically varying tuple packing density according to the tuple size Tuple-level parallelism&hardware utilization JPE ArrayPE ArrayPE ArrayPE ArrayPE ArrayConfig.Aggregation Unit-Packing tuple w/zero-padding-Increasing parallelism up to 8xDatabase Tuple Binder-
20、Set proper aggregation link for final outputCASE 3(#of Attr 16)CASE 2(#of Attr 9-16)CASE 1(#of Attr 5-8)CASE 0(#of Attr 5)11 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsQuery Processing Core Heterogeneous Core Architect
21、ure Reconfigurable on-chip interconnect Enabling flexible data streaming J Task-level parallelism across the computing unitsMeta-DataFiltering UnitLinear Comp.UnitTree Comp.UnitAggregation UnitMultiFuncUnitFiltering UnitLinear Comp.UnitTree Comp.UnitAggregation UnitMultiFuncUnit1234Relational&MLOpco
22、deSettt+1t+2t+3Tuple0Tuple1Tuple2Tuple3Tuple0Tuple1Tuple2Tuple0Tuple1Tuple0Task-level pipelining12 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsFPGA Implementation Result System Setup&FPGA Implementation ResultComputatio
23、nal Storage Server-2 Intel Xeon Silver 4210 CPUs-PostgreSQL v12.6-MADlib v1.17.0-156GB DIMM-3.84TB SmartSSDFPGAFreq.InterfaceCore0Core1OthersUtilizationSpecificationsKintex Ultrascale+170MHzResource UtilizationLUT267.4%FF92437975736.1%BRAM311.515152246.9%URAM83232056
24、.3%DSP93423421035.7%13 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsEnd-to-End Trinity Evaluation Evaluate Against CPU-based DBMS Platform 0.85x 57.18x faster query processing than CPU-based DBMS 15.21x faster than CPU-b
25、ased DBMS on average CPU-based DBMSTrinitySpeedupLatency(ms)SpeedupSpeedupLatency(ms)Latency(ms)SpeedupLatency(ms)Speedup14 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsScaling-up with Multiple SmartSSDs Scale-up the Ove
26、rall System SmartSSD:easy to scale-out the number of devices w/U.2 form factor Deploying 4 SmartSSDs 200 x faster than CPU-based DBMSLatency(ms)1 SmartSSD2 SmartSSD4 SmartSSD-With 2 SmartSSD 1.85x performance gain-With 4 SmartSSDs 3.66x performance gainLinear Performance ImprovementOverall 200 x Spe
27、edup AchievedUniformly distribute databasePCIe Sub-system15 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data Analytics1.Full Stack System for In-DBMS Advanced Data Analytics 57.18x faster query processing than CPU-based DBMS2.Softwar
28、e Stack(PostgreSQL+)for SmartSSD Integration Dynamic offloading decision 96%accuracy3.Near-Storage based Hardware Accelerator(i-DPA)Direct data page decoding&abundant parallel processing(3-levels)ConclusionTrinity:In-Database,Near-Data Machine Learning Acceleration Platform for advanced Data Analytics16 of 16HOTCHIPS 2022Trinity:End-to-End In-Database Near-Data Machine Learning Acceleration Platform for Advanced Data AnalyticsThank You!Questions?Feel Free to Contact Me!E-mail:jihoon0708kaist.ac.kr LinkedIn:https:/ Near-Data ML Acceleration platform