《HC34.Arm.SurajSudhir.v03.pdf》由会员分享,可在线阅读,更多相关《HC34.Arm.SurajSudhir.v03.pdf(34页珍藏版)》请在三个皮匠报告上搜索。
1、 2022 ArmSuraj SudhirAugust 21 2022ML Frameworks and Frontends in MLIRHot Chips 342 2022 ArmML Framework Frontends Environments to define and build ML models 3 2022 ArmML Framework Frontends Environments to define and build ML models Offer a range of capabilities 4 2022 ArmML Framework Frontends Env
2、ironments to define and build ML models Offer a range of capabilities Very dynamic&evolving space 5 2022 ArmML Framework Frontends Environments to define and build ML models Offer a range of capabilities Very dynamic&evolving space ML compiler/systems design goals:Support multiple frameworksKeep up
3、with their evolution 6 2022 ArmML Framework Characteristics ExpressivenessHigh-level language capabilities and paradigms7 2022 ArmML Framework Characteristics ExpressivenessHigh-level language capabilities and paradigms Feature richnessOperator sets/libraries8 2022 ArmML Framework Characteristics Ex
4、pressivenessHigh-level language capabilities and paradigms Feature richnessOperator sets/libraries InfrastructuralTraining,quantization,optimization/performance.9 2022 ArmML Framework Characteristics ExpressivenessHigh-level language capabilities and paradigms Feature richnessOperator sets/libraries
5、 InfrastructuralTraining,quantization,optimization/performance.ML compiler ask:All this needs to”just work”.10 2022 ArmMLIR In ML Frameworks Starts with an ML model constructed within a framework.11 2022 ArmMLIR In ML Frameworks Starts with an ML model constructed within a framework.Translators conv
6、ert serialized model to MLIR form.12 2022 ArmMLIR In ML Frameworks Starts with an ML model constructed within a framework.Translators convert serialized model to MLIR form.Enables construction of MLIR based compiler infrastructure.13 2022 ArmMLIR In ML Frameworks MLIR dialects of multiple frameworks
7、 already present.TensorFlow and TensorFlow Lite from TensorFlow project PyTorch via Torch-MLIR JAX ONNX via ONNX-MLIR 14 2022 ArmFramework Consumption in MLIR15 2022 ArmConnecting Frameworks and Code Generation Reduction or waistline mid-level dialects.Designed to be compilation friendly.Convert fro
8、ntend ops to mid-level dialect(s).Complex ops decomposed into sequence of simpler ones Backend code generation paths target reduction dialect(s)16 2022 ArmMid-level MLIR Dialects TOSA(Tensor Operator Set Architecture)Dialect Specification-basedDefines functionality and precisionEnables hardware/soft
9、ware codesign.HLO(High Level Operations)Dialect Input language to XLA compiler at GooglePrimary output of JAX for compilation.LinAlg DialectPowerful codegen oriented dialectEnables tiling,vectorization,bufferization and other capabilities17 2022 ArmCase Study:TOSA Designed at ARM.Problem:Frontend dy
10、namism and heterogeneity,needed to stabilize hardware design.Goal:Target multiple frontends,stable path to ML accelerators.Whole-tensor operator set architecture backed by specification.Defines functionality,precision,quantization.MLIR dialect implements specification.18 2022 ArmTOSA TOSA is stable
11、and versioned.Defines profilesBase inference,main inference,training Conversions from frontends to TOSATensorFlow,TensorFlow Lite(stable)Torch-MLIR for PyTorch (advanced development)ONNX-MLIR for ONNX(WIP)Hardware and software designed to TOSA compliance.TOSA compliant hardware development at ARM.Us
12、ed within Googles IREE MLIR compiler,and elsewhere.19 2022 ArmExample:Quantized Conv2D Input Frontend:TensorFlow LiteQuantized Conv2D+bias additionFused relu6 activationsymmetrically quantized signed 16 bit datatypemodule attributes tf_saved_model.semantics,tfl.description=MLIR Converted.,tfl.schema
13、_version=3:i32 func main(%arg0:tensor1x32x32x8x!quant.uniform)-(tensor1x32x32x16x!quant.uniform%0=tfl.pseudo_qconst()qtype=tensor16x1x1x8x!quant.uniformi8:f32:0,.zps.,value=dense:tensor:()-tensor16x1x1x8x!quant.uniformi8:f32:0%1=tfl.pseudo_const()value=dense:tensor:()-tensor%2=tfl.conv_2d(%arg0,%0,%
14、1)dilation_h_factor=1:i32,dilation_w_factor=1:i32,fused_activation_function=RELU6,padding=SAME,stride_h=1:i32,stride_w=1:i32:(tensor1x32x32x8x!quant.uniform,tensor16x1x1x8x!quant.uniformi8:f32:0,.zps.,tensor)-tensor1x32x32x16x!quant.uniform return%2:tensor1x32x32x16x!quant.uniform 20 2022 ArmConv2D:
15、Conversion to TOSA tosa.conv2d consumes input zps,bias added to accumulated result tosa.rescale+tosa.clamp performs output rescaling+activationmodule attributes tf_saved_model.semantics,tfl.description=MLIR Converted.,tfl.schema_version=3:i32 func main(%arg0:tensor)-(tensor)%0=tosa.const()value=dens
16、e:tensor:()-tensor%1=tosa.const()value=dense:tensor:()-tensor%2=tosa.conv2d(%arg0,%1,%0)dilation=1,1,pad=0,0,0,0,quantization_info=input_zp=0:i32,weight_zp=0:i32,stride=1,1:(tensor,tensor,tensor)-tensor%3=tosa.rescale(%2)double_round=false,input_zp=0:i32,multiplier=21438:i32,18643:i32,20949:i32,1989
17、2:i32,18542:i32,20624:i32,20035:i32,21773:i32,19670:i32,31465:i32,18895:i32,21587:i32,31080:i32,19230:i32,21345:i32,20069:i32,output_zp=0:i32,per_channel=true,scale32=false,shift=.shifts.:(tensor)-tensor%4=tosa.clamp(%3)max_fp=0.000000e+00:f32,max_int=32767:i64,min_fp=0.000000e+00:f32,min_int=0:i64:
18、(tensor)-tensor return%4:tensor 21 2022 ArmExample:Complex MatMul PyTorch n-dim matrix multiplicationmatmul(4x8x16x32,8x32x17)-4x8x16x17 Additional artifactsImplicit and explicit broadcastingShape inferenceMap to fixed hardware-friendly TOSA 3x3 matmulmodule attributes torch.debug_module_name=Matmul
19、 func forward(%arg0:!torch.vtensor,%arg1:!torch.vtensor)-!torch.vtensor%0=torch.aten.matmul%arg0,%arg1:!torch.vtensor,!torch.vtensor-!torch.vtensor return%0:!torch.vtensor 22 2022 ArmMatMul-ND:Conversion to TOSA Transpose+reshape to canonical formLHS:common x lhs_bcast x reduction RHS:common x reduc
20、tion x rhs_bcastmodule attributes torch.debug_module_name=Matmul func forward(%arg0:tensor,%arg1:tensor)-tensor%0=tosa.const()value=dense:tensor:()-tensor%1=tosa.const()value=dense:tensor:()-tensor%2=tosa.reshape(%arg1)new_shape=1,8,32,17:(tensor)-tensor%3=tosa.transpose(%arg0,%1):(tensor,tensor)-te
21、nsor%4=tosa.reshape(%3)new_shape=8,64,32:(tensor)-tensor%5=tosa.transpose(%2,%0):(tensor,tensor)-tensor%6=tosa.reshape(%5)new_shape=8,32,17:(tensor)-tensor%7=tosa.matmul(%4,%6):(tensor,tensor)-tensor%8=tosa.reshape(%7)new_shape=8,4,16,17:(tensor)-tensor%9=tosa.transpose(%8,%1):(tensor,tensor)-tensor
22、%10=tensor.cast%9:tensor to tensor return%10:tensor 23 2022 ArmExample:n-Dim Gather Input Frontend:TensorFlowGatherNDmodule attributes tf.versions=bad_consumers=,min_consumer=0:i32,producer=1011:i32 func main(%arg0:tensor)-tensor attributes tf.entry_function=control_outputs=,inputs=placeholder_0,out
23、puts=result%cst=tf.Const()device=,value=dense:tensor:()-tensor%0=tf.GatherNd(%arg0,%cst)device=:(tensor,tensor)-tensor return%0:tensor 24 2022 ArmGatherND:Conversion to TOSA TensorFlow to TOSA conversionTOSA transpose+reshape+gathermodule attributes tf.versions=bad_consumers=,min_consumer=0:i32,prod
24、ucer=1011:i32 func main(%arg0:tensor)-tensor attributes tf.entry_function=control_outputs=,inputs=placeholder_0,outputs=result%0=tosa.const()value=dense:tensor:()-tensor%1=tosa.const()value=dense:tensor:()-tensor%2=tosa.reshape(%arg0)new_shape=1,32,256:(tensor)-tensor%3=tosa.mul(%1,%0)shift=0:i32:(t
25、ensor,tensor)-tensor%4=tosa.reduce_sum(%3)axis=1:i64:(tensor)-tensor%5=tosa.reshape(%4)new_shape=1,9:(tensor)-tensor%6=tosa.gather(%2,%5):(tensor,tensor)-tensor%7=tosa.reshape(%6)new_shape=3,3,32,8:(tensor)-tensor return%7:tensor 25 2022 ArmTOSA to Code Gen Implemented by Google IREE MLIR team Examp
26、le:-tosa-to-arith -Lower TOSA to the Arith dialect -tosa-to-linalg -Lower TOSA to LinAlg on tensors -tosa-to-linalg-named -Lower TOSA to LinAlg named operations -tosa-to-scf -Lower TOSA to the SCF dialect -tosa-to-tensor -Lower TOSA to the Tensor dialectmodule attributes tf_saved_model.semantics,tfl
27、.description=MLIR Converted.,tfl.schema_version=3:i32 func.func main(%arg0:tensor)-(tensor)%0=tosa.max_pool2d(%arg0)kernel=2,2,pad=0,1,0,1,stride=1,1:(tensor)-tensor return%0:tensor 26 2022 ArmExample:TOSA to LinAlg$mlir-opt-pass-pipeline=func.func(tosa-to-linalg-named,tosa-to-linalg)maxpool.mlirmod
28、ule attributes tf_saved_model.semantics,tfl.description=MLIR Converted.,tfl.schema_version=3:i32 func.func main(%arg0:tensor)-(tensor%cst=arith.constant-3.40282347E+38:f32%0=tensor.pad%arg0 low0,0,0,0 high0,1,1,0 bb0(%arg1:index,%arg2:index,%arg3:index,%arg4:index):tensor.yield%cst:f32 :tensor to te
29、nsor%cst_0=arith.constant-3.40282347E+38:f32%1=linalg.init_tensor 1,32,32,8:tensor%2=linalg.fill ins(%cst_0:f32)outs(%1:tensor)-tensor%3=linalg.init_tensor 2,2:tensor%4=linalg.pooling_nhwc_max dilations=dense:vector,strides=dense:vector ins(%0,%3:tensor,tensor)outs(%2:tensor)-tensor return%4:tensor
30、27 2022 ArmTOSA:Current Status Part of the core MLIR dialect set.Significant support infrastructure around dialect.Reference modelLarge unit testing infrastructure Multiple stable frontend consumption paths.Thousands of models run(Arm,Google,elsewhere)Hardware development at Arm+elsewhere.Collaborat
31、ion and interest across MLIR ecosystem.28 2022 ArmMid-level IR Design:Reflections from TOSA Defining the overall requirement is critical.Close to frontend?Co-design friendly?Spec-backed?Other?Define principles and/or write a rationale document.29 2022 ArmMid-level IR Design:Reflections from TOSA Def
32、ining the overall requirement is critical.Close to frontend?Co-design friendly?Spec-backed?Other?Define principles and/or write a rationale document.Development and connectivity were significant efforts.Multiple person-years Inter company collaboration-Arm,Google,AMD and more.30 2022 ArmMid-level IR
33、 Design:Reflections from TOSA Defining the overall requirement is critical.Close to frontend?Co-design friendly?Spec-backed?Other?Define principles and/or write a rationale document.Development and connectivity were significant efforts.Multiple person-years Inter company collaboration-Arm,Google,AMD
34、 and more.What kind of quarterback do you want?31 2022 ArmSummary ML compiler developers may have to support a range of capabilities present across multiple frameworks.Theres a substantial gap in abstraction between framework level ops and backend code generation patterns.Choosing or developing an a
35、ppropriate mid level IR is critical to effectively connect the framework and low level code gen.Developers can leverage the experience that went into existing mid level IRs to make the right design choices.32 2022 ArmAcknowledgments Arm ML Technology and Engineering teams Google IREE team and the ML
36、IR community 2022 ArmThank YouDankeGraciasGrazie谢谢AsanteMerciKiitos?2022 ArmThe Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited(or its subsidiaries)in the US and/or elsewhere.All rights reserved.All other marks featured may be trademarks of their respective