《CLIP 模型在线上大规模部署的最佳实践-王峰.pdf》由会员分享,可在线阅读,更多相关《CLIP 模型在线上大规模部署的最佳实践-王峰.pdf(44页珍藏版)》请在三个皮匠报告上搜索。
1、Berlin San Jose Beijing Shenzhen蒲大规模王ill争 felix.wangjina.aiHnumb3r3InfoGQCon全球软件开发大会K felix.wangjina.ai王峰,开源MLOPs框架Jjna的核心贡献者,专注机器学习与深度学习算法在NLP,多 模态表征学习和信息检索领域的落地与应用。Senior Researcher,Huya Al Senior Researcher,Tencent Al Ph.D.,Hong Kong Baptist University2021-now,Engineering manager,Jina Al2020-21,2
2、018-19,2011-18,InfoGQCon全球软件开发大会The most advanced MLOps platform for Multimodal AlJina Al是一家商业化开源软件公司,专注于打造针对多模态Al应 用的MLOps平台工具。Jina Al开源社区致力于促进多模态Al技 术的应用落地以及传播,通过人工智能和深度学习技术,帮助开发 者和企业减少开发学习成本,加快开发部署效率。Jina Al总部位于德国柏林,在北京、深圳、巴塞罗那等地均设有办 公室,海外员工人数比例超过三分之二。slack.jina.aiC get.jina.aiQCon全球软件开发大会InfoGNe
3、wsC Stars 36.5KBuild neural search and creative Al services on the cloud at scale.The most advancecT MLOps platform forMultimodal Al idrO Visit our GitHubJoin CommunityQCon全球软件开发大会SolutionsDevelopers vAbout vJ LoginInfoG目录01 CLIP模型介绍QCon全球软件开发大会03 推理加速技术CONTENT02 CUP-as-service 框架InfoGCLIP模型:,文本和图像O
4、penAI 在 2021 年 1 月发布的 CUP(Contrastive Language-Image Pre-training)模型,它是一种 基于对比文本-图像对的预训练方法或者模型。它的出现打破了自然语言处理和计算机视觉两大门派泾渭 分明的界限,实现了多模态AI系统。QCon全球软件开发大会InfoGCLIP模型训练细节模型训练:-4亿对图像-文本互联网数据3万+的botch size模型框架:-图像编码器支持使用ViT和ResNet-文本编码器使用 TransformerHyperparameterValueBatch size32768Vocabulary size49408Tra
5、ining epochs32Maximum temperature100.0Weight decay0.2Warm-up iterations2000Adam 魚0.9Adam/320.999(ResNet),0.98(ViT)Adam e10-8(ResNet),106(ViT)QCon全球软件开发大会InfoGCLIP模型:跨模态图文检索A football _EHCLIPText EncoderCLIP Image EncoderoQCon全球软件开发大会InfoQCLIP 模型:Zero-shot 分类1.准备 prompts2.zero-shot 推理Dxtcxset Example
6、sIiwcxjeA/et ResA/et!OlNero-sKot CLIPA ScoreQCon全球软件开发大会InfoGCLIP MS:图像生成Where is supervision come from?QCon全球软件开发大会At every diffusion stepCut images into small patchesAsk CLIP for guidancesSteering the direction of next stepDone in 314msShowing reasoning results(score in softmax):This is a photo of
7、 dog86.26This is a photo of an animal6.28This is a photo of cat4.64This is a blurry photo1.67This is a black and white photo0.76InfoQ。Github jina-ai/discoartCLIP-as-serviceCUP模型推理服务InfoGQCon全球软件开发大会OpenAI 湖 LaionAI 篇 Hugging Face Flow is ready to 夕 Protocol命 Local 0.0.A Private 192,16831.serve!-GRPC
8、.00:51000.160:51000 Public 217.70.138.123:510001 JJina技术栈:服务伸缩1 jtype:Flow2 executors:-name:encoder4 uses:CLIPEncoder uses_with:model_name:ViT-B/32,7 device:cuda8 replicas:3QCon全球软件开发大会encodergatewayCLIPEncoderCLIPEncoderCLIPEncodergatewayInfoGJina技术栈:服务监控 1 jtype:Flowr2-:-3 monitoring:true4 executo
9、rs:-name:encoder uses:CLIPEncoder uses_with:model_name:ViT-B/32 device:cuda10 replicas:3OpenTelemetry:Tracing+Metrics 数据收集Jaeger+Promethesus:数据接收并统计Grafana:前端数据展示QCon全球软件开发大会 General/clip-as-service 4Q+20Endpoint All v Executor All v datasource Prometheus vlife cycle of a requestValuePercent一 receiv
10、ing-gateway65.6 ms sending-gateway64.0 msencode text gateway/worker network44.3 msvisual-model-inference35.9 mspreproc imagereceiving-exec19.7 msencode image processing-encode19.1 mspreproc image16.8 ms46%preproc text一 text-model-inference11.7 ms preproc text7.47 ms20%encode imaqe5.43 ms15%Number of
11、 Request processed1913jina_receiving_request_seconds_sum60 ms40 ms20 ms-13:43:30 13:44:00 13:44:30 13:45:00 13:45:30 13:46:00 13:46:30 13:47:00 13:47:30 13:48:00一 gateway/rep-0/GRPCGatewayRuntime clip_t/rep-0y clip_t/rep-0Number of Documents processed per endpoint14520Number of requests per end poin
12、t1920InfoGJina技术栈:云原生部署-键导出 Kubernetesx docker-compose 配置文件-支持 Grafanax Promethesus、fluentdf.to_kubernetes_yaml(./k8s_grpc_yaml,k8s_namespace=clip-server)QCon全球软件开发大会k8s_grpc_yaml7 clip_tYML clip-t.yml7 gatewayYML gateway.ymlInfoGJina技术検:JCIoud托管服务$pip install jcloud$jc deploy clip_flow.yml 提供一键部署 提
13、供托管服务 提供监控和日志服务516,Executor Level MetricsNumber of requests processedencoder-clip indexer2516 2516QCon全球软件开发大会InfoGCLIP-as-service在线推理囊 C ft cloud.jina.ai/user/apps/inference-apiSelect your planPremium TierComing soon.v ViT-L/14-336px hosted completely freeJ 15,000 queries/month/8 embeddings(images
14、or text)per queryFree Tier$0 服务1000+用户 峰值请求 462 docs/second 无故障持续服务3个月Get your tokenGenerate access token Generate access tokenTesting your tokenCopy and paste an existing token to test it via our demos and to personalize the code snippet.TokenI 9286957220a05563be380a75035aa0e6|Text&Image Embedding
15、Visual ReasoningInputYou can input a sentence or an image URL,upload your image,or click one of the examples to try.First do it,then do it right,then do it better nExamplesQCon全球软件开发大会Jina is all you need!QCon全球软件开发大会InfoGJinaDocument,Executor&FlowDocument is the basic data structure in Jina.Executo
16、r is a group of functions that takes Documents as IO.Flow is how Jina streamlines and distributes Executors.QCon全球软件开发大会InfoGJinaProduction ready?O Replicas,sharding,scalabilityO Duplex streamingQ Async non-blocking data processingO gRPC,Websockets,HTTP,GraphQL gatewayO Microservice,Docker container
17、izationO Observability by Prometheus,Grafana Hub plugin ecosystemO Kubernetes seamless integrationQCon全球软件开发大会InfoGDesign of JinaLayers of AbstractionQCon全球软件开发大会InfoGDesign of JinaQCon全球软件开发大会FlowInfoG推理加速技术实践InfoGTensorRTQCon全球软件开发大会OPyTorch 噱椭推理加速技术ONNX Runtime?TonsorRT-混合半精度fpl6推理-AITemplate 推理引
18、擎-FlashAttention-CUDA Graph+DynamoQCon全球软件开发大会InfoG推理加速#1:混合半精度f pl 6推理 def convert_weights(model:nn.Module):Convert applicable model parameters to fpl6def _convert_weights_to_fpl6(I):if isinstance(I,(nn.Convld,nn.Conv2d,nn.Linear):I.weight.data=I.weight.data.half()if I.bias is not None:I.bias.data=
19、I.bias.data.half()if isinstance(I,nn.MultiheadAttention):for attr in*fHs_proj_weight for s in in,q,k,v,in_proj_bias,bias_k,bias_v:tensor=getattr(I,attr)if tensor is not None:tensor.data=tensor.data.half()for name in text_projection,proj:if hasattr(I,name):attr=getattr(I,name)if attr is not None:attr
20、.data=attr.data.half()Datasets:CIFAR10 Task:zero-shotclassificationacc:fpl6(fp32)1 model 11_|accl 1_1 _acc2/5 1_1 _mean recall 1 _11 RN50:openai 1i71.55(-0.03)1i98.11(-0.04)171.50(-0.01)11 V订-B-16:laion400m_31 191.70(-0.01)199.72(0.00)191.69(0.00)11ViT-B-32-quickgelu:laion400m_32190.72(+0.03)199.79(
21、0.00)190.74(+0.03)11 ViT-L-14:laion400m_31 194.66(0.00)199.90(0.00)194.67(0.00)11 ViT-H-14:laion400m_31 197.45(0.00)199.93(0.00)197.44(+0.01)|1 ViT-g-14:Iaion2b_sl2b_b42k 197.06(out of memory)1 99.93 197.08 1Datasets:VOC2007 Task:zero-shot calssification precision:fpl6(fp32)1 model 11_|precision1_i1
22、 RN50:openai 174.83750(-0.0052)11 V订-B-16:laion400m_31 178.35778(-0.0043)11ViT-B-32-quickgelu:laion400m_32176.27471(+0.0071)11 ViT-L-14:laion400m_31 178.53070(+0.0009)11 ViT-H-14:laion2b_s32b_b79k 180.12784(+0.0039)1model.apply(_convert_weights_to_fpl6)QCon全球软件开发大会InfoG推理加速#2:AIT emplate推理引擎G Soumit
23、h Chintala soumithchintalaWe just released AITemplate a high-performance Inference Engine similar to TensorRT but open-source.It is really fast!On StableDiffusion,it is 2.5x faster than the XLA based version released last week.個 Meta Al Meta Al 9hGet faster,more flexible inference on GPUs using our
24、newly open-sourced AITemplate,a revolutionary new inference engine that delivers up to 12X performance improvements on NVIDIA GPUs&4X on AMD GPUs compared to eager-mode within Pytorch.Learn more:bit.ly/3rl8F3bAIT relative speedup compared with PyTorch eager mode on A100/CUDA 11.6QCon全球软件开发大会把Al模型转换成
25、高性能C+GPU模板代码的Python框架该框架在设计上专注 于性能和简化系统。AITemplate系统一共分为两层:前端部分进行图优化,后端部分针对目标GPU 生成C+模板代码Tested on RTX3080:Model:ViT-L-14:Iaion2b-s32b-b82kshapept(ms)ait(ms)without flash_attn(1,77)8.68880.85231.6269(2,77)8.75430.98542.0161(4,77)8.72311.24592.8970(8,77)9.44662.02014.8552(16,77)10.02223.43998.7880(1,
26、224,224,3)18.07993.77538.4608(2,224,224,3)17.94218.4604InfoQ推理加速#3:FlashAttentionMatmulAQ:NxdV:NXd面、SRAM:19TB/S(20 MB)DropoutHBM:1.5TB/s(40 GB)SoftmaxMaskMatmulOutput to HBMPyTorchFlashAttentionsm(QKT)V:NxdInfoQKT:dx/vlAttention onMemory Hierarchy with Bandwidth&Memory SizeFused Kernel即oCompute Bloc
27、k on SRAMMain Memory(CPU DRAM)NX NQCon全球软件开发大会Inner LoopFlashAttentionDRAM:12.8 GB/s k(1 TB)o_cI Copy传统优化attention策略:优化FLOPS,稀疏近似/低秩近似 FlashAttention:优化GPU memory IO:避免频繁向HBM 读写数据,尽量使用SRAM完成计算Copy I!GPU、SRAMGPU HBMModel:ViT-L-14:Iaion2b-s32b-b82kflash_attn(s)|_ Ispeed up(x)|mean diff|shape|baseline(
28、s)|1 1(1,77)|0.81193|10.66437|11.22211|0.00066|1(2,77)|0.8264|0.70035|1.17998|0.0007|1(4,77)|0.81998|0.69887|1.1733|0.00064|(8,77)|1.05975|0.85742|1.23597|0.00054|(16,77)|2.12992|1.68367|1.26504|0.00057|(1,3,224,224)|2.00593|1.34507|1.49131|0.00155|(2,3,224,224)|3.74493|2.29818|1.62952|0.00122|(4,3,
29、224,224)|7.35365|4.44447|1.65456|0.00159|(8,3,224,224)|14.63006|9.05604|1.6155|0.00135|(16,3,224,224)|29.63732|18.29142|1.62028|0.00155|Outer Loop1 1 1 1 1 1 1Copy Block to SRAM Outer Loopr F!1 nii 1i 11 r:1推理加速#4:利用CUDA GraphCUDA Graph:CUDA 10之后发布的一种通过单个CPU操作 launch多个GPU算子的机制,从而减少了 launch开销。QCon全球软
30、件开发大会Infg推理加速#4:利用CUDA Graphtorch.fx:用于捕获和转换PyTorch模型print(traced.code)from clip_server.model.openclip_model import OpenCLIPModelCUDA Graph:CUDA 10之后发布的一种通过单个CPU操作 launch多个GPU算子的机制,从而减少了 launch开销。import torchfrom torch.fx import symbolic_trace,GraphModuleimport torchfrom torch.fx import symbolic_tra
31、cedef my_func(x):return torch.relu(x).neg()static_inputs=torch.zeros_like(x,device=cuda)for x in example_inputsmodel=OpenCLIPModel(name=Vit-B-32:openai,device=cuda)._model_visiondef forward(self,x):relu=torch.relu(x);x=None neg=relu.neg();relu=None return neg#Build CUDAGraph graph=torch.cuda.CUDAGra
32、ph()with torch,cuda.graph(graph,stream=streani,pool=torch.cuda.graph_pool_handle():static_outputs=graph_module(*static_inputs)if not isinstance(static_outputs,(list,tuple):static_outputs=(static_outputs,)stream=torch.cuda.Stream()st ream.wait_stream(torch.cuda.current_stream()x=placeholder target=x
33、args=()relu=call_function target=args=(x,)neg=call_method target=neg args=(relu,)output=output target=output args=(neg,)#Program capture via symbolic tracingtraced:GraphModule=symbolic_trace(my_func)for n in traced.graph.nodes:print(fn.name=n.op target=(n.target args=(n.args)#forward with CUDAGraphd
34、ef run(*new_inputs):for dst,src in zip(static_inputs,new_inputs):dst.copy_(src)addressgraph.replay()return static_outputs#Symbolic tracing frontend-captures the semantics of the module graph_module:torch.fx.GraphModule=symbolic_trace(model)QCon全球软件开发大会InfoG推理加速#4:DynamoTorch 2.0 新一代的 trace 工具 Dynamo
35、SehvJorPyEval EvalFrameDefault()QCon全球软件开发大会def optimize_model_dynamo(original_model:UnionCLIPTextTransformer,CLIPVisionTransformer,pool=to rch.cuda.g raph_pool_handle()-Callable:def compiler(gm:torch.fx.GraphModule,example_inputs:Listtorch.Tensor):return cuda_graphs_wrapper(gm,example_inputs,pool=p
36、ooDtorchdynamo.optimize(compiler)def run(*args,*kwargs):return original_model.forward(*argsf*kwargs)return runPerformance on TextModel:timesshapept(s)graph(s)speed upGPU(MB)RES(MB)1(顷)0.00680.00182/0.003093.6558/2.12311857/3011/2993-僅,77)0.00730.00216/0.004193.0058/1.64461859/3005/3132-(4,77)0.00700
37、.00294/0.007532.2076/0.91881859/3087/3043H8,77)0.00710.00450/0.012002.3033/0.58821863/3369/3315(16,77)0.00870.00791/0.021261.0675/0.44271883/3619/3591InfoQ挑战和限制-目前 AITemplate 和 CUDA Graph+Dynamo 只支持静态形状 state shape,限制了他们的应用场景。-然而,我们相信为了获得更好性能进行的权衡是值得的。特别是针对一些特定场 景,天然适合static shcipe。例如stablediffusion
38、,GPT这种内容生成类模 型。-期待Torch 2.x更多对于模型导出和推理优化方面的一些改进。QCon全球软件开发大会InfoG总结与展望-CLIP模型打破了自然语言处理和计算机视觉两大门派泾渭分明的界限-基于Jina开源MLOps平台实现CLIP-cis-service模型推理服务-最新推理加速技术的实践-混合精度fpl6推理-FlashAttention-AITemplate-CUDA Graph+Dynamo2023年模型推理生态可能会发生非常快的变化,torch 2.0的发布标志这torch框架 在生产环境下的生态日益完善。让我们拭目以待!QCon全球软件开发大会InfoG扫码加入Jina技术交流群QCon全球软件开发大会Jina Al开源社区致力于促进多模态AI技术的应 用落地以及传播,通过人工智能和深度学习技术,帮助开发者和企业减少开发学习成本,加快开发 部署效率。官网:https:/jina.cii/GitHub:https:/github.8m/jina-cii 加入全球开发者社区:https:/slack.jinci.cii/InfoQ