1、NVIDIA使用Triton优化深度学习推理的大规模部署徐添豪,张雪萌,黄孟迪#page#Triton OverviewInference Server pipelineA100 Multi-Instance GPU (MIG)AGENDADeployment on KubernetesIntegration with KFServingMetrics for Monitoring and AutoscalingPerformance Analyzer: Optimization GuidanceCustommer Case Studies#page#NVIDIATriton Overview
2、#page#Inefficiency Limits InnovationDifficulties with Deploying Data Center InferenceSingle Framework OnlySingle Model OnlyCustom Development33Chainer21HOuXGPYT6RCHTensorfloRecSySNLPASRtheanoDevelopers need to reinSolutions can only supportteSome systems are overused whileplumbing for every applicam
3、odels from one frameworkothers are underutiliized#page#NVIDIA Triton Inference ServerProduction Inference Server on GPU and CPUMaximize real-time inferenceNVIDIAT4performance of CPUS and GPUsNVIDIAT4Quickly deploy and manage multipleQO口Teslamodels per GPU per nodeV100TeslaEasily scale to heterogeneo
4、us GPUsV100and multi GPU nodesNVIDIAA100Integrates with orchestrationNVIDIAsystems and auto scalers via latencyA100and health metricsCPUOpen source for seamlessCPUcustomizattion and integrattion#page#Triton Inference Server ArchitecturePreviously “TensorRT Inference Server”口 Support for multiple fra
5、meworkss Concurrent model executiondi制 CPU and Multi-GPU supportDynamic batchings Sequence batching for stateful models HTTP/REST,gRPC,shared library Health and status metrics (Prometheus) reportingModel ensembling and pipelinings Shared-memory API (system and CUDA) GCS and S3 support Open source- m
6、onthly releases on NGC and GitHub#page#FeaturesUtilizationUsabilityCustomizationPerformanceConcurrent Model ExecutionMultiple Model Format SupportModel EnsembleSystem/CUDA Shared MemoryMultiple models (or multipleTensorRTPipeline of one or more models andInputs/outputs needed to be passedinstancesof
7、same model)mayPyTorch JIT(-pt)the connection of input and outputto/from Triton are stored inTensorFlow 1.Xtensors between those models (cansystem/CUDA shared memory.execute on GPU simultaneouslyGraphDef/SavedModelReduces HTTP/gRPC overheadbe usedwith custom backend)TensorFlow+TensorRT 1.x GraphDefDy
8、namic BatchingInference requests can be batchedTensorFlow 2.x SavedModelCustom Backend for C+ and PythonLibrary VersionLinkagainst lbtrtserver.sosothatup by the inference server to 1) theTensorFlow+TensorRT2.XCustom backend allows the userSavedModelmore flexibility by providing theiryou can include
9、all the inferencemodel-allowed maximum or 2)theONNX graph (ONNX Runtime)user-defined latency SLAown implementation of an executionserver functionality directly in yourengine through the use ofa sharedapplicationCPU Model Inference ExecutionlibraryFramework native models canexecute inference requests
10、 on theStreaming APICPUBuiltin support for audio streaminginput.AccommodatesMetricsstateful/sequence models that haveUtilization,count,memory,andasequence ofinputs tokeeptrack oflatency(speech,translation,etc).Model Control APIExplicitly load/unload models intoand out of Triton based on changesmade
11、in the model-controlconfiguration#page#TRITON 2.5What ls NewKFSerings new community standard gRPC and HTTP/RESTData Loading Library (DALI) backenddata plane v2 protocolAlows for accelerated pre-processing andaugmentation pipelinesEasily deploy serverless inferencing with Triton inwithin Triton for i
12、mages,videos,and speechKubernetesDecoupled inference servingPython custom backendEngages a model once sufficient but not all inputs are receivedAloWs Python code execution inside Triton (egpre, post-e.g.speechrecognition andsynthesisprocessing)Triton Model AnalyzerSupport for A100,MIGTools to charac
13、terize model performance and memory footprintHigher performance inference serving. Triton on MIG withfor efficient servingperformanceandfault isolationDeepStream 5.0 IntegrationLatestframework backendsNative integration in DeepStream 5.0 for multi-framework,multi-TensorRT7.1,TensorFlow2.2,PyTorch1.6
14、,ONNXRT1.5.3sensorstreaminganalyticsAzureMachine Learning IntegrationGoogle CAIP IntegrationAzureML integrated Triton as the platforms inferenceTriton is now available on GCP CAIP as a custom containerserver to deploy models at scale.#page#DEVELOPERS CAN FOCUS ON MODELS AND APPLICATIONSTriton Takes
15、Care of Plumbing To Deploy Models for InferenceMultipleDifferent Types ofInferencing on GPUDynamicFrameworksand CPUQueriesBatchingPTensorflowONNX用用发临PYT6RCH CustomRUNTIMEBatchReal timeStreamEnsembleAll Major FrameworkInference Serving on GPU 8Support For Different TypesDynamic BatchingMaximizesBacke
16、nds For Flexibility8CPU AcrossOf Inference Queries ForThroughput Under LatencyConsistencyDifferent Use CasesConstraintCloud I Data Center IEdgeConcurrent Model ExecutionFor High Throughput 8Bare metal|VirtualizationUtilizationStandard HTTP/gRPCCommunication#page#Designed for DevOps/MLOpsTriton Integ
17、rates Easily In Organizations Workflow For ALL Al Use CasesKubernetesMLOpsOpen Source 8IntegrationCustomizableLive Model Updates Scalable Microservice InDynamic Model LoadingCompletely Open SourceKubernetesTriton Model AnalyzerInspect,CustomizeaExtendHelm Chart For FastGoogle CAIP,Azure MLCustomizab
18、le ContainerDeploymentModular Backends For LowIntegrationMemory FootprintKFServing IntegrationGPU Util.,Memory,Inference Load 8 LatencyMetrics#page#NVIDIAInference ServerPipeline#page#Inference PipelineTypical PipelineInferenceRequest De-NetworkQueueRequestSerializationCompletedClientServerNetworkRe
19、questReceiveComputeSerializationClientTRITON#page#Running TritonTriton Docker Container Available on NGCname;output“bert_tf_v2_large_fp16_128_v2platform:modelname:“endlogitsPrerequisite: Docker and nvidia-docker installedmax batch_size:1data_type:TYPE_FP32input中dims:128name:“unique_ids”data_type$ do
20、cker pullTYPE_INT32name:dims:1“start_logitsnvcr.io/nvidia/tritonserver:20.11-py3data_type;eshape:shape:TVPE_FP321dims:128name;“segment_ids”个$ docker run -gpus=11-rm-p8000:8909data_typep8001:8001-p8002:8002-TYPE_INT32dims:128v/path/to/model/repository:/modelscount:1 tritonserver -modelkind:KIND_GPUna
21、me:“input_ids”gpus: erepository=/modelsdata_type:TYPE_INT32dims:128Tname:“input_maskdata_typeiTYPE_IN3dims:128#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubQA - Tokenizationimport tritonclient.http as httpclientmodel_name=“bert_tf_v2_large_fp16_128_v2”model version = -1RVID
22、IAbatch_size =11Overvewurl=x.x.x.:8000#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubMetadatatriton_client=httpclient.InferenceServerclient(url=url) Healthif not triton_client.is_server_live():if(metadataname=triton):print(metadata)if not triton_client.is_server_ready():prin
23、t(”FAILED:is_server_ready”)metadataif not triton_client.is_model_ready(model_name)print(”FAILED:is_model_ready”)print(metadata)#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubSend inference request to theCreatetheinference input/output for the modeinputs =results for1 = sanda
24、no#both output tensors.inputs.append (httpclient.InferInput(1,1,“INT32”)resultstriton client.infer(modelhameinputs.append(httpclientinputs,1,128,“INT32”)(sandano-sandanoinputs,aopend(httpclienmOUL1n3u1,128,“INT32) We expect there to be 2 results (each with batchoutputs,append(httpclient,InferRequest
25、edoutput(endsize1).1ogits,binary_data=False)end logits=resultend _logits0outputs.append (httpclient,InferRequestedoutput(stastart logits = resultlstart logitsrort_logits,binary_data=False)aniD#page#NVIDIAA100 Multi-lnstance GPU (MIG)五#page#A100 MIG SupportOptimize GPU Utilization, Expand Access to M
26、ore Users with Guaranteed Quality of ServiceSNSUp To 7GPU Instances In a Single A100:USERD8Dedicated SM,Memory,L2 cache,Bandwidth forhardware QoS 8 isolationUSER1SimultaneousWorkload ExecutionWithGuaranteed Quality Of Service:USER2GPU InstaAIL MIG instances run in parallel with predictablethroughput
27、8latencyU5ER38Right Sized GPU Allocation:Different sized MIG instances based on targetUSER4品福GPU InstancworkloadsUSER5Diverse Deployment Environments:品GPUInstaSupportedwith Bare metal,Docker,Kubernetes,JSERVirtualizedEnv.SGPU InstCNS20428:Multi-InstanceGPU(MIG)深度学习最佳用法示例#page#Inference with Triton7
28、ResNet Models on 7 MIG Instances in ParallelgRPC ClientLodBalancertriton-trttriton-trttriton-trttriton-trttriton-trttriton-trttriton-trRESNET50MIG4MIG7MIG1MIG2MIG3MIG5MIG6A100#page#Inference with TritonMeasure Performance Using the Perf AnalyzerPerf Analyzer (formerly “perf_client):Measures latency
29、and throughput (inf/s)under varying client loads.it can be usedto measure performance at the lowestpossible load on the model, by sending oneinference request to Triton and waiting forthe response.Use -concurrency-range option to sendmultiple requests at the same time.perf_analyzer -m flower -u 127.
30、0.0.1:50058-i http-concurrency-range 1:100 -f results.csv#page#Triton on A100 with MIG4.5X Throughput Speedup Using 7 MIG Instances on ResNet50Throughput(HTTP.BatchSize=1)A100-MG7X7One A100 can be partitioned into upto seven GPU instances to maximizingA100-MIG7x6GPU utilization and providing dynamic
31、scalability.A100-MIG7x5A100-MIG7x4Each MIG instance brings a consistentincrease in throughput.Using onlyA100-MIG7x3two MIG instances on A100 alreadyprovides an improvement over bothA100-MIG7x2V100V100 16GB and T4.A100-MIG7x1Conguroncy#page#Triton on A100 with MIG4X Latency Speedup Using 7 MIG Instan
32、ces on ResNet50Latency(p90,HTTP.BatchSize=1)Typically, when model concurrencyincreases, latency performance tends to100MG7xsuffers. This chart shows how increasingthe number of MIG instances used cankeep the latency low at higher切concurrency values.V100A100-MIG7X2The latency monotonously decreasesA1
33、0M7when more MIGs are added. Using onlytwo MIG instances on A100 alreadyA100-MIG7A7represents an improvement over bothMG7xV100 16GB and T4.At seven MIGA100-MG7xinstances,we obtain a significantincrease in latency speedup.Conct#page#1T4GPUTRITONClientNVIDIA TRITONModel RepositoryResNet5OTensorRTNVIDI
34、AT4GPU#page#MIG for Optimized InferencingScaling out microservices with Triton and MIGHorizontally scale out your containers or VMS7x by using MIG GPU compute Instance insteadof GPU devicesNo updates needed for application code;deployment code must be updated to use MIGrather than GPU resourcesConti
35、nue using microservice best-practice, oneserver per app,or allow Triton to manage allMIG devicesldeal for batch-size 1inferencing56 inference jobs on a DGXA100: 8*7* MIG 1g.5gb#page#NVIDIADeploymment on Kubernettes#page#Deployment on KubernetesHelm Chartapiversion:apps/v1kind:Deploymentcontainers:me
36、tadata;name:.Chart.Namename: template “triton-infereneaueNaaeuraeusanteAJaemrver.fuiname”namespace:(Release.NamespaceimagePullPolicy:values.imagepullPollcy labels;app:template“triton-inference-server.name”,resources:1imits:广 args:“tritonserver=aJos-tapomspec:RepositoryPath ”maRe.replicas:Values.repl
37、icaCount ports:selector:matchLabels:containerPort:8068name:httprelease:(CRelease.Name containerPort:8801templatename:grpcmetadata!-containerport:8892labels:name:metricsapp:(template“triton-infelivenessProbehttpGet:Pelease:t .Release.Name Jhttps:/ server#page#StorageTriton on Azure Kubernetes Service
38、BERT ModelModelBERT ChatBotRepositoryCLIENTBERT ChatBotAKSQuestions andAZUREVMTritonTriton InferencedLIHTritonInferenceServer DockerInferenceHTTPServer8000ContainerClient DockerK8s ServiceContainerTriton POD8002Metrics#page#Triton Inference Server on AKSAutoscalingAKSTritonTriton PODsK8s ServiceTrit
39、onInference20Server DockerMetricsContainerSorHPA(Horizontal Pod Autoscaler)Reg#page#6Questions andnVIDIAAnswers with BERTContext:NFollowing their loss in the divisional round ofthe previous seasons playoffs, the DenverBroncos underwentnumerous coachingchanges,including acoach John Fox(who had won.DA
40、:A:Q:What in0Q:Wh#page#费Kubernetes Support for MIG“Mixed”Strategy“Single” Strategyapiversion:v1apiVersion:V1kind:Podkind:Podmetadata:metadata:name:gpu-examplename:gpu-examplespec;spec:containers:containers:name:gpu-examplename:gpu-exampleimage:nvidia/cuda:11.0-baseimage:nvidia/cuda:11.0-baseresource
41、sresources1imits:1imits: A100-SXM4-48GB using traditional resource GPU typesExposed using new resource GPU typesGood for large clusters (homogeneous nodes)Good for smaller clusters (heterogeneous nodes)Users need to learn new type syntaxUsers dont need to learn new type syntaxCNS20428:Mullti-lnstanc
42、eGPU(MIG)深度学习最佳用法法示例#page#NVIDIAIntegration withKFServing#page#Triton Integration with KubeflowWhat is Kubeflow?Open-source project to make ML/DL workflows on Kubernetes simple, portable,and scalableCustomizable scripts and configuration files to deploy containers on their chosenenvironmentProblems
43、it solvesKubeflowEasily set up an ML/DL stack or pipeline that can fit into the majority of enterprisedatacenter and multi-cloud environmentsHow it helps Triton Inference ServerTriton Inference Server is deployedas a component inside of a production workflowTritonInferencetoServerOptimize GPU perfor
44、manceEnable auto-scaling, traffic load balancing,and redundancy/failover viametrics#page#Kubeflow Serving (KFServing)OverviewKFServing enables serverless inferencing on Kubernetes and provides performant high abstraction interfaces forcommon machine learning (ML) frameworks to solve production model
45、 serving use cases.You can use KFServing to:Provide a Kubernetes Custom Resource Definition for serving ML models on arbitraryframeworks.Encapsulate the complexity of autoscaling, networking,health checking,and server configuration to bringcutting edge serving features like GPU autoscaling, scale to
46、 zero,and canary rollouts to your MLdeployments.Enable a simple, pluggable, and complete story for your production ML inference server by providingXoq au Jo no Kllqpuleidxa pue Susssaoid-sod Sulssaoid-aid uolpaid#page#Yourtrainig/pruningvalidation flowdumpINFERENCE SERVERmodelARCHITECTURE KUBEFLOWMo
47、del repositorya00L0CMultipleworkloadsLoad balancerRECModelrepositoryPersistentvolumIMG8店3网到10PreQCAPl:ASRContainerizeprocessinginferenceserviceGPUPost(CPUGPU)processing网网网心0NVIDIA Trton商商地LegendMetricssericeAuto scalerAreadyexisingNew from NVIDIACluster图TensorRT,TensorFlow.C2/ONNXModl#page#Architect
48、ure OverviewInference Service Data PlaneExplaineriexplainDefault EndpointTransformerpredictpredict:explainPredictorpredict三Triton Inference Server First To Adopt KFServing V2 Protocol#page#KFServing: InterfaceInference ServiceApply the CRD:apiVersion:“serving.kubeflow.org/v1alpha2kind: “InferenceSer
49、vice”S kubectl apply-ftriton.yamlmetadata:name:“triton-simple-stringspec:default:predictor:ExpectedOutput:triton:storageUri:gs:/kfservinginferenceservice.serving.kubeflow.org/triton-simple-Samples/models/tensorrt”string created#page#KFServingRun a predictionUses the client at: https:/ example.html1.
50、determine the ingress IP and ports and set INGRESS HOST and INGRESS PORTSERVICE HOSTNAME=S(kubectl get inferenceservice triton-simple-string-o jsonpath=f.status.url Icut-d/-f 3)2.check server statuscurl -H Host: SESERVICE_HOSTNAME http:/SINGRESS_HOST3SEINGRESS_PORT3/api/status3.edit /etc/hosts to ma
51、p the CLUSTER IP to triton-simple-4.run the clientdocker run -e SERVICE_HOSTNAME:SSERVICE_HOSTNAME -it-rm-net=host nvcr.io/nvidia/tritonserver:20.11-py3-clientsdk./build/simple_string_client-u SSERVICE_HOSTNAMEndnoroottrantor:/workspace#./build/simple_string_client-utriton-simple-0+1=1#page#BERT Exa
52、mpleExtend KFServing and Implement pre/post processing and predictionclass BertTransformer(kfserving.KFModel):def_init_(self,name:str,predictor_host:str):super()._init_(name)The preprocess method converts theself.short_paragraph_text=“The Apollo.”paragraph and the question to BERT inputself.predicto
53、r_host = predictor_hostself.tokenizer=with the help of the tokenizertokenization.FullTokenizer(vocab_file=/mnt/models/vocab.txtdo_lower_case=True)The predict method calls the Tritonself.model_name=“bert_tf_v2_large_fp16_128_v2“inference server PYTHON API toself.model_version =-1self.protocol = Proto
54、colType.from_str( http)communicate with the inference serverwith HTTPdef preprocess(self,inputs: Dict)- Dict:The postprocess method converts rawreturn self.featuresprediction to the answer with thedef predict(self,features: Dict) -Dict:probabilityreturn resultdef postprocess(self,result:Dict)-Dict:r
55、eturn predictions”: prediction, “prob”:fet+aTqeqoud.teuosasaqu#page#BERT ExampleCreate the Inference ServiceapiVersion:“serving.kubeflow.org/v1alpha2env:kind:“InferenceServicename:STORAGE URIvalue:“gs:/kfservingmetadata:name:“bert-largesamples/models/triton/bert-transformerspec:predictor:default:tri
56、ton:transformer:resourcescustom:limits:Cpu:“1”Container:name:kfserving-containermemory: 16Giimage:gcr.io/kubeflow-ci/kfserving/bert-requests:Cpu:“1transformer:latestmemory:16G1resources:1imits:storageUri:gs:/kfserving-Cpu:“1Samples/models/triton/bertmemory: 1Girequests:Cpu:“1memory:1Gicommand:kubect
57、l apply-fbertyaml“python“bert_transformer#page#BERT ExampleRun Inferenceinstances”:“what President is credited with the original notion of putting Americans in space文MODEL_NAME=bert-1argeINPUT_PATH=./input.json/p-nnss=duoseqssuu8 )NISOHIS-f3)HIdIndNI$p-aWVNISOH3IAH3S$SoHH-A-Tunhttp:/$INGRESS_HOST:IN
58、GRESS_PORT/v1/models/$MODEL_NAME:predictExpectedoutput:fpredictions:John F.Kennedy,“prob77.904https:/ ForMonitoring and AutoscalingA#page#Triton Inference Server Metrics For AutoscalingBefore Triton Inference Server- 5,000 FPSBefore Triton Inference Server- 800 FPSOne model per GPUSpike i
59、n requests for blue modelRequests are steady across all modelsGPUs running blue model are being fullyUtilization is low on all GPUsutilizedOther GPUs remain underutilized#page#Triton Inference Server Metrics For AutoscalingAfter Triton Inference Server- 15,000 FPSAfter Triton Inference Server-5,000
60、FPSLoad multiple models on every GPUSpike in requests for blue modelLoad is evenly distributed between all GPUsEach GPU can run the blue model concurrentlyMetrics to indicate time to scale upGPU utitizationPowerusageInference countQueue timeNumber of requests/sec#page#AVAILABLE METRICSNameUse CaseGr
61、anularityCategoryFrequencyPer GPUProxy for load on the GPUPer secondPower usagePer GPUPower limitMaximum GPU power limitPer secondGPUUtilizationGPUutilization ratePer GPUPer secondGPUutilization0.0-1.0)Per GPUTotal GPU memory,tn bytesPer secondGPU Total MemoryGPU MemoryPer GPUUsed GPU memory,in byte
62、sPer secondGPU Used MemoryPer modelNumber of inference requestsPer requestRequest countPer modelNumber of model inference executionsPer requestCountExecution countGPU 8 CPUrequest batchingPer modelPer requestNumber of inferences performed lone request countsInference countas“batch size”inferences)Pe
63、r modelPer requestLatency:requesttimeEnd-to-endinferencerequest handlingtimeTimearequest spends executing the inference modelPer modelPerrequestLatency: compute timeLatency(in the appropriate framework)GPU&CPUPer requestPer modelTimearequest spends waiting in thequeue beforeLatency:queuetimebeing ex
64、ecuted#page#Triton Metrics: AutoscalingHorizontal Pod AutoscalerThe HPAcontroller operates ontheratio beteendesiredPod1metric value and current metricvalue,the followingHorizontalequation returns the number of desired replicas:Pod1PodDeploymentAutoscalerScalePodnR = ceilWhereRis the number of replic
65、as that kubernetes needsWhen R is different from CR then the HPA increasestohave.auuoGuoe Kg seoudal Jo Jequinu aul sesealoepJoCR is the current number of replicas,deployment (in our case the TTIS deployment):CVis the current metric value andDVis the desiired metric value#page#Triton Metrics: Autosc
66、alingCustom MetricsHere is the summary of what we need to deploy1.Prometheus operator and PrometheusWhere Req is the total number of requests.2.Horizontal Pod Autoscaler3Service MonitorWe need to express this equation using PromQL,the4.Custom Metricsprometheus query language,with the actual name of
67、themetrics exposed by TRTis:Wewant the HPA to perform autoscaling based onthe following metrics average time spent for eachOT= delta(nv_inference_queue_duration _us30srequest in the queuedela(nv inference request success30s)0TOT=Req#page#Triton Metrics: AutoscalingCustom MetricsapiVersion;metadata:n
68、am会tricsapiVersion:vIkind:ConfigMapmetadataonfignamespace:customtricsdata:config-data:adapterrulesseriesQuery取艾resourcesoverridesnamespaceresourcenamematchesa8:avg time queue msmetricsQueryvg(delta (nv inferenceSuecesLabeiMatcher8/308j)/1000)by()#page#NVIDIAPerformance Analyzer:Optimization Guidance
69、A#page#Measuring Inference PerformanceTriton includes a performance measurement toolcalled the perf analyzer (formerly “perf client”)s Measures throughput and latency under varyingclient Loadss Real or synthetic input tensor data and clientloads HTTP/REST or GRPC APlss Complete feature coverage: sha
70、red-memorystateless and stateful models,batching,etc.sCommand-line tool w/ spreadsheet templates Generates charts to help visualize the throughputvs latency tradeoffs#page#Perf_Analyzeron Github repository and on NGC Triton Client ContainerSettingsThe perf_analyzer helps you determine the ideal mode
71、lBatchsize:Measurementwindow5msecconfiguration which maximizes performance based on specificLatency limit:9 msecConcurrency 1imit: 108 concurrent requestsconstraintsStabilizing using average latencyThe throughput and latency are taken over a time window,andthen repeated the measurements until it get
72、s stable values.Request concurrency:6Client:Request count:8387By default the perf_analyzer uses average latency to determineThroughput:1677.4 infer/secAvg latency:3575usec(standard deviationsinsalazesoelaiuaad-auasn ue noK anqqes248usec)p59 latency:3578 usecbased on that confidence level.p93 latency
73、:3626 usep95 latency: 4161 usecpgg latency:4218 usecAVB HTTP time:3569usec(send/recv146usec+responsewa1t3423usec)Inference count: 10366Executioncount:16866$ perf_analyzer -m flower、-u127.0.0.1:50050-i httpSuccessful request count:10866concurrency-range 1:50 -p5060Avgrequestlatency:2223usecusecinfer5
74、49usec+compute output 12usec)CnID#page#Basic Optimization - Inference Schedulers Default model scheduler:S Process 1 inference request at a time for each modelsIf multiple models each can have 1 inference request executing at any giventimeAn inference request can be a batch Gif client creates the ba
75、tch)Minimize LatencyBatch-1 RequestModel BackendBatch-4 RequestFrameworkDefaultSchedulerRunttime#page#Dynamic Batching OptimizationTriton Inference ServerGrouping requests into aModelY Backendsingle “batch” increasesoverall GPU throughputRuntime考Contexts Process multiple inferenceDynamicBatcherreque
76、sts at the same timeContextfor a modelIndividual requests arebatched and executedtogether#page#Dynamic BatchersEach models scheduler configured independentlys Dynamic batcher controls:Preferred batch sizesMaximum delay to hold inference request to form larger batchPrioritization and timeoutdynamic b
77、atching preferred batch_size:4,81#page#Dynamic Batcher ResultsDeep-recommender TensorRT modelDynamic Batch Size 1-321500010000馆50003p95Latency(ms)#page#Dynamic Batching2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency ThresholdStatic vs Dynamic Batching(T4TRT Resnet50 FP16 Instance 1
78、)Triton Inference Server groups inference2000requests based on customer defined metricsfor optimal performanceCustomer defines 1) batch size (required)1500and 2)latency requirements (optional)Example:No dynamic batching(batch size1000188)vs dynamic batching101412#page#Basic Optimization- Concurrent
79、Model ExecutionsBy default Triton creates 1 copy of each model (on each available GPU)sEach copy is known as an instance of the models Inference requests are scheduled independently to each model instances Using more than 1 instance increase throughput and reduce latency if:GPU has sufficient memory
80、 to hold multiple copies of the modelGPU has sufficient compute to execute multiple inferences simultaneouslyGPU / PCle has sufficient bandwidth for simultaneous inferencesPrioritization and timeoutinstance group(count:2)1#page#Concurrent Model Execution- ResNet504x Better Performance and Improved G
81、PU Utilization Through Multiple Model ConcurrencyTriton Inference ServerCommon Scenario 1V100 16GB GPUOne APlusing multiple copies of theCUDASIInferenceFN50Instance1same model on a GPURequestsResNetCUDASFN50Instance250Example:12 instances of TRT FP16CUDA SreanBeuguOgNQoResNet50 (each model takes 1.3
82、3GB GPUCuDA StrFN5DInaanc4memory) are loaded onto the GPU and canrun concurrently on a 16GB V100 GPU.14 concurrent inference requests happenCUDARN5DInsnce6each model instance fulfils one requestCUDASRN50Insance7simultaneously and 2 are queued in theper-model scheduler queues in TritonCUDAStreamRN50I
83、nstancs8Inference Server to execute after the 12CUDA StrasRN5OInstnngtrequests finish.With this configuration,CUDA Sun2832 inferences per secondat 33.94 mswith batch size 8on each inference serverCUDA Sreaminstance is achieved.CUDA StraarRN50Instance12TimenID#page#Multiple Model Instances Increases
84、ThroughputDeep-recommender TensorRT modelDynamic Batch Size1-32D1instanceD2instances2000000003p95Laeny(ms)#page#NVIDIACustomer Case Studies#page#KINGSOFT CLOUD ADOPTS NVIDIATRITON INFERENCE SERVER TOMAXIMIZE PERFORMANCEAT SCALE15+ online AI computer vision services usingTriton for inferen
85、ce serving50% higher QPS per GPU with Triton4-5x higher QPS with Triton + TensorRT/TVM“Besides the increase of QPS and latency, we can nowKingsoft Cloudsmoothly shift our service from the offline mode (whichcares more about throughput) to online mode (which caresmore about latency) thanks to Tritons
86、 scheduling+batchingand TensorRTs dynamic batch support”nVIDIAKingsoftCloud#page#TENCENT YOUTU INTEGRATES TNNBACKEND INTO NVIDIA TRITONINFERENCE SERVER TO STANDARDIZEINFERENCE AT SCALE“in order to standardize inferencing across Tencent,Tencent Youtu developed a new open source highperforming framewo
87、rk called TNN. Next, they choseTencentTriton Inference Server for inference servingYouTu Labbecause of its product maturity and dynamicbatching 8 concurrent model execution capabilities.Tencent easily customized Triton by integratingTNN as a Triton custom backend. TNN and Tritontogether help achieve
88、 standardized highperformance inferencing for all developers buildingAl applications”CnVIDIATencent YouTuLab#page#UNIVERSALINFERENCE ENGINENaver is the#1search engine and intemet servicescompany in South Korea.They use deep learning,built288with multipleframeworks,toenrich and diversifyqueryresults.
89、Naver uses Triton Inference Server to accelerate thedeployment of models in production.The platformsupports multipleframeworks,batch and real-timeinferencing,andinferencingonGPUsandCPUs.Ithelps Naver roll out newAl-basedservices faster and with lowerdevelopmentcosts.NAVERNAVEROnVIDIA#page#第ANNOUNCIN
90、GMICROSOFT ADOPTS NVIDIA Al4TO CREATE SMART EXPERIENCESIN MICROSOFT OFFICECorrect Grammar I Q&A I Predict Text200ms Inference Response for SOTAAl1/3 Lower Cost Than CPUAzure ML&ONNX Runtime with Triton for Inference ServingHalfaTrilion Queries per Yearfor GrammarLinkto AnnouncementMicrosoft Bllog#pa
91、ge#TRITON CUSTOMER ADOPTIONCompanyUse CaseWorkflowsOutcomeStandardized high performance inferencing for allATencent YoutuComputervision usecases,facialIntegrated their highly performant TNNcustom backendrecognition,andbiometriwork intoTritorapplicationsFacial recognition and humanReal-time online in
92、ferencing on Triton in the50% higher QPS per GPU with TritonKingsoft Cloud4-5x higher QPs with Triton+TensorRT/TVMcloudattributes recognitionMicrosoft AzureAny ML/DLworkflowReal-time inference serving on Triton withIncreased throughput by -7x compared to FlaskMachine Learninsd-to-end python serverMi
93、crosoft OfficeNLP models (BERTTuring-NLG,etc)Real time grammar checker runningSlashed costs by-70%andachjievedathroughput of450queries per secondona single V100GPU,withmore than half a trillion queries ayearAmerican ExpressFraud detection on 8B credit cardUsed Triton to deploy a TensorRT-optimizedCa
94、n operatewithin2ms latency budget(a 50xtransactionsimprovement compared to CPUs that could not meetGated Recurrent Unit model to analyze tens ofmillionsof daily transactionsthe latency requirement)NaverSearch recommendations and imageModelsinmultipleframeworks(TensorFlow,Asingle inference platform t
95、hat allowed for fasterclassificationPyTorch,Caffe,and TensorRT) on CPU and GPUrollout of newDL models from multiple frameworksSPILDefect detection on30,000waferUsed Triton to deploy and manage DenseNet.Can detect 100%of defects with10%falsAutoencoder,and UNET models in TensorRTimages per day7x improvement),and scale to 100 diffthout changes to serving intrastruct#page#Try Triton Inference ServerDownload from GitHub or from Docker RegistryTriton Inference Server GitHub: https:/ GPU Cloud (NGC)- Docker Container:https:/ / Documentation:https:/ /