报告预览

使用 Triton 优化深度学习推理的大规模部署.pdf

编号：29551

PDF 68页 3.47MB 下载积分：VIP专享

下载报告请您先登录！

使用 Triton 优化深度学习推理的大规模部署.pdf

1、NVIDIA使用Triton优化深度学习推理的大规模部署徐添豪，张雪萌，黄孟迪#page#Triton OverviewInference Server pipelineA100 Multi-Instance GPU (MIG）AGENDADeployment on KubernetesIntegration with KFServingMetrics for Monitoring and AutoscalingPerformance Analyzer: Optimization GuidanceCustommer Case Studies#page#NVIDIATriton Overview

2、#page#Inefficiency Limits InnovationDifficulties with Deploying Data Center InferenceSingle Framework OnlySingle Model OnlyCustom Development33Chainer21HOuXGPYT6RCHTensorfloRecSySNLPASRtheanoDevelopers need to reinSolutions can only supportteSome systems are overused whileplumbing for every applicam

3、odels from one frameworkothers are underutiliized#page#NVIDIA Triton Inference ServerProduction Inference Server on GPU and CPUMaximize real-time inferenceNVIDIAT4performance of CPUS and GPUsNVIDIAT4Quickly deploy and manage multipleQO口Teslamodels per GPU per nodeV100TeslaEasily scale to heterogeneo

4、us GPUsV100and multi GPU nodesNVIDIAA100Integrates with orchestrationNVIDIAsystems and auto scalers via latencyA100and health metricsCPUOpen source for seamlessCPUcustomizattion and integrattion#page#Triton Inference Server ArchitecturePreviously “TensorRT Inference Server”口 Support for multiple fra

5、meworkss Concurrent model executiondi制 CPU and Multi-GPU supportDynamic batchings Sequence batching for stateful models HTTP/REST，gRPC，shared library Health and status metrics (Prometheus) reportingModel ensembling and pipelinings Shared-memory API (system and CUDA） GCS and S3 support Open source- m

6、onthly releases on NGC and GitHub#page#FeaturesUtilizationUsabilityCustomizationPerformanceConcurrent Model ExecutionMultiple Model Format SupportModel EnsembleSystem/CUDA Shared MemoryMultiple models (or multipleTensorRTPipeline of one or more models andInputs/outputs needed to be passedinstancesof

7、same model）mayPyTorch JIT（-pt）the connection of input and outputto/from Triton are stored inTensorFlow 1.Xtensors between those models （cansystem/CUDA shared memory.execute on GPU simultaneouslyGraphDef/SavedModelReduces HTTP/gRPC overheadbe usedwith custom backend）TensorFlow+TensorRT 1.x GraphDefDy

8、namic BatchingInference requests can be batchedTensorFlow 2.x SavedModelCustom Backend for C+ and PythonLibrary VersionLinkagainst lbtrtserver.sosothatup by the inference server to 1） theTensorFlow+TensorRT2.XCustom backend allows the userSavedModelmore flexibility by providing theiryou can include

9、all the inferencemodel-allowed maximum or 2）theONNX graph (ONNX Runtime）user-defined latency SLAown implementation of an executionserver functionality directly in yourengine through the use ofa sharedapplicationCPU Model Inference ExecutionlibraryFramework native models canexecute inference requests

10、 on theStreaming APICPUBuiltin support for audio streaminginput.AccommodatesMetricsstateful/sequence models that haveUtilization，count，memory，andasequence ofinputs tokeeptrack oflatency（speech，translation，etc）.Model Control APIExplicitly load/unload models intoand out of Triton based on changesmade

11、in the model-controlconfiguration#page#TRITON 2.5What ls NewKFSerings new community standard gRPC and HTTP/RESTData Loading Library (DALI） backenddata plane v2 protocolAlows for accelerated pre-processing andaugmentation pipelinesEasily deploy serverless inferencing with Triton inwithin Triton for i

12、mages，videos，and speechKubernetesDecoupled inference servingPython custom backendEngages a model once sufficient but not all inputs are receivedAloWs Python code execution inside Triton (egpre， post-e.g.speechrecognition andsynthesisprocessing）Triton Model AnalyzerSupport for A100，MIGTools to charac

13、terize model performance and memory footprintHigher performance inference serving. Triton on MIG withfor efficient servingperformanceandfault isolationDeepStream 5.0 IntegrationLatestframework backendsNative integration in DeepStream 5.0 for multi-framework，multi-TensorRT7.1，TensorFlow2.2，PyTorch1.6

14、，ONNXRT1.5.3sensorstreaminganalyticsAzureMachine Learning IntegrationGoogle CAIP IntegrationAzureML integrated Triton as the platforms inferenceTriton is now available on GCP CAIP as a custom containerserver to deploy models at scale.#page#DEVELOPERS CAN FOCUS ON MODELS AND APPLICATIONSTriton Takes

15、Care of Plumbing To Deploy Models for InferenceMultipleDifferent Types ofInferencing on GPUDynamicFrameworksand CPUQueriesBatchingPTensorflowONNX用用发临PYT6RCH CustomRUNTIMEBatchReal timeStreamEnsembleAll Major FrameworkInference Serving on GPU 8Support For Different TypesDynamic BatchingMaximizesBacke

16、nds For Flexibility8CPU AcrossOf Inference Queries ForThroughput Under LatencyConsistencyDifferent Use CasesConstraintCloud I Data Center IEdgeConcurrent Model ExecutionFor High Throughput 8Bare metal|VirtualizationUtilizationStandard HTTP/gRPCCommunication#page#Designed for DevOps/MLOpsTriton Integ

17、rates Easily In Organizations Workflow For ALL Al Use CasesKubernetesMLOpsOpen Source 8IntegrationCustomizableLive Model Updates Scalable Microservice InDynamic Model LoadingCompletely Open SourceKubernetesTriton Model AnalyzerInspect，CustomizeaExtendHelm Chart For FastGoogle CAIP，Azure MLCustomizab

18、le ContainerDeploymentModular Backends For LowIntegrationMemory FootprintKFServing IntegrationGPU Util.，Memory，Inference Load 8 LatencyMetrics#page#NVIDIAInference ServerPipeline#page#Inference PipelineTypical PipelineInferenceRequest De-NetworkQueueRequestSerializationCompletedClientServerNetworkRe

19、questReceiveComputeSerializationClientTRITON#page#Running TritonTriton Docker Container Available on NGCname；output“bert_tf_v2_large_fp16_128_v2platform：modelname:“endlogitsPrerequisite: Docker and nvidia-docker installedmax batch_size:1data_type:TYPE_FP32input中dims：128name：“unique_ids”data_type$ do

20、cker pullTYPE_INT32name：dims:1“start_logitsnvcr.io/nvidia/tritonserver:20.11-py3data_type；eshape:shape:TVPE_FP321dims：128name；“segment_ids”个$ docker run -gpus=11-rm-p8000:8909data_typep8001:8001-p8002:8002-TYPE_INT32dims：128v/path/to/model/repository:/modelscount:1 tritonserver -modelkind:KIND_GPUna

21、me:“input_ids”gpus: erepository=/modelsdata_type：TYPE_INT32dims：128Tname：“input_maskdata_typeiTYPE_IN3dims：128#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubQA - Tokenizationimport tritonclient.http as httpclientmodel_name=“bert_tf_v2_large_fp16_128_v2”model version = -1RVID

22、IAbatch_size =11Overvewurl=x.x.x.：8000#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubMetadatatriton_client=httpclient.InferenceServerclient(url=url） Healthif not triton_client.is_server_live(）：if（metadataname=triton）：print（metadata）if not triton_client.is_server_ready(）：prin

23、t(”FAILED：is_server_ready”）metadataif not triton_client.is_model_ready（model_name）print（”FAILED：is_model_ready”）print（metadata）#page#Running InferenceAvailable at NVIDIA Deep Learning Examples GithubSend inference request to theCreatetheinference input/output for the modeinputs =results for1 = sanda

24、no#both output tensors.inputs.append (httpclient.InferInput（1，1，“INT32”）resultstriton client.infer（modelhameinputs.append(httpclientinputs，1，128，“INT32”）（sandano-sandanoinputs,aopend(httpclienmOUL1n3u1,128，“INT32） We expect there to be 2 results （each with batchoutputs,append(httpclient,InferRequest

25、edoutput(endsize1）.1ogits，binary_data=False）end logits=resultend _logits0outputs.append (httpclient,InferRequestedoutput(stastart logits = resultlstart logitsrort_logits，binary_data=False）aniD#page#NVIDIAA100 Multi-lnstance GPU (MIG）五#page#A100 MIG SupportOptimize GPU Utilization， Expand Access to M

26、ore Users with Guaranteed Quality of ServiceSNSUp To 7GPU Instances In a Single A100:USERD8Dedicated SM，Memory，L2 cache，Bandwidth forhardware QoS 8 isolationUSER1SimultaneousWorkload ExecutionWithGuaranteed Quality Of Service:USER2GPU InstaAIL MIG instances run in parallel with predictablethroughput

27、8latencyU5ER38Right Sized GPU Allocation:Different sized MIG instances based on targetUSER4品福GPU InstancworkloadsUSER5Diverse Deployment Environments:品GPUInstaSupportedwith Bare metal，Docker，Kubernetes，JSERVirtualizedEnv.SGPU InstCNS20428:Multi-InstanceGPU（MIG）深度学习最佳用法示例#page#Inference with Triton7

28、ResNet Models on 7 MIG Instances in ParallelgRPC ClientLodBalancertriton-trttriton-trttriton-trttriton-trttriton-trttriton-trttriton-trRESNET50MIG4MIG7MIG1MIG2MIG3MIG5MIG6A100#page#Inference with TritonMeasure Performance Using the Perf AnalyzerPerf Analyzer (formerly “perf_client）:Measures latency

29、and throughput (inf/s）under varying client loads.it can be usedto measure performance at the lowestpossible load on the model， by sending oneinference request to Triton and waiting forthe response.Use -concurrency-range option to sendmultiple requests at the same time.perf_analyzer -m flower -u 127.

30、0.0.1:50058-i http-concurrency-range 1:100 -f results.csv#page#Triton on A100 with MIG4.5X Throughput Speedup Using 7 MIG Instances on ResNet50Throughput（HTTP.BatchSize=1）A100-MG7X7One A100 can be partitioned into upto seven GPU instances to maximizingA100-MIG7x6GPU utilization and providing dynamic

31、scalability.A100-MIG7x5A100-MIG7x4Each MIG instance brings a consistentincrease in throughput.Using onlyA100-MIG7x3two MIG instances on A100 alreadyprovides an improvement over bothA100-MIG7x2V100V100 16GB and T4.A100-MIG7x1Conguroncy#page#Triton on A100 with MIG4X Latency Speedup Using 7 MIG Instan

32、ces on ResNet50Latency（p90，HTTP.BatchSize=1）Typically， when model concurrencyincreases， latency performance tends to100MG7xsuffers. This chart shows how increasingthe number of MIG instances used cankeep the latency low at higher切concurrency values.V100A100-MIG7X2The latency monotonously decreasesA1

33、0M7when more MIGs are added. Using onlytwo MIG instances on A100 alreadyA100-MIG7A7represents an improvement over bothMG7xV100 16GB and T4.At seven MIGA100-MG7xinstances，we obtain a significantincrease in latency speedup.Conct#page#1T4GPUTRITONClientNVIDIA TRITONModel RepositoryResNet5OTensorRTNVIDI

34、AT4GPU#page#MIG for Optimized InferencingScaling out microservices with Triton and MIGHorizontally scale out your containers or VMS7x by using MIG GPU compute Instance insteadof GPU devicesNo updates needed for application code；deployment code must be updated to use MIGrather than GPU resourcesConti

35、nue using microservice best-practice， oneserver per app，or allow Triton to manage allMIG devicesldeal for batch-size 1inferencing56 inference jobs on a DGXA100: 8*7* MIG 1g.5gb#page#NVIDIADeploymment on Kubernettes#page#Deployment on KubernetesHelm Chartapiversion：apps/v1kind：Deploymentcontainers：me

36、tadata；name：.Chart.Namename: template “triton-infereneaueNaaeuraeusanteAJaemrver.fuiname”namespace:（Release.NamespaceimagePullPolicy:values.imagepullPollcy labels；app：template“triton-inference-server.name”，resources:1imits：广 args:“tritonserver=aJos-tapomspec：RepositoryPath ”maRe.replicas:Values.repl

37、icaCount ports：selector：matchLabels：containerPort：8068name：httprelease:（CRelease.Name containerPort:8801templatename:grpcmetadata！-containerport:8892labels：name：metricsapp:（template“triton-infelivenessProbehttpGet：Pelease:t .Release.Name Jhttps:/ server#page#StorageTriton on Azure Kubernetes Service

38、BERT ModelModelBERT ChatBotRepositoryCLIENTBERT ChatBotAKSQuestions andAZUREVMTritonTriton InferencedLIHTritonInferenceServer DockerInferenceHTTPServer8000ContainerClient DockerK8s ServiceContainerTriton POD8002Metrics#page#Triton Inference Server on AKSAutoscalingAKSTritonTriton PODsK8s ServiceTrit

39、onInference20Server DockerMetricsContainerSorHPA（Horizontal Pod Autoscaler）Reg#page#6Questions andnVIDIAAnswers with BERTContext:NFollowing their loss in the divisional round ofthe previous seasons playoffs， the DenverBroncos underwentnumerous coachingchanges，including acoach John Fox(who had won.DA

40、:A:Q:What in0Q:Wh#page#费Kubernetes Support for MIG“Mixed”Strategy“Single” Strategyapiversion：v1apiVersion：V1kind：Podkind：Podmetadata：metadata:name：gpu-examplename：gpu-examplespec；spec：containers：containers：name：gpu-examplename：gpu-exampleimage：nvidia/cuda:11.0-baseimage：nvidia/cuda:11.0-baseresource

41、sresources1imits：1imits： A100-SXM4-48GB using traditional resource GPU typesExposed using new resource GPU typesGood for large clusters (homogeneous nodes）Good for smaller clusters (heterogeneous nodes）Users need to learn new type syntaxUsers dont need to learn new type syntaxCNS20428:Mullti-lnstanc

42、eGPU（MIG）深度学习最佳用法法示例#page#NVIDIAIntegration withKFServing#page#Triton Integration with KubeflowWhat is Kubeflow？Open-source project to make ML/DL workflows on Kubernetes simple， portable，and scalableCustomizable scripts and configuration files to deploy containers on their chosenenvironmentProblems

43、it solvesKubeflowEasily set up an ML/DL stack or pipeline that can fit into the majority of enterprisedatacenter and multi-cloud environmentsHow it helps Triton Inference ServerTriton Inference Server is deployedas a component inside of a production workflowTritonInferencetoServerOptimize GPU perfor

44、manceEnable auto-scaling， traffic load balancing，and redundancy/failover viametrics#page#Kubeflow Serving (KFServing）OverviewKFServing enables serverless inferencing on Kubernetes and provides performant high abstraction interfaces forcommon machine learning (ML） frameworks to solve production model

45、 serving use cases.You can use KFServing to:Provide a Kubernetes Custom Resource Definition for serving ML models on arbitraryframeworks.Encapsulate the complexity of autoscaling， networking，health checking，and server configuration to bringcutting edge serving features like GPU autoscaling， scale to

46、 zero，and canary rollouts to your MLdeployments.Enable a simple， pluggable， and complete story for your production ML inference server by providingXoq au Jo no Kllqpuleidxa pue Susssaoid-sod Sulssaoid-aid uolpaid#page#Yourtrainig/pruningvalidation flowdumpINFERENCE SERVERmodelARCHITECTURE KUBEFLOWMo

47、del repositorya00L0CMultipleworkloadsLoad balancerRECModelrepositoryPersistentvolumIMG8店3网到10PreQCAPl:ASRContainerizeprocessinginferenceserviceGPUPost（CPUGPU）processing网网网心0NVIDIA Trton商商地LegendMetricssericeAuto scalerAreadyexisingNew from NVIDIACluster图TensorRT，TensorFlow.C2/ONNXModl#page#Architect

48、ure OverviewInference Service Data PlaneExplaineriexplainDefault EndpointTransformerpredictpredict：explainPredictorpredict三Triton Inference Server First To Adopt KFServing V2 Protocol#page#KFServing: InterfaceInference ServiceApply the CRD：apiVersion：“serving.kubeflow.org/v1alpha2kind： “InferenceSer

49、vice”S kubectl apply-ftriton.yamlmetadata：name:“triton-simple-stringspec:default:predictor：ExpectedOutput:triton：storageUri：gs:/kfservinginferenceservice.serving.kubeflow.org/triton-simple-Samples/models/tensorrt”string created#page#KFServingRun a predictionUses the client at: https:/ example.html1.

50、determine the ingress IP and ports and set INGRESS HOST and INGRESS PORTSERVICE HOSTNAME=S(kubectl get inferenceservice triton-simple-string-o jsonpath=f.status.url Icut-d/-f 3）2.check server statuscurl -H Host: SESERVICE_HOSTNAME http:/SINGRESS_HOST3SEINGRESS_PORT3/api/status3.edit /etc/hosts to ma

51、p the CLUSTER IP to triton-simple-4.run the clientdocker run -e SERVICE_HOSTNAME:SSERVICE_HOSTNAME -it-rm-net=host nvcr.io/nvidia/tritonserver:20.11-py3-clientsdk./build/simple_string_client-u SSERVICE_HOSTNAMEndnoroottrantor:/workspace#./build/simple_string_client-utriton-simple-0+1=1#page#BERT Exa

52、mpleExtend KFServing and Implement pre/post processing and predictionclass BertTransformer（kfserving.KFModel）：def_init_（self，name：str，predictor_host:str）：super（）._init_（name）The preprocess method converts theself.short_paragraph_text=“The Apollo.”paragraph and the question to BERT inputself.predicto

53、r_host = predictor_hostself.tokenizer=with the help of the tokenizertokenization.FullTokenizer（vocab_file=/mnt/models/vocab.txtdo_lower_case=True）The predict method calls the Tritonself.model_name=“bert_tf_v2_large_fp16_128_v2“inference server PYTHON API toself.model_version =-1self.protocol = Proto

54、colType.from_str（ http）communicate with the inference serverwith HTTPdef preprocess(self，inputs： Dict）- Dict:The postprocess method converts rawreturn self.featuresprediction to the answer with thedef predict(self，features: Dict） -Dict:probabilityreturn resultdef postprocess(self，result:Dict）-Dict:r

55、eturn predictions”： prediction， “prob”：fet+aTqeqoud.teuosasaqu#page#BERT ExampleCreate the Inference ServiceapiVersion：“serving.kubeflow.org/v1alpha2env:kind：“InferenceServicename：STORAGE URIvalue:“gs:/kfservingmetadata:name：“bert-largesamples/models/triton/bert-transformerspec：predictor：default:tri

56、ton：transformer：resourcescustom:limits:Cpu：“1”Container:name:kfserving-containermemory: 16Giimage：gcr.io/kubeflow-ci/kfserving/bert-requests：Cpu:“1transformer:latestmemory:16G1resources:1imits：storageUri：gs:/kfserving-Cpu：“1Samples/models/triton/bertmemory: 1Girequests:Cpu：“1memory:1Gicommand:kubect

57、l apply-fbertyaml“python“bert_transformer#page#BERT ExampleRun Inferenceinstances”：“what President is credited with the original notion of putting Americans in space文MODEL_NAME=bert-1argeINPUT_PATH=./input.json/p-nnss=duoseqssuu8 ）NISOHIS-f3）HIdIndNI$p-aWVNISOH3IAH3S$SoHH-A-Tunhttp:/$INGRESS_HOST：IN

58、GRESS_PORT/v1/models/$MODEL_NAME:predictExpectedoutput:fpredictions:John F.Kennedy,“prob77.904https:/ ForMonitoring and AutoscalingA#page#Triton Inference Server Metrics For AutoscalingBefore Triton Inference Server- 5,000 FPSBefore Triton Inference Server- 800 FPSOne model per GPUSpike i

59、n requests for blue modelRequests are steady across all modelsGPUs running blue model are being fullyUtilization is low on all GPUsutilizedOther GPUs remain underutilized#page#Triton Inference Server Metrics For AutoscalingAfter Triton Inference Server- 15,000 FPSAfter Triton Inference Server-5,000

60、FPSLoad multiple models on every GPUSpike in requests for blue modelLoad is evenly distributed between all GPUsEach GPU can run the blue model concurrentlyMetrics to indicate time to scale upGPU utitizationPowerusageInference countQueue timeNumber of requests/sec#page#AVAILABLE METRICSNameUse CaseGr

61、anularityCategoryFrequencyPer GPUProxy for load on the GPUPer secondPower usagePer GPUPower limitMaximum GPU power limitPer secondGPUUtilizationGPUutilization ratePer GPUPer secondGPUutilization0.0-1.0）Per GPUTotal GPU memory，tn bytesPer secondGPU Total MemoryGPU MemoryPer GPUUsed GPU memory，in byte

62、sPer secondGPU Used MemoryPer modelNumber of inference requestsPer requestRequest countPer modelNumber of model inference executionsPer requestCountExecution countGPU 8 CPUrequest batchingPer modelPer requestNumber of inferences performed lone request countsInference countas“batch size”inferences）Pe

63、r modelPer requestLatency:requesttimeEnd-to-endinferencerequest handlingtimeTimearequest spends executing the inference modelPer modelPerrequestLatency: compute timeLatency（in the appropriate framework）GPU&CPUPer requestPer modelTimearequest spends waiting in thequeue beforeLatency:queuetimebeing ex

64、ecuted#page#Triton Metrics: AutoscalingHorizontal Pod AutoscalerThe HPAcontroller operates ontheratio beteendesiredPod1metric value and current metricvalue，the followingHorizontalequation returns the number of desired replicas:Pod1PodDeploymentAutoscalerScalePodnR = ceilWhereRis the number of replic

65、as that kubernetes needsWhen R is different from CR then the HPA increasestohave.auuoGuoe Kg seoudal Jo Jequinu aul sesealoepJoCR is the current number of replicas，deployment （in our case the TTIS deployment）:CVis the current metric value andDVis the desiired metric value#page#Triton Metrics: Autosc

66、alingCustom MetricsHere is the summary of what we need to deploy1.Prometheus operator and PrometheusWhere Req is the total number of requests.2.Horizontal Pod Autoscaler3Service MonitorWe need to express this equation using PromQL，the4.Custom Metricsprometheus query language，with the actual name of

67、themetrics exposed by TRTis:Wewant the HPA to perform autoscaling based onthe following metrics average time spent for eachOT= delta(nv_inference_queue_duration _us30srequest in the queuedela(nv inference request success30s）0TOT=Req#page#Triton Metrics: AutoscalingCustom MetricsapiVersion；metadata：n

68、am会tricsapiVersion：vIkind：ConfigMapmetadataonfignamespace：customtricsdata:config-data：adapterrulesseriesQuery取艾resourcesoverridesnamespaceresourcenamematchesa8：avg time queue msmetricsQueryvg（delta (nv inferenceSuecesLabeiMatcher8/308j）/1000）by（）#page#NVIDIAPerformance Analyzer：Optimization Guidance

69、A#page#Measuring Inference PerformanceTriton includes a performance measurement toolcalled the perf analyzer (formerly “perf client”）s Measures throughput and latency under varyingclient Loadss Real or synthetic input tensor data and clientloads HTTP/REST or GRPC APlss Complete feature coverage: sha

70、red-memorystateless and stateful models，batching，etc.sCommand-line tool w/ spreadsheet templates Generates charts to help visualize the throughputvs latency tradeoffs#page#Perf_Analyzeron Github repository and on NGC Triton Client ContainerSettingsThe perf_analyzer helps you determine the ideal mode

71、lBatchsize：Measurementwindow5msecconfiguration which maximizes performance based on specificLatency limit:9 msecConcurrency 1imit: 108 concurrent requestsconstraintsStabilizing using average latencyThe throughput and latency are taken over a time window，andthen repeated the measurements until it get

72、s stable values.Request concurrency:6Client：Request count:8387By default the perf_analyzer uses average latency to determineThroughput:1677.4 infer/secAvg latency:3575usec（standard deviationsinsalazesoelaiuaad-auasn ue noK anqqes248usec）p59 latency:3578 usecbased on that confidence level.p93 latency

73、:3626 usep95 latency: 4161 usecpgg latency:4218 usecAVB HTTP time:3569usec（send/recv146usec+responsewa1t3423usec）Inference count: 10366Executioncount：16866$ perf_analyzer -m flower、-u127.0.0.1:50050-i httpSuccessful request count:10866concurrency-range 1:50 -p5060Avgrequestlatency：2223usecusecinfer5

74、49usec+compute output 12usec）CnID#page#Basic Optimization - Inference Schedulers Default model scheduler:S Process 1 inference request at a time for each modelsIf multiple models each can have 1 inference request executing at any giventimeAn inference request can be a batch Gif client creates the ba

75、tch）Minimize LatencyBatch-1 RequestModel BackendBatch-4 RequestFrameworkDefaultSchedulerRunttime#page#Dynamic Batching OptimizationTriton Inference ServerGrouping requests into aModelY Backendsingle “batch” increasesoverall GPU throughputRuntime考Contexts Process multiple inferenceDynamicBatcherreque

76、sts at the same timeContextfor a modelIndividual requests arebatched and executedtogether#page#Dynamic BatchersEach models scheduler configured independentlys Dynamic batcher controls:Preferred batch sizesMaximum delay to hold inference request to form larger batchPrioritization and timeoutdynamic b

77、atching preferred batch_size：4，81#page#Dynamic Batcher ResultsDeep-recommender TensorRT modelDynamic Batch Size 1-321500010000馆50003p95Latency（ms）#page#Dynamic Batching2.5X Faster Inferences/Second at a 50ms End-to-End Server Latency ThresholdStatic vs Dynamic Batching（T4TRT Resnet50 FP16 Instance 1

78、）Triton Inference Server groups inference2000requests based on customer defined metricsfor optimal performanceCustomer defines 1） batch size (required）1500and 2）latency requirements （optional）Example:No dynamic batching（batch size1000188）vs dynamic batching101412#page#Basic Optimization- Concurrent

79、Model ExecutionsBy default Triton creates 1 copy of each model (on each available GPU）sEach copy is known as an instance of the models Inference requests are scheduled independently to each model instances Using more than 1 instance increase throughput and reduce latency if:GPU has sufficient memory

80、 to hold multiple copies of the modelGPU has sufficient compute to execute multiple inferences simultaneouslyGPU / PCle has sufficient bandwidth for simultaneous inferencesPrioritization and timeoutinstance group（count:2）1#page#Concurrent Model Execution- ResNet504x Better Performance and Improved G

81、PU Utilization Through Multiple Model ConcurrencyTriton Inference ServerCommon Scenario 1V100 16GB GPUOne APlusing multiple copies of theCUDASIInferenceFN50Instance1same model on a GPURequestsResNetCUDASFN50Instance250Example:12 instances of TRT FP16CUDA SreanBeuguOgNQoResNet50 （each model takes 1.3

82、3GB GPUCuDA StrFN5DInaanc4memory） are loaded onto the GPU and canrun concurrently on a 16GB V100 GPU.14 concurrent inference requests happenCUDARN5DInsnce6each model instance fulfils one requestCUDASRN50Insance7simultaneously and 2 are queued in theper-model scheduler queues in TritonCUDAStreamRN50I

83、nstancs8Inference Server to execute after the 12CUDA StrasRN5OInstnngtrequests finish.With this configuration，CUDA Sun2832 inferences per secondat 33.94 mswith batch size 8on each inference serverCUDA Sreaminstance is achieved.CUDA StraarRN50Instance12TimenID#page#Multiple Model Instances Increases

84、ThroughputDeep-recommender TensorRT modelDynamic Batch Size1-32D1instanceD2instances2000000003p95Laeny（ms）#page#NVIDIACustomer Case Studies#page#KINGSOFT CLOUD ADOPTS NVIDIATRITON INFERENCE SERVER TOMAXIMIZE PERFORMANCEAT SCALE15+ online AI computer vision services usingTriton for inferen

85、ce serving50% higher QPS per GPU with Triton4-5x higher QPS with Triton + TensorRT/TVM“Besides the increase of QPS and latency， we can nowKingsoft Cloudsmoothly shift our service from the offline mode (whichcares more about throughput） to online mode (which caresmore about latency） thanks to Tritons

86、 scheduling+batchingand TensorRTs dynamic batch support”nVIDIAKingsoftCloud#page#TENCENT YOUTU INTEGRATES TNNBACKEND INTO NVIDIA TRITONINFERENCE SERVER TO STANDARDIZEINFERENCE AT SCALE“in order to standardize inferencing across Tencent，Tencent Youtu developed a new open source highperforming framewo

87、rk called TNN. Next， they choseTencentTriton Inference Server for inference servingYouTu Labbecause of its product maturity and dynamicbatching 8 concurrent model execution capabilities.Tencent easily customized Triton by integratingTNN as a Triton custom backend. TNN and Tritontogether help achieve

88、 standardized highperformance inferencing for all developers buildingAl applications”CnVIDIATencent YouTuLab#page#UNIVERSALINFERENCE ENGINENaver is the#1search engine and intemet servicescompany in South Korea.They use deep learning，built288with multipleframeworks，toenrich and diversifyqueryresults.

89、Naver uses Triton Inference Server to accelerate thedeployment of models in production.The platformsupports multipleframeworks，batch and real-timeinferencing，andinferencingonGPUsandCPUs.Ithelps Naver roll out newAl-basedservices faster and with lowerdevelopmentcosts.NAVERNAVEROnVIDIA#page#第ANNOUNCIN

90、GMICROSOFT ADOPTS NVIDIA Al4TO CREATE SMART EXPERIENCESIN MICROSOFT OFFICECorrect Grammar I Q&A I Predict Text200ms Inference Response for SOTAAl1/3 Lower Cost Than CPUAzure ML&ONNX Runtime with Triton for Inference ServingHalfaTrilion Queries per Yearfor GrammarLinkto AnnouncementMicrosoft Bllog#pa

91、ge#TRITON CUSTOMER ADOPTIONCompanyUse CaseWorkflowsOutcomeStandardized high performance inferencing for allATencent YoutuComputervision usecases，facialIntegrated their highly performant TNNcustom backendrecognition，andbiometriwork intoTritorapplicationsFacial recognition and humanReal-time online in

92、ferencing on Triton in the50% higher QPS per GPU with TritonKingsoft Cloud4-5x higher QPs with Triton+TensorRT/TVMcloudattributes recognitionMicrosoft AzureAny ML/DLworkflowReal-time inference serving on Triton withIncreased throughput by -7x compared to FlaskMachine Learninsd-to-end python serverMi

93、crosoft OfficeNLP models (BERTTuring-NLG，etc）Real time grammar checker runningSlashed costs by-70%andachjievedathroughput of450queries per secondona single V100GPU,withmore than half a trillion queries ayearAmerican ExpressFraud detection on 8B credit cardUsed Triton to deploy a TensorRT-optimizedCa

94、n operatewithin2ms latency budget（a 50xtransactionsimprovement compared to CPUs that could not meetGated Recurrent Unit model to analyze tens ofmillionsof daily transactionsthe latency requirement）NaverSearch recommendations and imageModelsinmultipleframeworks（TensorFlow，Asingle inference platform t

95、hat allowed for fasterclassificationPyTorch，Caffe，and TensorRT） on CPU and GPUrollout of newDL models from multiple frameworksSPILDefect detection on30,000waferUsed Triton to deploy and manage DenseNet.Can detect 100%of defects with10%falsAutoencoder，and UNET models in TensorRTimages per day7x improvement)，and scale to 100 diffthout changes to serving intrastruct#page#Try Triton Inference ServerDownload from GitHub or from Docker RegistryTriton Inference Server GitHub: https:/ GPU Cloud (NGC)- Docker Container:https:/ / Documentation:https:/ /

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（使用 Triton 优化深度学习推理的大规模部署.pdf）为本站（X-iao）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。

上海品茶

使用 Triton 优化深度学习推理的大规模部署.pdf

使用 Triton 优化深度学习推理的大规模部署.pdf