上海品茶

5.基于eBPF的Service Mesh性能瓶颈定位与优化.pdf

编号:161114 PDF 47页 15.09MB 下载积分:VIP专享
下载报告请您先登录!

5.基于eBPF的Service Mesh性能瓶颈定位与优化.pdf

1、首届中国eBPF研讨会基于eBPF的服务格性能瓶颈定位与优化主讲:陈鹏飞单位:中学2022-11-12首届中国eBPF研讨会录01背景介绍02服务格数据优化03FaaS数据优化04展望基于eBPF的服务网格性能瓶颈定位与优化首届中国eBPF研讨会背景介绍01基于eBPF的服务网格性能瓶颈定位与优化首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化!#$%&()*+,-(.!/0.!*12!3456789:;,?ABCDEF!#CG!#$%&()*+,-(.!/0.!*12!3456789:;,?ABCDEF!#CGH$%IJKL/MNOP/QMN/RSTUVWX*YZH$%

2、IJKL/MNOP/QMN/RSTUVWX*YZFF首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化连续交付:连续的开发和交付,减少业务Go-To-Market的时间容器:基础使能技术,使开发和部署软件系统的速度加快DevOps:新的软件开发模式,加速软件的开发速度;微服务:小而精的软件产品,易于开发、交互和维护;天下武功,唯快不破!世间软件,唯快不赢!首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化引自CNCFANNUAL SURVEY 2021Kubernetes在全球范围内的

3、应用首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化Kubernetes的流行度首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化Kubernetes 是一个开源的容器编排引擎,用来对容器化应用进行自动化部署、扩缩和管理。该项目托管在 CNCF。Kubernetes的架构首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化4|Copyright 2019HOWDO YOUOBSERVE?HOW DO YOUMANAGEAPIS?HOW CAN ENFORCESECURITY?MONOLITHMICROSERVICESWhy you m

4、ight be interested?!#$%&()*+,-./00123456!789:;2?,-.2ABCDEFGHIJKLMNNNMNNNO,-.O,-.2PQR-.SA2TEIUVWXYZNNNNO,-.O,-.2EJK3,-._,-._aaNNaaNN2bbc首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化p _abc;deQMNCfghijp klcmnB;opdeCIqghijp QMNfgUrstuUVWXjEnvoy数据面,负责包过滤和转发;管理和配置部署Sidecar,配置流量规则、故障恢复、重试和熔断负责身份认证和证书管理的核心安全组件配置校验,下发

5、规则 服务网格(Service Mesh)首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化 Serverless(FaaS)Knative 工作流Knative 架构Knative 是一个典型的基于K8S和Istio的服务器计算平台,能够以事件触发的形式高并发运行Functions首届中国eBPF研讨会云原系统基于eBPF的服务网格性能瓶颈定位与优化 Dapr(Distributed Application Runtime)分布式应用运行时,一个事件驱动、可移植的运行时用于云上和边缘计算上构建微服务。首届中国eBPF研讨会vwxvwx y yvzv|vwvvvvxvvzv

6、|vwvvvvxv;|z|z:/C“”+C,;:/C“”+C,;C-FEC-FE|z|z;-CO/*F;-CO/*FeBPF基于eBPF的服务网格性能瓶颈定位与优化首届中国eBPF研讨会eBPF基于eBPF的服务网格性能瓶颈定位与优化就像宇宙中的“虫洞”提供一条一个时空到达另一个时空的捷径,eBPF是操作系统用户空间和内核空间的“虫洞”宇宙“虫洞”内核“虫洞”首届中国eBPF研讨会基于eBPF的服务网格性能瓶颈定位与优化17 eBPF正在努力让操作系统内核可编程化,成为云原生时代软件系统的“瑞士军刀”;系统安全系统可观测性网络优化协议解析负载均衡存储系统优化混沌工程eBPFeBPF遇上Serv

7、ice Mesh会发生什么?首届中国eBPF研讨会服务格数据优化02基于eBPF的服务网格性能瓶颈定位与优化首届中国eBPF研讨会基于eBPF的服务网格性能瓶颈定位与优化19服务格数据优化 基于K8S的容器管理平台产生了多层复杂的虚拟网络,网络栈复杂度增加,延长了端到端的请求处理时间 性能问题首届中国eBPF研讨会基于eBPF的服务网格性能瓶颈定位与优化20服务格数据优化 Sidecar的引入增加了请求在网络协议栈传输路径,增加了延迟;Iptables的线性搜索增加了请求延迟;Sidecar中请求频繁的用户态与内核态之间的切换增加了请求延迟;性能问题首届中国eBPF研讨会服务格数据优化基于eB

8、PF的服务网格性能瓶颈定位与优化 性能问题首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 性能剖析p 以 Isitio 提 供 的BookInfo作为Benchmark;p 跟踪请求的指向性过程以及涉及到的系统调用;p 观察请求延迟在内核中的分布;p 结 论:请 求 在envoy网络协议栈中的时延较长;5885586587588589590599559659759859960060056066076086096666226236246

9、256266276286296306335636637638Conference acronym XX,June 0305,2018,Woodstock,NYAnon.639640644564664764864965065556566576586596606665666667668669670677567667767867968068856866876886896906995696Figure 4:Latency distribution o

10、f requests in Bookinfo.Figure 5:The overall architecture of our system.Our optimization based on eBPF is shown in figure 3(d)and 3(c).It shortens the communication between proxies and the communi-cation between proxies and service instances and then reduces taillatency.Next,we will show more details

11、.5.2Intra-pod Optimization5.2.1Socketredirection.Theintra-podoptimizationfocusesonre-ducing the time of packet filtering to reduce tail latency.eBPF has some hooks as well as some types of programs at thesocket layer.A SOCK_OPS program is attached to some hooksof sockets,such as connection establish

12、ment,retransmission time-out.If those operations happen,we can obtain the information ofsockets.A SK_MSG program is triggered to execute when the sys-tem calls such as sendmsg()or sendfile()are called.We can do thesocket redirection in a SK_MSG program.eBPF provides a helperTable 1:Information of so

13、ckets when requests once.socketsource(ip:port)destination(ip:port)(1)172.20.2.132:3548410.68.241.231:9080(2)127.0.0.1:15001172.20.2.132:35484(3)172.20.2.132:52944172.20.2.134:9080(4)172.20.2.134:15006172.20.2.132:52944(5)127.0.0.1:36709127.0.0.1:9080(6)127.0.0.1:9080127.0.0.1:36709function called bp

14、f_msg_redirect_hash()to achieve socket redirec-tion.This function needs a SOCKHASH to do socket redirection.This function can directly forward packets from the current socketto peer socket indexed by the key stored in SOCKHASH,withoutpassing through the TCP/IP stack and iptables filtering rules.Weus

15、e the current sockets quadruple information as the key,that is,source address,source port,destination address and destinationport.5.2.2Workflow after optimization.Table 1 shows the informationof established sockets when the client of Pod 1 requests to theserver of Pod 2 in figure 3(a)and 3(b).With s

16、ocket redirection pro-vided by eBPF,we optimize the process I in figure 3(a)and 3(b)byattaching SOCK_OPS and SK_MSG programs on each node.The optimization with socket redirection is shown in figure 3(f).The SOCK_OPS program is responsible for updating SOCKHASH.When a socket sends SYN and ACK for SYN

17、,it obtains the quadru-ple information of current socket as a key and current socket asa value,then calls the bpf_sock_hash_update function to updatesthe SOCKHASH.6首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 性能剖析Latency(us)CPU Usage(Virtual Cores)TCPHTTPgRPCTCPHTTPgRPCIPC11.59(30%)12.75(8%)13.04(7%)0.49(

18、15%)0.51(5%)0.55(4%)Read8.14(16%)9.01(5%)9.37(5%)0.26(8%)0.29(3%)0.30(2%)Write13.22(34%)13.80(8%)14.35(7%)0.45(14%)0.48(5%)0.57(4%)Notification1.33(3%)1.27(1%)1.35(1%)0.26(8%)0.27(3%)0.26(2%)Protocol Parsing-117.35(70%)142.38(73%)-6.00(62%)9.76(71%)Protocol Other4.25(11%)13.07(8%)14.39(7%)1.79(55%)2

19、.09(22%)2.34(17%)Total38.63167.25194.793.259.6513.79Table 2:Contribution of different components to the overhead of a single sidecar instance in different protocol modes.Thenumbers report both inbound and outbound overheads.Latency(us)Virtual CoresFault Injection5.74(3.1%)0.20(1.9%)Rate Limit8.19(4.

20、5%)0.21(2.0%)Tap156.09(85.0%)2.95(8.0%)Lua80.59(43.9%)3.18(30.2%)WebAssembly26.30(14.3%)0.69(6.6%)Table 3:Latency overhead of five filters.The percentage inparentheses denotes the additional overhead atop baselineHTTP mode(without any filters).no-op)and,as assumed by MeshInsight,validate that theiro

21、verhead is additive.We study five different filters,covering all three ways towrite an Envoy filter:1)Fault Injection:a built-in,C+filterthat helps test the resilience to communication failures;2)(Local)Rate Limit:a built-in,C+filter that rate limits trafficto a service instance.3)Tap(File):a built-

22、in,C+that recordstraffic and is configured to log to a file;4)Lua:a custom,no-op filter written as a Lua script;5)WebAssemtly:a custom,no-op filter written as a WebAssembly module.We add thesefilters on Envoy configured in HTTP mode.Table 3 shows the overhead of each filter inferred byMeshInsight wh

23、en subjected to the same workload as the pre-vious section(100 byte messages,30K request per second).We see that different filters have widely different overheads.The baseline overhead of C+filter is low,as evidenced bythe low overhead of Fault Injection and Rate Limit filters.The high overhead of T

24、ap(file)is high because of its inter-action with the file system.On the other head,even no-opLua or WebAssembly filters have substantial latency and CPUoverheads,with Lua being 3x more expensive for latency andnearly 5x more expensive for CPU.To studythe composabilityoffilters,we considerfive differ

25、-ent filter configurations,each with a different way to combinefilter types:1)CC:combines all three types of C+filters;2)CLWcombines the Lua and WebAssembly filters;3)CCL:combines the Lua filter with all three C+filters;4)CCW:combines the WebAssembly filter with all three C+filters;(a)Latency.(b)CPU

26、 Usage.Figure 9:Prediction results of different filters configurations.and 5)CCLW:combines all five filters.Figure 9 shows both predicted and measured overheads ofeach of these combinations.The measured overhead denoteslatency and CPU usage with the filters minus that without thefilters.We see that

27、filter combination overheads can be quitehighwhen multiple expensive filters are employed(somethingthat the developers must avoid).We also see the predictionsof MeshInsight,based on adding individual overheads on topof base HTTP proxy overhead in Table 3,are quite accurate.5.5Impact of Message Size

28、and RateWe now characterize the impact of message size and rate.Wewill show that,consistent with our modeling assumptions,theoverhead increases with each of these factors.To study theimpact of message size,we vary it from 100bytes to 16KB.The upper end of this range is well beyond the maximum sizeth

29、at we directly profile(4KB).To study the impact of messagerate,we vary it from 10K to 50K requests per second.LatencyFigure 10 plots latency overhead for HTTP proxywithout filters.The latency increase is similar for other pro-tocols.We see that latency overhead increases slowly withmessage size.Goin

30、g from 100 bytes to 16 KB(which repre-sents a very large message),the latency overhead increasesby 53 ms.This increase represents only a 30%increase forHTTP.The presence of filters does not significantly changethe impact of message size on latency,as most filters operate9引自论文Dissecting Service Mesh

31、OverheadsSidecar带来的延时和CPU开销Sidecar中不同的组件对不同协议的延迟贡献首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 解决方案5885586587588589590599559659759859960060056066076086096666226236246256266276286296306335636637638Conference acronym XX,June 0305,2

32、018,Woodstock,NYAnon.639640644564664764864965065556566576586596606665666667668669670677567667767867968068856866876886896906995696Figure 4:Latency distribution of requests in Bookinfo.Figure 5:The overall architecture of our system.Our

33、 optimization based on eBPF is shown in figure 3(d)and 3(c).It shortens the communication between proxies and the communi-cation between proxies and service instances and then reduces taillatency.Next,we will show more details.5.2Intra-pod Optimization5.2.1Socketredirection.Theintra-podoptimizationf

34、ocusesonre-ducing the time of packet filtering to reduce tail latency.eBPF has some hooks as well as some types of programs at thesocket layer.A SOCK_OPS program is attached to some hooksof sockets,such as connection establishment,retransmission time-out.If those operations happen,we can obtain the

35、information ofsockets.A SK_MSG program is triggered to execute when the sys-tem calls such as sendmsg()or sendfile()are called.We can do thesocket redirection in a SK_MSG program.eBPF provides a helperTable 1:Information of sockets when requests once.socketsource(ip:port)destination(ip:port)(1)172.2

36、0.2.132:3548410.68.241.231:9080(2)127.0.0.1:15001172.20.2.132:35484(3)172.20.2.132:52944172.20.2.134:9080(4)172.20.2.134:15006172.20.2.132:52944(5)127.0.0.1:36709127.0.0.1:9080(6)127.0.0.1:9080127.0.0.1:36709function called bpf_msg_redirect_hash()to achieve socket redirec-tion.This function needs a

37、SOCKHASH to do socket redirection.This function can directly forward packets from the current socketto peer socket indexed by the key stored in SOCKHASH,withoutpassing through the TCP/IP stack and iptables filtering rules.Weuse the current sockets quadruple information as the key,that is,source addr

38、ess,source port,destination address and destinationport.5.2.2Workflow after optimization.Table 1 shows the informationof established sockets when the client of Pod 1 requests to theserver of Pod 2 in figure 3(a)and 3(b).With socket redirection pro-vided by eBPF,we optimize the process I in figure 3(

39、a)and 3(b)byattaching SOCK_OPS and SK_MSG programs on each node.The optimization with socket redirection is shown in figure 3(f).The SOCK_OPS program is responsible for updating SOCKHASH.When a socket sends SYN and ACK for SYN,it obtains the quadru-ple information of current socket as a key and curr

40、ent socket asa value,then calls the bpf_sock_hash_update function to updatesthe SOCKHASH.6优化的基本思路p 使用eBPF sock_ops中的redirect能力实现节点内socket数据直通;p 使用TC/XDP redirect实现跨节点的网络接口直通;TC/XDP redirect首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 解决方案4654664674684694704775476477478479480488548648

41、748848949049954964974984995005005506507508509555522Tail Latency Optimization in Service Mesh Based on eBPFConference acronym XX,June 0305,2018,Woodstock,NY52352452552652752852953053355365375385395405445546547548549550551552

42、55355455555655755855956056655665675685695705775576577578579580(a)Pod communication within one node.(b)Pod communication across different nodes.(c)Optimized pod communication across nodes.(d)Optimized pod communication within one node.(e)The workflow of eBPF.(f)The workflow of s

43、ocket redirection.Figure 3:The workflows of communication in nodes before and after optimization,eBPF and socket redirection.the Server is not deployed on Node 1,it is forwarded by Flannel.Encapsulated in UDP,the packet is forwarded to the physical NICeth0 of Node 1 and then received by the physical

44、 NIC of Node 2.After that,it is forwarded to Flannel.1 vNIC,cni0 then veth2.Packets transmitted between containers and proxies in a pod isrequired to pass through the TCP/IP stack and iptables packet fil-tering rules.Iptables match filtering rules for a chain,which hasa certain impact on packet tran

45、smission performance.In a servicemesh,because of proxies,packets transmitted between pods con-sume more time when passing the TCP/IP stack and iptables chainrules,leading to the latency increase.4.2Latency AnalysisTounderstandthelatencydistributionofpackettransmissioninIs-tio after the analysis in S

46、ection 4.1,we first attaches custom eBPFprograms on the corresponding kernel probe of important systemcalls called when packets are sent and received,and then obtainpackets latency distribution of the Bookinfo application(see fig-ure 4).Our experimental environment and tools are introduced inSection

47、 6.As can be seen in figure 4,for cross-node requests,the commu-nication between pods consumes a lot on latency.That is becausepacketsmustpassthroughthephysicalNIC,consumingmoretimethan that only pass through the bridge on the same node.Focusedon the highlighted area in the middle,it can be observed

48、 that ithas a great impact on latency when proxies obtains packets fromthe physical NIC and performs Netfilter filtering.Finally,we ana-lyzetheothertime-consumingregionsoffigure4.Reasonsforlonglatency are the long time for packet coping in the socket buffer be-tween user space and kernel space as we

49、ll as the time for server toprocess packets.Based on our observation,the unpredictable com-munication time between nodes,queuing time when doing Netfil-ter filtering as well as the time for packets coping all have a greatimpact on tail latency.5OPTIMIZATIONBasedontheanalysisinSection4,toreducetailla

50、tencyinaservicemesh,it is necessary to reduce the time of communication acrossnodes,packets filtering and copying.eBPF allows us to acceleratethe packet transmission.This section describes our optimizationwith eBPF in detail.Optimizing the data plane of Istio at the socketlayer and at traffic contro

51、l layer helps reduce tail latency.5.1ArchitectureThe overall architecture of our work is shown in Figure 5.Opti-mizer uses eBPF to optimize the data plane of Istio and acceleratethe packet transmission.Socket redirection and redirection at traf-fic control layer can help to optimize the communicatio

52、n betweenserviceinstancesandproxiesandcommunicationbetweenproxies.Then we run benchmarks with a load generator in Istio to see theefficiency of the optimization.Here,Meshery20 is used as a loadgenerator to perform performance benchmarks in Istio.After loadgeneration,collector collects the tail laten

53、cy before and after op-timization as well as the CPU and memory usage to find out howwell our optimization method works.The experimental results andanalysis are shown in Section 6.546546646746846947047754764774784794804885486487488489490499549649749849950050150250350

54、4505506507508509555522Tail Latency Optimization in Service Mesh Based on eBPFConference acronym XX,June 0305,2018,Woodstock,NY5235245255265275285295305335536537538539540544554654754854955055555565575585595605665566567568569

55、5705775576577578579580(a)Pod communication within one node.(b)Pod communication across different nodes.(c)Optimized pod communication across nodes.(d)Optimized pod communication within one node.(e)The workflow of eBPF.(f)The workflow of socket redirection.Figure 3:The workflows of communi

56、cation in nodes before and after optimization,eBPF and socket redirection.the Server is not deployed on Node 1,it is forwarded by Flannel.Encapsulated in UDP,the packet is forwarded to the physical NICeth0 of Node 1 and then received by the physical NIC of Node 2.After that,it is forwarded to Flanne

57、l.1 vNIC,cni0 then veth2.Packets transmitted between containers and proxies in a pod isrequired to pass through the TCP/IP stack and iptables packet fil-tering rules.Iptables match filtering rules for a chain,which hasa certain impact on packet transmission performance.In a servicemesh,because of pr

58、oxies,packets transmitted between pods con-sumemoretimewhenpassingtheTCP/IPstackandiptableschainrules,leading to the latency increase.4.2Latency AnalysisTounderstandthelatencydistributionofpackettransmissioninIs-tio after the analysis in Section 4.1,we first attaches custom eBPFprograms on the corre

59、sponding kernel probe of important systemcalls called when packets are sent and received,and then obtainpackets latency distribution of the Bookinfo application(see fig-ure 4).Our experimental environment and tools are introduced inSection 6.As can be seen in figure 4,for cross-node requests,the com

60、mu-nication between pods consumes a lot on latency.That is becausepacketsmustpassthroughthephysicalNIC,consumingmoretimethan that only pass through the bridge on the same node.Focusedon the highlighted area in the middle,it can be observed that ithas a great impact on latency when proxies obtains pa

61、ckets fromthe physical NIC and performs Netfilter filtering.Finally,we ana-lyzetheothertime-consumingregionsoffigure4.Reasonsforlonglatency are the long time for packet coping in the socket buffer be-tween user space and kernel space as well as the time for server toprocess packets.Based on our obse

62、rvation,the unpredictable com-munication time between nodes,queuing time when doing Netfil-ter filtering as well as the time for packets coping all have a greatimpact on tail latency.5OPTIMIZATIONBasedontheanalysisinSection4,toreducetaillatencyinaservicemesh,it is necessary to reduce the time of com

63、munication acrossnodes,packets filtering and copying.eBPF allows us to acceleratethe packet transmission.This section describes our optimizationwitheBPFindetail.OptimizingthedataplaneofIstioatthesocketlayer and at traffic control layer helps reduce tail latency.5.1ArchitectureThe overall architectur

64、e of our work is shown in Figure 5.Opti-mizer uses eBPF to optimize the data plane of Istio and acceleratethe packet transmission.Socket redirection and redirection at traf-fic control layer can help to optimize the communication betweenserviceinstancesandproxiesandcommunicationbetweenproxies.Then w

65、e run benchmarks with a load generator in Istio to see theefficiency of the optimization.Here,Meshery20 is used as a loadgenerator to perform performance benchmarks in Istio.After loadgeneration,collector collects the tail latency before and after op-timization as well as the CPU and memory usage to

66、 find out howwellouroptimizationmethodworks.Theexperimentalresultsandanalysis are shown in Section 6.5单节点上pod-proxy以及proxy-proxy的网络传输栈p 单节点网络优化方法单节点网络优化结果首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 解决方案p 单节点网络优化方法 套接字重定向p sock_ops 程序附加到套接字连接建立的钩子点,获取套接字选项信息,更新socket_hashp sk_msg程序在socket_hash中的socket进行se

67、ndmsg系统调用时被触发执行,完成套接字重定向工作,进行数据包转发。首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 解决方案p 单节点网络优化方法跨节点proxy-proxy的网络传输跨节点proxy-proxy的优化结果利用DPDK用户态协议栈实现CPU Bypass也可加速,比如网易方案首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 解决方案p套接字重定向p 对部署在相同节点上的pod间通信优化pTC层重定向p 对部署在不同节点上的pod间通信优化p Traffic Control(简称 TC)是 Linux 负责流量控制的模块

68、,通过在网卡设备上建立队列规则,建立数据包队列,定义队列中数据包的发送方式,从而实现流量控制p TC 中的队列规则类型 clsact 可作为钩子点挂载用户自定义的 eBPF 程序。Ingress 处理入口流量,Egress 处理出口流量。p 将数据包重定向到另一个网卡设备上,以此实现包转发替换成XDP_redirect实现首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 测试方案888225826827828829830833583683783883984084184284384

69、484584684784884985085558568578588598608665866867868869870Conference acronym XX,June 0305,2018,Woodstock,NYAnon.877587687787887988088858868878888898908995896897898899900900590690790890999992292392492592

70、6927928(a)The architecture of Bookinfo.(b)The architecture of Histershop.Figure 6:The architectures of two benchmarks,namelyBookinfo(a)and Hipstershop(b)twoworkernodes,withanIstioservicemeshdeployed,hostingtheBookinfo application officially provided by Istio and the HipsterShop application provided

71、by Google as examples.6.1.1Applications.Bookinfo shows information about a book,in-cluding its description,details such as ISBN and comments aboutit.It comprises four services.Productpage service asks for Detailand Reviews services to generate a web page.Detail service storesdetails of books.Reviews

72、 service generates book reviews whoseversions v2 and v3 ask Ratings for books ratings.Hipster15 sim-ulates E-commerce services.When we access the Frontend,it willask Adservice,Currency,Cart,Recommendation and Productcata-log for details to generate a web page.Figure 6 shows the deploy-ment of those

73、applications on nodes.Solid lines represent actualservice invocation in benchmarks.Dashed lines represent the in-vocations existing in Hipster but not carried out in benchmarks.6.1.2Meshery.Fordifferentapplicationsandinfrastructures,usersneedtocomparetheirseparatebehaviorsfromservicemesheswhenselect

74、ingaservicemeshtodeploy.Meshery20isareliablebehav-ior analysis tool for different service meshes.Meshery managesservice meshes by interacting them with corresponding adapters.Mesheryallowsuserstomodifytheconfigurationofadapters,suchas their type,name,exposed port and so on.Users can define theirown

75、benchmark configuration,such as the duration,concurrency,type of load generator and so on,through the meshery UI,mesh-eryctl commands or a script.Once users submit a benchmark con-figurationtotheMesheryserver,Mesherywillautomaticallyrunabenchmark with the specified load generator and collect data th

76、enreturn to users.With the help of Meshery,we run benchmarks inIstio to test the effectiveness of the optimization.6.1.3Service capacity.Before our experiments,we first need tofind out the service capacity of Bookinfo.Then We test the P99of service response time under different target QPS with diffe

77、rentconcurrency.As shown in Figure 7(a),the largest target QPS is100 and an obvious turning point happens at 70 QPS.Therefore,weconsider the case that QPS larger than 70 QPS as a high workloadwhile the other case as a low workload.6.2Evaluation Under Low WorkloadTo know the effectiveness of optimiza

78、tion method we proposed,we first run benchmarks under a low workload for thirty secondswith50targetQPSsettedinfoursituations,namelytheoriginalser-vice mesh,only for intra-pod communication optimization,onlyfor inter-pod communication optimization,and the combined op-timization.Here,we focus on the t

79、ail latency of request response time.Fig-ure 7(b)draws the average of P99,where the error bars reflect thestandard variance.P99 represents the response time of 99%of therequests and accurately reflects the tail latency of requests.As seen in Figure 7(b),the inter-pod communication optimiza-tion is s

80、imilar to intra-pod communication optimization.The com-bined optimization achieves the best result.For a packet,the inter-pod communication optimization method reduces two iptables fil-tering operations.While the inter-pod communication eliminatesiptables filtering when the packet is forwarded on th

81、e same nodeandbridgeforwardingwhenforwardedondifferentnodes.Wecanalso see that the optimization works well when processing fiveusers concurrently with 50 target QPS.It reduces tail latency tofour-fifths of the original service mesh.6.3Evaluation Under High WorkloadWe run benchmarks under a high load

82、 with the target QPS set to100.By default,the concurrency of proxy in Istio is set as 2,whichmeansthataproxycanonlyhandle2requestsconcurrently.Itdoesnot make sense when we manually modify that value since it isrelated to the resources of CPU and memory.Figure 7(c)shows the tail latency of different

83、connections whenwerunbenchmarksonBookinfoandHipstershop.Wecompareouroptimization with merbridge2 which is proposed to accelerateIstio.For Bookinfo,it can be observed that when the number ofconnection is less than twenty.The optimized tail latency of ourproposed method is better than merbridge.Our me

84、thod reducesabout half of tail latency when the connection is 1,then 22%whenthe connection is 5,and then 5%with 10 connections.However,merbridge reduces just 1ms in those cases.When the connectionincreases to 100 and 125,our method and merbridge only achievevery tiny performance improvement.With 100

85、 connections,ouroptimization makes the tail latency 2%longer than the originalservice mesh,while merbridge makes it 1%longer.Actually,whenthe number of connections reaches 125,there is no optimizationany more for both of out method and merbridge.For Hipstershop,our experimental result is similar as

86、that ofBookinfo.We run benchmarks with 30 target QPS that is largerthanitcanserve.Whentheconnectionissetto1,ouroptimizationimprovesperformanceaboutone-fifthoftheoriginalservicemesh,while merbridge makes even no improvement.As the number of8888225826827828829830831832

87、833834835836837838839840844584684784884985085558568578588598608665866867868869870Conference acronym XX,June 0305,2018,Woodstock,NYAnon.87758768778788798808885886887888889890899589689789889990090059069079089099914

88、99923924925926927928(a)The architecture of Bookinfo.(b)The architecture of Histershop.Figure 6:The architectures of two benchmarks,namelyBookinfo(a)and Hipstershop(b)twoworkernodes,withanIstioservicemeshdeployed,hostingtheBookinfo application officially provided by Istio and th

89、e HipsterShop application provided by Google as examples.6.1.1Applications.Bookinfo shows information about a book,in-cluding its description,details such as ISBN and comments aboutit.It comprises four services.Productpage service asks for Detailand Reviews services to generate a web page.Detail ser

90、vice storesdetails of books.Reviews service generates book reviews whoseversions v2 and v3 ask Ratings for books ratings.Hipster15 sim-ulates E-commerce services.When we access the Frontend,it willask Adservice,Currency,Cart,Recommendation and Productcata-log for details to generate a web page.Figur

91、e 6 shows the deploy-ment of those applications on nodes.Solid lines represent actualservice invocation in benchmarks.Dashed lines represent the in-vocations existing in Hipster but not carried out in benchmarks.6.1.2Meshery.Fordifferentapplicationsandinfrastructures,usersneedtocomparetheirseparateb

92、ehaviorsfromservicemesheswhenselectingaservicemeshtodeploy.Meshery20isareliablebehav-ior analysis tool for different service meshes.Meshery managesservice meshes by interacting them with corresponding adapters.Mesheryallowsuserstomodifytheconfigurationofadapters,suchas their type,name,exposed port a

93、nd so on.Users can define theirown benchmark configuration,such as the duration,concurrency,type of load generator and so on,through the meshery UI,mesh-eryctl commands or a script.Once users submit a benchmark con-figurationtotheMesheryserver,Mesherywillautomaticallyrunabenchmark with the specified

94、 load generator and collect data thenreturn to users.With the help of Meshery,we run benchmarks inIstio to test the effectiveness of the optimization.6.1.3Service capacity.Before our experiments,we first need tofind out the service capacity of Bookinfo.Then We test the P99of service response time un

95、der different target QPS with differentconcurrency.As shown in Figure 7(a),the largest target QPS is100 and an obvious turning point happens at 70 QPS.Therefore,weconsider the case that QPS larger than 70 QPS as a high workloadwhile the other case as a low workload.6.2Evaluation Under Low WorkloadTo

96、 know the effectiveness of optimization method we proposed,we first run benchmarks under a low workload for thirty secondswith50targetQPSsettedinfoursituations,namelytheoriginalser-vice mesh,only for intra-pod communication optimization,onlyfor inter-pod communication optimization,and the combined o

97、p-timization.Here,we focus on the tail latency of request response time.Fig-ure 7(b)draws the average of P99,where the error bars reflect thestandard variance.P99 represents the response time of 99%of therequests and accurately reflects the tail latency of requests.As seen in Figure 7(b),the inter-p

98、od communication optimiza-tion is similar to intra-pod communication optimization.The com-bined optimization achieves the best result.For a packet,the inter-pod communication optimization method reduces two iptables fil-tering operations.While the inter-pod communication eliminatesiptables filtering

99、 when the packet is forwarded on the same nodeandbridgeforwardingwhenforwardedondifferentnodes.Wecanalso see that the optimization works well when processing fiveusers concurrently with 50 target QPS.It reduces tail latency tofour-fifths of the original service mesh.6.3Evaluation Under High Workload

100、We run benchmarks under a high load with the target QPS set to100.By default,the concurrency of proxy in Istio is set as 2,whichmeansthataproxycanonlyhandle2requestsconcurrently.Itdoesnot make sense when we manually modify that value since it isrelated to the resources of CPU and memory.Figure 7(c)s

101、hows the tail latency of different connections whenwerunbenchmarksonBookinfoandHipstershop.Wecompareouroptimization with merbridge2 which is proposed to accelerateIstio.For Bookinfo,it can be observed that when the number ofconnection is less than twenty.The optimized tail latency of ourproposed met

102、hod is better than merbridge.Our method reducesabout half of tail latency when the connection is 1,then 22%whenthe connection is 5,and then 5%with 10 connections.However,merbridge reduces just 1ms in those cases.When the connectionincreases to 100 and 125,our method and merbridge only achievevery ti

103、ny performance improvement.With 100 connections,ouroptimization makes the tail latency 2%longer than the originalservice mesh,while merbridge makes it 1%longer.Actually,whenthe number of connections reaches 125,there is no optimizationany more for both of out method and merbridge.For Hipstershop,our

104、 experimental result is similar as that ofBookinfo.We run benchmarks with 30 target QPS that is largerthanitcanserve.Whentheconnectionissetto1,ouroptimizationimprovesperformanceaboutone-fifthoftheoriginalservicemesh,while merbridge makes even no improvement.As the number of8p 选择BookInfo和Hipstershop作

105、为Benchmark首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 测试方案p Meshery作为测试工具首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 测试结果929930933593693793893994094459469479489499509555956957958959960966596696796896997097759769779789799809885986Tail Latency Op

106、timization in Service Mesh Based on eBPFConference acronym XX,June 0305,2018,Woodstock,NY9879889899909995996997998999258002036925811042

107、05070100target QPS030060090002100P99(ms)1 connection5 connections10 connections20 connections100 connections125 connections(a)Service capacity of the Bookinfo application.(b)Benchmark results under low load.1510 20100125connections030060090002100P99(ms)bookinfo-combi

108、ned opt.bookinfo-without opt.bookinfo-merbridgehipster-combined opt.hipster-without opt.hipster-merbridge(c)Benchmark results under high load.Figure 7:The service capacity of Bookinfo and tail latency under different loads.2933time0500025003000350040004500500055006000Memory usa

109、ge of Workers(KB)node1-without opt.node1-combined opt.node2-without opt.node2-combined opt.(a)Memory usage on nodes.2933time05540CPU usage of Worker 1(%)without opt.-userwithout opt.-syscombined opt.-usercombined opt.-sys(b)CPU utilization on Kubernetes-node1.2933tim

110、e051015202530CPU usage of Worker 2(%)without opt.-userwithout opt.-syscombined opt.-usercombined opt.-sys(c)CPU utilization on Kubernetes-node2.Figure 8:CPU and memory usage in nodes when running a benchmark under low load.connection increases to 10,our method shortens 12%of the taillatency while me

111、rbridge shortens about 16%of that.We know from figure 7(c)that as the connection increases,theimprovement on tail latency of our optimization and merbridge isgradually reduced.Because of the concurrency limitation of prox-ies,whenmoreconnectionsarrive,packetsaremorelikelytoaccu-mulate thus increase

112、the time of queuing,which increases the taillatency and makes the number of connection have a great impacton the performance of our optimization and merbridge.Comparedwith merbridge,our optimization method works better with smallconnection.When the number of connection becomes larger,ouroptimization

113、 method performs similarly to merbridge.6.4OverheadHere,we run the benchmark introduced in section 6.2 and eval-uate the overhead introduced by our optimization.We record theaverage CPU usage of workers during the benchmark,as shown infigure 8(b)and 8(c),where the benchmark starts at the 2nd seconda

114、nd end at the 30th second.Observing Figure 8(b)and 8(c),during the benchmark,the vari-ation of CPU utilization in both kernel space and user space beforeand after optimization is almost the same for workers,that is,ourproposed method will not introduce extra overhead of CPU.We also focus on the use

115、of memory when running the bench-mark.In figure 8(a),the optimization of memory on Kubernetes-node1 is significant.During the benchmark,the memory we usedwithoutoptimizationis2.827MBmorethanthatbeforebenchmarkonaverage,whilethatwithcombinedoptimizationis1.59MB,about40%improvement.For Kubernetes-node

116、2,the used memory is re-duced a little.That is the used memory was reduced when improv-ing the tail latency.It can be seen that our optimization methodreduce the use of memory by about 20%on average.It can also beseen from figure 8(a)that the used memory changes more stablywith the combined optimiza

117、tion.The memory usage is reduced because of the reduction of copy-ing packets between the network card and the Linux kernel.Whena packet arrives at the network card and accepted by the kernel,Linux stores that packet in its memory.Since our optimizationmethod bypasses the transmission at the network

118、 card,it reducesthe memory copying thus the memory used.6.5DiscussionExperimentalresultsaboveshowtheeffectivenessofour optimiza-tion method.However,there is something that can be improvedand the future work can be carried out.ExperimentsarebasedonBookinfoandHipstershop.Comparedwith existing large-sc

119、ale microservice systems,it is hard to know9首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 测试结果README.md2022/8/1010/11README.md2022/8/1011/11基础测试优化测试首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化SoCC 20,October 1921,2020,Virtual Event,USAMarios Kogias,Rishabh Iyer,and Edouard Bugnionrich,stateful scheduling policie

120、s while rapidly adjusting tochanges in the service set,adding minimum latency over-head to the application,not creating I/O bottlenecks,andavoiding broken connections.State-of-the-art internal loadbalancers have beneted from recent innovation in protocoldesign specically aimed at improving their sca

121、lability,in-cluding transport protocols other than TCP 33,38,45.Suchapproaches though,break backwards compatibility with ex-isting applications,while TCP still remains prevalent bothfor datacenter 2 and cloud communications.We note thatthis problem statement is dierent from that of external loadbala

122、ncers,who must accept and lter standards-based traf-c from the Internet,mostly deal with HTTP(S)trac andmight also implement TLS termination.Our approach bypasses the load balancer without regret.Specically,we remove the load balancer from the criticalpath as much as possible and oer close to direct

123、 commu-nication latencies.At the same time,our design allows forelaborate load balancing policies that improve tail-latencyand quickly react to changes in the service set.We design CRAB,aConnectionRedirect LoAdBalancer.CRAB depends on a new TCP option included in theSYNandSYN-ACKpackets that enables

124、 trac redirection.Thisallows CRAB to only deal withSYNpackets and stay o theconnection datapath,thus tremendously reducing the loadbalancing load,while still being able to implement complexload balancing policies that otherwise would require a state-ful load balancer implementation.Our implementatio

125、n shows that CRABs datapath can beeasily implemented in a programmable switch or in softwareusing kernel-bypass or kernel-based mechanisms.The CRABimplementation in clients and servers requires a modestchange;this can be implemented in a kernel module that hasno measurable impact on performance or a

126、s direct kernelmodications oered as pre-built images to cloud tenants.Our evaluation demonstrates that CRAB outperforms L4-based load-balancers in terms of added latency overhead,connection throughput,and load balancing policies whilebeing implemented on top of a simple stateless design.Our contribu

127、tions are:The design of a backward-compatible extension to RFC791 50 that enables TCP connection redirectionThe design of a CRAB load balancer that depends on thenew connection redirect feature of TCP and supports exiblescheduling policies.The implementation of the TCP connection redirectionoption i

128、n the Linux kernel for both clients and servers.Fourimplementations of the load balancer using P4,DPDK,eBPFand Netlter.A discussion on the caveats,assumptions,and opportu-nities for CRAB in the public cloud and the integration ofCRAB for Kubernetes NodePort load balancing.BrowsersDatacenterLBLBWeb 1

129、Web 2Back 1Back 2Back 3 负载均衡优化DirectLoad Balanced004000500099-th Latency(us)CRRRRFigure 2:Connection-Request-Response(CRR)andRequest-Response(RR)latencybenchmarksonAzurewith accelerated networking with and without anAzure internal load balancerlower than the load-balanced scenario.However

130、,the latencyoverhead introduced by the load balancer is signicant forboth the RR and the CRR benchmarks.The load balanceradds approximately 1ms and 2ms respectively for the RR andCRR benchmarks;such a large overhead can overshadow thecost of non-balanced RPC.Given this signicant latency overhead ass

131、ociated withinternal cloud load balancing,our goal is to minimize itas much as possible to achieve latencies that are close todirect communication.To do so,we need to understand theunderlying load balancing mechanisms and policies.2.1Load Balancing FlavorsIn this section,we categorize and compare th

132、e state-of-the-artapproachestoloadbalancingforinternalcloudworkloadsrunning on top of VMs or containers.Our comparison isbased on the following criteria:Load Balancing Policy:Centralized policies leveragea global view that includes every back-end server whiledistributed policies make scheduling deci

133、sions based onlyon local state.Persistent Connection Consistency(PCC):Can theload-balancer route all packets from the same connection tothe same back-end server in the presence of server arrivalsand failures?Expected Load:What is the load balancer load in termsof the packets it has to process for ea

134、ch connection?Latency Overhead:How much overhead does the loadbalancer add?Updates:How quickly does the load balancer take intoaccount scale-up(server-arrival)and scale-down(server-removal)events?Layer 4 Load Balancing:L4 load balancers operate at thetransport layer(TCP/UDP)of the networking stack a

135、nd re-main agnostic to the upper application layers.All publiccloud providers oer some form of L4 load balancing,ex-amples include Microsofts Azure Load Balancer 8,whichwas used for the experiment in Figure 2,and Amazons AWSNetwork Load Balancer 4Figure 3a describes the communication between the cli

136、ent,load balancer,and back-end servers for an L4 load balancer.The load balancer listens to a virtual IP(VIP)and the clientuses this IP to talk to the service.The service is run onback-end servers that listen to some direct IP(DIP).Theload balancer assigns each connection to a particular back-end se

137、rver and performs address translation.It modies thedestination IP(to the DIP)for packets sent by the client andthe source IP(to the VIP)for packets sent by the server.Thisrequires all packets to go through the load balancer addinga latency overhead of 1 RTT to the end-to-end client-servercommunicati

138、on and reducing the I/O scalability of the loadbalancer.An optimization to the above approach is Direct ServerReturn(DSR).In this scheme,packets originating at theserver can be sent directly to the client without being routedthrough the load balancer.Servers are aware that they arebeing load balance

139、d and modify the source IP of outgoingpackets to the VIP using address rewriting mechanisms suchastc64.DSR reduces the load balancers load since it nowonly processes client packets and reduces the latency over-head to 0.5 RTT.Figure 3b illustrates an L4 load balancerwith DSR enabled.There has been s

140、ignicant research 3,9,16,24,35,42,43,47,4749 on L4 load balancers.All these approaches can besplit into two main categories depending on whether or notthey store per-connection state.Stateless load balancers 3,47 typically depend on someform of consistent hashing 34 and daisy chaining to ensurethat

141、packets with the same 5-tuple will always be forwardedto the same DIP.Relying on hashing to distribute load en-ables them to eschew per-connection state leading to betterperformance and scalability.However,this approach has twomain caveats.First,load balancing policies are limited tohashing,namely r

142、andom load balancing;this leads to loadimbalances especially when connections are skewed.Sec-ond,despite the use of daisy chaining there remain cornercases during server arrival and removal that lead to PCCviolations 9.Stateful load balancers maintain per connection state tocorrectly route each pack

143、et they receive from the client.Fur-ther,such load balancers can also maintain state about eachback-end server,in order to support more elaborate loadbalancing policies such as Join-Shortest-Queue or Power oftwo 44.Such policies cannot be implemented on a statelessload balancer.While per-connection

144、state eliminates PCCviolations,the state lookups can become a bottleneck whenthe number of active connections is large.195ClientS1(1)(2)S1S1LB(3)(b)L4 Load Balancing with DSRClientS1DNS(a)(b)(1)(2)S1S1(d)Agent-based Load BalancingFigure 3:Commonly used load balancing schemes for cloud services based

145、 on VMs or containers.MethodPropertyPolicyPCC violationsExpected loadLatency overheadUpdatesL4CentralPossible*every packet1 RTT for every RTTFastL4 w/DSRCentralPossible*one way packets1/2 RTT for every RTTFastL7CentralNoneevery packet1 RTT for every RTTFastDNSCentralNone1 RPC everyfew connectionsup

146、to 1RTTper connectionSlowLocal AgentDistributedNoneevery packetnoneSlowCRABCentralNoneSYN packets1/2 RTT for everyconnection establishmentFast*In stateless L4 load balancersTable 1:Feature comparison between dierent deployed load balancing schemes and CRAB.L7 Load Balancing:L7 load balancers or reve

147、rse-proxiesoperate at the application layer.These load balancers termi-nate client connections and open new connections to theback-end servers.Figure 3a could also describe a L7 load bal-ancing scheme since all the received and transmitted packetshave to go through the load balancer.However,for L7 l

148、oadbalancing arrows(1),(4)and(2),(3)would belong to dierentTCP connections.Popular open-source L7 load balancersinclude NGINX 46 and HAProxy 27.Cloud providers alsooer such services,e.g.,Amazons AWS ALB 4.L7 load balancers are typically centralized.Terminatingclient connections and establishing new

149、ones with back-endsservers,enables them to avoid PCC violations.Further,op-erating at the application layer allows such load balancersto understand L7 protocols,e.g.,HTTP;this enables them toperformne-grainedrequest-levelloadbalancingasopposedto the more coarse-grained connection level load balancin

150、g.However,this results in them depending on complicated soft-warethattypicallyruninuserspace.Thishasthecorrespond-ing performance implications,in particular a considerableincrease in the latency overhead(we illustrate this in 5.2).DNS Load Balancing:Another form of load balancing usedboth in the pub

151、lic internet as well as by container orchestra-tors such as Docker Swarm 53,and Mesos 30,dependson DNS.DNS load balancing relies on the fact that mostclients use the rst IP address they receive for a domain afterDNS resolution.Typically,the DNS server sends the list of IPaddresses in a dierent order

152、 each time it responds to a newclient,using the round-robin method.As a result,dierentclients direct their requests to dierent servers,eectivelydistributing the load across the server group.Figure 3c de-scribes the client,server,and DNS server interactions for aDNS load balancing scheme.Steps(a)-(b)

153、can be performedonce for several connections(1)-(2).DNS load balancing,while centralized,is extremely coarsegrained,since it only balances the load at a per-client granu-larity.Further,to avoid the repeated overhead of DNS reso-lution and reduce the load on the DNS server,clients cacheDNSentries;onc

154、eanentryisinthecache,clientsandserverstalk directly.Despite its performance benets,caching cancausesevereloadimbalanceissues.Sinceclientsusethesametarget IP until the cached entry expires,the system cannotmitigate load imbalances during this period.Also,removingservers from the back-end pool becomes

155、 challenging andslow,since administrators have to wait until every possibleTTL for the associated entries has expired.DNS load balanc-ing though does not suer from PCC violations since clientsand servers communicate directly.Local Load-balancing Agent:This load balancing schemeis used in Kubernetes

156、39.In a Kubernetes cluster,everynode that runs networked containers also runs a local agent196集群内部的负载均衡负载均衡开销Kube-proxy首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 K8S架构首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 负载均衡优化Client(CIP)LB(VIP)SRC:CIP DST:VIPSYNSPORT:5347 DPORT:8080SRC:CIP DST:VIPACKSPORT:5347 DPORT:

157、8080Server(DIP)SRC:CIP DST:DIPSYNSPORT:5347 DPORT:8080SRC:VIP DST:CIPSYN-ACKSPORT:8080 DPORT:5347SRC:CIP DST:DIPACKSPORT:5347 DPORT:8080(1)(2)(3)(4)(5)(a)TCP handshake over a L4 load balancer with DSTClient(CIP)LB(VIP)SRC:CIP DST:VIPSYNSPORT:5347 DPORT:8080REDIR_OPT:ONSRC:CIP DST:DIPACKSPORT:5347 DP

158、ORT:8080Server(DIP)SRC:CIP DST:DIPSYNSPORT:5347 DPORT:8080REDIR_OPT:VIPSRC:DIP DST:CIPSYN-ACKSPORT:8080 DPORT:5347REDIR_OPT:VIP(1)(2)(3)(4)(b)TCP handshake with connection redirection over CRABFigure 5:A load balanced TCP handshake with and without connection redirection.Blue boxes correspond to IPh

159、eaders,red boxes correspond to TCP headers.p 优化后的负载均衡方案,ByPass LB首届中国eBPF研讨会服务格数据优化基于eBPF的服务网格性能瓶颈定位与优化 负载均衡优化54045kRPS024681099-th Latency(ms)CRAB-randLBL4-DSRCRAB-rrFigure 11:Load Balancing 48 single-core servers run-ning a synthetic service time application withS=1msThe goal of the exp

160、eriment is to validate if CRAB can realizethe benets of the elaborate policies as shown in 2.2.Figures 11,12 plot the tail latency vs throughput curvesfor the CPU bound and I/O bound applications respectively.We observe that for both application classes,despite all threepolicies achieving the same t

161、hroughput,CRAB Round-Robinachieves signicantly lower tail-latency.For application pro-les with low service time dispersion,the Round-Robin loadbalancing policy picks the least loaded server and forwardrequests to it without requiring explicit communication be-tween the load balancer and the server.T

162、hus,CRAB in addi-tion to eliminating I/O bottlenecks and reducing communi-cation latencies,supports elaborate load balancing policies,truly achieving the best of all worlds.6DISCUSSIONPort redirection:In our existing CRAB implementationwe only consider connection redirections based on the targetIP a

163、ssuming that the load balanced service always runs onthe same port in the back-end servers.The mechanism canbe easily extended to modify the target port,too,in casethis is desirable by including both the IP and port in theRedirection Option.Mechanism placement:We implemented connection redi-rection

164、as part of the Linux kernel assuming the followingdeployment models:(1)In the case the kernel patch goesupstream,newer kernel version will support it.(2)If not,cloud providers can oer VM images with the modied ker-nel which cloud tenants can leverage to benet from CRAB.However,these assumptions are

165、not fundamental to CRAB.We now discuss how CRABs advantages can be retainedwith alternative placements of connection redirection thatthe client and server kernels remain agnostic to.020406080100120kRPS05001000150099-th Latency(us)LBL4-DSRCRAB-randCRAB-rrFigure 12:Load Balancing 48 NGINX servers serv

166、ingan 8 kB static le.Cloud providers implement engines either in software 13,20,41,or in hardware 21 that accelerate their virtual net-working infrastructure.These engines apply address transla-tion rules and encapsulate and decapsulate packets.Connec-tion redirection can be supported by those engin

167、es,insteadof the guest kernels.On the client side receiving aSYN-ACKwith theConnection Redirectoption will create two newrules that will perform Source Network Address Translation(SNAT)for the received packets and Destination NetworkAddress Translation(DNAT)for transmitted packets respec-tively.The

168、engine will overwrite the DIP with the VIP inthe received option for incoming packets,and vice-versafor the transmitted packets.The server-side implementationwill create a short-lived rule on receiving aSYNpacket withthe redirection option to echo the option in the outgoingSYN-ACK.The downside of su

169、ch an implementation is that itinvolves packet modications on the critical path that canincur performance overheads in a software-based stack.De-spite the similarities with the agent-based load balancing in2,supporting CRAB on the host infrastructure still enablesguests to benet from the centralized

170、 load balancing policiesand easy and fast updates to the server pool.AlternativeTransports:WhilewefocusonlyonTCP,here,we discuss how the core ideas behind CRAB apply to otherconnection-oriented transport protocols,in particular QUIC.QUIC 40 is a low-latency transport protocol designed orig-inally fo

171、r HTTPS trac.While QUIC runs over UDP,it still retains the notion of aconnection that is established between a client and a serverafter a handshake.QUIC also allows a 0-RTT connection es-tablishment for endpoints that have already communicatedinthepast.Aftertheinitialhandshake,theconnectionisasso-ci

172、atedwithaConnectionIDthatdenestheconnection.Loadbalancers use this ConnectionID to forward packets fromthe same connection to the correct back-send server 40.ConnectionIDsalsoenableseamlessconnectivityduringend-point migrations(address changes).203p 优化后的性能首届中国eBPF研讨会FaaS数据优化03基于eBPF的服务网格性能瓶颈定位与优化首届中

173、国eBPF研讨会FaaS数据优化基于eBPF的服务网格性能瓶颈定位与优化29Networked Systems GroupExisting approach and its limitationsBuilding blocks of Serverless Edge Cloud for IoTIngress gatewayFuncIon podQueue proxyUser containerAutoscalerFuncIon podQueue proxyUser containerFuncIon podQueue proxyUser containerFuncIon podQueue prox

174、yUser containerService MeshIoT devicesProtocol adaptorBrokerMetrics serverScrape metricsScaling decisionExpensiveConstantly runningWe have to really be aware of resource usageKnative FaaS平台首届中国eBPF研讨会FaaS数据优化基于eBPF的服务网格性能瓶颈定位与优化36Networked Systems GroupAuditing the Overheads of Serverless Computing:

175、KNativeProcessing involved in a typical serverless function chain setup:network protocol,copies,interrupts,context switches etc.aboundPhysical NICkernel protocol stackBroker/Front-endcontainerkernel protocol stackveth-pairIngress gateway containerkernel protocol stackveth-pairFunction-1s Poduser con

176、tainersidecar proxykernel protocol stackveth-pairFunction-2s Poduser containersidecar proxykernel protocol stackveth-pairUserspaceKernel spaceData Pipeline No.ExternalWithin chainTotaltotaltotal#of copies1234441215#of ctxtswitches1234441215#of irqs3476661825#of proto.processing123333912#of serializa

177、tion11222268#of deserialization01122267?性能负荷p 典型的无服务器函数链处理中涉及的步骤:网络协议、复制、中断、上下文切换等;SPRIGHT:ExtracIng the Server from Serverless CompuIng!High-Performance eBPF-based Event-driven,Shared-Memory Processing,Sigcomm 2022 首届中国eBPF研讨会FaaS数据优化基于eBPF的服务网格性能瓶颈定位与优化 性能负荷37Networked Systems GroupOverhead auditi

178、ngKey takeawaysData Pipeline No.ExternalWithin chainTotaltotaltotal#of copies1234441215#of ctxtswitches1234441215#of irqs3476661825#of proto.processing123333912#of serialization11222268#of deserialization01122267Takeaway#1:Excessive data copies,context switches,and interrupts.Takeaway#2:Excessive,du

179、plicate protocol processing.Takeaway#3:Unnecessary serialization/deserialization.Takeaway#4:Individual,constantly-running heavyweight components.p 典型的无服务器功能链设置中涉及的处理:网络协议、复制、中断、上下文切换等;首届中国eBPF研讨会FaaS数据优化基于eBPF的服务网格性能瓶颈定位与优化 性能负荷p Sidecar引起的额外负载38Overheads:Understanding impact of Sidecar ProxiesKey t

180、akeawaysTakeaway#4:Individual,constantly-running heavyweight components.Overheads:Understanding impact of Sidecar ProxiesKey takeawaysTakeaway#4:Individual,constantly-running heavyweight components.#Having a sidecar proxy results in a 37 reduction in throughput,37higher latency,and a significant inc

181、rease in CPU cycles per request.#CPU overhead breakdown:50%of CPU cycles are consumed by the kernel stack for the sidecar proxy.首届中国eBPF研讨会FaaS数据优化基于eBPF的服务网格性能瓶颈定位与优化 系统实现p 系统的总体架构40Extracting the Server out of Serverless Computing!An overview of our design eBPF-based event-driven capability Shared

182、 memory processingOptimization#1:Event-driven,shared memory function chain processingOptimization#2:Direct Function Routing(DFR)Optimization#3:Event-driven proxyOptimization#4:eBPF-based dataplane acceleration for external communicationOptimization#5:Event-driven protocol adaptation(e.g.,IoT)Network

183、ed Systems GroupExtracting the Server out of Serverless Computing!An overview of our design eBPF-based event-driven capability Shared memory processingOptimization#1:Event-driven,shared memory function chain processingOptimization#2:Direct Function Routing(DFR)Optimization#3:Event-driven proxyOptimi

184、zation#4:eBPF-based dataplane acceleration for external communicationOptimization#5:Event-driven protocol adaptation(e.g.,IoT)Data PlaneControl PlaneRouting ControllerMetric ServerAutoscalerIngress GatewayFunction 1User containerMetric flowDescriptor flowPacket floweBPF programfunction chain config.

185、SPRIGHT gatewayFunction 2User containerFunction 3User container Shared memoryRouting update flowEPROXYSPROXYSPROXYSPROXYSPROXYRouting table首届中国eBPF研讨会FaaS数据优化基于eBPF的服务网格性能瓶颈定位与优化 系统实现Design detailsE/S-PROXYIn Knative,a queue proxy runs as an additional container in a function pod distinct from the u

186、ser containerBuffering,metrics collection,health checkExisting sidecar proxy designs are too heavyweightWe built a lightweight,event-driven eBPF based E/S-PROXY insteadBuffering/queueing is offloaded to shared memoryeBPF programs used for metrics collectionWe create a metrics map with eBPF maps to s

187、tore collected metricsThe metrics map can be accessed by a user space metrics agent to report metrics to the control planeInternal event-driven metrics collection hooks inside Gateway to provide fine-grained L7 metrics as an enhancement of EPROXYHealth check is offloaded to kubeletNo overhead if no

188、events:strictly load proportionalMuch less overhead when handling eventsResource-savings compared to current sidecar designSPRIGHT gateway podSPRIGHT gateway containervethsocketTXRXeBPFmapsmetrics mapEPROXYsocket mapShared memoryFunction podUser containersocketDescriptor deliveryRead/Write with desc

189、riptorSPROXYSPROXYlookuplookupFunction podUser containersocketSPROXYDescriptor deliveryRead/Write with descriptorRead/Write with descriptor?Physical NICBroker pod veth-hostTXRXXDPTC?veth-podIngress GW pod veth-hostTCveth-podKernel FIB tableRoutes lookupPackets floweBPF-based GW pod基于eBPF的gateway基于eB

190、PF的ProxyPerformance with Realistic Workloads1.Online boutiqueFor different alternatives:configure different concurrency levels(i.e.,#of concurrent users)at the load generator Knative:4K concurrency(we stop at 4K since its the maximum load that Knative can handle)Our designs using event-driven shared

191、 memory processing(SKMSG)and our design using polling-based shared memory processing(DPDK):12K concurrencyThroughput&latency:Knatives RPS is highly variable over time(890 req/sec)Both DPDK and SKMSG maintain a stable RPS of 2600 req/sec(3 higher than Knative)Knative shows clear overload behavior,e.g

192、.,from 50s to 72s,response time increases significantly,due to large queueing at the Knatives gateway large tail latency Shared memory processing reduces communication overhead within function chains,achieving better RPS and latency than Knative,even at much higher traffic load020406080100120140time

193、stamp(second)060003000Req/secDPDKSKMSGKnative测试结果首届中国eBPF研讨会展望01基于eBPF的服务网格性能瓶颈定位与优化首届中国eBPF研讨会展望 服务网格数据面下沉到内核基于eBPF的服务网格性能瓶颈定位与优化p 实现基于eBPF的Service Mesh数据面部分能力下沉,包括:请求转发、负载均衡、可观测性等首届中国eBPF研讨会展望 服务网格数据面下沉到内核基于eBPF的服务网格性能瓶颈定位与优化p eBPF+Proxy(Envoy)实现丰富的服务治理如L7路由、灰度方法、故障注入等;首届中国eBPF研讨会展望p Istio ambient mesh 是 Istio 的一个无 sidecar 的数据平面,旨在降低基础设施成本和提高性能;首届中国eBPF研讨会Thanks!2022/11/

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(5.基于eBPF的Service Mesh性能瓶颈定位与优化.pdf)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

 微**...  升级为至尊VIP 136**76...  升级为高级VIP

156**77...  升级为高级VIP wei**n_... 升级为标准VIP 

wei**n_... 升级为高级VIP 185**08...  升级为高级VIP

 wei**n_... 升级为至尊VIP 151**13...  升级为至尊VIP

136**32... 升级为高级VIP  wei**n_...  升级为至尊VIP

  132**99... 升级为高级VIP  Hen** H... 升级为高级VIP 

wei**n_... 升级为至尊VIP wei**n_...  升级为标准VIP

 S** 升级为标准VIP wei**n_...  升级为至尊VIP

 wei**n_... 升级为高级VIP  wei**n_... 升级为高级VIP 

188**66... 升级为至尊VIP     wei**n_... 升级为高级VIP

181**98...  升级为标准VIP  wei**n_... 升级为至尊VIP

180**15... 升级为高级VIP 136**53...  升级为标准VIP

 wei**n_... 升级为至尊VIP 150**25...  升级为至尊VIP 

wei**n_...  升级为标准VIP  wei**n_... 升级为标准VIP

 wei**n_... 升级为标准VIP wei**n_...  升级为高级VIP

135**09...  升级为至尊VIP 微**...  升级为标准VIP

wei**n_... 升级为标准VIP  wei**n_...  升级为标准VIP

 wei**n_... 升级为至尊VIP wei**n_...  升级为至尊VIP

 wei**n_... 升级为标准VIP 138**02... 升级为至尊VIP

138**98... 升级为标准VIP   微**... 升级为至尊VIP

wei**n_... 升级为标准VIP    wei**n_... 升级为高级VIP

wei**n_...  升级为高级VIP wei**n_... 升级为至尊VIP 

 三**... 升级为高级VIP  186**90... 升级为高级VIP 

 wei**n_... 升级为高级VIP 133**56...  升级为标准VIP 

152**76... 升级为高级VIP wei**n_... 升级为标准VIP 

wei**n_... 升级为标准VIP    wei**n_... 升级为至尊VIP 

wei**n_... 升级为标准VIP  133**18... 升级为标准VIP

wei**n_...  升级为高级VIP   wei**n_...  升级为标准VIP

 微**... 升级为至尊VIP  wei**n_... 升级为标准VIP

wei**n_...  升级为高级VIP   187**11... 升级为至尊VIP

189**10...  升级为至尊VIP   188**51... 升级为高级VIP

134**52... 升级为至尊VIP  134**52... 升级为标准VIP

wei**n_...  升级为高级VIP  学**... 升级为标准VIP  

liv**vi... 升级为至尊VIP  大婷  升级为至尊VIP 

wei**n_... 升级为高级VIP wei**n_...  升级为高级VIP

微**... 升级为至尊VIP  微**... 升级为至尊VIP

wei**n_... 升级为至尊VIP    wei**n_... 升级为至尊VIP 

wei**n_... 升级为至尊VIP  战**   升级为至尊VIP

玍子  升级为标准VIP ken**81...  升级为标准VIP

185**71... 升级为标准VIP  wei**n_... 升级为标准VIP

微**...  升级为至尊VIP  wei**n_... 升级为至尊VIP

138**73... 升级为高级VIP 138**36...  升级为标准VIP 

138**56...   升级为标准VIP  wei**n_... 升级为至尊VIP 

wei**n_...  升级为标准VIP   137**86... 升级为高级VIP 

159**79... 升级为高级VIP  wei**n_...  升级为高级VIP

139**22... 升级为至尊VIP    151**96... 升级为高级VIP

wei**n_... 升级为至尊VIP  186**49...  升级为高级VIP

187**87... 升级为高级VIP  wei**n_...  升级为高级VIP

 wei**n_... 升级为至尊VIP  sha**01... 升级为至尊VIP

 wei**n_... 升级为高级VIP 139**62...  升级为标准VIP