《1C-202_FlexTOE Flexible TCP Offload with Fine-Grained Parallelism.PDF》由会员分享,可在线阅读,更多相关《1C-202_FlexTOE Flexible TCP Offload with Fine-Grained Parallelism.PDF(30页珍藏版)》请在三个皮匠报告上搜索。
1、FlexTOEFlexTOE:Flexible TCP Offload with Fine-Grained ParallelismRajath Shashidhara1,Tim Stamler2,Antoine Kaufmann3,Simon Peter11University of Washington,2UT Austin,3MPI-SWSSan Jose,CA April 26-28,2022High CPU Overhead of TCP TCP remains the default protocol in the datacenter But TCP stacks have hig
2、h CPU overhead Even with modern optimized stacks(TAS,Snap,)CPU profile of Memcached with 32B requests/responsesTo go further,we need to offload2only 26%San Jose,CA April 26-28,2022Need for FlexibleFlexibleTCP Offload Flexibility:Datacenter networks evolve rapidly Operators need flexibility for agile
3、 development Existing TOEs are hardwired:slow upgrade cyclesCPU profile of Memcached with Chelsio Terminator TOE3only 16%San Jose,CA April 26-28,2022TCP Offload:TCP Offload:Can we get Can we get flexibilityflexibility and and performanceperformance?4San Jose,CA April 26-28,2022FlexTOEFlexTOE:Flexibl
4、e,High Performance TCP Offload Eliminates all host TCP stack overheads Supports POSIX-sockets,DCTCP/Timely congestion control Fully extensible(software development velocity),with eBPF support553%San Jose,CA April 26-28,2022TCP Offload to SmartNICs-ChallengesSmartNICs are flexible but restrictive:Eg:
5、Netronome Agilio,Mellanox BlueField,Pensando DSC,Fungible DPU,Parallel architectures geared towards stateless offloads Many wimpy cores with limited memoriesTCP connections are processed sequentially:Stateful code paths track in-flight segments Stringent per-packet time budgets Sensitive to reorderi
6、ngTraditional TCP stacks perform poorly on Traditional TCP stacks perform poorly on SmartNICsSmartNICs6San Jose,CA April 26-28,2022FlexTOEFlexTOE:Flexible,High-Performance TCP Offload with Fine-grained ParallelismTo provide high performance and flexibility,FlexTOE leverages:Modularity:fine-grained m
7、odules keep private state and communicate explicitlyFine-grained parallelism:Modules may be replicated,sharded,execute out-of-orderOne-shot data-path offload:Payload is never buffered on the NIC7San Jose,CA April 26-28,2022FlexTOEFlexTOEFlexibility:XDPSupports eXpress Data Path(XDP)modules implement
8、ed in eBPF Operate on raw packets Shared state via BPF mapsImplemented common datacenter features Tracing,Statistics&Profiling Connection Firewalling VLAN encapsulation/decapsulation tcpdumpAccelTCPs NSDI20 connection splicing in 24 lines of eBPF at NIC line rate!8San Jose,CA April 26-28,2022FlexTOE
9、FlexTOEOffload Architecture9San Jose,CA April 26-28,2022FlexTOEFlexTOEOffload Architecture10 Data-path:per-packet transport logic for established connectionsSan Jose,CA April 26-28,2022FlexTOEFlexTOEOffload Architecture11Control-plane:policy,management and infrequent recovery code-pathsSan Jose,CA A
10、pril 26-28,2022FlexTOEFlexTOEOffload Architecture12libTOE library:provides POSIX sockets to the application with kernel-bypassSan Jose,CA April 26-28,2022Parallelizing the TCP Data-path for Offload13San Jose,CA April 26-28,2022Parallelizing the TCP Data-path for Offload14San Jose,CA April 26-28,2022
11、Parallelizing the TCP Data-path for Offload15San Jose,CA April 26-28,2022Parallelizing the TCP Data-path for Offload16San Jose,CA April 26-28,2022Parallelizing the TCP Data-path for Offload17San Jose,CA April 26-28,2022Parallel TCP Processing Example:Transmit(TX)18enters the pipeline firstSan Jose,C
12、A April 26-28,2022Parallel TCP Processing Example:Transmit(TX)19assign sequence numberSan Jose,CA April 26-28,2022Parallel TCP Processing Example:Transmit(TX)20San Jose,CA April 26-28,2022Parallel TCP Processing Example:Transmit(TX)21stallstallSan Jose,CA April 26-28,2022Parallel TCP Processing Exam
13、ple:Transmit(TX)22out-of-order transmitTCP requires processing in-order for loss detectionbut Data-parallel modules have varying processing times and may reorder segmentsSan Jose,CA April 26-28,2022Parallel TCP Processing Example:Transmit(TX)23out-of-order transmitFlexTOE:Assign sequence number on d
14、ata-path ingress reorder segments on egressSan Jose,CA April 26-28,2022EvaluationEvaluation24San Jose,CA April 26-28,2022Evaluation SetupIntel Xeon Gold 6138 CPU,20 cores 2 GHz with 40GB RAMCompare:FlexTOE(flexible offload)on Netronome Agilio CX40 SmartNIC 40 GbpsLinux(in-kernel stack):Intel XL710 4
15、0 Gbps TAS(kernel-bypass):Intel XL710 40 GbpsChelsio TOE(inflexible offload):Terminator 6 100 GbpsIdentical application binaries across all baselines.25San Jose,CA April 26-28,2022Benefits of Offload:Throughput ScalabilityMemcached throughput,varying number of server coresFlexTOE saves up to 81%CPU
16、cycles versus Chelsio and 50%versus TASOffloaded CPU cycles may be used for application workOffloaded CPU cycles may be used for application work26San Jose,CA April 26-28,2022Benefits of Offload:Low Tail-LatencyMemcached latency distribution across different stack combinationsFlexTOE achieves the lo
17、west median and tail latenciesOffloaded provides excellent performance isolationOffloaded provides excellent performance isolation27San Jose,CA April 26-28,2022Is Fine-grained Parallelism Necessary?Exploiting both intraExploiting both intra-and interand inter-connection parallelism is necessaryconne
18、ction parallelism is necessary28120 x286xSan Jose,CA April 26-28,2022Data-path Parallelism:Does it Generalize across Platforms?Single connection speedup by 4x on Bluefield(and 2.4x on x86)Single connection speedup by 4x on Bluefield(and 2.4x on x86)29San Jose,CA April 26-28,2022FlexTOEFlexTOE:High-p
19、erformance andandFlexible TCP Offload Eliminates all host TCP stack overheads to save CPU cycles for the application Data-path parallelism via fine-grained modules with out-of-order processing Easily extensible with full user-space programmabilitytcpdump with packet filteringVLAN encap/decapFirewallConnection splicingFlexTOEFlexTOE is openis open-source:source:https:/