《20230615_C-202_Kim.PDF》由会员分享,可在线阅读,更多相关《20230615_C-202_Kim.PDF(73页珍藏版)》请在三个皮匠报告上搜索。
1、Rearchitecting the TCP Stack for I/O-Offloaded Content DeliveryTaehyun Kim,Deondre Martin Ng,Junzhi Gong*,YoungjinKwon,MinlanYu*,KyoungSooParkKAIST&*Harvard UniversityIncreasing Demand for High-quality Video Streaming2 COVID-19 pandemic(2020)“more”rapid increase in video trafficGlobal Video Traffic0
2、02017 2022Exabytes per month(Cisco whitepaper 2022)Increasing Demand for High-quality Video Streaming2 COVID-19 pandemic(2020)“more”rapid increase in video trafficCDN server performance is critical for cost-effective serviceGlobal Video Traffic002017 2022Exabytes per month(Cisc
3、o whitepaper 2022)Computing Hardware Development Trend3DISKCPUNICComputing Hardware Development Trend3DISKCPUNICRead()Send()Video to usersComputing Hardware Development Trend31GbE2005150-200 IOPS2 coresDISKCPUNICRead()Send()Video to usersComputing Hardware Development Trend31GbE2005150-200 IOPS2 cor
4、es+Rapid performance improvement with IO devices-Demise of Moores law for CPU advancement(since 2006)DISKCPUNICRead()Send()Video to usersComputing Hardware Development Trend320221GbE400GbE400 x400 x2005150-200 IOPSA few 106IOPS10104 4x x2 cores+Rapid performance improvement with IO devices-Demise of
5、 Moores law for CPU advancement(since 2006)DISKCPUNICRead()Send()Video to usersComputing Hardware Development Trend320221GbE400GbE400 x400 x2005150-200 IOPSA few 106IOPS10104 4x x2 cores64 cores32x32x+Rapid performance improvement with IO devices-Demise of Moores law for CPU advancement(since 2006)D
6、ISKCPUNICRead()Send()Read()Send()Video to usersVideo to usersCPU Consumption for HTTP Video Streaming Server4Typical server operations Read an HTTP request Read a file chunk for the request Send the responseResultBenchmark setting-nginx(v.1.80.0)on Linux-300KB files on 4x OptaneNVMe-100Gbps NIC-Sing
7、le core of Xeon Silver 4210CPU Consumption for HTTP Video Streaming Server4Typical server operations Read an HTTP request Read a file chunk for the request Send the responseResult Observation 1:CPU cycles are 100%utilizedBenchmark setting-nginx(v.1.80.0)on Linux-300KB files on 4x OptaneNVMe-100Gbps
8、NIC-Single core of Xeon Silver 4210CPU Consumption for HTTP Video Streaming Server4Typical server operations Read an HTTP request Read a file chunk for the request Send the responseResult Observation 1:CPU cycles are 100%utilized Observation 2:dataplanedominates CPU consumptionBenchmark setting-ngin
9、x(v.1.80.0)on Linux-300KB files on 4x OptaneNVMe-100Gbps NIC-Single core of Xeon Silver 4210Data plane:Data plane:72%72%Control plane:28%Disk IOMemoryManagementNetwork IOopen()fstat()ApplicationTCP control logicCPU Consumption for HTTP Video Streaming Server4Typical server operations Read an HTTP re
10、quest Read a file chunk for the request Send the responseResult Observation 1:CPU cycles are 100%utilized Observation 2:dataplanedominates CPU consumptionWhy CPU bottleneck for IO-bound workload?Benchmark setting-nginx(v.1.80.0)on Linux-300KB files on 4x OptaneNVMe-100Gbps NIC-Single core of Xeon Si
11、lver 4210Data plane:Data plane:72%72%Control plane:28%Disk IOMemoryManagementNetwork IOopen()fstat()ApplicationTCP control logicRoot cause:“CPU-centric”OS Abstractions5 Modern OS designed with“implicit”assumptions CPU is fast but IO devices are slow CPU is never bottleneck for IO-bound workloadsRoot
12、 cause:“CPU-centric”OS Abstractions5 Modern OS designed with“implicit”assumptions CPU is fast but IO devices are slow CPU is never bottleneck for IO-bound workloads Artifact:CPU-centric IO operations size_tread(fd,buf,size);/DiskMemory size_twrite(fd,buf,size);/MemoryIOdevice All content must be bro
13、ught to”main”memory first!Memory(or CPU)becomes the bottleneckDiskMemoryCPUNetwork CardRoot cause:“CPU-centric”OS Abstractions5 Modern OS designed with“implicit”assumptions CPU is fast but IO devices are slow CPU is never bottleneck for IO-bound workloads Artifact:CPU-centric IO operations size_trea
14、d(fd,buf,size);/DiskMemory size_twrite(fd,buf,size);/MemoryIOdevice All content must be brought to”main”memory first!Memory(or CPU)becomes the bottleneckHow to avoid CPU bottleneck for IO-bound workload?DiskMemoryCPUNetwork CardOpportunity in the Solution Space6Modern PCIedevices support P2PDMAP2PDM
15、A Peer-to-peer DMA without CPU intervention No main memory copy if the DMA devices have memoryProgrammabilityProgrammabilityin IO devices SmartNICs&computational SSDs Arm SOC,FPGA-based,or ASIC-basedApproach:SmartNICas the hub for NVMedisk IOsDiskMemoryCPUNetwork CardP2P DMAOpportunity in the Soluti
16、on Space6Modern PCIedevices support P2PDMAP2PDMA Peer-to-peer DMA without CPU intervention No main memory copy if the DMA devices have memoryProgrammabilityProgrammabilityin IO devices SmartNICs&computational SSDs Arm SOC,FPGA-based,or ASIC-basedApproach:SmartNICas the hub for NVMedisk IOsKey design
17、 issue:where to place TCP stack?DiskMemoryCPUNetwork CardP2P DMAChallenges in Operating TCP Stack7Option1:TCP stack on CPU sideCPU sideNo disk content available on CPUDiskMemoryCPUNetwork CardP2P DMATCP stack?Challenges in Operating TCP Stack7Option2:TCP stack on SmartNICSmartNICPerformance limited
18、by SmartNICresourcesOption1:TCP stack on CPU sideCPU sideNo disk content available on CPUDiskMemoryCPUNetwork CardP2P DMATCP stack?Challenges in Operating TCP Stack7Option2:TCP stack on SmartNICSmartNICPerformance limited by SmartNICresourcesLighttpdwith Linux TCP on Bluefield-2:12Gbps On Xeon Silve
19、r 4210:59GbpsOption1:TCP stack on CPU sideCPU sideNo disk content available on CPUDiskMemoryCPUNetwork CardP2P DMATCP stack?Challenges in Operating TCP Stack7Option2:TCP stack on SmartNICSmartNICPerformance limited by SmartNICresourcesLighttpdwith Linux TCP on Bluefield-2:12Gbps On Xeon Silver 4210:
20、59GbpsHybrid solution?Separate IO-intensive Data Plane from TCP stack?Option1:TCP stack on CPU sideCPU sideNo disk content available on CPUDiskMemoryCPUNetwork CardP2P DMATCP stack?IO-TCP:Split TCP Stack Architecture for Content Delivery8SmartNICIO-TCP StackHost IO-TCP StackTCP stack Separation of T
21、CP control/data planesIO-TCP:Split TCP Stack Architecture for Content Delivery8SmartNICIO-TCP StackHost IO-TCP StackReliable data deliveryCongestion controlFlow controlHeader generationLow latency control logicTCP stack Separation of TCP control/data planesIO-TCP:Split TCP Stack Architecture for Con
22、tent Delivery8SmartNICIO-TCP StackHost IO-TCP StackReliable data deliveryCongestion controlFlow controlHeader generationRead file from diskSend packetLow latency control logicHigh throughput data operationsTCP stack Separation of TCP control/data planesClientIO-TCP Overview9 Provides 4 offload APIs
23、for SmartNICexecution(offload_open(),offload_fstat(),offload_close(),offload_write()1.Application calls offload APIs for remote execution2.Host sends a special command to SmartNICfor each API3.SmartNICstack performs corresponding IO operationsCPUNVMeSSDsHost IO-TCP StackApplicationControl Plane:CPUD
24、ata Plane:SmartNIC&DiskP2PDMAData streamSmartNICSmartNICIO-TCP StackIO-TCP based Web Server Workflow10ClientCPUACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationControl Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Workflow10ClientCPURe
25、questACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationControl Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Workflow10ClientCPUOPENRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_open()Contro
26、l Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Workflow10ClientCPUOPENRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_open()Control Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Wor
27、kflow10ClientCPUOPENRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_open()Control Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Workflow10ClientCPUHTTP responseheadersOPENRequestACKs bypass NIC stackSmartNICIO-TCP Stack
28、NVMeSSDsHost IO-TCP StackApplicationoffload_open()Control Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Workflow10ClientCPUHTTP responseheadersOPENRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()offload_open()Con
29、trol Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web Server Workflow10ClientCPUSENDHTTP responseheadersOPENRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()offload_open()Control Plane:CPUData Plane:SmartNIC&DiskFile open/r
30、ead via P2P DMAIO-TCP based Web Server Workflow10ClientCPUSENDHTTP response bodyHTTP responseheadersOPENRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()offload_open()Control Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAIO-TCP based Web
31、 Server Workflow10ClientCPUSENDHTTP response bodyHTTP responseheadersOPENACKRequestACKs bypass NIC stackSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()offload_open()Control Plane:CPUData Plane:SmartNIC&DiskFile open/read via P2P DMAHost IO-TCP StackKey Operation for Offloadin
32、g:SEND Command11EthIPTCPfile id offset lengthSEND command(virtual data packet)SmartNICIO-TCP StackHost IO-TCP StackKey Operation for Offloading:SEND Command11EthIPTCPfile id offset lengthSEND command(virtual data packet)DISKSmartNICIO-TCP StackHost IO-TCP StackKey Operation for Offloading:SEND Comma
33、nd11EthIPTCPfile id offset lengthEthIPTCPSEND command(virtual data packet)1.Translate the payload of SENDcommandfile id offset lengthDISKSmartNICIO-TCP StackHost IO-TCP StackKey Operation for Offloading:SEND Command11EthIPTCPfile id offset lengthEthIPTCPSEND command(virtual data packet)1.Translate t
34、he payload of SENDcommand2.Read file asynchronously from disk to the payloadfile id offset lengthFileDISK(real data packet)SmartNICIO-TCP StackHost IO-TCP StackKey Operation for Offloading:SEND Command11EthIPTCPfile id offset lengthEthIPTCPSEND command(virtual data packet)1.Translate the payload of
35、SENDcommand2.Read file asynchronously from disk to the payload3.Send out the packetfile id offset lengthFileDISK(real data packet)SmartNICIO-TCP StackHost IO-TCP StackKey Operation for Offloading:SEND Command11EthIPTCPfile id offset lengthEthIPTCPSEND command(virtual data packet)SmartNICHWTSOChecksu
36、m OffloadingTLS Offloading1.Translate the payload of SENDcommand2.Read file asynchronously from disk to the payload3.Send out the packetfile id offset lengthFileDISK(real data packet)IO-TCP Challenges12How to calculate accurate packet RTT?How to deal with retransmission?ChallengesIO-TCP Challenges12
37、How to calculate accurate packet RTT?How to deal with retransmission?ChallengesMore details in the paperHow to Handle Retransmission?13SENDFileL ong disk IOL ong disk IOHost IO-TCP StackSmartNICIO-TCP StackClientHow to Handle Retransmission?13 Re-reading the disk for retransmission could be slow!SEN
38、DSENDRetransmission triggeredFileL ong disk IOL ong disk IOR epeat disk IO?R epeat disk IO?Host IO-TCP StackSmartNICIO-TCP StackClientHow to Handle Retransmission?13 Re-reading the disk for retransmission could be slow!Our approach Keep the data on NIC memory until the data is confirmed to be delive
39、red(ACK)Problem:only Host receives all the ACKs(for control logic)SENDSENDRetransmission triggeredACKKeep the fileFileL ong disk IOL ong disk IOHost IO-TCP StackSmartNICIO-TCP StackClientHow to Handle Retransmission?13 Re-reading the disk for retransmission could be slow!Our approach Keep the data o
40、n NIC memory until the data is confirmed to be delivered(ACK)Problem:only Host receives all the ACKs(for control logic)Solution:periodic notification of ACKACKnowledgeD D(ACKD)sequence numbers(Host NIC)Required memory size thresholdFileFree ACKDedfilesKeep the fileFileL ong disk IOL ong disk IOHost
41、IO-TCP StackSmartNICIO-TCP StackClientImplementation Host stack:extended mTCPto support NIC offload 1,793 lines of code modification on mTCP NIC stack:based on NVIDIA Bluefield2 SmartNIC 1,853 lines of C code We implement TSO,scatter-gather IO,and TLS crypto offload Easy to port existing apps open()
42、,fstat()and close()offload_open(),offload_fstat()and offload_close()write()offload_write()Porting Lighttpdserver to IO-TCP:10 lines of code modification10 lines of code modification14Experiment Setup Baselines Lighttpdon Linux TCP with sendfile()(Kernel version:4.14)Atlas:webserver on kernel-bypass
43、TCP stack with raw disk access 1 Buffer-cache-free design FreeBSD 1.10&Chelsio100Gbps NIC151“Disk|Crypt|Net:rethinking the stack for high-performance video streaming.”SIGCOMM,2017.ServerXeon Silver 4210 2.2GHz,10 cores with 128GB DDR4NVIDIA Bluefield-2(100Gbps x 2)Intel Optane 900P NVMe(20Gbps x 4)C
44、lient2 ClientsMax.80GbpsIO-TCP Performance-Plaintext16500KB video file chunks(disk bound)Lighttpdported to IO-TCP12.056.8IO-TCPAtlasLinuxTCP007080Throughput(Gbps)Number of CPU Cores1“Disk|Crypt|Net:rethinking the stack for high-performance video streaming.”SIGCOMM,2017.Max.thro
45、ughput of 4 NVMesIO-TCP Performance-Plaintext16500KB video file chunks(disk bound)Lighttpdported to IO-TCP20.312.056.8IO-TCPAtlasLinuxTCP007080Throughput(Gbps)Number of CPU Cores007080Throughput(Gbps)Number of CPU Cores78.81“Disk|Crypt|Net:rethinking the s
46、tack for high-performance video streaming.”SIGCOMM,2017.Max.throughput of 4 NVMesIO-TCP Performance-Plaintext16500KB video file chunks(disk bound)Lighttpdported to IO-TCPIO-TCP achieves Full BW of 4 NVMedisks with a single CPU core007080Throughput(Gbps)Number of CPU Cores78.120
47、.312.056.8IO-TCPAtlasLinuxTCP007080Throughput(Gbps)Number of CPU Cores007080Throughput(Gbps)Number of CPU Cores78.81“Disk|Crypt|Net:rethinking the stack for high-performance video streaming.”SIGCOMM,2017.Max.throughput of 4 NVMesIO-TCP Performance TLS17500
48、KB video file chunks(disk bound)Lighttpdported to IO-TCPCipher mode:AES-GCM 256IO-TCP still reaches the max throughput for TLS traffic007080Throughput(Gbps)Number of CPU Cores77.46.412.544.237.4007080Throughput(Gbps)Number of CPU Cores0070801234
49、5678910Throughput(Gbps)Number of CPU CoresIO-TCPAtlasLinuxTCP1“Disk|Crypt|Net:rethinking the stack for high-performance video streaming.”SIGCOMM,2017.Max.throughput of 4 NVMesSource of Performance Improvement18SeparationSeparationof Control plane Control plane/Data planeData plane-No main memory rea
50、d/write for IO-No CPU cache eviction by DDIOSource of Performance Improvement18SeparationSeparationof Control plane Control plane/Data planeData plane-No main memory read/write for IO-No CPU cache eviction by DDIOMitigatedCache&Memory contention Cache&Memory contention forControl planeControl planeS
51、eparation achieves 27%lower LLC miss rate27%lower LLC miss rateFaster control planeFaster control planeIPCIPCof the control path improves by 58%58%Source of Performance Improvement18SeparationSeparationof Control plane Control plane/Data planeData plane-No main memory read/write for IO-No CPU cache
52、eviction by DDIOMitigatedCache&Memory contention Cache&Memory contention forControl planeControl planeSeparation achieves 27%lower LLC miss rate27%lower LLC miss rateShorter e2e RTT Shorter e2e RTT L arger window size L arger window size Overall Throughput ImprovementOverall Throughput Improvement99
53、.1 62.9 37.1 28.1 21.6 18.2 02040608080100R el.Throughput(%)Additional Delay by the control plane(s)Faster control planeFaster control planeIPCIPCof the control path improves by 58%58%IO-TCP Overhead Evaluation19 Overhead factors Host-NIC communication overhead Performance limit of Arm-ba
54、sed subsystem on BF2The fewer connections would be advantageous to CPU-only approach(Linux TCP)4.3217.6336.786.379.789.9908 16 32 64Throughput(Gbps)Throughput(Gbps)Number of ConnectionsNumber of ConnectionsIO-TCPLinux TCP0.451.812.760.791.331.3201231248 16 32 64Throughput(Gbps)Throughput(
55、Gbps)Number of ConnectionsNumber of ConnectionsIO-TCPLinux TCP300 KB files10 KB filesSummary20 BIG Trend:IO device advancement outpaces the rate of CPU capacity growth IO-TCP:a split TCP stack architecture for a content delivery system CPU host stack carries out the control plane functionalities of
56、a TCP stack NIC stack serves as data plane of a TCP stack IO-TCP achieves full bandwidth of 4 NVMedisks with a single CPU core Current bottleneck lies in SmartNICmemory bandwidth SmartNICwith higher memory BW will improve the throughput even more Bluefield-3 will achieve 140Gbps per NIC QUIC-based C
57、DN can adopt our separated stack design as well21Thank you!IO-TCP Performance Varying File Sizes2246.451.156.854.778.179.278.879.564.176.678.175.3020406080100100KB300KB500KB1MBThroughput(Gbps)File SizePlaintextLinuxTCPAtlasIO-TCP28.433.137.436.843.443.844.244.164.176.277.474.8020406080100100KB300KB5
58、00KB1MBThroughput(Gbps)File SizeTLSLinuxTCPAtlasIO-TCPIO-TCP Performance Varying Number of Connections2378.3 79.3 79.0 77.4 79.1 79.8 79.5 80.6 02040608030004000Throughput(Gbps)#of Host CPU coresTLS1278.8 79.4 78.9 77.5 79.1 80.0 79.7 81.5 02040608030004000Throughput(Gbps)#of H
59、ost CPU coresPlaintext12Linux TCP vs.IO-TCP24LighttpdLighttpdsetupsetupThroughputThroughput(GbpsGbps)Linux TCP on Bluefield-2only11.98Linux TCP on Bluefiend-2 and 1 CPU core22.02IOIO-TCPTCPon Bluefield-2and 1 CPU core44.13TCP Fairness25 Jains fairness index with varying number of connections IO-TCP:
60、0.910.97 Linux TCP:0.900.97User-level TCP Stacks vs.IO-TCP26 Throughput with 500KB file delivery TAS:9.0 Gbps mTCP:21.4 Gbps F-Stack:36.0 Gbps Linux TCP:56.8 Gbps These stacks are not optimized for large-file content delivery Optimized for small messages Lacks of an implementation for sendfile()and
61、a support for TSOAsynchronous sendfile()on FreeBSD vs.IO-TCP2734.260.867.270.96476.278.175.3020406080100100KB300KB500KB1MBThroughput(Gbps)File SizePlaintextFreeBSD-nginxIO-TCP19.233.934.838.864.176.277.474.8020406080100100KB300KB500KB1MBThroughput(Gbps)File SizeTLSFreeBSD-nginxIO-TCPRetransmission T
62、imer&RTT Measurement28ClientCPUSENDSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()Retransmission timer at the host stackRetransmission Timer&RTT Measurement28ClientCPUSENDSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()Disk I/O delay Retransmission time
63、r at the host stack Problem:disk access delay is added Up to a few mswhen backloggedRetransmission Timer&RTT Measurement28ClientCPUSENDSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()SENDECHODisk I/O delay Retransmission timer at the host stack Problem:disk access delay is add
64、ed Up to a few mswhen backlogged Solution:SEND ECHO packets to the hostRetransmission Timer&RTT Measurement28ClientCPUSENDSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()SENDECHODisk I/O delay Retransmission timer at the host stack Problem:disk access delay is added Up to a fe
65、w mswhen backlogged Solution:SEND ECHO packets to the hostShort&fixed delay(3us in our setup)Negligible overhead:a SEND for 10s100s MTUsShort&Static delayRetransmission Timer&RTT Measurement29ClientCPUSENDSmartNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()SENDECHO Retransmission
66、timer at the host stack Problem:disk access delay is added Up to a few mswhen backlogged Solution:SEND ECHO packet to the host Short and static delay(3us in our setup)Negligible overhead:a SEND for 10s100s MTURTT measurement with Timestamp optionRetransmission Timer&RTT Measurement29ClientCPUSENDSma
67、rtNICIO-TCP StackNVMeSSDsHost IO-TCP StackApplicationoffload_write()SENDECHO Retransmission timer at the host stack Problem:disk access delay is added Up to a few mswhen backlogged Solution:SEND ECHO packet to the host Short and static delay(3us in our setup)Negligible overhead:a SEND for 10s100s MT
68、URTT measurement with Timestamp option Add the static delay(3us)and the I/O delay to the=+3+/CPU as a Bottleneck for NVMe30 With multiple multiple NVMeNVMedisksdisks,CPU cancanbe a bottleneck Intel Xeon Silver 4210(2.20GHz)6x Intel Optane900P Simple fio A single CPU core cannot even support 2 disks for 4K BS2 disks for 4K BS020406080100123456CPU Utilization(%)Number of NVMe Devices4K BS8K BS16K BS32K BS64K BS