《20230614_A-103_Borrill.pdf》由会员分享,可在线阅读,更多相关《20230614_A-103_Borrill.pdf(24页珍藏版)》请在三个皮匠报告上搜索。
1、Putting Consistency,Reliability,Availability and Partition-tolerance in the SmartNIC1Paul Borrill,Jonathan GorardPlus:Charlie,Steve,Liane,Chuck,Melissa,SusanSmartNICS Session:Wednesday,June 14th 04:20-5:20 PMDaedaelus?DDLUS addresses fundamental problems in distributed systems using protocols,data s
2、tructures and algorithms inspired by Quantum Information Theory and Multiway SystemsOur market is next generation platforms for secure,reliable,distributed computing on the edgeWe provide Microdatacenters with a fundamentally more reliable and programmable graph infrastructureInitial use-cases inclu
3、de Transaction systems,Digital Twins,AI/ML/LLM infrastructure,Multiplayer Games and Interfaces to Quantum ComputersSee session:Wednesday,04:20-5:20 PM2At the beginning of time,in networkingA set of brilliant decisions were madePackets could be droppedFor congestionAnd to simplify handling certain co
4、rner casesTCP sessionsWould recover those packet drops,deliver in orderIf the TCP session disconnected,recover in the appEven re-running a file transfer over Arpanet wasnt hard3As the Hyperscale era began A set of brilliant decisions were made,again:Commodity hardware and software,onlyWhite box serv
5、ersLinuxNICs,and later switchesExisting,proven foundation technology onlyScale out,not scale upDistributed databasesDistributed storage systemsLoad balancers to replicated front endsTo stateful back ends4Distributed Applications are Hard Nodes need to agree on a lot of things,all the time What nodes
6、 are in the cluster?Whos up?Is the“leader”alive?Did that storage write(or database update)commit?Getting consensus algorithms right is hard Lost packets,broken TCP connections:big impact Gray failures(performance collapse)happen often(ZOOKEEPER-1465)What happens in a network partition is harder Part
7、ial network partitions are worse5A Surprising(and Scary)Conclusion Brilliant networking decisions and brilliant hyperscale decisions together cause metastable failures in stateful applications like distributed storage and databases At best cause performance collapse(“gray failure”)At worst cause sil
8、ent data corruption(University of Waterloo)Google SRE Handbook,chapter 23“Managing Critical State”Computer Science has studied this at length,and concluded that these problems can be mitigated but not solved Not true by the way SmartNICs are in an ideal position to do more than mitigate these proble
9、ms6Cant solve this within“business as usual”We tried giving distributed applications reliable,deterministic communication In a software layer over best practice networking The industry failed Using custom protocols and drivers over off the shelf NICs,with and then without switches The industry faile
10、d We found the need for“entanglement”at the Link layer and end-to-end,so the sender and receiver both know immediately if a packet arrived successfully,without timeouts and retries 7DDLUSNewtons Cable8DDLUSNo Cloning Link Protocol9FPGALinkFPGAAppAppApp(Alice)App(Bob)Silent(cannot be observedDont eve
11、n try)PayloadPayloadStrike Point(Injected TokenAnd PayloadExtraordinary claims require Extraordinary ProofSee our Table Top Demo in the Exhibit HallThe Network Changes,But DoesntNICs connect directly to each other no switches necessaryUplinks from the network are backwards compatible TCP/IP and Ethe
12、rnet stack are unchangedThere are no dropped packetsTraffic is paused on link failure&healing locallyIf a transaction packet doesnt reach destination,we know in microsecondsWe dont use timeouts,causality/events on multiple pathsWe ensure both ends have the same facts about whether a packet was deliv
13、ered or notAt the application level:enables agreement on facts across a set of nodes,despite the CAP theorem“proof”cant be done10(PCIe)The Rack/Row Deterministic SubnetUplinks from that subnet are just Layer 3 via EthernetAddressing and forwarding have novel propertiesSoftware endpoints are addresse
14、d,not serversA software endpoint can move within the subnetEndpoints can be managed in directed graphsNew hardware paradigms are enabledServer really is a peripheral of the NICEnables 10 x more 10 x smaller servers2 centimeter cables between adjacent nodesConnection cost per server radically lowerCo
15、nsensus which actually works enables dividing a distributed app over far more nodesEndpoints in sets/graphs simplify management,deployment,and ACLs11DaedaelusA graph software company focused on dependable computingWe solve putatively unsolvable problems in the communication between pieces of a distr
16、ibuted applicationWhich reside on different computersWhich communicate over a fallible networkWhich require agreement on certain facts in order to operate correctlyIncidental to our solution,we write code for an FPGA NICIncidental to our solution,we use a mesh network of serversAs part of our soluti
17、on,our FPGA NIC provides line-speed extreme low latency primitives which assist consensus,atomic update of shared data items,conservation of tokens,etc for distributed app nodes within our subnet12Distributed Systems APIs for SmartNICsIN HONOR OF MARK CARLSONCONSISTENCYPARTITIONTOLERANCEAVAILABILITY
18、AUTONOMOUSCONFIGAUTONOMOUSHEALINGTRANSACTIONSENERGY CONSERVATIONAPPLICATIONGRAPHSSECURECONFINEMENTTHE CAP THEOREM NEEDS OUR HELP SECURE ENCLAVES NEED OUR HELPTRANSACTIONS NEED OUR HELPGREY FAILURE NEEDS OUR HELPPROGRAM WITH LINGUA FRANCAKUBERNETES NEEDS OUR HELPSession:Wednesday,04:20-5:20 PMPOC Tab
19、le Top Demo in the Exhibit HallDDLUSWe make Transactions ReliableReliable Substructure ClustersReversible transfers at line rates,with ultra-low LatencyNo tradeoff of Bandwidth and LatencyNo Metastable FailuresFPGAs do not have a halt state!They just run.Just circuits.Stuff goes in comes out,no halt
20、 states in betweenUnlike ASICs,they dont take years to create141.Compile directly from Rust into the Ddlus protocol description language2.Simulate all possible non-deterministic evolution histories with state transition graphs3.Extract entanglement information from the protocols,indicating which mic
21、rostates are non-separable(as in quantum mechanics)4.Simulate typical failure scenarios(e.g.link failure,packet loss)and quantify robustness and recovery capabilityLabyrinth-Formal Verification&Simulation15https:/ BorrillFounder/CEO and T1680%of failures have a catastrophic impact,with data loss bei
22、ng the most common(27%)90%of the failures are silent,the rest produce warnings that are unclear21%of the failures lead to permanent damage to the system.This damage persists even after the network partition healsApplication errors caused by communication issues17https:/dl.acm.org/doi/10.1145/3576192
23、055404550550123456Percentage of RequestsLatency(ms)Daedaelus reduces latency in ways conventional networks cannot:Direct connectionsMulticast consensus,in parallel over 8 ports instead of serial over 1Truncated Tail Latency protocol knows it failed or succeeded (without heartbeats or time
24、outs)1800.511.522.533.544.555.5062485460667278849096Percentage of RequestsLatency(ms)Conventional Clos NetworkFPGA Chiplet MeshLong tail due to Network sharing and(unbounded)retriessTail LatencyLower latency.Truncated Tail with atomic protocol(elimination of heartbeats&retries)Fallible Ch
25、iplet MeshSpacecraft Arrays:In Formation*19Complete Redundancy:Any cell can become a controller if others fail*From:ItsAboutTime.club talk:Swarming Proxima Centauri:How Really Good Clocks Enable Optical Communication Over Interstellar DistancesA software infrastructure for datacenter networks based
26、on algorithms whose assumptions about causality go beyond Newtonian and Minkowski spacetime.We design and verify protocols for direct(near neighbor connected)networks that can be deployed on FPGA-enabled SmartNICs to address fundamental problems in distributed systems.This leads to a system of rewri
27、ting rules that can execute in multiway application fragments invisibly and indivisibly in the FPGA substructure clusterQuantum Ethernet(Dual SAW-Petri-Spekkens-Protocol)20Compile Rust to Petri netsFor formal verification and simulationCompile Petri nets to VerilogFor Deployment on FPGAs21Centralize
28、dDecentralizedDistributedShort local linksNo SPOFsNo BottlenecksLong cablesONE SPOFLong/Medium cablesMULTIPLE SPOFs(Network partition possibilities)DaedaelusEvolving GridDaedaelusValency 8Valency 5DDLUSItsAboutTime.clubA relationship with time is intrinsic to everything we do within and between our
29、networked computers.An assumption that time is a smooth,irreversible,global Newtonian/Minkowskian background is a common but rarely questioned belief in computer science;yet,physicists now know this model to be incorrect.Our guest speakers are all people who have thought deeply about the nature of t
30、ime.We collectively realize that a new understanding could potentially revolutionize the way we approach physics,computer science,chemistry,neuroscience,and many other subjects.SmartNICs in Particular can benefit.Temporal Intimacy with bits on the wire.Decoupled transactions,CAP:Consistency,Availabi
31、lity,Partitioning.22A place to discuss our evolving knowledge of the nature of time and causality.For physicists,computer scientists,neuroscientists,philosophers and practicing engineers.SmartNIC Skyscraper ModelNew Substructure Consortium like SNIANew IEEE Standard for Distributed SystemsRevolutionary Technology from DAEDAELUS:Willing to Share,OpenSource*,License Fairly&ReasonablyThe Revolution Starts here:at theSmartNICs SummitMeet on Wednesday Afternoon 24(PCIe)https:/