《SNIA-SDC23-Jensen-libvfn-NVMe-VFIO_1.pdf》由会员分享,可在线阅读,更多相关《SNIA-SDC23-Jensen-libvfn-NVMe-VFIO_1.pdf(76页珍藏版)》请在三个皮匠报告上搜索。
1、1|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021libvfnA low-level NVMe Applicationand VFIO Driver FrameworkKlaus Jensen,Samsung Electronics2|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Re
2、served.What is libvfn?Two“libraries”A VFIO utility library(#include)with helpers for writing user space drivers for any PCI device Core helpers vfio configuration,device bring up,IRQ configuration,mmio IOMMU helpers iommu api,I/O virtual address allocator An NVMe user space driver(#include)Polling a
3、nd event-driven modes Low-level queue and register API LGPL,MIT dual-licensed Core library has zero external dependencies libnvme(and some GPL licensed support libraries)required for building tests and examples Designed(for now)for x86_64 and ARM643|2023 Storage Developer Conference.2023 Samsung Ele
4、ctronics Co.,Ltd.All Rights Reserved.Sigh,another user space driver?Why?io_uring,io_uring_cmd xNVMe?Hello?io_uring_cmd has dramatically reduced the need for user space NVMe drivers io_uring_cmd allows user space to“talk shop”(sending raw-ish NVMe commands)xNVMe provides high-performance abstractions
5、 over block(and raw NVMe I/O)unified API supporting several backends linux aio,io_uring,io_uring_cmd,spdk,and libvfn command submission helpers callback-based“reactor”for completions NVMe type definitions asynchronous and synchronous submission modes4|2023 Storage Developer Conference.2023 Samsung E
6、lectronics Co.,Ltd.All Rights Reserved.io_uring_cmd and NVMeFundamentally this enables a user to Submit raw-ish NVMe commands The submitted payload is slightly different from NVMe PRP1 repurposed as a single 64 bit pointer to a virtual memory address(or struct iovec)PRP2 split into two 32 bit values
7、 describing length(or number of vector elements)of the metadata and data pointers while not having to worry about bootstrapping the driver enabling,probing namespaces,configuring queues,etc.5|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.io_uring_cmd and NVMe
8、Available NVMe lingo remains bound by the environment provided(and optionally enforced)by the kernel driver Without CAP_SYS_ADMIN only I/O commands without”dangerous”command effects.simple white-listed admin commands(identify,etc.).With CAP_SYS_ADMIN What if you delete queues?Detach a namespace?Whel
9、p.You can insmod garbage.ko at anytime anyway,so no biggie really.A cardinal rule of using io_uring_cmd is do not screw up the driver.6|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.But,you want it allAs a host/device verification engineer,you want to issue a
10、ny command and observe the fallout you probably do not care about the block layer,file systems,etc.you want to issue malformed commands(invalid PRPs,SGLs)without potentially(or rather,likely)bringing down the kernel with a fat finger resulting in an Oops and a rebootAnd that is the domain of the saf
11、e user space driver Its not just for performance7|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.User space drivers and NVMeFundamentally this enables a user to Submit raw(as in totally raw)NVMe commands in a safe way no risk of breaking the kernel(there is no
12、 driver to break)you might brick the drive if not careful while having to write a driver in user space to bootstrap the controller(configuring admin queues,probing namespaces,setting up queue memory,handling PRPs/SGLs)8|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Re
13、served.libvfn vs.The Current State of The ArtOther user space drivers SPDK PyNVMe QEMU block/nvmeA direct comparison is not fair to either parties SPDK is so much more than just an NVMe driver io_uring_cmd-based bdev_xnvme is closing the gap with lib/nvme PyNVMe is a test-dedicated NVMe driver with
14、a native Python API To the best of my knowledge,derived from SPDK Since v3,no longer open source9|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.libvfn vs.SPDK libvfn has similarities to SPDK,why is this work not in SPDK?SPDK is more than an NVMe driver,it is
15、a“development kit”,an application framework bdev,fabrics support,etc.lots of useful sugar libvfn is fundamentally a userspace PCIe driver framework maybe more similarities with DPDK a low-level NVMe driver is included because that is what spurred the”libvfio”part libvfn might be more suitable of emb
16、edding into other projects(YMMV)Zero dependencies Supports both polling and event-driven modes out of the box minimal API less sugar included10|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.QEMU NVMe driver The QEMU NVMe driver allows the QEMU block layer to
17、use a PCIe NVMe device directly as the underlying storage of VMs.an emulated device is layered on top(e.g.,virtio-blk or even hw/nvme)single I/O queue pair extremely“to the point”when disregarding all the QEMU block layer plumbing libvfn borrows two techniques Fast command tracking Relatively simple
18、 IOVA allocator See Fam Zheng at KVM Forum 19https:/ Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.What makes user space drivers tick(safely)?Safe user space PCIe device drivers rely on1.the presence of a DMA remapping facility(a Translation Agent,or TA)to ensure
19、isolation through host-managed mappings(in Address Translation and Protection Tables,or ATPTs)The PCI Express specification defines the concept(but not the implementation)of TAs Intel VT-d,AMD-Vi and ARM SMMU implement and provide such facilities1.raw register access(typically memory-mapped I/O)2.an
20、d interrupt programmability12|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.PCIe Address Translation and VFIO13|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Legacy PlatformsIn general,on legacy platforms,PCI devices h
21、ave full access to the entire host physical memory address space”entire address space”depends on device capabilities if or not it can access 64 bit addresses Maliciously(or not),hardware may exploit thisRoot ComplexMain MemoryRoot PortPCIe deviceUNSAFE DMA14|2023 Storage Developer Conference.2023 Sa
22、msung Electronics Co.,Ltd.All Rights Reserved.Modern PlatformsOn modern platforms,memory transactions may pass through a Translation Agent Software(typically the OS)maintains a translation table which controls what physical addresses a device may access Allows the OS to protect itselfagainst faulty(
23、or malicious)hardware,but NOT from buggy driversRoot ComplexMain MemoryRoot PortPCIe deviceSAFE DMATranslation Agent(TA)Address Translation and Protection Table(ATPT)15|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Not just for securityBenefits of Address Tra
24、nslationAddress remapping map discontiguous memory into a contiguous range(scatter/gather)allow 32-bit only devices to access memory in 64-bit host memory space this was the original intent and purpose of a TAEnable safe user-space drivers some smart guys figured out that this could be used by the k
25、ernel to allow user-space to create mappings,but only for its own memory pages isolate devices from each other16|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Typical UseMain MemoryPCIe deviceCPUOS ManagedCPUPage TablesTAPage TablesVirtual Address(VA)I/O Virt
26、ual Address(IOVA)Physical Address(PA)MMUTA17|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Typical UseMain MemoryPCIe deviceCPUOS ManagedCPUPage TablesTAPage TablesVirtual Address(VA)I/O Virtual Address(IOVA)Physical Address(PA)Applicationvoid*p=mmap()MMUTA18
27、|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Typical UseMain MemoryPCIe deviceCPUOS ManagedCPUPage TablesTAPage TablesVirtual Address(VA)I/O Virtual Address(IOVA)Physical Address(PA)Applicationmap 0 x1000 to physaddr(p)in TA and pin memoryvoid*p=mmap()MMUTA
28、19|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Typical UseMain MemoryPCIe deviceCPUOS ManagedCPUPage TablesTAPage TablesVirtual Address(VA)I/O Virtual Address(IOVA)Physical Address(PA)Applicationread from 0 x1000map 0 x1000 to physaddr(p)in TA and pin memor
29、yvoid*p=mmap()MMUTA20|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Translation Agent ImplementationsThe PCI specification only defines the concept of the Translation Agent as a logical entity The details are vendor specificThe capabilities of the TA variesTh
30、e format of the ATPT variesThe PCI topology defines the granularity of isolationVFIO(and now,IOMMUFD)unifies it all under common uAPIs21|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.A unified APIAn IOMMU and device agnostic API for securely exposing direct d
31、evice access to userspace.Three main concepts-Containers,Groups and Devices The container manages address translations(a set of page tables)for a set of groups ioctls:SET_IOMMU,IOMMU_MAP_DMA,The group represents a set of devices that share an isolation granularity ioctls:SET_CONTAINER,GET_DEVICE_FD,
32、The device is a,well,device ioctls:GET_REGION_INFO,SET_IRQS,RESET,22|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Do we need a libvfio?Best practices for using VFIO is scattered amongst various projects QEMU(hw/vfio,util/vfio-helpers.c)DPDK(lib/eal/linux/eal
33、_vfio.h,c)Verbose uAPIs The uAPIs can be cumbersome,non-trivial and boiler-plate heavy to use risk of mistakes,steep learning curve Duplication,redundant code duplication,redundant code23|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Using VFIO is a little bo
34、iler plate heavySteps required to bring up a PCI device(configure iommu group)1.Configure VFIO“container”(open/dev/vfio/vfio)a.Verify API versionb.Verify IOMMU support2.Configure IOMMU groupa.Determine iommu group of device(i.e./dev/vfio/N)b.Determine if iommu group is“viable”c.Set(attach)group to c
35、ontainer3.Configure IOMMUa.Set IOMMU type on containerb.Retrieve IOMMU information(capabilities)24|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Using VFIO is a little boiler plate heavySteps required to bring up a PCI device(configure device)1.Get device han
36、dle(file descriptor)2.Get device informationa.Verify that device is a PCI deviceb.Get device region information(PCI configuration space)3.Configure,initialize BARsa.Get device region information per BAR4.Set PCI bus master(write to configuration space)5.Configure IRQs1.Determine IRQ mechanisms(INTx,
37、MSI,MSI-X)2.Select IRQ mechanism depending on support25|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved./*Required data structures*/int container,group,device,i;struct vfio_group_status group_status=.argsz=sizeof(group_status);struct vfio_iommu_type1_info iommu
38、_info=.argsz=sizeof(iommu_info);struct vfio_iommu_type1_dma_map dma_map=.argsz=sizeof(dma_map);struct vfio_device_info device_info=.argsz=sizeof(device_info);/*Create a new container*/container=open(/dev/vfio/vfio,O_RDWR);if(ioctl(container,VFIO_GET_API_VERSION)!=VFIO_API_VERSION)/*Unknown API versi
39、on*/if(!ioctl(container,VFIO_CHECK_EXTENSION,VFIO_TYPE1_IOMMU)/*Doesnt support the IOMMU driver we want.*/*Open the group*/group=open(/dev/vfio/26,O_RDWR);/*Test the group is viable and available*/ioctl(group,VFIO_GROUP_GET_STATUS,&group_status);if(!(group_status.flags&VFIO_GROUP_FLAGS_VIABLE)/*Grou
40、p is not viable(ie,not all devices bound for vfio)*/*Add the group to the container*/ioctl(group,VFIO_GROUP_SET_CONTAINER,&container);/*Enable the IOMMU model we want*/ioctl(container,VFIO_SET_IOMMU,VFIO_TYPE1_IOMMU);/*Get addition IOMMU info*/ioctl(container,VFIO_IOMMU_GET_INFO,&iommu_info);/*Alloc
41、ate some space and setup a DMA mapping*/dma_map.vaddr=mmap(0,1024*1024,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,0,0);dma_map.size=1024*1024;dma_map.iova=0;/*1MB starting at 0 x0 from device view*/dma_map.flags=VFIO_DMA_MAP_FLAG_READ|VFIO_DMA_MAP_FLAG_WRITE;ioctl(container,VFIO_IOMMU_MAP_DMA,&d
42、ma_map);/*Get a file descriptor for the device*/device=ioctl(group,VFIO_GROUP_GET_DEVICE_FD,0000:06:0d.0);/*Test and setup the device*/ioctl(device,VFIO_DEVICE_GET_INFO,&device_info);for(i=0;i device_info.num_regions;i+)struct vfio_region_info reg=.argsz=sizeof(reg);reg.index=i;ioctl(device,VFIO_DEV
43、ICE_GET_REGION_INFO,®);/*Setup mappings.read/write offsets,mmaps *For PCI devices,config space is a region*/for(i=0;i 0used1used2used3emptyvid);49|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Identify Example(eventfd)int efd=eventfd(0,0);/*register eventf
44、d for vector 0*/vfio_set_irq(&ctrl.pci.dev,&efd,1);nvme_rq_map_prp(rq,&cmd,iova,NVME_IDENTIFY_DATA_SIZE);nvme_rq_exec(rq,&cmd);/*wait for interrupt*/uint64_t v;read(efd,&v,sizeof(v);/*will not spin*/nvme_rq_spin(rq,&cqe);50|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Right
45、s Reserved.Case StudyIntegration with xNVMe51|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Integration in xNVMelibvfn is available as an alternative to the SPDK backend in xNVMe makes xNVMe a little lighterxNVMe requires backends to implement buffer allocati
46、on and mapping maps directly to mmap(allocation)and DMA mapping of the buffers async interface queue init,poke,(vectored)io52|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Integration in xNVMeThe asynchronous interface in xNVMe is based on callbacks libvfns s
47、truct nvme_rq opaque member stores the xNVMe command context holds the callback and argument to be executed upon command completionThe xNVMe asynchronous API maps almost 1-to-1 with libvfn queue init nvme_create_ioqpair()io nvme_rq_post/exec()poke loop around nvme_cq_get_cqe()and nvme_rq_from_cqe()5
48、3|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Performance NumbersAre we on par?54|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Performance(Setup)Intel(R)Xeon(R)CPU E3-1240 v6 3.70GHz An oldie,but a goodie Intel NVMe
49、 Optane Memory Series MEMPEK1W016GA 16GB,M.2 80mm PCIe 3.0,20nm,3D Xpoint 1 core(1 thread,1 queue)random read(512 bytes)NVMe queue size is 128(device max)I/O queue depths 1,2,4,8 64 10s warmup,30s proper55|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Perform
50、ance(program)As close to an apples-to-apples comparison with SPDK as possible libvfn examples/perf.c mirrors spdk examples/nvme/perf/perf.c Re-issue command on completion Same time measurement strategy(RDTSC-based,not clock_gettime)Everything pre-allocated56|2023 Storage Developer Conference.2023 Sa
51、msung Electronics Co.,Ltd.All Rights Reserved.Performance(IOPS)57|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Performance(Average Latency)58|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Wrapping UpExamples and Next
52、Steps59|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.What about iommufd?Yes.We are following the latest developments and testing My colleague,Joel Granados,is integrating iommufd support Rework vfio parts to use this as appropriate Does requires some public
53、API changes(planned for v4)vfio_map,unmap_vaddr()iommu_map,unmap_vaddr()We already hide the group-centric VFIO API behind a device centric abstraction60|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Next StepsMore NVMe helpers SGL mapping helpers Pluggable IO
54、VA allocation and lookup More sugar?Maybe in another support library?Non 4KB page size based systems Some assumptions in the core.Mads is working on that.WONTFIXes High level event framework(roll your own)61|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Quest
55、ions?Grab libvfn at v2v3 just released62|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.63|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Backup
56、for Questions64|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.The Level Two API(nvme_rq)sq0123executingcompleted65|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.The Level Two API(nvme_rq)sq0123rq=nvme_rq_acquire(sq);nv
57、me_rq_exec(rq,cmd);executingrqcompleted66|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.The Level Two API(nvme_rq)sq0123rq=nvme_rq_acquire(sq);nvme_rq_exec(rq,cmd);executingrqcompleted67|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All R
58、ights Reserved.The Level Two API(nvme_rq)sq0132cqe=nvme_cq_get_cqe(cq);rq=nvme_rq_from_cqe(cqe);executingrqcompleted68|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.The Level Two API(nvme_rq)sq013nvme_rq_release(rq);executingcompleted269|2023 Storage Develope
59、r Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.The Level Two API(nvme_rq)sq013cqe=nvme_cq_get_cqe(cq);rq=nvme_rq_from_cqe(cqe);executingcompleted2rq70|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.The Level Two API(nvme_rq)sq0132nvme_rq_rel
60、ease(rq);executingcompleted71|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Regularstruct nvme_rq*rq=sq-rq_top;if(!rq)errno=EBUSY;return NULL;sq-rq_top=rq-rq_next;return rq;Atomicstruct nvme_rq*rq=load_acquire(&sq-rq_top);while(rq&!cmpxchg(&sq-rq_top,rq,rq-rq
61、_next);if(!rq)errno=EBUSY;return rq;nvme_rq_acquire()72|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.Regularstruct nvme_sq*sq=rq-sq;nvme_rq_reset(rq);rq-rq_next=sq-rq_top;sq-rq_top=rq;Atomicstruct nvme_sq*sq=rq-sq;nvme_rq_reset(rq);rq-rq_next=load_acquire(&s
62、q-rq_top);while(!cmpxchg(&sq-rq_top,rq-rq_next,rq);nvme_rq_release()73|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.NVMe Refresher its just queue processing74|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.NVMe Refresh
63、erHost informs the device about new entries in the queue using a“doorbell”mechanism A“doorbell”is the common name for a write-only memory-mapped I/O registerPCI devices expose these registers in the PCI Configuration Space In NVMe,the controller registers are located in the NVMe“MBAR”(BAR 0&1)015163
64、1Device IDVendor ID0 x00-0 x040 x080 x0CNVMe MBAR(BAR 0&1)0 xffabcdef0 x100 x14BAR 20 x18BAR 30 x1CBAR 40 x20BAR 50 x24-0 x28Subsystem IDSubsystem Vendor ID0 x2C-0 x300 x340 x380 x3C75|2023 Storage Developer Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.NVMe RefresherStartEndSizeSy
65、mbolDescription0 x00 x078CAPController Capabilities0 x080 x0b4VSVersion0 x140 x174CCController Configuration0 x280 x2f8ASQAdmin Submission Queue Base Address0 x300 x378ACQAdmin Completion Queue Base Address.0 x10000 x10034SQ0TBDLSubmission Queue 0 Tail Doorbell(Admin)0 x10040 x10074CQ0HDBLCompletion
66、 Queue 0 Head Doorbell(Admin)0 x10080 x100b4SQ1TBDLSubmission Queue 1 Tail Doorbell0 x100c0 x100f4CQ1HBDLCompletion Queue 1 Head Doorbell0 x1000+(2n 2)0 x1003+(2n 2)4SQnTBDLSubmission Queue n Tail Doorbell0 x1000+(2n+1 2)0 x1003+(2n+1 2)4CQnHDBLCompletion Queue n Head Doorbell76|2023 Storage Develop
67、er Conference.2023 Samsung Electronics Co.,Ltd.All Rights Reserved.NVMe RefresherStartEndSizeSymbolDescription0 x00 x078CAPController Capabilities0 x080 x0b4VSVersion0 x140 x174CCController Configuration0 x280 x2f8ASQAdmin Submission Queue Base Address0 x300 x378ACQAdmin Completion Queue Base Addres
68、s.0 x10000 x10034SQ0TBDLSubmission Queue 0 Tail Doorbell(Admin)0 x10040 x10074CQ0HDBLCompletion Queue 0 Head Doorbell(Admin)0 x10080 x100b4SQ1TBDLSubmission Queue 1 Tail Doorbell0 x100c0 x100f4CQ1HBDLCompletion Queue 1 Head Doorbell0 x1000+(2n 2)0 x1003+(2n 2)4SQnTBDLSubmission Queue n Tail Doorbell0 x1000+(2n+1 2)0 x1003+(2n+1 2)4CQnHDBLCompletion Queue n Head DoorbellUsing MMIO,the host writes the tail/head values to the relevant doorbells“Ringing the Doorbell”