上海品茶

SNIA-SDC23-Mazzie-Understanding-Applications-Through-NVMe_1.pdf

编号:149003 PDF 29页 1.31MB 下载积分:VIP专享
下载报告请您先登录!

SNIA-SDC23-Mazzie-Understanding-Applications-Through-NVMe_1.pdf

1、1|2023 Micron Technology,Inc.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Understanding Applications Through NVMe Driver Tracing Using BPFJohn MazzieMember of Technical Staff,Systems Performance EngineerMicron Technology,Inc.2|2023 Micron Technology,Inc.All Rights Reserved.AgendaBPF and

2、 the NVMe DriverApplication Analysis:MLPerf Storage3|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.BPF and the NVMe Driver4|2023 Micron Technology,Inc.All Rights Reserved.BPF Overview4 Originally“Berkeley Packet Filter”Developed to analyze network traffic Integrated

3、with kernel Executes sandbox programs in kernel Used to trace,profile and monitor Utilizes a just-in-time compiler Verification Engine to protect kernel space Various features supported in different kernel versions Kernel 3.18 eBPF VM Kernel 5.5 BPF Trampoline support BPF stack(Kernel)is limited to

4、512 bytes Use maps to increase memory availability5|2023 Micron Technology,Inc.All Rights Reserved.Methods of Tracing Kernel/Drivers5 Tracepoints Stable interface Managed by developers over multiple kernel versions Limited to the data provided by tracepoint.Kprobes(Kernel Probes kprobe/kretprobe)Can

5、 attach/register probe to virtually any instruction.Attachment to none kernel methods/functions requires debug kernel.Can access data not directly provided.Unstable interface Kernel Functions are not stable across versions BPF Trampoline(kfunc/kretfunc and fentry/fexit)Interface is similar to kprobe

6、s Reduced overhead from kprobes Doesnt cause events to be missed due to interruption Requires kernel support(Added in mainline kernel 5.5)6|2023 Micron Technology,Inc.All Rights Reserved.Need6Original Multiple Tools Blktrace Used to analyze read/write pattern that was going to the device at the bloc

7、k layer Requires post processing to get necessary output Nvmelat Bpftrace based tool,to give latency histogram of transactions at the driver layer Could miss some transactionsNew Tool Data processing done in line Collects data for every transaction7|2023 Micron Technology,Inc.All Rights Reserved.Lin

8、ux Storage Stack7Block driver(nvme)Write/read dataBIOsRequestsHW IRQ handlingEnqueue TasksApplicationsVFS/File SystemBlock(Request Q ,I/O schedule,plug/un-plug)Direct I/OPage cacheHost Bus DriverStorage DeviceBIOsData TransfernvmetraceblktraceDma mappingIOMMUHost RAMComplete TasksHost Bus DriverDma

9、unmapping8|2023 Micron Technology,Inc.All Rights Reserved.NVMeTrace8Collections information on every transaction in the nvme driver.Starting LBA Transaction Size/Length Start Time/Completion Time/Latency Process ID/Name Device Queue ID Transaction Type Read,write,flush,adminDeveloped using libbpfKer

10、nel version specific(sometimes)9|2023 Micron Technology,Inc.All Rights Reserved.Why Libbpf?9 Bpftrace High level scripting language Helpful to build tools quickly Built on bcc and libbpf Limited control over implementation Libbpf More difficult entry point More detailed control over implementation K

11、ernel space handlers User space processing and output CO-RE(Compile Once Run Everywhere)Can be done,might be difficult to implement depending on tool requirements10|2023 Micron Technology,Inc.All Rights Reserved.Code Flow10 Kernel SpaceMemory MapsStore data in program while its being processed.Use P

12、er CPU memory maps to avoid locking of map.Ring BufferUsed to transfer processed data to user space.Three handlers tracing functions in the NVMe drivernvme_setup_discardHandles parsing multiple discards sent as single DSM commandnvme_submit_cmdHandles submission of transactions to the NVMe device qu

13、eueCollect information about the transaction and store in a memory mapnvme_complete_rqHandles completion of transactions,called when interrupt is activated.Get completion time of transactionCalculate latencyPut processed data on ring buffer User SpaceLoads BPF applicationVerification is done by the

14、JIT compiler/BPF VMHandles data passed through from kernel space11|2023 Micron Technology,Inc.All Rights Reserved.Request/Command Structure11 Request Structure containing data from block layer provided to NVMe Driver nvme_iod Structure containg Nvme I/O data.Exists immediately after request in memor

15、y Contains nvme_request,nvme_command,nvme_queue Pointers for all structures are not passed into each traced function Limits direct access and reusability of code across kernel versions Tool needs access to request and nvme_command in all functions Getting data from nvme_iod and request requires movi

16、ng around memory Jumping between structures in memory requires knowledge of the specific structures Size,members,relative memory locations Function interfaces and structures are not stable across kernel version Kernel versions could require recompile,or even rewrite of handler logic12|2023 Micron Te

17、chnology,Inc.All Rights Reserved.nvme_setup_discard Handler12 Loops in BPF are hard Must have a defined end JIT compiler does a basic check Loop helper exists in newer kernel versions bpf_loop Discards are sent through Data Set Management(DSM)command Up to 256 discards per DSM command Need to loop t

18、hrough individual SEC(fentry/nvme_setup_discard)int BPF_PROG(do_nvme_setup_discard,struct nvme_ns*ns,struct request*req,struct nvme_command*cmnd)int temp_index;struct bio*_bio=BPF_CORE_READ(req,bio);/max ranges=256 for discard DSM command.for(int index=0;index slba=0;temp_discard_data-length_bytes=0

19、;temp_discard_data-length_lbas=0;break;temp_discard_data-slba=BPF_CORE_READ(_bio,bi_iter.bi_sector)(BPF_CORE_READ(ns,lba_shift)-9);temp_discard_data-length_bytes=BPF_CORE_READ(_bio,bi_iter.bi_size);temp_discard_data-length_lbas=temp_discard_data-length_bytes BPF_CORE_READ(ns,lba_shift);_bio=BPF_CORE

20、_READ(_bio,bi_next);else break;return 0;13|2023 Micron Technology,Inc.All Rights Reserved.nvme_submit_cmd Handler13 Generate pointers to necessary memory locations for structures Check if memory is available on the heap Start collecting available data for the event Check if its a non-admin command L

21、ength=1(No device name)Stores collected information in event_map for use in nvme_complete_rq handlerSEC(fentry/nvme_submit_cmd)int BPF_PROG(do_nvme_submit_cmd,struct nvme_queue*nvmeq,struct nvme_command*cmd,bool write_sq)struct nvme_iod*iod=container_of(cmd,struct nvme_iod,cmd);struct request*req=bl

22、k_mq_rq_from_pdu(iod);_u64 req_address=(_u64)req;int index=0;struct event*temp_event=bpf_map_lookup_elem(&heap,&index);if(temp_event)int length;temp_event-qid=BPF_CORE_READ(nvmeq,qid);temp_event-pid=bpf_get_current_pid_tgid()32;bpf_get_current_comm(temp_event-process_name,sizeof(temp_event-process_n

23、ame);temp_event-opcode=BPF_CORE_READ(cmd,common.opcode);length=bpf_probe_read_str(temp_event-device_name,sizeof(temp_event-device_name),BPF_CORE_READ(req,rq_disk,disk_name);if(length 1)if(temp_event-opcode=nvme_cmd_read|temp_event-opcode=nvme_cmd_write)_u32 size=511;temp_event-slba=BPF_CORE_READ(cmd

24、,rw.slba);temp_event-length_bytes=BPF_CORE_READ(req,_data_len);temp_event-length_lbas=BPF_CORE_READ(cmd,rw.length)+1;else if(temp_event-opcode=nvme_cmd_dsm)/slba,length_bytes,and length_lbas get handled with nvme_setup_discard /Setting to 0 until set at completion temp_event-slba=0;temp_event-length

25、_bytes=0;temp_event-length_lbas=0;else temp_event-slba=0;temp_event-length_bytes=0;temp_event-length_lbas=0;else /Admin Command temp_event-slba=0;temp_event-length_bytes=0;temp_event-length_lbas=0;temp_event-start_time_ns=bpf_ktime_get_ns();bpf_map_update_elem(&event_map,&req_address,temp_event,BPF_

26、ANY);return 0;14|2023 Micron Technology,Inc.All Rights Reserved.nvme_complete_rq Handler14Gets matching information from request in event_mapReserves space on the ring bufferCalculates latencyWrites all collected data to ring buffer for user space processing.SEC(fentry/nvme_complete_rq)int BPF_PROG(

27、do_nvme_complete_rq,struct request*req)_u64 req_address=(_u64)req;struct event*info=bpf_map_lookup_elem(&event_map,&req_address);if(info)struct event*e;e=bpf_ringbuf_reserve(&ring_buffer,sizeof(*e),0);/This is allocating too slow if(!e)bpf_printk(BUFFER FULL!n);return 0;e-start_time_ns=info-start_ti

28、me_ns;e-end_time_ns=bpf_ktime_get_ns();e-latency_ns=e-end_time_ns-e-start_time_ns;e-qid=info-qid;e-pid=info-pid;bpf_probe_read_str(e-process_name,sizeof(e-process_name),info-process_name);bpf_probe_read_str(e-device_name,sizeof(e-device_name),info-device_name);e-opcode=info-opcode;e-slba=info-slba;e

29、-length=info-length;bpf_map_delete_elem(&event_map,&req_address);bpf_ringbuf_submit(e,0);return 0;15|2023 Micron Technology,Inc.All Rights Reserved.Example Output15start_time_ns,end_time_ns,latency_ns,process_name,pid,device,qid,slba,length_bytes,length_lbas,opcode945661828630244,945661828679823,495

30、79,systemd-udevd,823,nvme2n1,18,0,4096,8,2945661828720722,945661828744932,24210,systemd-udevd,823,nvme2n1,18,8,4096,8,2945661828762102,945661828780561,18459,systemd-udevd,823,nvme2n1,18,24,4096,8,2945661833805074,945661833822884,17810,systemd-udevd,823,nvme2n1,18,0,4096,8,2945661833841224,9456618338

31、56614,15390,systemd-udevd,823,nvme2n1,18,8,4096,8,2945661833869263,945661833884423,15160,systemd-udevd,823,nvme2n1,18,24,4096,8,2945661838342307,945661838359766,17459,systemd-udevd,823,nvme2n1,18,0,4096,8,2945661838394956,945661838431165,36209,systemd-udevd,823,nvme2n1,41,8,4096,8,2945661838451645,9

32、45661838466984,15339,systemd-udevd,823,nvme2n1,41,24,4096,8,2945661839510777,945661839552986,42209,systemd-udevd,55562,nvme2n1,31,30005842432,4096,8,2945661839579855,945661839596465,16610,systemd-udevd,55562,nvme2n1,31,30005842592,4096,8,2945661839609995,945661839625125,15130,systemd-udevd,55562,nvm

33、e2n1,31,0,4096,8,216|2023 Micron Technology,Inc.All Rights Reserved.BPF Helpers16 bpf_ktime_get_ns()Get current kernel timestamp bpf_get_current_comm()Gets process name of process that triggered event being traced bpf_get_current_pid_tgit()Gets PID of process that triggered event being traced BPF_CO

34、RE_READ()Reads memory space of structures Can read arbitrarily deep through structures with pointers.bpf_probe_read_kernel()bpf_core_read Read arbitrary memory location bpf_probe_read_str()bpf_core_read_str Reads string value and stores it in another point in memory17|2023 Micron Technology,Inc.All

35、Rights Reserved.BPF CO-RE17https:/ Compile Once Run Everywhere Compile once and execute on multiple kernel versions Helper functions and methodology that help develop portable applications BTF BPF Type Format Debug information to describe all kernel/driver type information Used by BPF Verifier Finds

36、 matching structures and gets offsets for structure members Enables ability to not have to fully define a structure to access a member of that structure Build Kernel with CONFIG_DEBUG_INFO_BTF=y18|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Application AnalysisMLPe

37、rf Storage19|2023 Micron Technology,Inc.All Rights Reserved.How do we size storage for AI training?MLCommons produces many AI workload benchmarksTraining,Inference,Tiny,HPC,etcTraining benchmark has been scaled to nearly 4 thousand acceleratorsThe performance of storage has been optimized out of the

38、 Training benchmarkCant be used for measuring storage workload Options:De-optimize the training processDevelop a new process De-optimizing Limit memory to the system to prevent filesystem caching Some datasets are very small,and it is impossible to find a memory capacity that allows the models to tr

39、ain properly without caching the entire dataset Develop a new process Must do IO in the same way as the real AI training process Must reduce hardware requirements for testing (few storage vendors have hundreds of GPU systems for load testing)Must provide larger datasets to limit effect of filesystem

40、 cachingMLPerf20|2023 Micron Technology,Inc.All Rights Reserved.Using the tool DLIO from Argonne Leadership Computing Facility(ALCF)Uses the same data loaders as the real workload(pytorch,tensorflow,etc)to move data from storage to CPU memory Implements a sleep in the execution loop for each batch S

41、leep time is computed from running the real workload A sleep time and batch size effectively defines an accelerator How much data per batch and how long to spend on forward&backward passes Scale up/out testing performed by adding clients running DLIO and using MPIO for multiple emulated accelerators

42、 per client MLPerf Storage Defines a set of configurations to represent results submitted to MLPerf Training Version 0.5:BERT&Unet3D(NLP and 3D medial imaging)Allows scale out and scale up testing without requiring GPUs Reported metrics are:Samples per Second Number of supported accelerators Require

43、s maintaining a minimum Accelerator Utilization for a passed test Still in early development Get involved!https:/mlcommons.org/en/groups/research-storage/MLPerf Storage Benchmark21|2023 Micron Technology,Inc.All Rights Reserved.Unet3DI/O throughput versus timeFor a single Accelerator(top plot)Data t

44、ransferred in 1 second intervals ranges from 0 to 600 MB with peaks to 1,600 MBThe peaks correspond to the start of an epoch where the prefetch buffer is filled before starting computeFor 15 accelerators(bottom plot)Near the drives limit(17 accelerators)Workload continues to have bursty behavior wit

45、h some 1 second intervals showing 0 MB transferredWhile the system does hit the maximum throughput of the device,the low QD and idle times result in an average throughput that is 15 20%less than max supportedTraditional tools will not show the peak throughput as measured here1 ACC15 ACC22|2023 Micro

46、n Technology,Inc.All Rights Reserved.Unet3DQueue depth versus time 1 accelerator(top plot):Over time,queue depth remains low(less than 10)After initial ramp,QD remains constant even during epoch starts which showed higher MB per second 15 accelerators(bottom plot):Queue depth peaks at 145 early then

47、 stabilized at 120 and below This heavily loaded system still has low Queue Depth operations1 ACC15 ACC23|2023 Micron Technology,Inc.All Rights Reserved.Unet3DPercent of I/Os by queue depth for 1 accelerator For 1 accelerator:Less than 1%of IOs are at Queue Depths 2-5 Nearly 50%of IOs were inserted

48、as the only IO in the queue Nearly 50%were inserted as the second IO in the queue(QD1)Note:The specific transfer size is dependent on the device,block settings,and filesystem settings but we consistently see the max available size(512KB 1280KB)24|2023 Micron Technology,Inc.All Rights Reserved.Unet3D

49、Percent of I/Os by queue depth for 15 acceleratorFor 15 accelerators:We see a distribution of Queue Depths The bump at low QDs is important A not-insignificant number of IOs are inserted at very low Queue Depths(less than 5)This behavior introduces idle time in workloads that were expected to be con

50、stantly high throughput25|2023 SNIA.All Rights Reserved.How device settings can affect I/O patternMaximum Data Transfer Size MDTS Controller Setting Sets maximum transfer size drive will accept/sys/block/nvmeXnY/queue/max_hw_sectors_kb(Value in KiB)Can be adjusted down“echo /sys/block/nvmeXnY/queue/

51、max_sectors_kb”max_sectors_kb Working limit on OSNamespace Optimal Write Size NOWS Namespace setting Cannot be adjust in OS Hint for applications&file systems not enforced by drive26|2023 Micron Technology,Inc.All Rights Reserved.Unet3DI/O Blocksize Pattern 16 Accelerators XFS FilesystemMDTS:4MiB/NO

52、WS:4KiBMDTS:4MiB/NOWS:256KiBMDTS:512KiB/NOWS:256KiB27|2023 Micron Technology,Inc.All Rights Reserved.Future Improvements27Trace of files accessedTrace application processesAnalysis Improvements28|2023 Micron Technology,Inc.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.29|2023 Micron Technology,Inc.All Rights Reserved.Reference Linkslibbpf-https:/ Performance Tools(Brendan Gregg)-https:/ Storage-https:/mlcommons.org/en/groups/research-storage/

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(SNIA-SDC23-Mazzie-Understanding-Applications-Through-NVMe_1.pdf)为本站 (2200) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

 wei**n_... 升级为标准VIP   wei**n_... 升级为标准VIP

wei**n_...  升级为至尊VIP  137**64... 升级为至尊VIP

 139**41...  升级为高级VIP Si**id  升级为至尊VIP 

180**14... 升级为标准VIP   138**48... 升级为高级VIP 

180**08... 升级为高级VIP  wei**n_... 升级为标准VIP 

 wei**n_... 升级为高级VIP 136**67...  升级为标准VIP

136**08...   升级为标准VIP  177**34...  升级为标准VIP

 186**59... 升级为标准VIP 139**48...   升级为至尊VIP

wei**n_...  升级为标准VIP   188**95... 升级为至尊VIP 

 wei**n_... 升级为至尊VIP wei**n_... 升级为高级VIP 

 wei**n_... 升级为至尊VIP 微**... 升级为至尊VIP

 139**01...  升级为高级VIP 136**15...  升级为至尊VIP

 jia**ia...  升级为至尊VIP wei**n_...  升级为至尊VIP

183**14...  升级为标准VIP   wei**n_... 升级为至尊VIP

  微**... 升级为高级VIP  wei**n_... 升级为至尊VIP

  Be**en 升级为至尊VIP  微**... 升级为高级VIP

186**86...  升级为高级VIP Ji**n方...   升级为至尊VIP

 188**48...  升级为标准VIP wei**n_...  升级为高级VIP 

 iam**in... 升级为至尊VIP wei**n_...  升级为标准VIP

 135**70... 升级为至尊VIP  199**28...  升级为高级VIP

 wei**n_... 升级为至尊VIP  wei**n_... 升级为标准VIP 

 wei**n_... 升级为至尊VIP 火星**r... 升级为至尊VIP 

 139**13... 升级为至尊VIP 186**69... 升级为高级VIP 

157**87...  升级为至尊VIP 鸿**... 升级为至尊VIP 

wei**n_...  升级为标准VIP  137**18... 升级为至尊VIP

wei**n_... 升级为至尊VIP wei**n_... 升级为标准VIP 

139**24... 升级为标准VIP  158**25...  升级为标准VIP 

wei**n_... 升级为高级VIP 188**60... 升级为高级VIP 

Fly**g ... 升级为至尊VIP wei**n_... 升级为标准VIP  

 186**52... 升级为至尊VIP  布** 升级为至尊VIP

186**69... 升级为高级VIP  wei**n_... 升级为标准VIP

139**98... 升级为至尊VIP  152**90...  升级为标准VIP

 138**98... 升级为标准VIP  181**96... 升级为标准VIP

185**10... 升级为标准VIP  wei**n_...   升级为至尊VIP

 高兴  升级为至尊VIP  wei**n_... 升级为高级VIP

wei**n_...  升级为高级VIP  阿**... 升级为标准VIP 

wei**n_...  升级为高级VIP lin**fe...  升级为高级VIP

  wei**n_... 升级为标准VIP wei**n_...   升级为高级VIP

 wei**n_... 升级为标准VIP  wei**n_... 升级为高级VIP

 wei**n_... 升级为高级VIP wei**n_... 升级为至尊VIP 

 wei**n_... 升级为高级VIP wei**n_...  升级为高级VIP

 180**21... 升级为标准VIP  183**36... 升级为标准VIP

wei**n_...  升级为标准VIP  wei**n_...  升级为标准VIP

xie**.g... 升级为至尊VIP 王** 升级为标准VIP 

 172**75... 升级为标准VIP  wei**n_... 升级为标准VIP 

 wei**n_... 升级为标准VIP  wei**n_... 升级为高级VIP 

135**82...  升级为至尊VIP 130**18...  升级为至尊VIP

wei**n_... 升级为标准VIP wei**n_...  升级为至尊VIP 

wei**n_... 升级为高级VIP  130**88... 升级为标准VIP

 张川 升级为标准VIP wei**n_...  升级为高级VIP