《李焱-DIF and DIX work in Lustre with LSI Fusion-MPT.pdf》由会员分享,可在线阅读,更多相关《李焱-DIF and DIX work in Lustre with LSI Fusion-MPT.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、深圳国家基因库 China National GeneBank深圳国家基因库 China National GeneBankDIF and DIX work in Lustre with LSI Fusion-MPTHomer Li 2023-101深圳国家基因库 China National GeneBankShow the silent data corruptionMedium errorsSSD Nand/Magnetic media/DRAM errorsOS driver bugsHardware firmware issuesTransmission and Receiving
2、errorsShortcomingsPerformance overheadThe implementation and configuration of DIX are relatively complexDIX can only detect and correct some data errors2Why DIF and DIX深圳国家基因库 China National GeneBankDIF/DIX with SCSI device(mpt3sas 44.00.00.00)DIX ApplicationLustreOSDIX/DIF is fully supportedRHEL 7.
3、6 and laterDIF T10 PIHBA driverSCSI device supportLinux block driverDM and MD linearRAID0/1/10ZFS support end-to-end data protection in designLustre crc+zfs checksumOr add a bee watcher to test zfsApplicationOSHBAIO expandersswitchDisk driveDIFDriverFilesystemblockDIXLustreLinuxZFScksumLustre cksumB
4、it flipping without cksum in the flight(Silent err)RAID/EC no much helpCould not replace the full stack protection with the DIF/DIX3Bit flipping in the medium(Silent err)深圳国家基因库 China National GeneBankCONFIG with SCSI deviceFirmware supportSome pages was wrong,make sure with the vendorsg_vpd-page=ei
5、-long/dev/sdapSPT=1 protection types 1 and 2 supportedSome firmware could not be fully support,for more info,please contact vendor of the SCSI devicempt3sas kernel module,LSI Fusion-MPT 93009500insmod mpt3sas.ko prot_mask=0 x7fsg_format-format-size=4096-fmtpinfo=2Provides support of PI protection us
6、ing 10-and 16-byte commands,does not allow the use of 32-byte commandssg_format-format-size 4096-fmtpinfo=3(recommend in production env)Provides checking control and additional expected fields within the 32-byte CDBs RHEL 7.6 and later fully support DIF/DIXAlmaLinux 8.8 x86_64Lustre 2.12 and later/L
7、ustre 2.15.3 and laterLustre 2.15.3 Changelog:LU-16413-T10PI is broken for CentOS 8.x4深圳国家基因库 China National GeneBankLustre Client 2.15.3Client#lctl get_param osc.*.checksum_typeosc.lustre-OST0000-osc-ffff933082d9b800.checksum_type=crc32 adler crc32c t10ip4K OSS#lctl get_param obdfilter.*.checksum_t
8、ype obdfilter.lustre-OST0000.checksum_type=crc32 adler crc32c t10ip512 t10ip4K t10crc512 t10crc4KClient Lustre log:(osc_request.c:1219:osc_checksum_bulk_t10pi()GRD tags per page=2048,resend=0,bytes=4194304,pages=10245深圳国家基因库 China National GeneBankOSSlustre/include/uapi/linux/lustre/lustre_idl.henum
9、 ost_cmd OST_READ =3,OST_WRITE =4,#bpftrace-e kr:obd_cksum_type_pack printf(returned:0 x%xn,retval);returned:0 x6000OBD_FL_CKSUM_T10IP4K =0 x00006000,/*T10PI IP cksum,4KB sector*/#bpftrace-e kr:lustre_msg_get_opc printf(returned:%dn,retval);returned:3returned:46Block device layer深圳国家基因库 China Nation
10、al GeneBankHost and Targetmpt3sas_cm0:host protection capabilities enabled DIF1 DIF2 DIF3 DIX0 DIX1 DIX2 DIX3defaultEnable DIX and DIFcat/sys/block/sdam/integrity/device_is_integrity_capable/sys/block/sdam/integrity/tag_size/sys/block/sdam/integrity/write_generate/sys/block/sdam/integrity/read_verif
11、y/sys/block/sdam/integrity/format121 1 T10-DIF-TYPE1-IP7深圳国家基因库 China National GeneBankReally Worked?Injected errors into ZFS dev and DIF dev,it shows error at the same time8LimitationsJust inject to the device,not trigger this error from the HBA/IO expander深圳国家基因库 China National GeneBank深圳国家基因库 Chi
12、na National GeneBankProduction issues9深圳国家基因库 China National GeneBankIncomplete dataJBOD bad signal cause read/write error,Found a lot of incomplete datascrub repaired 1.04G in 69h18m with 0 errors on Mon Sep 18 20:53:09 2023Scrub to fix-read/write error in scrub(dead loop)ZFS 2.2(released 3 days ag
13、o),Scrub error log(#12812,#12355)-fix the report corrupt blocks,double or more scrub,not dead loopReplace hardware and add monitor scriptls-l/sys/class/sas_phyphy device driver errorTargetIO ExpanderHosteg:phy-14:1:22running_disparity_error_countphy_reset_problem_count loss_of_dword_sync_countInvali
14、d_dword_countError count increased10深圳国家基因库 China National GeneBankSCSI Enclosure MonitorSCSI Enclosure ServicesElementDeviceEnclosureTemperatureCoolingPowerVoltageCustomeg:Monitor PowerUndervoltage may occur in some high-density JBOD(More Power supply compatibility test)Some pages not fully support
15、 in some hardware vendorsUpgrade firmware to fix11深圳国家基因库 China National GeneBankLustre file system scanlfs find-type dParallel list the metadata for each dirsGet the huge size directoryScan timeout&skipImpact the performance12深圳国家基因库 China National GeneBankBackup file by XOR(test)Single namespace/t
16、ape crashedData lostSupport different filesystem/object storageMore complex workflow and amplificationIt cant replace 1+1 backup,just 1+D1D2D3D4P1Tar file/mount1/mount2/mount3/mount4/mount5NFS 4.2/Lustre/local filesystem13深圳国家基因库 China National GeneBank深圳国家基因库 China National GeneBankBenchmark14深圳国家基
17、因库 China National GeneBankPerformance gapHDD Bandwidth in SPEC Vendor 1215 262 MB/sVendor 2216 250 MB/sAVG latency in SPECVendor 14.2 msVendor 24.16 ms15深圳国家基因库 China National GeneBankThe job delayedStandard test workflow can t show anythingThe lower the better,upgrade firmware no helps16深圳国家基因库 Chi
18、na National GeneBankDynamic test filesystem data integrity by filesystemClient mount filesystemTruncate Sparse fileSet up loop block devext4/xfs with mdadmOpenZFSWrite small files,Power loss server and cliente2fsck/xfs_repair ext4(loop dev)IO scrub zfs(loop dev)Dm-integrity(option)17First bloodTripl
19、e KillDouble KillThe irresponsible vendor disables some sync semantics for higher benchmark scores.深圳国家基因库 China National GeneBank2 bash commands to test the consistency of metadata#name=$(openssl rand-hex 4);for i in 0.9999do echo$name$(date+%s).$name.$i done#ls-lt-time=atime-time-style=+%s18Add timestamp in the filenameCompare with the metadata timestamp深圳国家基因库 China National GeneBankAt the same time(Lustre,within 1s)and high performanceEvery 15 secs sync once(another distributed file system)and low performance19深圳国家基因库 China National GeneBank谢谢 Thanks20