《Emoly_Liu_Qian_Yingjin_Xfast_Extreme_File_Attribute_Stat_Acceleration_for_Lustre.pdf》由会员分享,可在线阅读,更多相关《Emoly_Liu_Qian_Yingjin_Xfast_Extreme_File_Attribute_Stat_Acceleration_for_Lustre.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Xfast:Extreme File Attribute Stat Acceleration for LustreQian Yingjin,Liu Y,Nov.3rd,OutlineBackground and motivationXfast design and implementation Scalable statahead Batch RPC engine Subtree aggregate statahead(SAS)Size on MDT(SoM)Scale-out stathead Thrashing avoidancePerformance evaluationConclusi
2、on and future Background and MotivationData is growing at an extreme pace 10,000,000+files in a singe directoryMany HPC applications suffer most from slow directory scans Directory tree walks cost much time(minutes to hours)How to improve directory tree walks performanceFile Attribute StatAttribute
3、fetching and cachingin LustrePrefetchAggregatePFile Attribute StatSerialized POSIX interface Retrieval only operate on a single directory entry at a time;The traversal of a directory with millions of entries can take tens of minutes to complete due to repetitive stat()calls.Use predictable access pa
4、tterns to prefetch metadata.POSIX semantics Need to return the most recent file information when listing directories;New statx()system call allows applications to request specific attributes to minimize unnecessary overhead.Reduce the number of RPC calls per statx()operation and allowed us to implem
5、ent lazy and strict Size on MDT-feature(SoM)for Lustre.Parallel prefetching of attributes mpiFileUtils+dfind,drm,dcp,Convert the serial stat()access from user process into parallel asynchronous Attribute Fetching and Caching in Lustrestat()path in Lustre1.An RPC is sent to the MDT to acquire a lock;
6、2.MDT returns a protected read(PR)lock,along with metadata attributes and layout extended attribute(EA);3.Send a glimpse PR lock request with the extent range 0,EOF to OSTs to obtain the current file size and blocks attributes.Distributed lock manager(DLM)Protect data and metadata consistency;If a c
7、lient holds a read lock,it can access the data or metadata locally,without concern that another client modifies it.Cached locks on the client protects the strong consistency for file attribute caching.MDTOST 1.Nlookup,enqueueattributes,lockglimpse lock 1.Nattributes,locklocks in cachestat(2)stat(2)C
8、lientFigure:stat()Xfast Design and ImplementationScalable stataheadBatch RPC engineSubtree aggregate statahead(SAS)Size on MDT(SoM)Scale-out statheadThrashing Overview of released Lustre feature about xFastFeatureLustre versionYearFLAT stataheadv1.82009Asynchronous glimpse lockv.2.22012Lazy size on
9、MDT(LSoM)V2.122018Strict size on MDT(SSoM)new-Batch RPC engineV2.142021Batched stataheadV2.162023Subtree aggregate statahead(SAS)new-Scale-out statahead(pENT,pSTL and pSTH)new-File naming pattern stataheadIn Scalable StataheadFlat statahead algorithm(Lustre 1.8 in 2009)Traverse a flat directory:open
10、dir()followed by readdir()and stat();Launch a kernel statahead thread when kernel detects user stat()in readdir()order;The statahead thread is notified to release its resources when the user process stops the directory traversal by calling closedr().Asynchronous glimpse lock(AGL)for size(Lustre 2.2
11、in 2012)Once obtain attributes form MDT,push it into AGL pipeline;AGL thread scans its pipeline,send asynchronous glimpse RPC to OSTs to fetch file size.?Figure:Simplifile statahead workflow for ls Batch RPC Engine(Lustre 2.16 in 2023)Statahead batching packs several dentry names resulting from a re
12、addir()call into one large batched RPC,which is transferred via bulk I/O.Increase communication efficiency Reduce the message size by compacting requests with a similar format.batch_max controls the maximum number of items to batch in one aggregate RPC.statahead_max controls the statahead window siz
13、e,default 1024(batch_max=statahead_max)d0f0f1d1f2f3d3f4f5f6f7f8d2f10f9f11 f12 f13d4f15f14f16 f17 f18f19 f20 f21 f22 f23 f24 f25 f26 f27.Figure:SAS algorith for DFS mode(statahead_max=8,dmax=3)Subtree Aggregate Statahead(SAS)Tools find,du are Depth First Search(DFS)access pattern.SAS:FLAT+DFS It alwa
14、ys starts with FLAT algorithm and if traversal process drills down into the first subdirectory,it changes into DFS mode.It is controlled via statahead_maxstatahead_maxfor a directory and via dmaxdmaxfor a new maximum subdirectory lookahead.1f0 f5,d1,d32f6 f10,d23f16 f184f11 f15,d45f19 Size on MDT(So
15、M)Lazy SoM(LSoM,Lustre 2.12 in 2018)Reduces the number of RPCs required to fetch the size of a file,but cannot guarantee its accuracy.oStore the latest file size update and its block count as extended attributes on MDT,which can beaccessible via a single RPC without accessing several OSTsoUpdate on
16、the file close()and truncate()on MDT.LSoM Strict SoM(SSoM)An entry is added into the Lustre changelog every time when a file is opened for write or being truncated.A dedicated Lustre client uses a lease locklease lockto access these changelog records.A flag can be specified in stat()to return strict
17、 or lazy size Scale-out StataheadCombine Xfast with mpiFileUtils to provide scale-out performance for tree walks Parallel stat on entires(pENT)oA single file is the minimal work set for the parallel tree walk.oFiles within a directory can therefore be randomly distributed among different MPI ranks.o
18、Break the sequential stat()order from readdir().Statahead with limit(pSTL)oTrade-off strategy that balances parallelization and stata-head speedup.oPerform a local directory walk for the first stmaxstmax(default 256)files in a directory by FLAT algorithmoEnqueues the remaining entries into the globa
19、l libCircle queue.Statahead by hash division(pSTH)oHashing the filename ensures that file names and file name sizes evenly partition the hash key space,especially for a larger directory.oSplit stat()workload under a directory according to the hash space evenly(by segment_sizesegment_size)Thrashing A
20、voidanceIf statahead guesses the wrong access pattern,scarce memory and I/O bandwidth would be wasted.In this case Statahead decreases the next statahead window size by a factor of 2 When it decreases to 1,it waits for the traversing process until it catches up to the current statahead position or e
21、xits and disables statahead processing When the traversing process catches up,it enlarges the window size Performance EvaluationFlat directory traversal Comparison of FLAT and SSoM Client-side caching of file attributes Network bandwidth impact including batchingSAS algorithm FLAT vs.SASScale-out st
22、atahead pENT vs.pSTL vs.pSTHIO500 mdtestoTesting EnvironmentTesting Environment:Lustre version:2.14Server:1MDT,8 OSTs(DDN AI400X Appliance(20 x SAMSUNG 3.84 TB NVMe,4X IB-HDR100)Client:16 nodes(1x Intel Gold 5218 processor,96 GB DDR4 RAM,CentOS 8.1 Linux)Network:Infiniband IB-HDR100(by default)+1 Gb
23、ps Ethernet interfaceo The LustreLustre Network Request Scheduler Token Bucket Network Request Scheduler Token Bucket Filter(NRSFilter(NRS-TBF)TBF)is used to enforce RPC rate limitations to emulate different server capabilitiesdifferent server capabilities.o Tool netemnetemis used to emulate differe
24、nt network conditions with delays of 1-10ms into the 1 Gbps Ethernet network.o The tuple XXX(XXX(,)with defines the combination of statahead_maxstatahead_max=and batch_maxbatch_max=Comparison of FLAT and SSoM?ls-l command on a directory with 1M file entries on different stripe count between 1 and 16
25、 OSTClient-Side Caching of File Attributes#filesCold cache(s)Warm cache(s)1,000,00022115.7100,00021.50.99310,0002.260.1021,0000.2530.015ls-l for 1K to 1M files for FLAT(1,1)on a single OST#nodes0246810Time(s)42629Statahead performance with write conflictsls-l with FLAT(1024,1)Network band
26、width impact including batching?Impact of statahead_max and batch_maxSpeedup ratio for high network latencies(compared to FLAT(1,1)1 Gbps Ethernet,stripe_count=1,AGL enabledFLAT(1024,256)SAS Algorithm(FLAT+DFS)EvaluationModeThread countstat()RPCsTime(s)FLAT(1024,1)75,6971,219,008114FLAT(1024,256)75,
27、665108,439112SAS(1024,1)11,219,54473SAS(1024,256)121,10868find using FLAT vs.DFS modeLatency(ms)0.112345Baseline393541203614730FLAT(1024,1)29098233439FLAT(1024,256)28687693245SAS(1024,1)612611548SAS(1024,256)48661057FLAT vs.DFS for different net
28、work latenciesTraversed a directory containing 16 Linux source trees(linux-5.12-rc5)via the command find find srcsrc-uiduid 0 0withdmaxdmax=16=Scale-out Statahead Evaluation?Statahead combined with mpiFileUtils on 16 nodes?dwalk on resource-limited metadata serversdfind runtimes with various network
29、 latencies?Randwalkdwalkand dfinddfindcommands on a flat directory with 1M files and a directory including 16 Linux source code trees.(stmaxstmax=256256,segment_sizesegment_size=4096=4096)IO500 mdtest-Sustained Performance EnhancementsPre-SC19SC19ISC20ISC22SC22ISC23ISC23/PreSC19IOR Easy Write25.88 2
30、8.6237.5655.9558.0757.882.2xIOR Easy Read39.94 41.7245.9583.8677.5679.082.0 xIOR Hard Write2.78 2.962.775.025.275.382.0 xIOR Hard Read8.99 42.1940.8139.7349.3650.775.6xFind1,735.41 8101,698.006,248.5512628.7813,229.117.6xMdtest Easy Write143.88 152.84157.22270.04312.9344.702.3xMdtestMdtest Easy Stat
31、Easy Stat455.03 451.97453.51740.011,278.501,276.312.8xMdtest Easy Delete88.52 132.76135.09223.61272.64311.163.5xMdtest Hard Write32.33 79.6590.47119.41157.4199.366.1xMdtest hard Read44.92 172.59169194.33238.82391.098.7xMdtestMdtest Hard StatHard Stat20.41 449.93446.75514.361,214.031,105.3354.1xMdtes
32、t Hard Delete16.35 75.1576.94101.98122.44112.586.8xBandwdith12.68 19.65 21.02 31.10 32.90 33.432.6xIOPS91.41 207.62 232.69 368.48 544.23 603.396.6xScore34.05 63.87 69.93 107.05 133.81 142.034.1xStorage Platform ES400NVES400NVX ES400NVX28 x CPU/node12 x CPU/node(1.5x)1 x EDR/node1 x HDR200/node(2x)PC
33、IGen3 NVMePCIGen4 NVMe(2x)https:/io500.org/submissions/view/657Performance improvement goes beyond what hardware upgrades can Conclusion and Future WorkXfast can significantly improve the performance of common directory operations.Future workOther statahead patterns and optimizationsoFile naming statahead patternoGiven an input file name list,do batched statahead.oCombining with statahead and readahead.Improve prefetching pipelineThank You!