《SNIA-SDC23-Inupakutika-Lofton-Benchmarking-Storage-with-AI-Workloads_1.pdf》由会员分享,可在线阅读,更多相关《SNIA-SDC23-Inupakutika-Lofton-Benchmarking-Storage-with-AI-Workloads_1.pdf(44页珍藏版)》请在三个皮匠报告上搜索。
1、1|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Benchmarking Storage with AI WorkloadsPresented byDevasena Inupakutika,Charles Lofton,Bridget Davis Samsung Semiconductor Inc.2|2023 SNIA.All Rights Reserved.MotivationGrowing production datasets:10s,100s of petabytesSamsungs data
2、center storage and memory productsResearch involving the impact of storage on AI/ML pipelines is limitedHow to showcase Samsung datacenter products impact to real world workloads?3|2023 SNIA.All Rights Reserved.Introduction Benchmarking essential to evaluating storage systems:Storage needs for large
3、 machine learning datasets are growing Evaluating storage for AI workloads is challenging Real-world AI training requires specialized hardware System resources stressed by AI application Do AI workloads benefit from high performance storage systems?Is there a realistic method to showcase high perfor
4、mance storage for AI workloads?Can the test methods be easily implemented and reproducible?4|2023 SNIA.All Rights Reserved.IntroductionBenchmark datasets are smaller whereas data is the moving force of AI algorithmsReal-world production workloads demands huge data(both for training and generation du
5、ring streaming)Empirical study to understand how AI workloads utilize storage devices through I/O patterns5|2023 SNIA.All Rights Reserved.AI Workloads I/O CharacterizationBetter understanding of AI I/O profilesProvides insights on the design and configuration of storage systemsMain aspects under con
6、sideration:I/O Rates Throughput Rates Randomness Locality of reference I/O size distribution%Reads vs Writes6|2023 SNIA.All Rights Reserved.Blocktrace Analysis of AI WorkloadsGives deeper insight into I/O profileThe block report generated by“btt”provides detail about each I/O:Command(read or write),
7、precise timestamp,starting LBA,ending LBA From the above data we can derive details about:Randomness:If starting address of I/O“B”equals ending address of I/O“A”,I/O is sequential Read/write ratios I/O size distribution:Ending LBA minus starting LBA equals block size in sectors Locality of reference
8、:Some address ranges are accessed more frequently than others7|2023 SNIA.All Rights Reserved.Rule of ThumbAI workloads are computation bound Loading a 200KB image takes 200us Classify a image takes 10msParallelize AI jobs to saturate I/O Use a cluster of GPUs Keep every GPU busy 8|2021 Storage Devel
9、oper Conference.Insert Company Name Here.All Rights Reserved.I/O intensive MethodologiesBenchmarking AI workloads in a customer representative scenarios9|2023 SNIA.All Rights Reserved.Limiting MemoryTo accurately model realistic workload with very large training dataset requirement Readily available
10、 benchmark datasets are small and fit in memory Goal is to stress storage in a small realistic test environmentControl Dataset size to memory ratio e.g.MLPerf ImageNet dataset(150 GB)Docker memory limit optionsDataset Size(GB)System Memory(GB)Ratio1507681:5150642.5:110|2023 SNIA.All Rights Reserved.
11、Simultaneous Data Ingestion and TrainingNormally,training is not run in isolationMultiple models to be trainedRealistic scenario:data ingest and training happen together11|2023 SNIA.All Rights Reserved.Training in parallelTraining parallelism:Storage to meet the needs of concurrent data ingest of di
12、fferent training jobsHyper-parameter tuning:Run tens of hundreds of instances of the same training job with different configuration of the model12|2023 SNIA.All Rights Reserved.Inference:Streaming applicationsInference is more likely I/O bound Training has 3x computations compared to Inferencing For
13、ward propagation,backward propagation,and weight updates Less CPU bound implies possibility of I/O bound13|2023 SNIA.All Rights Reserved.I/O Challenges for Streaming applicationsLarge amount of concurrent input data volume One 4K 30 fps video stream:45Mbps(6MBps)1000 video streams:45Gbps(6GBps)Massi
14、ve intermediate data from different stages in a pipelineVideo processing pipeline Videos are split into frames Stages are isolated into containers One stage consume frames from last stage Frames are passed through Apache Kafka with replicas14|2023 SNIA.All Rights Reserved.For inference testbed:Compu
15、te node cluster Kubernetes Storage(message broker)cluster Kafka(Helm charts)Test SystemHardware ComponentsDetailsGPU8x Nvidia Tesla V100S,32 GBCPUIntel Xeon Platinum 8268,2.9 GHz,2 Sockets,2 threads per core,96(24*2*2)total cores,768 GB System MemoryStorageLocal:1 Samsung PM9A3(3.49 TiB)drive per ho
16、st:PCI Express Gen4 x 4 interface U.2(EXT4 file system)Software ComponentsDetailsUbuntu20.04 focalTensorflow(tensorflow-gpu)MLPerf-Version:2.4.1DockerVersion:20.10.12CUDA ToolkitVersion:CUDA-11.2FIOVersion:3.26-59ResNet50 v1.5 modelDistributed multi-GPU training with ImageNet ILSVRC2012 datasetOpenM
17、PIVersion:3.0.0HorovodVersion:0.24.2Face DetectionFeature ExtractCompute ClusterApache KafkaKubernetesOSClassificationIngestionKubernetesOSSDPOSSamsung SSDsApache KafkaSDPOSSamsung SSDsStorage ClusterStorage ClusterStorage Cluster15|2023 SNIA.All Rights Reserved.Dataset and Model detailsTaskModelFra
18、meworkDataset detailsImage classification trainingResNet50Tensorflow-gpu:2.4.1ImageNet-1kVideo streaming and recognition:Inference through Image classification modelResNet50Tensorflow-gpu:2.11.01.Videos:a.Big Buck Bunny,Frame rate:24FPS,Resolution:1920 x1080,Size:45 MB,Duration:09:56 minb.CostaRica,
19、Framerate:60FPS,Resolution:3840 x2160,Size:1.13 GB,Duration:05:13 min2.ImageNet-1k Validation dataset16|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Impact of Limiting Memory17|2023 SNIA.All Rights Reserved.Baseline vs Limited memory:Disk profiles Disk throughput is
20、 substantially increased 48x Training time does not change much when limiting memory with faster/performant storage*Zero values are discarded from disk metric statistics calculation in the tables.Disk I/O,Throughput,Block sizes,Response time,CPU and GPU utilization%are average values.MetricBaselineL
21、imited MemoryAvg.IOPS232,244Avg.Throughput(MiB/s)5.84280.46Avg.Block Size(KiB)169.55170.23Avg.Response time(s)203.63185.91Training time(minutes)36435718|2023 SNIA.All Rights Reserved.System resources Baseline and Limiting memory exhibit comparable performance19|2023 SNIA.All Rights Reserved.I/O Prof
22、ile:Resnet50 Single-Model TrainingI/ORead Pct.Random Pct.Average IOPSMinimum Read Request(KiB)Median Read Request(KiB)Maximum Read Request(KiB)Mean Read Request(KiB)Standard Deviation(KiB)Minimum Write Request(KiB)Median Write Request(KiB)Maximum Write Request(KiB)Mean Write Request(KiB)Standard Dev
23、iation(KiB)Total99.94%83.88%6394128256 616Random99.96%100%5364481081313Sequential 99.85%0%353044441918Nearly 100%read,84%random,with I/O sizes ranging from 4K to 256K20|2023 SNIA.All Rights Reserved.Trace statistics:I/O plots and locality histogram Random and Sequent
24、ial reads within a relatively narrow address range High locality of reference21|2023 SNIA.All Rights Reserved.Trace statistics:I/O Request Sizes Random reads ranged from 4K to 256K,but more than 99%were either 128K or 256K(left)Random write I/O sizes were more diverse(right).Sequential I/O size dist
25、ribution was similar.22|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Simultaneous Data Ingestion and Training23|2023 SNIA.All Rights Reserved.Baseline vs Limited memory:Disk profiles*Zero values are discarded from disk metric statistics calculation in the tables.Dis
26、k I/O,Throughput,Block sizes,Response time,CPU and GPU utilization%are average values.MetricBaselineLimited MemoryAvg.IOPS2505425035Avg.Throughput(MiB/s)3162.593181.91Avg.Block Size(KiB)Read:169.8Write:128Read:170.4Write:128Avg.Response time(ms)79.41875.48Training time(minutes)373.1537324|2023 SNIA.
27、All Rights Reserved.System resources GPU utilization unaffected:GPU not handling dataingestion operationsCPU-IOWait increases:Parallel data ingestion25|2023 SNIA.All Rights Reserved.I/O CharacterizationI/ORead PercentRandom PercentAverage IOPSMinimum Read(KiB)Median Read(KiB)*Mean Read(KiB)Read Std.
28、Dev.(KiB)Minimum Write(KiB)Median Write(KiB)Maximum Write(KiB)Mean Write(KiB)Write Std.Dev.(KiB)Baseline0.33%95.47%24,741285081286Limited Memory1.78%93.86%24,78642562455241285081287BaselineLimited MemoryI/O profile is mostly write and mostly randomPrimary difference between baseline and l
29、imited memory is in the read profileIn baseline training run,disk reads occur primarily in the first epoch because the entire data set fits in memoryIn limited memory run,reads from disk occur during all training epochs*Also Maximum Read26|2023 SNIA.All Rights Reserved.Trace statistics:Write I/O plo
30、ts and localityWrites are 95%random,but locality of reference is highBaselineLimiting Memory27|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Training in Parallel28|2023 SNIA.All Rights Reserved.Parallel models training:Disk profilesContainers/Parallel Models1248GPUs
31、per training workload8421Batch Size4512Disk I/O1658.31679.942805.261245.34Disk Throughput(MiB/s)276.55419.56351.32310.72Block(KiB)169.55253.71127.31254.2Response time(s)203.63304.57162.71195.88Training time(minutes)364258.2441682*Zero values are discarded from disk metric statistics calcu
32、lation in the tables.Disk I/O,Throughput,Block sizes,Response time,CPU and GPU utilization%are average values.29|2023 SNIA.All Rights Reserved.System resources CPU and GPU utilization increases with number of read-intensive training workloads30|2023 SNIA.All Rights Reserved.1 Model2 Models4 Models8
33、ModelsTotal Reads794,262509,8761,084,946509,674Mean Read Request170 KiB256 KiB128 KiB256 KiBMedian Read Request128 KiB256 KiB128 KiB256 KiBRandomness83.9%95.4%74.8%92.6%Locality Bands1313Percent of I/O received by 10%address space99%63%98%62%2-models and 8-models parallel training similaritiesAverag
34、e request size increased from 256 blocks to 512 blocks(256 KiB)8-models training is 100%read,with randomness increasing from 75%(4-models)to 92%I/O Characterization31|2023 SNIA.All Rights Reserved.Two-and eight-models show several bands of activity distributed across drives address rangeTrace statis
35、tics:I/O PlotsSingle Two models Four models Eight models 32|2023 SNIA.All Rights Reserved.Highest locality of reference in single model training:6%address space receiving 99%readsTwo-and eight-models have reads more distributed across the drives address rangeTrace statistics:Locality 1&4 models 2&8
36、models 33|2023 SNIA.All Rights Reserved.Single model:Random read request sizes ranged from 4KiB to 256KiB Mainly either 4KiB or 256KiBFour models:Most reads are 128 KiBTrace statistics:I/O Request SizesSingle model Two models Four models Eight models 34|2021 Storage Developer Conference.Insert Compa
37、ny Name Here.All Rights Reserved.Inference:Streaming workload35|2023 SNIA.All Rights Reserved.Frame extraction from 300 concurrent streams and publish to topic:27K IOPSDisk I/O and Throughput increase with great parallelism Data Ingestion Disk MetricsMetric/Concurrent Streams300,24 FPS Videos,3 RF(6
38、 partitions)-1 topics300,24 FPS Videos,3 RF(6 partitions)-3 topics300,60 FPS Videos,3 RF(6 partitions)-1 topic300,60 FPS Videos,3 RF(6 partitions)-3 topicsAvg.IOPS4471.797327.7427637.6313234Avg.Throughput(MiB/s)46.77152.69407.75306.63Avg.Block Size(KiB)Read:110.87Write:11.69Read:44Write:18Read:157.7
39、Write:13.2Read:125Write:21.18Avg.Response time(s)838.371489.38975.291223.0936|2023 SNIA.All Rights Reserved.CPU overhead increased with increasing partitions from 3 to 6 but remained constant with further increase to 12 partitions.Videos with higher frame rate(FPS)and resolution showed relatively hi
40、gher CPU utilization.System Resources37|2023 SNIA.All Rights Reserved.Nearly 100%write,70%randomData Ingestion I/O CharacterizationI/ORead PercentRandom PercentAverage IOPSMinimum Write(KiB)Median Write(KiB)Maximum Write(KiB)Mean Write(KiB)Std.Dev.(KiB)30 Streams0.08%71.43%28144 7643296100 Streams0.
41、54%69.92%422487646414030 streams100 streams Writes more widely distributed across SSDs address range with increased streamsStandard deviation suggests high diversity of write sizes38|2023 SNIA.All Rights Reserved.Write locality high both for 30 and 100 streams with 6%address space receiving 87%and 9
42、3%writes respectively.Trace statistics:Locality of reference and I/O sizes distribution30 streams100 streams Random write request size distribution was quite varied 70%of random writes were 28K or less,but the remaining 30%ranged up to 764K39|2023 SNIA.All Rights Reserved.System Implications and Dis
43、cussionThe majority of the workloads studied were primarily random,with relatively high locality of reference Suitable for testing optimizations such as read caching and write coalesceSome workloads(e.g.inference streaming)exhibited a very diverse write I/O size distribution Useful“real-world”benchm
44、arking tool for challenging high performance storage systems40|2023 SNIA.All Rights Reserved.ConclusionSimultaneous data ingestion and training,and inference were particularly effective benchmarks These approaches present challenging,“real-world”workloads to storageOur testing indicates that high-pe
45、rformance storage allows I/O-intensive and computationally-intensive portions of the AI pipeline to run in parallel with minimal impact on training and inference times.41|2023 SNIA.All Rights Reserved.Thank You!42|2023 SNIA.All Rights Reserved.Backup Slides43|2023 SNIA.All Rights Reserved.Summary st
46、atisticsWorkload DescriptionRead PercentageRandom PercentageAverage IOPSMinimum Read Request(KiB)Median Read Request(KiB)Maximum Read Request(KiB)Mean Read Request(KiB)Standard Deviation(KiB)Minimum Write Request(KiB)Median Write Request(KiB)Maximum Write Request(KiB)Mean Write Request(KiB)Standard
47、Deviation(KiB)Random Read OperationsRandom Write OperationsSequential Read OperationsSequential Write OperationsTrace Length SecondsResnet50 Training Single Model99.94%83.88%6394481081616666,340265127,9221941,244Resnet50 Training Two Models100.00%95.43%6004256256256644822486,584223,292285
48、0Resnet50 Training Two Models LM100.00%96.20%2,308425625666 46,231,3161,3121,824,85474420,823Resnet50 Training Four Models99.95%74.79%890441281120811,309471273,637521,220Resnet50 Training Eight Models100.00%92.59%2574256256256700000471,924037,74601,983Inference Baseline,Video S
49、treaming,Ingestion Phase(30 Streams,3 Partitions)0.08%71.43%2850447643296773720,92740288,6053,599Inference Baseline,Video Streaming,Ingestion Phase(100 Streams,3 Partitions)0.54%69.92%422448764641408,0161,054,351260456,7033,599Simultaneous Data Ingestion and Training(5 Epochs)0
50、.33%95.47%24,7746474,458 175,355,09233,9608,305,4817,456Simultaneous Data Ingestion and Training(5 Epochs Limited Memory)1.78%93.86%24,7864256256245524,879,201 157,200,319154,185 10,321,8626,881Training with Checkpointing Every 100 Steps93.27%92.61%5514416
51、1,280431567507,35512,52716,21425,2553,408Training with Checkpointing Every 1252 Steps(Default Interval)99.68%96.78%5674161,280134362501,25629715,3481,3513,438BERT 2000-Step Default Checkpoint Interval PM9830.22%4.38%264492,7407461,1852,511BERT 2000-Step Default Check
52、point Interval PM9A30.11%60.38%434481,28036176215164,87892108,2186,395BERT 2000-Step Default Checkpoint Interval PM9A3+Preconditioning+New FS0.23%0.49%2441,2801,2801,,1132,163BERT 2000-Step Default Checkpoint Interval PM9A3+New FS+Pytorch Framework0.00%3.47%181000004
53、5081,28057944307,3820205,0781,176BERT 2000-Step Limited Memory Default Checkpoint Interval PM9830.27%3.63%264072,1496059,8182,380BERT 2000-Step Limited Memory Default Checkpoint Interval PM9A30.12%58.17%454481,28036174219158,072106113,7076,110BERT 2000-Step With 250-
54、Step Checkpoint Interval PM9830.10%3.70%23254339,655119254,3282,504BERT 2000-Step With 250-Step Checkpoint Interval PM9A30.08%57.94%7264481,28089285196202,81499147,2792,680BERT 2000-Step With Simultaneous Data Ingestion PM9830.05%97.63%4,4704428127817,13533,601,8801,471814,0307,704BERT 2000-Step With Simultaneous Data Ingestion PM9A30.04%99.32%24,341281,280127126,94962,821,43616,860411,6392,60244|2023 SNIA.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.