《SNIA-SDC23-Emberson-Singhal-Crasta-Fabric-Attached-Memory.pdf》由会员分享,可在线阅读,更多相关《SNIA-SDC23-Emberson-Singhal-Crasta-Fabric-Attached-Memory.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、1|2023 Hewlett Packard Enterprise.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021Fabric Attached MemoryHardware and Software ArchitectureClarete Crasta,Dave Emberson,Sharad Singhal2|2023 Hewlett Packard Enterprise.All Rights Reserved.AgendaMotivation Using fabric attached memory in HPC Ar
2、chitecture Software stackResults and use cases Microbenchmarks Arkouda-based graph processingSummary&future work 3|2023 Hewlett Packard Enterprise.All Rights Reserved.Need quick answers on larger data sizesEXPONENTIALLY INCREASINGDATAEXPLODINGDATASOURCESSHRINKINGTIME TOACTIONMassive advancesin compu
3、ting powerXX=NEEDED EVERYWHEREData nearly doubles every two years(2013-25)Source:IDC Data Age 2025 study,sponsored by Seagate,Nov 2018 4|2023 Hewlett Packard Enterprise.All Rights Reserved.AI and machine learning Applications in simulation,modeling,large language models HPC workflows/pipelines such
4、as those in genomics Applications to transform data in a workflow with large intermediate data sets Large scale graphs Applications inSecurity:website reputation,malware detectionSocial networks:Community detection,link predictionAdvertising:brand reputation,click-through predictionInternet of thing
5、s:traffic management,risk detection Applications have enormous memory footprints Datasets can be 10s-100s of terabytes to multi-petabytes in size Analytics performance is currently limited by the total amount of DRAM in the HPC cluster Random data access patterns Processor caches are inefficient due
6、 to low hit rates Distributed applications often require expert programmers Data movement introduces high latencies Demand paging to SSDs is very slow Moving data consumes time and energy Difficult to optimize system resource locality or network performanceCharacteristics of emerging applications5|2
7、023 Hewlett Packard Enterprise.All Rights Reserved.Merging memory and storage:the new hierarchyCPURegistersCacheDRAMSCMSSDHDDTape150-300ps1.2-30ns50-150ns100-500ns20-100s5-15ms50-60sAccess Time“We are entering the HPC era of data intensive applications.Existing supercomputers are not suitable for th
8、ese kinds ofworkloads.There is a 5-orders-of-magnititude gap in thecurrent storage hierarchy.”Jiahua He et al,Proceedings SC2010,“DASH:a Recipe for a Flash-based Data Intensive Supercomputer”6|2023 Hewlett Packard Enterprise.All Rights Reserved.Technology advancesAdvances to Memory TechnologiesAdvan
9、ces to interconnects Slingshot,CXLHeterogeneous ComputeHBM7|2023 Hewlett Packard Enterprise.All Rights Reserved.Emerging system architectures E1.S modules x8 CXL 32 GT/s=32 GB/s max Capacity up to 256GB per module E3.S modules x8 CXL at 32 GB/s x16 CXL at 64 GB/s max(50 GB/s in practice)Capacity up
10、to 512GB per module Running 64 CXL lanes at 32 Gb/s=256 GB/s max(200 GB/s in practice)CXL has higher latency than on-chip DDR controllers Switches to extend memory capacity beyond the limits of directly connected modules,but incurs additional latencyGNRGNR AP DDR B/W 500 GB/s,Max capacity 6TB(GNR SP
11、 DDR B/W 350 GB/s,Max capacity 4 TB)AP onlyAP onlyCXL Memory B/W 200 GB/s,Max capacity 4TB GNR AP B/W increase 40%,capacity 67%GNR SP B/W increase 57%,capacity 100%8|2023 Hewlett Packard Enterprise.All Rights Reserved.Benefits Independent scaling of compute and memoryImproved utilization,reduced ove
12、rprovisioningDecoupling of failure domains Reduced depth of I/O stack for large workloadsDirect,unmediated access Accessible by all compute resources Challenges Coherence domains are limitedIndividual nodes on RDMA-based interconnectsSmall clusters on CXL-based interconnects Fabric latency is large
13、compared to local memorySoftware has to be(usually)refactored for performanceFabric attached memoryNVMNVMNVMNVMFabric-Attached Memory PoolMemory fabricI/O networkCompute nodeLocal DRAMLocal DRAMSoCSoCCompute nodeLocal DRAMLocal DRAMSoCSoCMemoryMemoryMemoryMemoryMemoryMemory9|2023 Hewlett Packard Ent
14、erprise.All Rights Reserved.GoldenTicket data-driven HPC architectureFAM on Slingshot is implemented with Memory Nodes10|2023 Hewlett Packard Enterprise.All Rights Reserved.Prototype implementation32 Compute Nodes DL385 Dual AMD“Milan”CPUs 1 TB DRAM eachFAM Partition A(10 Nodes)DL385 Dual AMD“Milan”
15、CPUs 4TB DRAM eachFAM Partition B(10 Nodes)DL380 Dual Intel“Ice Lake”1 TB DRAM each 8TB“Barlow Pass”SCM eachFuture:Memory servers with pool of CXL memory interconnected over Slingshot400 Gbps Slingshot per node50 TB total DRAM-based FAM80 TB Optane-based FAM11|2023 Hewlett Packard Enterprise.All Rig
16、hts Reserved.Software stack-componentsHDF5ArkoudaC/C+OpenSHMEMChapelOpenFAM APIRMADatapathRPCMercuryLibfabricControl PathRPCgRPCMercuryLibfabricClient Process Management SlurmMpirunPMIxLibfabricSlingshotInfiniBandOmnipathSocketsCommunicationClients12|2023 Hewlett Packard Enterprise.All Rights Reserv
17、ed.API and reference implementation to program FAM Targeted at scale-up,scale-out and emerging FAM architectures.API is generic,supports both fabric attached persistent memory and volatile memory Supports direct access and RDMA based on the environment.Open source available on github C+implementatio
18、n,with C+and C APIsOpenFAMhttps:/ Hewlett Packard Enterprise.All Rights Reserved.Data path operations Get,put,gather,scatter Blocking,non-blocking Fetching and non-fetching atomics Map/unmap Map FAM to process address space Available on scale-up environments(CXL memory in-plan)Memory ordering and co
19、llectives Quiet,fence,barrier Multi-threading and I/O contextsMemory management Region creation,resizing,destruction Data item allocation,deallocationMiscellaneous FAM to FAM copy Permission management Data item/region lookup by name FAM backup and restoreFunctions supported14|2023 Hewlett Packard E
20、nterprise.All Rights Reserved.Throughput tests for blocking calls Throughput scales linearly with PEs and memory servers For 8 memory servers(200 GB/s aggregate bandwidth):fam_get_blocking:192.8 GB/s fam_put_blocking:193.6 GB/s Interleave size:65,536 I/O operations:100 Message size:64 MiB Data items
21、/PE:1 Number of memory servers:1,2,4,8 Number of PEs:2,4,8,1615|2023 Hewlett Packard Enterprise.All Rights Reserved.Changes to Radix Sort Step 0:Data read from FAM instead of SSD Step 3:Global histogram updates in FAM Step 6:Count(all-to-all)updates use FAMLSD Radix Sort applicationPhases of a Paral
22、lel LSD Radix Sort Series of stable sorts,iterating through a sequence of select functions,to create sorted result of 10-byte keys.16|2023 Hewlett Packard Enterprise.All Rights Reserved.Benefit of FAM in SHMEM-based LSD radix sortPurpose:Comparison of LSD radix sort SHMEM only version and Hybrid mod
23、e results.Graphs show overall application execution time Configuration:Problem size from 64 million to 512 million elements per PE Total 32 PEs2,048 Million 16,384 Million elements 16 OMP threads per PE For the FAM access(hybrid mode),4 OpenFAM memory servers.OpenFAM v3.1.0,Cray-OpenSHMEMx/11 5.6.In
24、ferenceFrom initial experiments,Hybrid mode performs better than the SHMEM only version as the problem size increases.With the hybrid mode the end-to-end application time is 45 to 55 percent better than the SHMEM only version.17|2023 Hewlett Packard Enterprise.All Rights Reserved.Ingest threads Peri
25、odically ingest data from external sources(e.g.,scientific instruments)Analysis threads Background processing of incoming data Long running analyses Interactive queries In-memory data store High-speed access to data from queries or analysis threads Archival storage Backup of data no longer in useGen
26、eralized data-flow for HPC workflowsHigh performance computing at scale for interactive workloads18|2023 Hewlett Packard Enterprise.All Rights Reserved.High performance interactive data analyticsExample application workflow using FAM-enhanced Arkouda.19|2023 Hewlett Packard Enterprise.All Rights Res
27、erved.Benefit of FAM-based intermediate data storeIncremental processing leverages previously computed results stored in FAM to speed time to new results.20|2023 Hewlett Packard Enterprise.All Rights Reserved.Summary and future workA fabric attached memory architecture for HPC applications Defined t
28、he hardware components and configurations Built the software stack to access and use FAM Performance benchmarks for FAM access latencies and throughput Use cases that can benefit from FAMFuture work CXL memory,future memory technologies on memory nodes More ecosystem enablement OpenSHMEM,Chapel,Arkouda,data formats21|2023 Hewlett Packard Enterprise.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.