《SNIA-SDC23-Montana-DNAe2c-ECC-for-DNA-Data-Storage.pdf》由会员分享,可在线阅读,更多相关《SNIA-SDC23-Montana-DNAe2c-ECC-for-DNA-Data-Storage.pdf(40页珍藏版)》请在三个皮匠报告上搜索。
1、1|2023 SNIA.All Rights Reserved.Virtual ConferenceSeptember 28-29,2021DNAe2cECC for DNA Data Storage:10 x Improvement over RS CodesM.Montana,A.Marelli,R.Micheloni,V.DeCian,C.Spolaore,C.Tocalli Presented by Mario Montana2|2023 SNIA.All Rights Reserved.AgendaAbout DNAalgo Why DNA storage and ECCError
2、Sources in the DNA ChannelCODECs in literatureThe DNAalgo CODEC:DNAe2c Conclusion3|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.DNAalgo Inc4|2023 SNIA.All Rights Reserved.DNAalgo:Company Profile4 Executive Team Sabrina Barbato(biologist),CEO Rino Micheloni(engineer)
3、,COO Alessia Marelli(mathematician),CTO Mario Montana(senior executive),Chief Strategy and Alliance Officer Board Member Located in Italy Privately held 5|2023 SNIA.All Rights Reserved.DNAalgo:Missionto leverage“Information Theory”for a fast and reliable DNA storageAt DNAalgo we believe that data“ma
4、nipulation”is the only way for making DNA storage reliable and fast enough for the storage industry;without reliability and speed,DNA storage wont go too far from Todays proof-of-concept stage.56|2023 SNIA.All Rights Reserved.DNAalgos role inside the DNA Storage Pipeline 6DECODERENCODERSynthesisSequ
5、encingStorageErrorsErrorsErrorsCODEC7|2023 SNIA.All Rights Reserved.What we offer:3 pillars78|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Why DNA storage and Why Error Correction?9|2023 SNIA.All Rights Reserved.Why DNA Storage?Massive amounts of data are being gene
6、rated every day A new archival storage layer is needed beyond tape DNA storage enables.LongevityLow powerCapacity10|2023 SNIA.All Rights Reserved.DNA Storage Creates Challenges wrt Errors Nothing comes for free,so the main DNA storage issues are At DNAalgo we believe that data“manipulation”is the on
7、ly way for making DNA storage reliable and fast enough for the storage industry;without reliability and speed,DNA storage wont go too far from Todays proof-of-concept stageA huge amount of data are stored togetherData are read without orderChannel IDSErasure/PCR replicas:blocks of data be missingPol
8、ymerase Chain Reactionn cycles11|2023 SNIA.All Rights Reserved.Error Correction Codes(ECC)Error Correction was born as a part of information theory for telecomunications The main purpose is to correct the errors that occur during transmission over a mediumData=k bitsECC=n-k bitsData=k bitsECC=n-k bi
9、ts+Codeword or frame of length nTrasmitted bits12|2023 SNIA.All Rights Reserved.Code Rates&ECC Code rate CR is defined as the ratio k/nCR$A high code rate guarantees less overhead and so a monetary gainA low code rate guarantees high correctability13|2023 SNIA.All Rights Reserved.Example:ECC in Comm
10、unicationsPhone wires and the internet In the late 80s home phones were a commodity on quite all the houses.At that time,internet was becaming popular in houses.How was it possible to send digital data in all the houses without changing the infrastructure?Through the use of ECC,high speed communicat
11、ions over a medium not created originally for this purpose,was made possible.14|2023 SNIA.All Rights Reserved.Application Example:Flash Storage(SSD)In early 2000s the Flash market was dominated by NOR memories.At that time the NAND Flash arrived.While NOR Flash was reliable,NAND was not due to their
12、 intrinsic structure But NAND is fast and very scalable.In the same voltage space were it was possible to discriminate between 2 digital values,in few years it was passible to have 4,8,16,.Distributions NAND Flash is low cost and is used in a lot of applications such as SSDs.15|2023 SNIA.All Rights
13、Reserved.NAND and SSD:Also a challenging medium Error region is on each overlapping region between distributions.Distributions overlap due to the usage and retention In a space with 16 or 32 distributions,they overlap also in a fresh device.The estimated error probability is 10-2 How can we such an
14、error prone media for enterprise grade applications that need closer to error rates of 10-14 10 Another example of where,through the use of ECC,a very poor media is able to be used for an application requiring higher performance than the media can provide on its own16|2023 SNIA.All Rights Reserved.T
15、he mission of ECC in DNA storage.When DNA is created for a storage use,it is not necessarily by itself a bad medium it is actually quite stable resulting in strong data retention.The problem lies in the synthesis(writing)sequencing(reading)processes which generate errors.In order to reduce the noise
16、 in the process,processes employ expensive and time consuming techniques to write and read the information A storage system more resilient to errors could potentially tolerate a more approximate synthesis or sequencing process hence,decreasing time and cost of the biological methods Maybe.Through th
17、e use of a strong ECC approach,poor but lower cost,and faster writing and reading processes can be used for Enterprise Grade storage applications17|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.Error Sources in the DNA Channel18|2023 SNIA.All Rights Reserved.IDS erro
18、rs&ErasuresDuring synthesis,errors arise from incomplete capping and DNA damage during oxidation and deblocking stepsThese errors can be an insertion,a deletion or a substitution(IDS)at a nucleotide levelPolymerase Chain Reactionn cyclesDuring sequencing,PCR is applied so that each strand is read a
19、variable number of times(also 0 times)creating possible erasures of entire strands19|2023 SNIA.All Rights Reserved.Information Channel exampleDNA sub-stringRandom replicaErrors insertion/substitution/deleteNo replica lost of infoinsertionsubstitutiondeleteerasure20|2021 Storage Developer Conference.
20、Insert Company Name Here.All Rights Reserved.Current approaches for CODECs in DNA storage and How to Evaluate 21|2023 SNIA.All Rights Reserved.State-of-the-art ECCs/CODECs for DNA Storage Error Correction Codes are used mainly for substitution errors,or erasure errors Known codes used so far in the
21、industry are:Reed Solomon(Organick,Lee,et al.Random access in large-scale DNA data storage.Nature biotechnology 36.3(2018):242-248.)LDPC(Chandak,Shubham,et al.Improved read/write cost tradeoff in DNA-based data storage using LDPC codes.2019 57th Annual Allerton Conference on Communication,Control,an
22、d Computing(Allerton).IEEE,2019.)Fountain(Erlich,Yaniv,and Dina Zielinski.DNA Fountain enables a robust and efficient storage architecture.science 355.6328(2017):950-954.)Hedges(Press,William H.,et al.HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.Procee
23、dings of the National Academy of Sciences 117.31(2020):18489-18496.)22|2023 SNIA.All Rights Reserved.How we do compare ECCs?Reed Solomon and LDPC are standard ECC described in a substitution channel or erasure channel Fountain codes are very powerful but mainly for erasure Hedges code has a very low
24、 code rate and are very computationally intensive and are also used for erasureCodeErasureInsertionDeletionSubstitutionNo codecNOpartiallypartiallyPartiallyReed SolomonpartiallyNONOpartiallyLDPCNOYESYESYESFountainYESNONONOHedgesNOYESYESYESDNAalgo DNAe2c YESYESYESYES23|2023 SNIA.All Rights Reserved.F
25、ER vs BER?Generally codes are evaluated as Bit Error Rate(BER)against Frame Error Rate(FER)which represents the probability of having un uncorrectable frame given a specific BER.Given a target FER,the better ECC is the one that can reach the target with the highest BER possible.In other words the mo
26、st right we are,the better ECC we have We can compare two codes by computing the percentage of BER we can gain to reach the same targetFERBER1Maximum FER is 1,i.e.All the frames are uncorrectableWhen BER increases,the probability of failed frames increases tooTarget FERBER gain24|2023 SNIA.All Right
27、s Reserved.FER in DNA data storageAACGTTGACGTGTTAAGCTGGCAGGGCCTACATGAC00000000000000000000001001111101
28、00000011100001In ECC domain strands have a longer length than in DNA domain,the lenght is 1.5-2 times bigger depending on the mapping usedDNA domainECC domainA frameA frameA frameA frame can coincide with the strand length,but it can also go across strands
29、 in different ways depending on the coding strategy used25|2023 SNIA.All Rights Reserved.FER vs SuER/IER/DERIn DNA channel,we do not have BER because the errors can be created by insertion,deletion and substitution and all those probabilities are independentThe analysis is split in 2 graphs:FER vs S
30、uER(substitution error rate)where IER(Insertion Error Rate)and DER(Deletion Error Rate)are fixed and equalFER vs DER/IER where SuER is fixedFERSuERIER=DER=xFERIER/DERSuER=x26|2023 SNIA.All Rights Reserved.FER vs erasures In DNA channel,some strands can be lost,this is what we call erasures.In any ca
31、se it is possible to scramble the erased nucleotides among all the frames.In addition to that we may add erasures in trace reconstruction for example,if we found out that a strand hasnt the correct length and we decide to erase the whole strands.In order to evaluate performance against erasures we p
32、rovide a graph of FER vs Dropout Rate/erasure rate In this graph,by fixing a dropout rate,the better code is the one that shows a smaller FERFERDropout Rate/erasure rateFER gain27|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.DNAe2c 28|2023 SNIA.All Rights Reserved.T
33、he Goal Error sources can change a lot depending on the synthesizing or sequencing machines we want an ECC with some tuning parameters so that it can be targeted to a specific error channel In order to understand performances we need to perform simulations with known codes with a lot of different pa
34、rameters(SuER,erasure,PCR distribution,etc.)we need to implement different ECC on DNAssim and perform many simulations against our solution.i.e SW/HW co-simulations We want to keep the CR as high as possble in order to avoid writing too much and to keep computational complexity and power consumption
35、 as low as possible the solution must be implemented in HW29|2023 SNIA.All Rights Reserved.DNAe2c Solution is based on a proprietary code which must be Iterative so that latency can be changed by changing the number if iterations Flexible code rate so that we can change the number of parity bits tha
36、t must be written according to the synthesizing machine in use Tricks(Recovery Mechanisms)such that we can enable/disable different tricks depending on the error conditions HW implementable so that it will be easier to deploy it in data center solutionsDNAe2c NAoisewarerrors&erasuresleaner30|2023 SN
37、IA.All Rights Reserved.Error Floor Iterative decoders exhibits error floor In order to avoid it we studied and verified different floor breaker strategies that can be enabled standalone or in combinationError floor is an abrupt change in the slope when DER/IER decreases.There is no way to compute it
38、,it is sure it will pop up but where can be evaluated only by simulationsBy using a combination of floorbraker 1+2 the slope is corrected31|2023 SNIA.All Rights Reserved.How DNAe2c tuning works Analyze experimental data to create a noise modelInput noise model in DNAssim to evaluate DNAe2c performan
39、ceTune DNAe2c parametersDo we hit FER target?DNAe2c yesno32|2023 SNIA.All Rights Reserved.Enabling&Disabling Recovery MechanismsDNAe2c is a complete set of solutions based on the error conditionIf the number of erasure increases we can add ERM1(Erasure Recovery Mechanism)or a combination of two diff
40、erent tricksIf the number of both IDS errors and erasures dramatically increases we can add other ERM3 DNAe2c+ERM 1DNAe2c+ERM 1+2DNAe2c DNAe2c+ERM3DNAe2c#errors(IDS)#erasures33|2023 SNIA.All Rights Reserved.HW/SW Implementation of CODEC Function In order to evaluate power consumption and computation
41、al effort we implemented DNAe2c on a FPGA(Xilinx Alveo U50)34|2021 Storage Developer Conference.Insert Company Name Here.All Rights Reserved.DNAe2c comparisonSection Subtitle35|2023 SNIA.All Rights Reserved.Graph comparison of ECCs/CODECSIn the following we will see some comparison of different grap
42、h in different error conditionsError conditions can be determined by SuER,IER,DER and Dropout RateError conditions can vary depending on the algorithms used in the pipeline(e.g.Trace reconstruction)DNAe2c is a set composed by a propretary ECC+different ERMs enabled or disabled by analysizing the set
43、 of errors in a particular environmentIn order to have a curve,many simulations are performed by DNAssim in pure SW of HW/SW co-simulations36|2023 SNIA.All Rights Reserved.FER vs DER/IERDER=0.002DER=0.027DER=0.006Incredible gain of DNAe2c against LDPC and Reed-SolomonIf we fix a target FER of 10-9(0
44、 is not an option!),in other words we accept of losing one frameover a billion,DNAe2c is able to have a DER/IER 10 x bigger in comparison to Reed-Solomon codes!Solid line is pure software simulation,while dashed line are SW/HW co-simulationSimulated Error Set10X Improvement over RS Codes 37|2023 SNI
45、A.All Rights Reserved.FER vs dropout rate/erasure rateBy fixing a droput rate(e.g.0.07)that means that 7%of the strands are in erasure,we see the gain of around two orders of magnitude of DNAe2c compared to Reed-SolomonSimulated Error Set38|2021 Storage Developer Conference.Insert Company Name Here.
46、All Rights Reserved.ConclusionSection Subtitle39|2023 SNIA.All Rights Reserved.ConclusionError Correction Codes can enable the use of poor media for high performance system applicationsDNA data storage is a different channel in comparison to standard storage or telcommunications channelsWe need a wa
47、y to compare different coding strategies in a DNA data storage environment DNAe2c is a complete set of solutions that can be tuned based on a specific noise set.By using the knowledge of the noise condition the codec can be tuned in order to reach the target FERDNAe2c has been implemented in HW and SW on a standalone accelerator cardDo you want to challenge DNAe2c on your particular error conditions(synthesizing machine+sequencing machine)?Reach us!40|2023 SNIA.All Rights Reserved.Please take a moment to rate this session.Your feedback is important to us.