《Big Data:From Theory to Systems 樊文飞.pdf》由会员分享,可在线阅读,更多相关《Big Data:From Theory to Systems 樊文飞.pdf(35页珍藏版)》请在三个皮匠报告上搜索。
1、Wenfei FanShenzhen Institute of Computing SciencesUniversity of EdinburghBeihang UniversityBig Data:From Theory to Systems1The 5 Vs of Big DataThe study has raised as many questions as it has answered2 Volume:The size of data grows rapidly and continuouslyChina generated 23.9 ZB business data in 202
2、2.It is expected to reach 76.6 ZB in 2027 Velocity:“You cannot afford to make decisions based on yesterdays data”Healthcare,retail,financial services,cyber security,Variety:Relational database D,transaction graph GCan we write a query across D and G in SQL?Veracity:The most challenging issue among t
3、he 5VsReal-life data is dirty:semantic inconsistencies,duplicates,stale data,missing linksValue:Killer APPs?What practical value can we get out of big data?Big Data:Volume,Variety,Velocity,Veracity,ValueThe challenges introduced by digital economyDigital Currency Heterogeneous queries on big data ac
4、ross different models Real-time transaction processing with consistency and reliability requirements Data-driven fraud detection and intelligent analysisChallenges:How to query big data with limited resources?Volume How to answer queries across heterogeneous data models?Variety How to query dynamic
5、data in response to updates?Velocity How to clean dirty data?Veracity What is benefit of big data analytics?Value Smart City Fusion of data from various models(historical BIM/CIM;and newly collected data)Massive data from unreliable data sources Real-time analysis in response to updatesThe need for
6、both theory and systems for big data analytics 3The challenges introduced by AIGC ChatGPT has led to a large number of AIGC startups 73%startups in China focus on application domains,and 14%on LLMs.Most LLMs are developed via fine-tuning of open-source pre-trained models.To make practical use of AIG
7、CThe next step:LLMs for specific application domains.But Where can we get high-quality data in a specific domain for LLM training?How can we make LLMs accurate,fair and robust?Can we interprete ML predictions after all?4 Shenzhen Institute of Computing Sciences 500+people,87%are experienced engineer
8、s 3 systems and 5 products since 2019 95+papers in TODS,VLDBJ,SIGMOD,VLDB,ICDE,etc;60%of the techniques proposed in the papers have been implemented in the systemsThe systems developed at SICSRock:Data qualityYashan DB:HTAP DBMSFishing Fort:Graph analytics Products:MedHunter,Mirror,Dream Creak,Lemmo
9、n Grass,Dasan PassAn end-to-end solution to big data management 5Volume:The solution of YashanDBTheory:Bounded evaluation(BEAS)YashanDB:A database management system for hybrid workload Traditional computational complexity theory of almost 60 years:The good:polynomial time computable(tractable,PTIME)
10、The bad:NP-complete(intractable)The ugly:PSPACE-hard,EXPTIME-hard,undecidableWhat happens when it comes to big data?Using an SSD of 12G/s,a linear scan of 15TB-dataset takes 20 minutesO(n)time is already beyond reach on big data in practice!The good,the bad and the uglyPolynomial time queries may be
11、come“intractable”on big data!7Big data:Through the eyes of computationComputer science is the subject about the computation of function f(x)Big data:the data parameter x is large:PB or EBDDQ()Q()traditional database:GB(109B)big data:PB or even EB(1015B or 1018B)Fundamental challenges introduced by q
12、uerying big data?Tractability revisited for big dataBD-tractable queries:properly contained in P unless P=NCBD-tractablenot BD-tractableW.Fan,F.Geerts,F.Neven.Making Queries Tractable on Big Data with Preprocessing,VLDB 2013.Open,like P=NPParallel polylog time for online processing,after PTIME offli
13、ne one-time preprocessingNP and beyondPA departure from classical theory and traditional techniques8Bounded evaluation:Make big data smallx105f(p0/idx)d(pid,mm,yy,cid/idx)c(cid,city,price/FK)JOIN:2000mJOIN:1.5PBc(price)100/15TB15TBSCAN:20mAssume 15TB of friend and dine tables.It is 300PB for MetaTra
14、ditional query:1.4 daysPK(p0)FetchAC1(5k)AC2(31*5k)At most 31 daysFetchAC3(1*31*5k)Up to 5K friends One price for one idFetchBounded evaluation:Fetch=85%,far better than ML modelsPerformance:Accuracy 97%,from 81%Manual effort reduced by 8XPerformance:Find 450K duplicates in 4.5M entities.Accuracy 95
15、.4%;100X faster than ML modelsDomainPain PointsRock FeedbackRock in actionAn infrastructure for data transaction market20Problems:Large data collection:170K+tables,10M+attributes.No data standardization across different departmentsProblems:Rules are handcrafted by human experts.Costly,error-prune,fr
16、agileNot real-timeProblems:A lot of duplicatesNot scalableMissing data(null)Rock4ML:Data cleaning for MLInput:A dataset D,and an ML model M(possibly LLM)Output:A cleaned dataset Dc to maximize Accuracy(M,Dc),Fairness(M,Dc)and Robustness(M,Dc)Real-life data,e.g.,relational tables,texts,images,etc.Fea
17、ture engineeringTraining dataML modelInference in unseen datapredictionsData may be dirty!Rock4MLfeedbackMore accurate dataNew challenges:How to clean document data,typical training data for LLMs?How to impute missing labels and correct mislabelled data?How to make blackbox ML models more accurate?E
18、.g.,ER may do more harm than good.How to enrich data for ML,e.g.,adding adversarial examples to prevent adversarial attacks?Make ML models more accurate,fair,robust and practical21Value:Getting values out of big dataMethod:Machine learning or logic rules?Fishing Fort:A model of ML+logic deductionGra
19、ph association rules(GARs)Predicates:link predicates l(x,y);logic predicates:x.id=y.id,x.A y.B,x.A c,including temporal orders ML predicates M(,):ER,similarity,link predictionQ:graph pattern:a list of vertices in QX,Y:conjunctions of predicatesX Y:dependencyInterpret ML predications in terms ofQ(X M
20、(,)Possible for GNN-based models:FO2 with limited counting,for vertex classification and link predictionFishing Fort:Big graph analyticsUnifying logic deduction and machine learning23MedHunter:Drug repurposing for Parkinsons diseasePattern Q and conditions X:drug x0 may work for Parkinson disease x1
21、 because CTD(Comparative Toxicogenomics Database)Identified 5 drugs for Parkinsons disease:4 with published evidences,1 under active lab investigationx0 has known impact on an inborn genetic blood disease x2x0 has known effect on skin cancer x6,which shares an effect pathway with x1x0 interacts with
22、 gene x3,which shares an effect pathway x4 with x1x0 interacts with gene x5,which has a predicted relationship with x1New drug is costly:10 years,$1 billion,success rate 5h for batch processingSumming up3333 Volume:From the bounded evaluation theory to YashanDB Variety:From HER to SQL across relatio
23、ns and graphs in YashanDB Veracity:Rule learning,error detection and error correction with certainty Value:ML+logic to benefit from both,and interpret ML predictions Velocity:Algorithm incrementalization in Fishing Fort and Rock,to cope with dynamic dataMore challenges&Opportunities:Rock4ML:Data cle
24、aning for MLImprove downstream ML models for accuracy,fairness and robustnessPrepare data for LLM training YashanDB:Semantic joins across relations and data of other models Fishing Fort:Application domainsQuantitative trading,manufacturing industryBig data:Theory and PracticeInvitation:Join forces to tackle the challenges together34VENIVENI,VIDIVIDI,VICIVICIVENIVENI,VIDIVIDI,VICIVICIVENIVENI,VIDIVIDI,VICIVICIVENIVENI,VIDIVIDI,VICIVICI我来我来、我见、我征服、我见、我征服