《丁锐-InsightPilot-TowardsLLM-EmpoweredAutomatedDataExploration.pdf》由会员分享,可在线阅读,更多相关《丁锐-InsightPilot-TowardsLLM-EmpoweredAutomatedDataExploration.pdf(51页珍藏版)》请在三个皮匠报告上搜索。
1、InsightPilot:Towards LLM-Empowered Automated Data Exploration丁锐 微软演讲嘉宾丁锐微软 首席研究员丁锐是微软的数据、知识和智能(DKI)团队的首席研究员。丁锐一直致力于数据分析当中的洞见(insights)研究,这对于在商业和日常生活中理解数据及有效决策至关重要。丁锐的研究主要集中在两个主题上。第一个主题是如何将洞见概念转化为可计算的数据实体,这是洞见发现(即检测和挖掘)的基础问题。另一个主题是数据分析的可解释性以及因果性在其中的重要角色,这也是使洞见具有解释性,可靠性及泛化性的关键。丁锐的研究成果主要发表在SIGMOD和SIGKD
2、D等会议上。此外,与洞见相关的研究在微软也有一系列产品转化,作为微软产品中用于数据分析的功能,包括Power BI的QuickInsights、Excel 的 AnalyzeData 和 FormsInsights。目 录CONTENTS1.Insight-Based Exploratory Data Analysis2.InsightPilot:LLM-Empowered Automated Data Analysis ParadigmInsight-Based Exploratory Data AnalysisPART 01Outline The concept and formulati
3、on of insight The analysis space established from insight MetaInsight:Enriching the intension of insight XInsight:Expanding extension of insightWhat is Exploratory Data AnalysisExploratory Data Analysis(EDA)is a process of analyzing data to summarize its main characteristics,for the purpose of Gaini
4、ng knowledge from data Facilitating further in-depth data analysisImportance of Interesting Pattern Discovery Increasingly popular in the era of big data Values of discovering interesting data pattern Data understanding Knowledge discovery Further drill-down data analysis Common practice Exploratory
5、 data analysis Visual/interactive data analysisChallenges of Interesting Pattern Discovery Existing work Mainly focus on dealing with individual types of interesting patterns Lack of unified formulation of“interesting patterns”Lack of general mining frameworks Users target is often broad or vague In
6、sights hidden amongst subspaces across different semantic levels E.g.,Ford vs.Ford SUV vs.Ford SUV China Insights hidden amongst different measure columns E.g.,price,volume,revenueKey problems to be solved Insight Definition:What are the general abstraction&tangible form of interesting patterns?Insi
7、ght Scoring:What are the factors&criteria for quantifying insight?Insight Mining:How to discover desirable insights efficiently?Intuition of insight concept(I)An insight typically reflects the interestingness of a subject or a relationship among a set of subjects from certain perspective020406020112
8、0015FordHondaToyotaExample:Fords sales is increasing steadily over years Subject(s):Sales of Ford,grouped-by Year Perspective:trend Interestingness:increasing steadily Intuition of insight concept(II)Analysis entity:abstraction of subject or relationship among subjects Specifies the conte
9、nt of interests for data analysis purpose E.g.,Brand=Ford,Measure=Sales,Group-by=Year Corresponds to a specific raw data distribution E.g.,(2011,15,000),(2012,21,000),(2013,27,000),(2014,32,000),(2015,38,000)Analysis semantic:facilitating analysis needs Captures essential characteristics of raw data
10、 distributions E.g.,a trend is appeared in the time series In the form of symbolic representation E.g.,Increasing=True,Steadiness=TrueMapping to Multi-Dimensional Data Model Analysis entity:a 3-tuple Indicates a sibling group with corresponding aggregate values on the measure.Corresponds to a data c
11、ube,with the raw data distribution Analysis semantic Perspective:Materialize as different insight types Essential characteristics:captured by insight properties Interestingness:Evaluate,=trueNote:AE is short for Analysis EntityInsight definition In multi-dimensional data model,an insight is a tuple
12、of ,where ,with non-trivial(significant)evaluation resultThe pivoting role of AE(Analysis Entity)It reflects the typical query operation in OLAP Filtering/group-by/aggregation It acts as a bridge between data manipulation operations and insights Insight is evaluated from an AE against a specific per
13、spective It has a natural mapping to visual charts X-axis values:values of breakdown dimension Y-axis values:aggregation values from the measure Filter:the subspace Example of insightsSystem:QuickInsight*Dataset MetadataData DriverQuickInsights MinerTrendASOutliers,Change PointsDominanceCorrelationS
14、easonalitySQL DB*Released into Microsoft Power BI(2015)and Excel Analyze Data(2019)Establishing insight-based analysis spaceBasic Insight+,Operation enrichmentSemantics expansionXInsightMetaInsightAnalysis spaceScenario 1Scenario 2Scenario MetaInsight:composition from basic insights Observation:ther
15、e exists semantically meaningful operations over AE:AEsubspacebreakdownmeasure SiblingsChildrenParents Inclusion/exclusionDifferent granularityInclusion/exclusionHomogenous Operations A typical EDA iterationHouse Sales in CaliforniaA real estate agent-BobCityHouse StyleMonthSalesLos Angeles2StoryJan
16、208,500Los Angeles1.5FinDec163,200Yuba1.5UnfDec118,000A typical EDA iterationA real estate agent-Bobwhat about the monthly sales in LA?SELECT Month,SUM(Sales)FROM DATASET WHERECity=“Los Angeles”GROUP BY Monthconstruct query0400Jan Feb Mar Apr May Jun JulyAug Sep Oct Nov DecSales(million$)For Los Ang
17、eles,Apr has minimum Sales CityMonthSalesLos AngelesJan288,231,200Dec269,562,200query resultvisualizationA typical EDA iterationA real estate agent-Bobwhat about the monthly sales in other cities?0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For San Francisco,Apr has minimum Sal
18、es 0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For Amador,Apr has minimum Sales 0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For Alameda,Apr has minimum Sales 0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For San Diego,July has minimu
19、m Sales A typical EDA iterationA real estate agent-Bobdid ALL cities have bad sales in April or are there any exceptions?0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For San Francisco,Apr has minimum Sales 0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For A
20、mador,Apr has minimum Sales 0400Jan Feb Mar Apr May Jun July Aug Sep Oct Nov DecSales(million$)For Alameda,Apr has minimum Sales A typical EDA iterationA real estate agent-BobMetaInsight:For most Cities in California,Aprhas minimum Sales;except San Diego0400Sales(million$)San DiegoAlamedaAmadorSan F
21、ranciscoLos AngelesSensemaking mechanisms of human EDA iteration Mechanism#1 1knowledge extraction:essential characteristics of raw data distributione.g.,April has BAD Sales.Mechanisms#2 2inductive hypothesis:the generality of characteristics of a basic data patterne.g.,did ALL cities have bad sales
22、 in April?Mechanisms#3 2validity inquiry:existence of unusual cases and how they differ from the general knowledgee.g.,is there any city that does NOT have bad sales in April?1 Ding,Rui,et al.QuickInsights:Quick and automatic discovery of insights from multi-dimensional data.SIGMOD.2019.2 Zhang,Peng
23、yi,and Dagobert Soergel.Towards a comprehensive model of the cognitive process and mechanisms of individual sensemaking.Journal of the Association for Information Science and Technology 65.9(2014):1733-1756.MetaInsight Overview MetaInsight is a structured representation of knowledge extracted from m
24、ulti-dimensional data.How to constitute a MetaInsight?How to generate a data pattern?How to extend an(basic)insight?How to organize basic insights?Organize Rules Rule#1 Insights with same type and highlight are grouped to form commonness if their ratio exceeds Rule#2 Remaining insights are marked as
25、 exceptionsHow to score MetaInsight?XInsight:eXplainable Data AnalysisThe practical needs of eXplainable Data Analysis(XDA)What do users want from EDA:justify and rely on knowledge and conclusions XDA:is proposed to deliberate data facts and enhance user comprehension.XDA advances data analysis by p
26、roviding users with effective explanations.By suggesting and justifying choices to alter outcomes,XDA helps users comprehend and trust phenomena emerging from data;as a result,it facilitates real-world decision makingWhat Does User Ask in Daily Data Analysis Tasks?What Does User Expect For An Answer
27、?Problematic Explanation Can Backfire Human CognitionFormulation of XInsight Why-Query:composition of two AEs 1=1,2=2,Where 1and 2are within a sibling group 1,2,where ,An example of XInsight for XDARecap:Establishing insight-based analysis spaceBasic Insight+,Operation enrichmentSemantics expansionX
28、InsightMetaInsightAnalysis spaceScenario 1Scenario 2Scenario InsightPilot:LLM-Empowered Automated Data Analysis ParadigmPART 02Todays data analysis is human centricAnalysis EnginesAnalysis needsDomain knowledgeContext/semanticsAnalysis inquiryAutomated data analysis:the holy grail Human centric data
29、 analysis automated data analysis can hardly happen LLM has potential to play the data analysts role automated data analysis could happen!The potential of LLMLLM can play at least as a modest data analyst to“drive”the analysis,because of Broad domain knowledge Can understand data context/semantics F
30、ast responseA case study:ask LLM to conduct data analysis on two tasks House price data,and weasel vs.stoat data.Our goal:unleash the power of LLM but also harness it LLMs risks:Unsoundness/Explainability/Hallucination LLM could play the role of a data analyst,but cannot replace the role of analysis
31、 engines InsightPilot:let LLM and insight engine works synergically!Revisit the cognitive process of data analysis flow1 Brehmer,M.,&Munzner,T.(2013).A multi-level typology of abstract visualization tasks.IEEE transactions on visualization and computer graphics,19(12),2376-2385.Framework of InsightP
32、ilotIntent Context Analysis Intent Context Analysis:Provide by Insight:Provide by LLM/HumanMaintenance of global intents and contextsIntent Context Analysis Analysis to intent:use current analysis result to detect the next analysis intentIntent to analysis:use current intent to pick the suitable ana
33、lysis engine to conduct next-step analysisContext to analysis:use current context to feed suitable parameters to the analysis engineAnalysis intent selection Problem:given current analysis result(insight)and analysis history(context),what is the next analysis intent?Domain of analysis intent Underst
34、and,Summarize,Explain,Compare,Extensible to support broader scenarios Solution:=,E.g.,given a dataset about the maths scores of each student in a school(metadata),now an insight shows that ClassA has highest average maths score(insight),given that the maths scores have been explored by different cla
35、sses(context).Now LLM suggests that the next step is Intent=“Explain”,corresponding to the intent that to explain why ClassA has such highest score.Extracting parameters to trigger insight engine Problem:given an analysis intent and previous insight,how to parse suitable parameters to trigger the in
36、sight engine?Key observations The parameters are structured due to the formulation of insight The parameters are highly associated with the“property”tuple of the previous insight Easy to provide few shot examples to LLM Solution:=,E.g.,the intent is“Explain”,and the =,AVG ,1=(ClassA has highest aver
37、age maths score among all classes),the suitable parameters can beModule=XInsight1=,2=,or 2=,Other:the other classes,ClassB:the class with second highest maths scoreSummary of prompt engineering Initialization Data,metadata,initial insights Analysis intent selection Insight,context,metadata Parameter
38、 extraction Insight,intent,metadata Insight selection A set of insights,context,metadata Report generation A sequence of explored insights,metadataCurrent implementationTypical tasks between LLM and Insight Engine Intent selection:select what is the high-level intent of next analysis stepBased on co
39、ntext and semantics of currently seen insights Parameters extraction:extract proper parameters which are feasible to feed into Insight Engine Insight selection:select a most suitable insight that the analysis proceedA case studySummary Insight-Based Exploratory Data Analysis The concept and formulation of insight The analysis space established from insight Case studies:MetaInsight,XInsight InsightPilot:LLM-Empowered Automated Data Analysis Paradigm The opportunities of unleashing the power of LLM Synergistic Integration of LLM with InsightEngine Towards automated data analysis 感 谢 聆 听