上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

5-1 腾讯文本理解系统 TexSmart 中的细粒度实体识别关键技术.pdf

编号:102340 PDF 60页 5.52MB 下载积分:VIP专享
下载报告请您先登录!

5-1 腾讯文本理解系统 TexSmart 中的细粒度实体识别关键技术.pdf

1、蒋海云腾讯 AI Lab高级研究员|腾讯文本理解系统TexSmart中的细粒度实体识别关键技术目录1 TexSmart系统介绍2 TexSmart细粒度NER概述3 基于知识库的组合方法4 基于相似实体推断的远程监督方法5 基于多源融合的Zero-shot方法|01TexSmart系统介绍|nTexSmart是自然语言理解的工具与服务对中文和英文两种语言的文本进行词法、句法和语义分析https:/ 可扩展性十几种类别 1000多种类别人名 演员、歌手、运动员、节目主持人、作家等太多的类别需要标注大量的训练数据标注的细粒度训练数据代价太高u 歧义问题“苹果 CEO 正在喝 苹果 汁”公司 or

2、水果?“李娜网球”vs.“李娜唱功”运动员 or 演员?细粒度NER遇到的两大挑战:|特色一:细粒度NER语义联想:对句子中给定的实体,预测与其关联的实体集合流浪地球 战狼二、上海堡垒、悲伤逆流成河 特色二:语义联想一、模型精度和速度的矛盾实现了多种不同的模型和算法精度高和速度快的模型二、从封闭测试环境到开放测试环境利用无标注数据训练模型:分词模型、细粒度NER模型三、动态更新模型增量式地收集无标注数据,周期性更新模型特色三:多维度的设计理念实现方法|功能类型一:文本理解功能类型二:文本匹配语义相似度n 基本功能分词、词性标注、命名实体识别(NER)、语义联想、句法分析、语义角色标注、文本分类

3、、关键词提取功能类型三:文本图谱相似词、同义词、反义词、上位词、下位词TexSmart基本功能|TexSmart Demo分词和标注命名实体识别文本分类句法分析TexSmart Demo|语义角色标注文本匹配TexSmart Demo|文本图谱TexSmart Demo|02TexSmart细粒度NER概述|粗粒度细粒度超细粒度NER粒度需求超细粒度NER应用更好地理解文本,辅助下游任务(1)关系抽取、知识库构建、问答系统等理解型任务(2)文本改写、对话生成、问题生成等生成型任务|超细粒度NER类别样例“人物”的部分超细粒度实体类型“地点”的超细粒度实体类型TexSmart超细粒度类别体系:h

4、ttps:/ 无监督方法u 无监督和有监督的组合方法细粒度NER:无监督算法无结构文本数据(苹果,公司)(西瓜,水果)(苹果,水果)(微软,公司)抽取(苹果,pany)(苹果,food.fruit)(西瓜,food.fruit)(微软,pany)is-a 数据term-to-type graph构建映射“西瓜 很甜”food.fruit|Texsmart:A text understanding system for fine-grained ner and enhanced semantic analysis,arXiv preprintTexsmart:A system for enhan

5、ced natural language understanding,ACL 2021细粒度NER:无监督算法 Is-a 数据抽取a)人工编写Is-a模板b)从大量无结构化数据中抽取(苹果,水果)(西瓜,水果)(苹果,公司)(微软,公司)X1、X2等YY诸如X1、X2等 匹配Is-A模板苹果、西瓜等水果上市公司诸如苹果和微软|细粒度NER:无监督算法“苹果 汁”food.fruitOpany?无结构数据(苹果,公司)(西瓜,水果)(苹果,水果)(微软,公司)抽取(苹果,pany)(苹果,food.fruit)(西瓜,food.fruit)(微软,pany)is-a 数据term-to-type

6、 graph构建映射歧义问题|细粒度NER:无监督算法“苹果汁”C1,C2打分food.fruit离线在线C1:(苹果,西瓜,food.fruit)C2:(苹果,微软,谷歌,pany)C3:(C+,Java,Python,language.programming)词向量聚类语料库term-to-type graph(实体集合,类别)检索歧义问题的解决方法|细粒度NER:无监督算法无法识别不在词典内的实体缺点一词典规模与覆盖率的折衷 规模大:实体覆盖率高,但内存消耗大 规模小:内存消耗小,但低频实体无法识别缺点二|细粒度NER:组合算法 基本思想 无监督方法预测细粒度的类别分布 有监督方法预测粗

7、粒度的类别分布 联合推理出最优的细粒度类别苹果 CEO无监督有监督联合模型水果0.5机构名0.6公司0.18=0.6*0.30300400500600700800900002202302402502602702802903003350360370380390400445046047048049050055505605705805906006650660670680690700775076077078

8、07908008850860870880890900995096097098099ACL 2020 Submission*.Confidential Review Copy.DO NOT DISTRIBUTE.TEXSMART:A Text Understanding Platformfor Natural Language ProccessingAnonymous ACL submissionAbstractThis document contains the instructions forpreparing a manuscript for t

9、he proceedings ofACL 2020.The document itself conforms toits own specifications,and is therefore an ex-ample of what your manuscript should looklike.These instructions should be used forboth papers submitted for review and for finalversions of accepted papers.Authors are askedto conform to all the d

10、irections reported in thisdocument.1CreditsThis document has been adapted by Steven Bethard,Ryan Cotterrell and Rui Yan from the instruc-tions for earlier ACL and NAACL proceedings,including those for ACL 2019 by Douwe Kielaand Ivan Vulic,NAACL 2019 by Stephanie Lukinand Alla Roskovskaya,ACL 2018 by

11、 Shay Co-hen,Kevin Gimpel,and Wei Lu,NAACL 2018 byMargaret Michell and Stephanie Lukin,2017/2018(NA)ACL bibtex suggestions from Jason Eisner,ACL 2017 by Dan Gildea and Min-Yen Kan,NAACL 2017 by Margaret Mitchell,ACL 2012by Maggie Li and Michael White,ACL 2010 byJing-Shing Chang and Philipp Koehn,ACL

12、 2008by Johanna D.Moore,Simone Teufel,James Allan,and Sadaoki Furui,ACL 2005 by Hwee Tou Ngand Kemal Oflazer,ACL 2002 by Eugene Char-niak and Dekang Lin,and earlier ACL and EACLformats written by several people,including JohnChen,Henry S.Thompson and Donald Walker.Ad-ditional elements were taken fro

13、m the formattinginstructions of the International Joint Conferenceon Artificial Intelligence and the Conference onComputer Vision and Pattern Recognition.2IntroductionargmaxtPU(t|x)PS(m(t)|x)(1)The following instructions are directed to au-thors of papers submitted to ACL 2020 or acceptedfor publica

14、tion in its proceedings.All authors arerequired to adhere to these specifications.Authorsare required to provide a Portable Document For-mat(PDF)version of their papers.The proceed-ings are designed for printing on A4 paper.3Electronically-available resourcesACL provides this description and accompa

15、nyingstyle files athttp:/acl2020.org/downloads/acl2020-templates.zipWe strongly recommend the use of these style files,which have been appropriately tailored for the ACL2020 proceedings.LATEX-specific details:The templates includethe LATEX2e source(acl2020.tex),the LATEX2estyle file used to format i

16、t(acl2020.sty),an ACLbibliography style(acl natbib.bst),an examplebibliography(acl2020.bib),and the bibliographyfor the ACL Anthology(anthology.bib).4Length of SubmissionThe conference accepts submissions of long papersand short papers.Long papers may consist of upto eight(8)pages of content plus un

17、limited pagesfor references.Upon acceptance,final versions oflong papers will be given one additional page upto nine(9)pages of content plus unlimited pagesfor references so that reviewers comments canbe taken into account.Short papers may consistof up to four(4)pages of content,plus unlimitedpages

18、for references.Upon acceptance,short pa-pers will be given five(5)pages in the proceedingsand unlimited pages for references.For both longand short papers,all illustrations and tables that|细粒度NER:组合算法“上个月30号,南昌王青松在自己家里边看流浪地球边吃煲仔饭。”无监督模型(上个月30号,time.generic)(南昌,loc.city)(王青松,person.generic)(煲仔饭,food.

19、generic)(流浪地球,work.movie)有监督模型(上个月30号,time.generic)(南昌,loc.city)(煲仔饭,food.generic)(流浪地球,work.movie)(上个月30号,time.generic)(南昌,loc.generic)(王青松,person.generic)(煲仔饭,other)(流浪地球,work.generic)联合推理人工标注数据(12 类)030040050060070080090000220230240250260270280290300310

20、32033034035036037038039040044504604704804905005550560570580590600665066067068069070077507607707807908008850860870880890900995096097098099ACL 2020 Submission*.Confidential Review Copy.DO NOT DISTRIBUTE.TEXSMART:A Text Understanding Pla

21、tformfor Natural Language ProccessingAnonymous ACL submissionAbstractThis document contains the instructions forpreparing a manuscript for the proceedings ofACL 2020.The document itself conforms toits own specifications,and is therefore an ex-ample of what your manuscript should looklike.These instr

22、uctions should be used forboth papers submitted for review and for finalversions of accepted papers.Authors are askedto conform to all the directions reported in thisdocument.1CreditsThis document has been adapted by Steven Bethard,Ryan Cotterrell and Rui Yan from the instruc-tions for earlier ACL a

23、nd NAACL proceedings,including those for ACL 2019 by Douwe Kielaand Ivan Vulic,NAACL 2019 by Stephanie Lukinand Alla Roskovskaya,ACL 2018 by Shay Co-hen,Kevin Gimpel,and Wei Lu,NAACL 2018 byMargaret Michell and Stephanie Lukin,2017/2018(NA)ACL bibtex suggestions from Jason Eisner,ACL 2017 by Dan Gil

24、dea and Min-Yen Kan,NAACL 2017 by Margaret Mitchell,ACL 2012by Maggie Li and Michael White,ACL 2010 byJing-Shing Chang and Philipp Koehn,ACL 2008by Johanna D.Moore,Simone Teufel,James Allan,and Sadaoki Furui,ACL 2005 by Hwee Tou Ngand Kemal Oflazer,ACL 2002 by Eugene Char-niak and Dekang Lin,and ear

25、lier ACL and EACLformats written by several people,including JohnChen,Henry S.Thompson and Donald Walker.Ad-ditional elements were taken from the formattinginstructions of the International Joint Conferenceon Artificial Intelligence and the Conference onComputer Vision and Pattern Recognition.2Intro

26、ductionargmaxtPU(t|x)PS(m(t)|x)(1)The following instructions are directed to au-thors of papers submitted to ACL 2020 or acceptedfor publication in its proceedings.All authors arerequired to adhere to these specifications.Authorsare required to provide a Portable Document For-mat(PDF)version of thei

27、r papers.The proceed-ings are designed for printing on A4 paper.3Electronically-available resourcesACL provides this description and accompanyingstyle files athttp:/acl2020.org/downloads/acl2020-templates.zipWe strongly recommend the use of these style files,which have been appropriately tailored fo

28、r the ACL2020 proceedings.LATEX-specific details:The templates includethe LATEX2e source(acl2020.tex),the LATEX2estyle file used to format it(acl2020.sty),an ACLbibliography style(acl natbib.bst),an examplebibliography(acl2020.bib),and the bibliographyfor the ACL Anthology(anthology.bib).4Length of

29、SubmissionThe conference accepts submissions of long papersand short papers.Long papers may consist of upto eight(8)pages of content plus unlimited pagesfor references.Upon acceptance,final versions oflong papers will be given one additional page upto nine(9)pages of content plus unlimited pagesfor

30、references so that reviewers comments canbe taken into account.Short papers may consistof up to four(4)pages of content,plus unlimitedpages for references.Upon acceptance,short pa-pers will be given five(5)pages in the proceedingsand unlimited pages for references.For both longand short papers,all i

31、llustrations and tables that|细粒度NER:实验评价中文数据集:无标注的训练数据:约400G 粗粒度的标注数据:29K句子数MethodsPrecisionRecallFscoreBase56.2655.0255.68Hybrid72.80 58.8865.10方法:Base:无监督的方法 Hybrid:无监督和有监督相结合|04基于相似实体推断的远程监督方法|动机动机 It is challenging to learn effective representations for contextualized mentions in FGET,since the

32、representations are required to well distinguish fine-grained types with similar but different semantics.Existing SOTA models perform poorly on a certain number of“hard”mentions,leading to limited overall performance.First,the structure of some contexts surrounding the hard mentions are inherently t

33、oo complex.Second,the contexts of some hard mentions are ambiguous and thus it is insufficient to handle these mentions by learning from their contexts only|Learning from Sibling Mentions with Scalable Graph Inference in Fine-Grained Entity Typing(Chen et al.,ACL 2022)相似实体:相似实体:Sibling Mentions Sibl

34、ing mentions refer to the mentions that potentially share the same or semantically similar types(e.g.,country and nation)with the target mention.For detecting it,we propose two similarity metrics,based on which we design an effective sibling selection algorithm.Figure 1:Illustration of the proposed

35、approach.|异构图模型异构图模型 Two kinds of nodes:Mentions and Types.Three kinds of edges:the sibling relationship between mentions the hierarchical relationship between types the isLabel relationship between mentions and typesFigure 2:Illustration of the graph model.|方法概述方法概述First,a mention-type graph is con

36、structed from training samples.Then,the features for mentions and types are learned by an attentive graph neural module upon.During inference stage,we add test mentions into graph by connecting them to their sibling mentions in the training set.Figure 1:Illustration of the proposed approach.|相似相似Men

37、tion检测算法检测算法Word distribution-based metric:Based on the assumption that mentions sharing more contextual words tend to have more similar ground-truth types.We use TF-IDF to encode mentions as sparse feature vectors.Then the sibling similarity between any two mentions is measured by the cosine simila

38、rity of their vectors.Typing distribution-based metric:We first derive the prior score distributions over the type set Y for all the mentions in the dataset from an extra base model(Lin and Ji,2019)trained on the same dataset.Then the sibling mentions are selected by their cosine similarities to the

39、 target mention based on the score distributions.|自注意力图神经网络自注意力图神经网络We employ graph neural networks(GNNs)with layers to aggregate the information of sibling mentions and types for learning mention representations.Update of type embedding#:Update of mention embedding#:Type Prediction:|Dropout of The

40、representation%()incorporates the information from ground-truth type neighbors.However,it is then used for predicting the ground-truth types in turn.The setting that contains all the ground-truth types will inevitably degenerate the model to just focus on the type neighbors while totally ignore the

41、mention neighbors.To overcome this,each neighboring type in is randomly discarded with a certain probability.In this way,the prediction of discarded type will force the model to learn from the sibling mentions rather than directly from type neighbors.|可扩展的推断可扩展的推断Step 1:Given a batch of test mention

42、s,we first obtain their sibling mentions.Step 2:We add the test mentions as nodes into the mention-type graph,where the test mentions are connected to their sibling mentions selected at Step 1.Note that,in the new graph,test mentions have no type neighbors since their ground-truth types are not avai

43、lable.Besides,there is no edge between any two test mentions in the new graph.Step 3:The representations of test mentions*are updated by aggregating the embeddings for their sibling mentions.Through layers of updates,the final representations*()are obtained.Step 4:We predict the type score distribut

44、ion for*,Based on the mention embedding*and the type embeddings().|主要结果主要结果We evaluate the proposed model on two widely-used datasets:OntoNotes and BBN.We consider both the original and the augmented OntoNotes.We select sibling mentions according to the typing distribution from(Lin and Ji(2019).We o

45、bserve that,after aggregating sibling information through the attentive graph neural module,our model significantly outperforms Lin and Ji(2019)on both the original OntoNotes and the BBN dataset.相似相似Mention检测算法的有效性检测算法的有效性 Measuring sibling qualityFor each mention%+,denote its ground-truth types as,

46、and sibling mentions in graph as,.Further,for,we denote their ground-truth types as,i.e.,To quantify the effect of different similarity metrics on the quality of siblings,we define Purity,Coverage and Quality similar to the definitions of Precision,Recall and F1.|相似相似Mention检测算法的有效性检测算法的有效性 ResultsT

47、he typing-based metric performs better than the word-based metric.The scores from the gold typing-based and the random-based metrics reveal the upper bound and the lower bound of the scores for the typing-based metric|05多源融合的Zero-shot方法|Fine-grained entity typing(FET)FET aims to detect the types of

48、an entity mention given its context.The types usually form a hierarchy.Zero-shot fine-grained entity typing(ZFET)The target types for training and testing are entirely disjoint.SentenceNorthwest and Midway are two of the five airlines with which Budget has agreements.MentionNorthwestType/organizatio

49、n/corporationRootOrganizationGovernmentCorporationPersonActorLocationCity任务定义任务定义|An Empirical Study on Multiple Information Sources for Zero-Shot Fine-Grained Entity Typing(Chen et al.,EMNLP 2021)40 Existing methods Introduce auxiliary information to build the semantic connections between the seen

50、and unseen types.Challenge The power of auxiliary information has not been sufficiently exploited.The effects of each information source remain to be clearly understood.动机动机|Context consistency A correct type should be semantically consistent with the context if the mention is replaced with the type

51、 name in the context.Type hierarchy The ontology structure connecting the seen and unseen types.Source1:Context ConsistencyConsistent:Corporation and Midway are two of the five airlineswith which Budget has agreements.Inconsistent:Drug and Midway are two of the five airlines withwhich Budget has agr

52、eements.Source2:Type HierarchyOrganizationGovernmentCorporationSource3:Background KnowledgePrototypes:western_union,quebecor,merrill,rtc,Description:a business firm whose articles of incorporation havebeen approved in some state.辅助信息源辅助信息源41|Background knowledge Protypes:refer to the carefully selec

53、ted candidate mentions for a type,providing a mention-level summary for types.Descriptions:are queried from WordNet glosses by type names,providing a brief high-level summary for each type.辅助信息源辅助信息源42|Source1:Context ConsistencyConsistent:Corporation and Midway are two of the five airlineswith whic

54、h Budget has agreements.Inconsistent:Drug and Midway are two of the five airlines withwhich Budget has agreements.Source2:Type HierarchyOrganizationGovernmentCorporationSource3:Background KnowledgePrototypes:western_union,quebecor,merrill,rtc,Description:a business firm whose articles of incorporati

55、on havebeen approved in some state.Context-Consistency-Aware(CA)Module Measure the context consistency by large-scale pre-trained language models,e.g.,BERT.Type-Hierarchy-Aware(HA)Module Use Transformer encoder to model the hierarchical dependency among types.Background-Knowledge-Aware(KA)Module Mod

56、el ZFET with knowledge as natural language inference with a translation-based solution.多源融合模型多源融合模型43|MASKandMidwayaretwo.ofCLSSEPBERTdrugcorporationgovernment Loss function=#$!%&#$!log$!,#,)*=pos Prediction Score=#$!%&#$!$!,#Context-Consistency-Aware Module(CA)44|OrganizationPersonLocationCorporati

57、onGovernmentCompanyActor Hierarchy-aware type encoder A type only attends to its parent type in the hierarchy and itself,while the attention to the remaining types will be masked.Type-Hierarchy-Aware Module(HA)45|Hierarchy-aware type encoder A type only attends to its parent type in the hierarchy an

58、d itself,while the attention to the remaining types will be masked.Denote the final representation for type as#-./=!0./OtherwiseATT,=softmax0+%Type-Hierarchy-Aware Module(HA)46|Mention-context encoder The entity mention and its context are represented as the weighted sum of their ELMo word represent

59、ations.Then the mention representation$#and context representation%#are concatenated as the final representation:$%=$%where$,%#,$&#,denotesconcatenation.Type-Hierarchy-Aware Module(HA)47|Prediction score=$%,=$%($%,:$%$%,:Loss function=CrossEntropy,Type-Hierarchy-Aware Module(HA)48|Inference from bac

60、kground knowledgeMultiple Premises Context-based premise:Northwest and Midway are twoof the five airlines with which Budget has agreements.Prototypes-based premise:/organization/corporationhas the following prototypes:western_union,Description-based premise:/organization/corporationdenotes a collect

61、ion of business firms whose.Hypothesis/organization/corporation is a correct type for themention Northwest.Background-Knowledge-Aware Module(KA)49|Encoding multiple premises and the hypothesis Context-based premise$%=$%Prototypes-based premise)=)Description-based premise#=#Hypothesis*=$Multiple Prem

62、ises Context-based premise:Northwest and Midway are twoof the five airlines with which Budget has agreements.Prototypes-based premise:/organization/corporationhas the following prototypes:western_union,Description-based premise:/organization/corporationdenotes a collection of business firms whose.Hy

63、pothesis/organization/corporation is a correct type for themention Northwest.Background-Knowledge-Aware Module(KA)50|Prediction score Project all the representations to the inference space?$%,?),?#,?*=$%,),#,*When the hypothesis can be inferred from the premises,we hope?$%+?)+?#?*Minimize their squa

64、red Euclidean distance=?$%+?)+?#?*+.,./=*0/=*1/=2/=12345467$Background-Knowledge-Aware Module(KA)51|Prediction score The score for type is defined as=Loss function$,-.=Gpos,8neg 0,1 8posBackground-Knowledge-Aware Module(KA)52|Overall loss function=$,2.+$,3.+$,-.Overall prediction score Normalize the

65、 scores from CA,HA and KA according to4=sigmoid5678,The final decision score by the fusion model isscore=94+4+:4.9,+,:0,9+:=1CAHAKAFusionOverall LossMentionContextType多源融合模型多源融合模型53|Datasets Statistics of training and test datasets.Types of both BBN and Wiki are organized into a 2-level hierarchy.Th

66、ere are 47types in BBN and 113 types in Wiki.Zero-shot setting The coarse-grained(Level-1)types such as/organization are used for training(denoted as seen types),while the fine-grained(Level-2)types such as/organization/corporation are reserved for testing(denoted as unseen types).实验设置实验设置54|Zero-sh

67、ot performance on the unseen types.主要实验结果主要实验结果55|Supervised performance on the seen types.主要实验结果主要实验结果56|Ablation results on BBN unseen types(Ma-F1).Ablation Studies57|Impact of context length Performance on the unseen types of BBN relative to the context length.Ablation Studies58|Venn diagram of the test cases of unseen types correctly predicted by each module.Ablation StudiesComplementarity among Different Information Sources59|非常感谢您的观看|

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(5-1 腾讯文本理解系统 TexSmart 中的细粒度实体识别关键技术.pdf)为本站 (云闲) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部