上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

3-3 分子与自然语言之间的翻译.pdf

编号:102288 PDF 53页 9.43MB 下载积分:VIP专享
下载报告请您先登录!

3-3 分子与自然语言之间的翻译.pdf

1、Translation between Moleculesand Natural LanguageHeng Ji(UIUC,Amazon Scholar)Based on the wonderful work done by Hongwei Wang,Carl Edwards,Tuan Lai and Zixuan Zhang(UIUC)Collaborations with Martin Burke(UIUC)and Kyunghyun Cho(NYU)hengjiillinois.eduUniversity of IllinoisUrbana-Champaign2Problem:Too M

2、any papersMorethan500KpapersarepublishedatPubMedeveryyear,andmorethan1.2millionnewpapersarepublishedin2016alone,bringingthetotalnumberofpaperstoover26million(VanNoorden,2014)AsofJune13,2020,thereareatleast140KpapersaboutcoronavirusQuality:Giventherapidpublicationsofpreprintswithoutpeerreviews,manyre

3、searchresultsareredundant,complementaryorevenconflictingwitheachotherHumansreadingabilitykeepsalmostthesameacrossyears:USscientistsestimatedthattheyread,onaverage,only264papersperyear(1outof5000availablepapers,thesameacrossyears)3How Modern Chemists Design Their ExperimentsMostofthecurrentscientific

4、experimentsarestillbasedonmanualdesignandcandidaterankingThereare500K+possiblereactionsE.g.,Top20candidatesmanuallyselectedforSuzukiCoupling:Noliteraturesearchenginessupportcross-mediaretrievalBioNLPsharedtasksmainlycoverbiomedicalpapers,butverylimitedpapersareaboutchemistry44How Modern Doctors Pred

5、ict Cancer TodayTheclassificationfeaturesareextremelycoarse-grained,genericandfragileChangingthenumberofbiopsiesfrom1to2willchangethecancerrisklevelfrom17%to37%,despiteofthepositive/negativeresultsofbiopsiesPrecisionMedicineisonlyaffordableforatinypopulationDevelopmentcostisabout$2.6billion5Scientif

6、ic LiteratureHierarchical Spherical EmbeddingOntology Enriched Text EmbeddingCross-media Structured Semantic RepresentationGenerative Adversarial Networks for Data Augmentation and Distant SupervisionMultimedia Search and SummarizationGraph neural networksJoint entity/relation/event extraction and o

7、ntology constructionChemical Ontology&Existing DatabasesMultimedia Knowledge BaseDATASEMANTICSKNOWLEDGEBASEConverting Unstructured Scientific Data to Structured Knowledge:Our Road Map6Developedmultimodaldefinitionpoweredentityrepresentation2-Dimagesofmolecules,representingtheunderlyingmoleculesorrea

8、ctionstext-basedmoleculedescriptorschemicalgraphstructurefrom480KreactionsnaturallanguagedefinitionanddescriptionstructuredpropertiesinexternaldatabasesIncorporateexternalknowledgeviaentitylinkingCapturecomplexsentencestructuresAMRparsingandgraphneuralnetworksUnique Challenges and Solutions75,6-DIHY

9、DROXY-1H-INDOLE-2-CARBOXYLIC ACIDNatural Language Definition:5,6-dihydroxyindole-2-carboxylic acid is a dihydroxyindole that is indole-2-carboxylic acid substituted by hydroxy groups at positions 5 and 6.It has a role as a mouse metabolite.It is a conjugate acid of a 5,6-dihydroxyindole-2-carboxylat

10、e.It is a tautomer of a dopachrome.Natural Language Context:Tautomerization of dopachrome to 5,6-dihydroxyindole-2-carboxylic acid(DHICA)is a biologically crucial reaction relevant to melanin synthesis,cellular antioxidation,and cross-talk among epidermal cells.Challenge 1:Whats in a chemical entity

11、?Chemical entity mentions are essentially rare terms that cannot be learned well by only language modelChemical entities are often complex formula-like namesMany chemicals simply have never been coined with any nomenclature in natural language85,6-Dihydroxy-1H-indole-2-carboxylic acidProperty NamePr

12、operty ValueMolecular Weight193.16XLogP3-AA1.2Hydrogen Bond Donor Count4Hydrogen Bond Acceptor Count4Rotatable Bond Count1Exact Mass193.03750770Monoisotopic Mass193.03750770Topological Polar Surface Area93.6 Heavy Atom Count14Formal Charge0Complexity2459Molecule Representation LearningmoleculeGraphe

13、ncoderembeddingDownstreamtasksChemical reaction predictionMolecule property predictionMolecule generationDrug discoveryRetrosynthesis planningChemical text miningChemical knowledge graph modeling2-hydroxypropanoic acidIUPAC nomenclatureMolecular formulaC3H6O3Structural formulaCH3-CH-C-OHOH OSpace fi

14、lling modelBall-and-stick model10SMILES StringsNNN#NCH3N=C=OCN=C=OCCc(c1)ccc2n+1ccc3c2nHc4c3cccc4CCc1cn+2ccc3c4ccccc4nHc3c2cc1(Simplified molecular-input line-entry system)Visualization of 3-cyanoanisole as COc(c1)cccc1C#N.11SMILES-based RepresentationoSMILES-based MRL methodsotake SMILES strings as

15、 inputouse language models(BERT,Transformer)as their base modelsooutput hidden layers as molecule embeddingsIllustrationofSMILES-Transformer(Hondaetal.,2019)oExamples:oMolBERT(Fabian et al.,2020)oChemBERTa(Chithrananda et al.,2020)oSMILES-Transformer(Honda et al.,2019)oSMILES-BERT(Wang et al.,2019)o

16、Molecule-Transformer(Shin et al.,2019)12GNN-based Methods1.Propagatingmessagesoverthegraph2.Readoutthemoleculegraphembedding13Limitation of SMILES-based and GNN-based MethodsoSmiles are 1D linearization of molecule structures,which makes them hard to learn the original structural information of mole

17、culesCC(CCCCCCCO)=OThesetwoOsarecloseinSMILESstringbutactuallytheyarefarfromeachotheroGNN-based methods focus on designing fresh and delicate GNN architectures,while ignoring generalization abilityoThere is no specific GNN that performs universally best in all downstream tasks of MRL14Structural Mol

18、ecule EncoderoWe use GNNs as the molecule encoderoEach atom has an initial feature vector consisting of four parts:oElement typeoChargeoWhether the atom is in an aromatic ringoThe count of attached hydrogen atom(s)oNo edge feature(i.e.,bond type)is consideredoBond type can be inferred by the feature

19、s of its two associated atomsElementtype:CCharge:0Whetherthisatomisinanaromaticring:TrueThecountofattachedhydrogenatom(s):1One-hotencodingElementtypeChargeAromatic#Hatom(s)15Preserving Chemical Reaction Equivalence Wang et al.,ICLR2022oSeveral physical quantities retain constant before and after the

20、 reactionoMass,energy,charge,etc.oWe aim to preserve such equivalence in the molecule embedding space:C2H5OH+O2 CH3CHOC2H5OHCH3CHOO2ChemicalreactionspaceMoleculeembeddingspace16Experiments:Chemical Reaction PredictionoUSPTOoTraining set:409k,validation set:30k,test set:40koEach reaction contains SMI

21、LES strings of up to five reactant(s)and exactly one productoFormat:reactant_smilesproduct_smiles0CC(C)CMg+.CON(C)C(=O)c1ccc(O)nc1CC(C)CC(=O)c1ccc(O)nc11CN.O=C(O)c1ccc(Cl)c(N+(=O)O-)c1CNc1ccc(C(=O)O)cc1N+(=O)O-2CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=COCCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(NC=O

22、)cc3)cc21Dataset 17UsingproductsinthetestsetascandidatesImprovementofMolR-TAGoverthebestbaseline14.2%390.217.4%12.2%10.1%7.9%Experiments:Chemical Reaction Prediction18Now Can We Translate between Molecules and Natural Language?Edwards et al.,2022arxiv19Molecule CaptioningThereisanenormousnumberofpos

23、siblemolecules,andthesecantallbetestedinalab.Canwedescribewhattheydoatasemanticleveltohelpacceleratethingslikedrugdiscovery,etc?20Text-Guided de Novo Molecule GenerationCanwedothesamethingswithmoleculesandlanguage?Givenadescription,generateamoleculewhichcorrespondstothedescription.Zeatin is a cytoki

24、nin derived from adenine,which occurs in the form of a cis-and a trans-isomer and conjugates.Zeatin was discovered in immature corn kernels from the genus Zea21What are the challenges?Moleculecaptionsaremuchhardertocreatetheyrequiredomainexpertise.Datascarcity:Itseasytocollectmillionsofexamplesofima

25、ge-captionpairsfromtheinternet.However,itsverydifficulttodosoformolecules.Forexample,recentworkusesadatasetofsize33,000.So,canweleveragesingle-modalpretraining?Inparticular,ourproblemisalmostliketranslatingfromachemicallanguagetoEnglish.22How do molecules and images differ?Imagesareacontinuousdomain

26、.Moleculesareadiscretespaceofgraphs.Inthemoleculecase,wecoulddescribeitwith:anIUPACname oneofmanydifferentsyntheticroutesfromknownprecursormolecules propertiesorapplications functionalgroupsThisallowsmuchmorelinguisticvarietyinmoleculecaptionsthaninimagecaptioning23Data SourcesTextModality:C4dataset

27、(acollectionofabout750GBoftextcrawledfromtheInternet).MoleculeModality:ZINC(100millionSMILESstringsusedinChemformerfromtheZINC-15dataset).Cross-modal:ChEBI-20dataset(33,000molecule-descriptionpairsfromText2Mol).24Training Edwards et al.,2022arxiv25Experiment ResultsMoleculeCaptioningPerformanceMolec

28、uleGenerationPerformance26Molecule Generation ResultsLeft-Ground truth,Right-Predicted27Molecule Generation ResultsLeft-Ground truth,Right-Predicted28Molecule Generation Results:Different ModelsThe molecule is a hydrate that is the dihydrate form of manganese(II)chloride.It has a role as a MRI contr

29、ast agent and a nutraceutical.It is a hydrate,an inorganic chloride and a manganese coordination entity.TransformerRNNT5MolT5InputGround TruthThe molecule is a member of the class of phhenylureas that is urea in which one of the nitrogens is substituted by a p-chlorophenyl group while the other is s

30、ubstituted by two methyl groups.It has a role as a herbicide,a xenobiotic and an environmental contaminant.It is a member of monochlorobenzenes and a member of phenylureas.29Molecule Generation Results:Different ModelsTransformerRNNT5MolT5InputGround TruthThe molecule is a monocarboxylic acid that i

31、s thyroacetic acid carrying four iodo substituents at positions 3,3,5 and 5.It has a role as a thyroid hormone,a human metabolite and an apoptosis inducer.It is an iodophenol,a 2-halophenol,a monocarboxylic acid and an aromatic ether.The molecule is a member of the class of chloroethanes that is eth

32、ane in which five of the six hydrogens are replaced by chlorines.A non-flammable,high-boiling liquid(b.p.161-162)with relative density 1.67 and an odourresembling that of chloroform,it is used as a solvent for oil and grease,in metal cleaning,and in the separation of coal from impurities.It has a ro

33、le as a non-polar solvent.Invalid,fixed30Molecule Generation Results:Different ModelsTransformerRNNT5MolT5InputGround TruthThe molecule is an eighteen-membered homodetic cyclic peptide which is isolated from Oscillatoria sp.and exhibits antimalarial activity against the W2 chloroquine-resistant stra

34、in of the malarial parasite,Plasmodium falciparum.It has a role as a metabolite and an antimalarial.It is a homodetic cyclic peptide,a member of 1,3-oxazoles,a member of 1,3-thiazoles and a macrocycle.InvalidThe molecule is a methylindole carrying a methyl substituent at position 3.It is produced du

35、ring the anoxic metabolism of L-tryptophan in the mammalian digestive tract.It has a role as a mammalian metabolite and a human metabolite.31Molecule Generation Results:Different ModelsTransformerRNNT5MolT5InputGround TruthThe molecule is a tripeptide composed of glycine,glycine and L-alanine residu

36、es joined in sequence.It has a role as a metabolite.The molecule is a sulfonated xanthene dye of absorption wavelength 573 nm and emission wavelength 591 nm.It has a role as a fluorochrome.Invalid32Molecule Captioning Results:Different ModelsTransformerRNNT5MolT5InputGround TruthThe molecule is a me

37、mber of the class of pyrazoles that is 1H-pyrazole that is substituted at positions 1,3,4,and 5 by 2,6-dichloro-4-(trifluoromethyl)phenyl,cyano,(trifluoromethyl)sulfanyl,and amino groups,respectively.It is a metabolite of the agrochemical fipronil.It has a role as a marine xenobiotic metabolite.It i

38、s a member of pyrazoles,a dichlorobenzene,a member of(trifluoromethyl)benzenes,an organic sulfide and a nitrile.The molecule is a member of the class of pyrazoles that is 1H-pyrazole that is substituted at positions 1,3,4,and 5 by 2,6-dichloro-4-(trifluoro methyl)phenyl,cyano,(trifluoromethyl)sulfin

39、yl,and amino groups,respectively.It is a nitrile,a dichlorobenzene,a primary amino compound,a member of pyrazoles,a sulfoxide and a member of(trifluoromethyl)benzenesThe molecule is a member of the class of pyrazoles that is 1H-pyrazole that is substituted at positions 1,3,4,and 5 by 2,6-dichloro-4-

40、(trifluoro methyl)phenyl,cyano,(trifluoromethyl)sulfinyl,and amino groups,respectively.It is a nitrile,a dichlorobenzene,a primary amino compound,a member of pyrazoles,a sulfoxide and a member of(trifluoromethyl)benzenesthe molecule is a deuterated compound that is is is is is an isotopologue of chl

41、oroform in which the four hydrogen atoms have been replaced by deuterium.it is a deuterated compound,a gamma-lactam and an aliphatic sulfide.the molecule is an organofluorinecompound that is 1,2,3,4-triazol-1h-1,2,4-triazole which issubstituted at positions 2,3,and 5 by a 2,3,5-triazol-1-yl groupand

42、 at position 5 by a 2-(trifluoromethyl)-1,3,5-triazol-1-yl group.it is an organofluorinecompound,an organofluorinecompound,an organofluorinecompound,an organofluorinecompound,an organofluorinecompound,an organofluorinecompound,an organofluorinecompound,an organofluorinecompound,an organofluorinecomp

43、ound and a member ofmonochlorobenzenes.33Molecule Captioning Results:Different ModelsTransformerRNNT5MolT5InputGround Truththe molecule is a cationic fluorescent dye having 2,3-dimethyl-1,2,3,4,6-tetrahydro-1h-1,2,3,4,6-tetrahydropyridin-1-yl amino amino group,respectively.it has a role as a fluoroc

44、hrome.the molecule is a deuterated compound that is is is is is an isotopologue of chloroform in which the four hydrogen atoms have been replaced by deuterium.it is a deuterated compound and an alpha,omega-dicarboxylic acid.The molecule is a quaternary ammonium ion and a member of phenanthridines.It

45、 has a role as an intercalator and a fluorochrome.The molecule is an organic cation that is phenoxazin-5-ium substituted by amino and methylamino groups at positions 3 and 7 respectively.The chloride salt is the histological dye azure C.The molecule is an organic cation that is phenoxazin-5-ium subs

46、tituted by methyl,amino and diethylamino groups at positions 2,3 and 7 respectively.The tetrachlorozincatesalt salt is the histological dye brilliant cresylblue.34Molecule Captioning Results:Different ModelsTransformerRNNT5MolT5InputGround TruthThe molecule is a GDP-L-galactose having beta-configura

47、tion at the anomeric centre of the L-galactose fragment.It is a conjugate acid of a GDP-beta-L-galactose(2-).The molecule is a GDP-L-galactose in which the anomeric oxygen is on the same side of the fucose ring as the methyl substituent.It has a role as a plant metabolite and a mouse metabolite.It i

48、s a conjugate acid of a GDP-beta-L-galactose(2-).the molecule is a gdp-d-glucoside-a-the molecule is the stable isotope of helium with relative atomic mass 3.016029.the least abundant(0.000137 atom percent)isotope of naturally occurring helium.The molecule is a GDP-D-glucose in which the anomeric ce

49、ntre of the pyranose fragment has alpha-configuration.It is a GDP-D-glucose and a ribonucleoside 5-diphosphate-alpha-D-glucose.It is a conjugate acid of a GDP-alpha-D-glucose(2-).35The molecule is a blue dye.36The molecule is an explosive.37Challenge 2:Sentence-level Context is InsufficientMost of t

50、he scientific concepts are abbreviated without explicit explanations of their meanings.The complex biomedical and chemical interactions between multifarious chemicals,genes,and proteins are even harder to understand.3738Constructing an Extra Brain:Encoding External Knowledge via Entity Linking(Lai e

51、t al.,ACL2021)39Initial Span Graph ConstructionEnumerate all the spans up to a certain length and computes a representation for each span.Predict the type of each span and the relation between each span pair jointly.The predictions are then used to construct an initial span graphUse a bidirectional

52、graph convolutional network(GCN)to integrate initial relational information into each span representation40Final Span Graph PredictionAfter building the two graphs,we soft-align the mentions and the candidate entities using an attention mechanism.With the extracted knowledge-aware span representatio

53、ns,we predict the final span gr41Paper authors tend to write long sentences with clauses and appositionsfor better presentations.Example:Foxp3Argument contains a proline-rich amino-terminal domain reported to function as a nuclear factor of activated T cells(NF-AT)and nuclear factor-kappaB(NF-kappaB

54、)binding domain,a central region containing a zinc finger and leucine zipper potentially important for protein-protein interactions,and a carboxyl-terminal forkhead(FKH)domain required for nuclear localizationTrigger and DNA-binding activity 14-16.Distance between event trigger and argument:41Datase

55、tAverage DistanceMaximal DistanceACE05-E(News)0.212 sentence56 wordsGENIA-2011(Papers)0.330 sentence77 wordsChallenge 3:Wide Context due to Complex Sentence Structures42Compressing Long Context with Abstract Meaning Representation Use BioBERT as the sentence encoder,and adopt span enumeration and cl

56、assification for identifying event triggers and entities Message passing using an edge-conditioned graph attention network Jointly training and inference(trigger and arguments)for biomedical event extraction43Knowledge-enriched Abstract Meaning Representation Zhang et al.,202144Entity and Relation E

57、xtraction ResultsBioRelEx dataset detecting binding interactions between proteins and/or biomoleculesADE dataset extracting drug-related adverse effectsEntity(Micro-F1)Relation(Micro-F1)SciIE(Luan et al.2018)73.5650.15Second Best Model82.7662.18KECI(Ours)87.3567.09Overall results(%)on the BioRelEx l

58、eaderboardEntity(Macro-F1)Relation(Macro-F1)Relation-Metric(Tran et al.2019)87.1177.29SpBERT(Eberts et al.2020)89.2878.84SpanMulti-Head(Ji et al.2020)90.5980.73KnowBertAttention(Baseline)90.0879.95KECI(Ours)90.6781.74Overall results(%)on the ADE dataset45GENIA2011(interaction between proteins and bi

59、omolecules)Extracting biomedical events(9 types)and proteins45Data SplitTrain SetDev SetTest Set#Documents908259231#Sentences8,6202,8463,348#Proteins11,6254,6905,301#Events10,3103,2504,487ModelPrecRecF1String Matching43.9221.8229.16Tree LSTM67.0152.1458.65GEANet64.6156.1160.06BEESL69.7253.0060.22Dee

60、pEM71.7156.2063.02Bert-Flat64.6852.9858.25BERT-AMR68.3953.5860.09BERT-AMR-KG72.7455.6263.04Event Extraction Results46Qualitative AnalysisExplanation#1:AMR helps to link the trigger“bind”to entity“CAII”although they are located far away from each other in the sentence.Explanation#2:External KG helps

61、to recognize“CAII”as the theme of“transcription”based on the additional KG link in the enriched AMR graph.47Qualitative AnalysisExplanation#1:AMR helps to guide the model to find the correct arguments for the two binding events in the sentence.Explanation#2:External KG helps to identify the positive

62、 regulation effect on the generation of heterodimers.48A new benchmark ChemET:includes 81 million molecules and 100 chemistry papers fully annotated with a new fine-grained Chemistry ontology Data comes from PubChem open access literatureDeveloped a new entity ontology that includes 62 entity types(

63、e.g.,ChemistryOrganic ChemistryOrganicCom-197poundsHydrocarbonsOther Hydrocarbons)Developed a new Chemistry Event Ontology with 59 types about results,methods and proceduresTraining set constructed by distant supervision(using Wikipedia page and PubChem synonyms)Evaluation set annotated by chemistry

64、 studentsDataset StatsA New Chemistry IE Benchmark49A New COVID-19 Benchmarkhttp:/blender.cs.illinois.edu/covid19/Manually annotated 122 documents with 6,789 sentences in totalData Sources:1.AI2 CORD-19 archive:index for COVID-19 related papers from PubMed,PubMed Central,Arxiv,and Bio-Arxiv2.NCBI PM

65、C API:retrieve PubMed(PM)or PubMed Central(PMC)articles3.NCBI Pubtator API:retrieve pre-annotated PubMed or Pubmed Central articles5050Figure 1.FDA approved drugs of most interest for repurposing as potential Ebola virus treatments.EntityGroundingforDrugMolecularStructureImageKGfromcaptiontextFDADru

66、gsEbolaapproverepurposeMultimediaKnowledgeGraphExpansionApplication in COVID-19 Drug Repurposing Report Generation51Wang et al.,NAACL2021 Best Demo Paper AwardApplication in COVID-19 Drug Repurposing Report Generation52ReactionTracker:Literature Search Engine for Chemical ReactionsQuery the specific compounds or reactant groupsRetrieve papers about reaction(s)involving the queried compounds or reactant groupsHighlight word-level or entity-level matchesDetect other entities from the same reactant group and involve them in rankingDemocratizing Drug Discovery

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(3-3 分子与自然语言之间的翻译.pdf)为本站 (云闲) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部