上海品茶

生成式 AI 如何助力蛋白质科学研究-郑在翔.pdf

编号:164046 PDF 87页 25.52MB 下载积分:VIP专享
下载报告请您先登录!

生成式 AI 如何助力蛋白质科学研究-郑在翔.pdf

1、生成式AI如何助力蛋白质科学研究ByteDance Research/郑在翔 How Generative AI Accelerates Protein ResearchWere doing AI for Science at ByteDance ResearchAI Protein Modeling&DesignLearning Harmonic Molecular Representations on Riemannian Manifold.In ICLR 2023On Pre-training Language Model for Antibody.In ICLR 2023Structu

2、re-informed Language Models Are Protein Designers.In ICML 2023(oral)Diffusion Language Models Are Versatile Protein Learners.In ICML 2024.Protein Conformation Generation via Force-Guided SE(3)Diffusion Models.In ICML 2024.Antigen-Specific Antibody Design via Direct Energy-based Preference Optimizati

3、on.preprint.2024Small Molecule DesignRegularized Molecular Conformation Fields.In NeurIPS 2022Zero-Shot 3D Drug Design by Sketching and Generating.In NeurIPS 2022Diffusion Models with Decomposed Priors for Structure-Based Drug Design.In ICML 2023DecompOpt:Controllable and Decomposed Diffusion Models

4、 for Structure-based Molecular Optimization.In ICLR 2024Cryo-EMCryoSTAR:Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction.preprint.2023_Structure-informed Language Models Are Protein Designers.In ICML 2023(oral)LM-DESIGN:steering large protein LMs to design protein

5、 sequencesas structure-conditioned sequence generative modelsDPLM:A Versatile Protein Foundation Model_Diffusion Language Models Are Versatile Protein Learners.In ICML 2024.AbDPO:designing antibodies with energy-based DPO _Antigen-Specific Antibody Design via Direct Energy-based Preference Optimizat

6、ion.2024(under review)Small Molecule Drug Design:DecompDiff_Diffusion Models with Decomposed Priors for Structure-Based Drug Design.In ICML 2023ConDiff:Protein Dynamic Conformation Generation with Physics-guided SE(3)Diffusion Model_Protein Conformation Generation via Force-Guided SE(3)Diffusion Mod

7、els.In ICML 2024.Cryo-EM Heterogeneous Reconstruction with CryoStar_CryoSTAR:Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction.2023(under review)OutlineBackgroundBasics of Generative AI,LLM&DiffusionBasics of ProteinGenerative AI x Protein LLM&Diffusion in AI for P

8、rotein,Alphafold&Protein Language ModelLarge-scale Generative Protein Modeling&Design in ByteDance ResearchLM-DESIGN:Sequence design for given structure w/protein LLMsDPLM:A versatile protein foundation model w/LLM+DiffusionOne more thing:Towards next-gen multimodal protein foundation model?Amazing

9、things that generative AI can doAlphafold learns protein foldingLarge LMs speakVision AIs create ArtsDeep generative modeling:Learning to generate data“Creating noise from data is easy;Creating data from noise is generative modeling.”Dr.Yang Song Score-based SDEsimplicit generative models-non-probab

10、ilistic likelihood-based models-probabilistic-w/or w/o latent variablesAutoregressive modelsDeep Generative ModelsAutoregressive Language ModelsData:Model:Learning Goal:Maximum Likelihood Estimation(MLE)AR-LMs generate data element by elementTransformer:Attention(over pairs)is all you needData:Model

11、:Learning Goal:Maximum Likelihood Estimation(MLE)Diffusion Models:Learning to generate by iterative denoisingNotable milestones of generative AI Multimodal ALL-IN-ONEOutlineBackgroundBasics of Generative AI,LLM&DiffusionBasics of ProteinGenerative AI x Protein LLM&Diffusion in AI for Protein,Alphafo

12、ld&Protein Language ModelLarge-scale Generative Protein Modeling&Design in ByteDance ResearchLM-DESIGN:Sequence design for given structure w/protein LLMsDPLM:A versatile protein foundation model w/LLM+DiffusionOne more thing:Towards next-gen multimodal protein foundation model?AI is revolutionizing

13、structural biologyProtein:The central dogma of molecular biology_Credit to Ellen Zhong:The content of the following slides for introduction to structural biology is mostly modified from Ellen Zhongs keynote speech at MLSB workshop.Structure biology:The study of proteins and other biomolecules throug

14、h their 3D structureStructure biology:The study of proteins and other biomolecules through their 3D structureAll essential biological processes are carried out by proteins and protein complexesMany proteins are enzymes that catalyze chemical reactionsprimary sequence/chainsecondary structurestertiar

15、y structures/foldsquaternary structures(2+chains that interact)-pleatedsheet-helix-pleatedsheet-helicesprotein-proteiniteractionamino acidsProtein:data modalitiessequence structure functionA sequence over 20 amino acids(AAs)In solvent will fold into a unique 3D spatial structure with minimal free en

16、ergyStructure determines protein functionprotein folding(seq struct)conformationenergy landscapeSequence of amino acidsStructure(mainly backbone)3D XYZ coordinatesLocal reference frames(AF2 style)Ca coords+orientationTorsion anglesContact/distance map Sequence:20 types of amino acidsStructure(mainly

17、 backbone)3D XYZ coordinatesLocal reference frames(AF2 style)Ca coords+orientationPair-wise contact/distance map Atomic coordinates of protein 3D structuresProtein Modalities:Sequence and structure and in-between-folding and inverse folding(structure-based)sequence designaka inverse folding_pdb id:1

18、IGT.from https:/www.rcsb.org/structure/1IGTDIVLTQSPSSLSASLGDTITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding alphafold,rosettafold,esmfold,etcprotein 3d structureDesigning protein sequence and structure as generative modeling problemsin

19、verse foldingconditional sequence generation_pdb id:1IGT.from https:/www.rcsb.org/structure/1IGTDIVLTQSPSSLSASLGDTITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding conditional structure generationprotein 3d structuresequence designstructu

20、re designsequence-structure co-designDiscreteness in NLP and BiologyGoal:learning joint prob.of sequence of discrete tokensFactorization(wrt the structures of data)needed!_Noelia Ferruz&Birte Hcker.2022.Controllable protein design with language models.Nature Machine Intelligence.proteinlanguagesTran

21、sformer:Attention(over pairs)is all you needData:Model:Learning Goal:Maximum Likelihood Estimation(MLE)Diffusion(-like)modeling for Protein Structure:AlphaFoldAlphaFold 2:A solution to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with AlphaFold.John Jumper,et

22、 al.Nature.2021AlphaFold 2:A solution to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with AlphaFold.John Jumper,et al.Nature.2021AlphaFold 2:A solution to a 50-year-old grand challenge in biologyDelve into AF:sequence-conditional structure generationA soluti

23、on to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with AlphaFold.John Jumper,et al.Nature.2021Delve into AF:homologous retrieval-augmented generation(RAG)A solution to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with

24、 AlphaFold.John Jumper,et al.Nature.2021Delve into AF:encoder-decoder&Transformers_Highly accurate protein structure prediction with AlphaFold.John Jumper,et al.Nature.2021Delve into AF:diffusion(-like)recycling/iterative refinementAlphafold:summaryAF conditional structure gen w/Transformer+RAG+Diff

25、usionProtein Sequence Modeling with LLMsRecap:LLMs for natural languageSimple&universal law of the scale:the larger the merrier_Jason Wei,Yi Tay,Rishi Bommasani,Colin Raffel,Barret Zoph,Sebastian Borgeaud,Dani Yogatama et al.Emergent Abilities of Large Language Models.Transactions on Machine Learnin

26、g Research.2022_Yao Fu,Hao Peng and Tushar Khot.How does GPT Obtain its Ability?Tracing Emergent Abilities of Language Models to their Sources.2022.On Yao Fus NotionProtein Language models(pLMs)Two types of commonly-used protein LMsProtein Sequence Encoder:predictive models for classifications and r

27、egressionsformulation:psudo-likelihood p(ai|aji seq)by MLM,DAE,etc.Instance(BERT-like):ESM-1b,ESM 2 seriesProtein Sequence Decoder:generative models for learning distributions and synthesizing sequencesformulation:likelihood p(ai|aji seq)by autoregressive/causal LMInstance(GPT-like):ProGen2,ProGPTES

28、M:Evolutionary Scale ModelingBERT analog for proteins.Learned with MLM(15%8/1/1)on 250M sequences.ESM-1b:650M params._Rives,A.,Meier,J.,Sercu,T.,Goyal,S.,Lin,Z.,Liu,J.,Guo,D.,Ott,M.,Zitnick,C.L.,Ma,J.,Fergus,R.,et al.Biological structure and function emerge from scaling unsupervised learning to 250

29、million protein sequences.2020ProGen:next ChatGPT for proteins?GPT-like autoregressive model on sequences_Madani,A.,Krause,B.,Greene,E.R.,Subramanian,S.,Mohr,B.P.,Holton,J.M.,Olmos Jr,J.L.,Xiong,C.,Sun,Z.Z.,Socher,R.and Fraser,J.S.Large language models generate functional protein sequences across di

30、verse families.Nature Biotechnology.2023ESM-2 series:scaling makes differentScaling is all you need:just as in LLMs for natural languagesemergent abilities:structural awarenessphase-transition at certain scale threshold_Lin,Z.,Akin,H.,Rao,R.,Hie,B.,Zhu,Z.,Lu,W.,Smetanin,N.,Verkuil,R.,Kabeli,O.,Shmue

31、li,Y.and dos Santos Costa,A.,2022.Evolutionary-scale prediction of atomic level protein structure with a language model.bioRxiv,pp.2022-07.2022ESMFold:Protein Folding using pLMs at scaleESMFold:pLMs at scale enable single-sequence structure prediction ESM-2+structural moduleComparable to AF2,but nee

32、ding no homologs and 60 x faster_Lin,Z.,Akin,H.,Rao,R.,Hie,B.,Zhu,Z.,Lu,W.,Smetanin,N.,Verkuil,R.,Kabeli,O.,Shmueli,Y.and dos Santos Costa,A.,2022.Evolutionary-scale prediction of atomic level protein structure with a language model.bioRxiv,pp.2022-07.2022(LLM+Diffusion)x Protein:Large-scale Generat

33、ive Protein Modeling&DesignOutlineBackgroundBasics of Generative AI,LLM&DiffusionBasics of ProteinGenerative AI x Protein LLM&Diffusion in AI for Protein,Alphafold&Protein Language ModelLarge-scale Generative Protein Modeling&Design in ByteDance ResearchLM-DESIGN:Sequence design for given structure

34、w/protein LLMsDPLM:A versatile protein foundation model w/LLM+DiffusionOne more thing:Towards next-gen multimodal protein foundation model?Notable milestones of generative AI Multimodal ALL-IN-ONEZaixiang Zheng1*,Yifan Deng2*,Dongyu Xue1,Yi Zhou1,Fei Ye1 and Quanquan Gu11ByteDance Research&2UW-Madis

35、onICML 2023 Oral“NMT moment”for Structure-based Sequence Designinverse foldingconditional sequence generationDIVLTQSPSSLSASLGDTITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding conditional structure generationprotein 3d structuresequence

36、designstructure designsequence-structure co-design200222023YearStructure TransformerGVP-GNNESM-IFProteinMPNNLM-DESIGNPiFoldDenseCPDGCAByteDanceMetaU.Washington(David Baker)WestlakeMITStanfordAccuracyNYUPROGRESS OF DL-BASED PROTEIN SEQUENCE DESIGNRadius proportional to the model scaleSeque

37、nce RecoveryStructure-based protein sequence design/inverse folding_1 Ingraham,et al.Generative models for graph-based protein design.In NeurIPS 2019.2 Dauparas,et al.Robust deep learningbased protein sequence design using ProteinMPNN.Science 2022.3 Hsu,et al.Learning inverse folding from millions o

38、f predicted structures.In ICML 2022Definition:to find amino acid sequence that can fold into a desired protein backbone structure ,by learning a probabilisitic model over a certain amount of protein structure-sequence data.Existing work:graph-to-sequence autoregressive modeling(StructTransformer1,Pr

39、oteinMPNN2,ESM-IF3,etc)Challenges and motivationsSequential evolutionary knowledge should be better considered&utilized!Limited experimentally determined structures0.1%known structuresmassive protein sequences millions to billionsStructurally non-deterministic regions are less informative andmuch ha

40、rderLeft-to-right autoregressive models not necessarily best fit spatial structured data like proteinsLarge-scale protein language models can helppLMs are such strong sequence learners Q:can pLMs be better structure-based protein designers?LANGUAGE MODELS KNOW SEQUENCES THE BEST!Protein LMs(pLMs,e.g

41、.,ESM-1b/ESM-2),learned from the universe of massive protein sequences,have demonstrated emergent evolutionary knowledge to enable amazing capabilities._ ESM Rives,et tal.Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.ESM2/ESMFold Lin et

42、al.Evolutionary-scale prediction of atomic-level protein structure with a language model.Science 2023.LM-DESIGN:reprogramming pLMs as structure-conditioned sequence generative modelsStructural surgery:implanting a lightweight structural adapter into a strong pretrained pLMwe focus on Bert-like MLMs(

43、e.g.,ESM-1b/ESM2)for bidirectional receptive fieldsLM-DESIGN=pretrained pLM as sequence decoder +protein structure encoder+structural adapterstrong sequence generative capabilitystructure understandingstructure-sequencealigner/translator_protein fig credit:RFDiffusion.Training:conditional masked lan

44、guage modeling(CMLM)with pLMs frozenDiffusion-like Inference:full-sequence iterative refinement/denoising for 5 cyclesrecycling for T times:iteratively refine TclsYKTVRAGRLGSISRSLEReosclsMKTVRQERLKSIVRILEReosstructural adapter Nstructure encoder(GNNs,ProteinMPNN,GVP,IPA,etc.)Multihead ATTNFFNTransfo

45、rmer layerMultihead ATTN+FFNsequence decoder:pLM(ESM series,etc)LM-DESIGN:reprogramming pLMs as structure-conditioned sequence generative modelsLM-DESIGN Improves SoTA results by a large margin(4%-12%)Non-AR modeling is a more proper probabilisitic model for protein dataLM-DESIGN Improves SoTA resul

46、ts by a large margin(4%-12%)Data-¶meter-efficient:outperforming without any additional data(2%trainable)LM-DESIGN Improves SoTA results by a large margin(4%-12%)Data-¶meter-efficient:outperforming without any additional data(2%trainable)Modularizable:further benefit from pretrained structure

47、 encodersLM-DESIGN Improves SoTA results by a large margin(4%-12%)Data-¶meter-efficient:outperforming without any additional data(85),novel and diverse for unconditional protein sequence generation,suggesting that DPLM well captures the underlying distribution of protein sequence data.Evaluation

48、 of Protein Representation Learning on Predictive TasksDPLM is a superior protein sequence representation learner,outperforming Masked-LM(ESM2)and AR-LM while performance can improve with scaling.Conditional generation of DPLM for various needs sequence conditioning(motif-scaffolding):DPLM can gener

49、ate reasonable scaffolds for given functional motifs at high success ratecross-modal conditioning(inverse folding):DPLM yields sequences that can accurately fold into the given backbone structure.controllable generation towards desired preference(secondary structure guided protein sampling):DPLM enj

50、oys plug-and-play programmability,steered to synthesize proteins that satisfy arbitrary user-defined secondary structure annotations w/plug-&-play classifier guidanceTakeaways-DPLMpaperWe introduce diffusion protein LM(DPLM),a versatile protein LM that is capable of both protein sequence generation

51、and representation learning,as well as various needs of conditional generation,including sequence conditioning,cross-modal conditioning,and programmable generation with plug-and-play discrete classifier guidance.Potential future directions:(1)Exploring DPLMs conditional generation for wider applicat

52、ions,(2)DPLM can further benefit from best practices of cutting-edge technical advancement in the vastness of large language models(LLMs),(3)It is imperative to integrate protein structure modeling into DPLM.Developing a universal protein language model with the next-generation DPLM,which accounts f

53、or both sequence and structure,is a particularly promising avenue.Whats next?“GPT-4 moment”for multimodal protein foundation models?Towards Unified Multimodal Protein Foundation Modelsinverse foldingconditional sequence generation_pdb id:1IGT.from https:/www.rcsb.org/structure/1IGTDIVLTQSPSSLSASLGDT

54、ITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding conditional structure generationprotein 3d structuresequence generationstructure generationsequence-structure co-designMultimodal-DPLM:One Model Can Do Whatever You Need for Proteinsuncond

55、itional structure design noise structurefolding:sequence structureco-design:noise (sequence,structure)applications:e.g,designing symmetric oligomersWere doing AI for Science at ByteDance ResearchAI Protein Modeling&DesignLearning Harmonic Molecular Representations on Riemannian Manifold.In ICLR 2023

56、On Pre-training Language Model for Antibody.In ICLR 2023Structure-informed Language Models Are Protein Designers.In ICML 2023(oral)Diffusion Language Models Are Versatile Protein Learners.In ICML 2024.Protein Conformation Generation via Force-Guided SE(3)Diffusion Models.In ICML 2024.Antigen-Specifi

57、c Antibody Design via Direct Energy-based Preference Optimization.preprint.2024Small Molecule DesignRegularized Molecular Conformation Fields.In NeurIPS 2022Zero-Shot 3D Drug Design by Sketching and Generating.In NeurIPS 2022Diffusion Models with Decomposed Priors for Structure-Based Drug Design.In ICML 2023DecompOpt:Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization.In ICLR 2024Cryo-EMCryoSTAR:Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction.preprint.2023DPLMLM-DESIGN

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(生成式 AI 如何助力蛋白质科学研究-郑在翔.pdf)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
会员动态
会员动态 会员动态:

wei**n_...  升级为至尊VIP wei**n_...  升级为高级VIP

 wei**n_... 升级为高级VIP wei**n_... 升级为至尊VIP  

 177**81... 升级为标准VIP 185**22... 升级为标准VIP  

 138**26... 升级为至尊VIP  军歌   升级为至尊VIP

159**75... 升级为至尊VIP  wei**n_... 升级为标准VIP

wei**n_...   升级为至尊VIP wei**n_...  升级为高级VIP 

 su2**62... 升级为至尊VIP  wei**n_... 升级为至尊VIP

wei**n_...  升级为至尊VIP 186**35... 升级为高级VIP

186**21...  升级为标准VIP   wei**n_... 升级为标准VIP

wei**n_...  升级为标准VIP  wei**n_... 升级为标准VIP 

137**40... 升级为至尊VIP wei**n_...  升级为至尊VIP 

186**37... 升级为至尊VIP   177**05...  升级为至尊VIP

wei**n_... 升级为高级VIP wei**n_... 升级为至尊VIP 

wei**n_... 升级为至尊VIP  wei**n_...  升级为标准VIP

wei**n_...  升级为高级VIP 155**91... 升级为至尊VIP

 155**91... 升级为标准VIP  177**25... 升级为至尊VIP 

 139**88... 升级为至尊VIP wei**n_... 升级为至尊VIP  

wei**n_...  升级为高级VIP  wei**n_...  升级为标准VIP

 135**30... 升级为标准VIP   wei**n_... 升级为高级VIP

  138**62... 升级为标准VIP 洛宾 升级为高级VIP

wei**n_... 升级为标准VIP wei**n_...  升级为高级VIP

wei**n_... 升级为标准VIP 180**13...  升级为高级VIP

  wei**n_... 升级为至尊VIP 152**69...  升级为标准VIP

152**69...  升级为标准VIP 小**... 升级为标准VIP

 wei**n_... 升级为标准VIP  138**09...  升级为标准VIP

wei**n_...  升级为至尊VIP 邓** 升级为标准VIP 

wei**n_... 升级为标准VIP   wei**n_...  升级为至尊VIP

186**22... 升级为标准VIP  微**... 升级为至尊VIP

wei**n_... 升级为至尊VIP  zhh**_s... 升级为标准VIP

wei**n_... 升级为至尊VIP wei**n_...  升级为至尊VIP 

wei**n_... 升级为高级VIP   wei**n_... 升级为至尊VIP 

 131**00... 升级为高级VIP  wei**n_... 升级为高级VIP 

 188**05... 升级为至尊VIP 139**80...  升级为至尊VIP 

 wei**n_...  升级为高级VIP  173**11... 升级为至尊VIP 

152**71... 升级为高级VIP 137**24... 升级为至尊VIP

wei**n_... 升级为高级VIP  185**31... 升级为至尊VIP 

 186**76... 升级为至尊VIP  wei**n_... 升级为标准VIP 

wei**n_...  升级为标准VIP  138**50...  升级为标准VIP 

 wei**n_... 升级为高级VIP  wei**n_... 升级为高级VIP 

wei**n_... 升级为标准VIP   wei**n_... 升级为至尊VIP

Bry**-C... 升级为至尊VIP  151**85... 升级为至尊VIP

136**28... 升级为至尊VIP  166**35...  升级为至尊VIP 

 狗**... 升级为至尊VIP  般若 升级为标准VIP

 wei**n_... 升级为标准VIP  185**87... 升级为至尊VIP 

 131**96... 升级为至尊VIP  琪**  升级为标准VIP

 wei**n_... 升级为高级VIP wei**n_... 升级为标准VIP 

186**76... 升级为标准VIP  微**...  升级为高级VIP

186**38...  升级为标准VIP wei**n_...  升级为至尊VIP 

Dav**ch... 升级为高级VIP    wei**n_... 升级为标准VIP

wei**n_... 升级为标准VIP  189**34... 升级为标准VIP