《生成式 AI 如何助力蛋白质科学研究-郑在翔.pdf》由会员分享,可在线阅读,更多相关《生成式 AI 如何助力蛋白质科学研究-郑在翔.pdf(87页珍藏版)》请在三个皮匠报告上搜索。
1、生成式AI如何助力蛋白质科学研究ByteDance Research/郑在翔 How Generative AI Accelerates Protein ResearchWere doing AI for Science at ByteDance ResearchAI Protein Modeling&DesignLearning Harmonic Molecular Representations on Riemannian Manifold.In ICLR 2023On Pre-training Language Model for Antibody.In ICLR 2023Structu
2、re-informed Language Models Are Protein Designers.In ICML 2023(oral)Diffusion Language Models Are Versatile Protein Learners.In ICML 2024.Protein Conformation Generation via Force-Guided SE(3)Diffusion Models.In ICML 2024.Antigen-Specific Antibody Design via Direct Energy-based Preference Optimizati
3、on.preprint.2024Small Molecule DesignRegularized Molecular Conformation Fields.In NeurIPS 2022Zero-Shot 3D Drug Design by Sketching and Generating.In NeurIPS 2022Diffusion Models with Decomposed Priors for Structure-Based Drug Design.In ICML 2023DecompOpt:Controllable and Decomposed Diffusion Models
4、 for Structure-based Molecular Optimization.In ICLR 2024Cryo-EMCryoSTAR:Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction.preprint.2023_Structure-informed Language Models Are Protein Designers.In ICML 2023(oral)LM-DESIGN:steering large protein LMs to design protein
5、 sequencesas structure-conditioned sequence generative modelsDPLM:A Versatile Protein Foundation Model_Diffusion Language Models Are Versatile Protein Learners.In ICML 2024.AbDPO:designing antibodies with energy-based DPO _Antigen-Specific Antibody Design via Direct Energy-based Preference Optimizat
6、ion.2024(under review)Small Molecule Drug Design:DecompDiff_Diffusion Models with Decomposed Priors for Structure-Based Drug Design.In ICML 2023ConDiff:Protein Dynamic Conformation Generation with Physics-guided SE(3)Diffusion Model_Protein Conformation Generation via Force-Guided SE(3)Diffusion Mod
7、els.In ICML 2024.Cryo-EM Heterogeneous Reconstruction with CryoStar_CryoSTAR:Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction.2023(under review)OutlineBackgroundBasics of Generative AI,LLM&DiffusionBasics of ProteinGenerative AI x Protein LLM&Diffusion in AI for P
8、rotein,Alphafold&Protein Language ModelLarge-scale Generative Protein Modeling&Design in ByteDance ResearchLM-DESIGN:Sequence design for given structure w/protein LLMsDPLM:A versatile protein foundation model w/LLM+DiffusionOne more thing:Towards next-gen multimodal protein foundation model?Amazing
9、things that generative AI can doAlphafold learns protein foldingLarge LMs speakVision AIs create ArtsDeep generative modeling:Learning to generate data“Creating noise from data is easy;Creating data from noise is generative modeling.”Dr.Yang Song Score-based SDEsimplicit generative models-non-probab
10、ilistic likelihood-based models-probabilistic-w/or w/o latent variablesAutoregressive modelsDeep Generative ModelsAutoregressive Language ModelsData:Model:Learning Goal:Maximum Likelihood Estimation(MLE)AR-LMs generate data element by elementTransformer:Attention(over pairs)is all you needData:Model
11、:Learning Goal:Maximum Likelihood Estimation(MLE)Diffusion Models:Learning to generate by iterative denoisingNotable milestones of generative AI Multimodal ALL-IN-ONEOutlineBackgroundBasics of Generative AI,LLM&DiffusionBasics of ProteinGenerative AI x Protein LLM&Diffusion in AI for Protein,Alphafo
12、ld&Protein Language ModelLarge-scale Generative Protein Modeling&Design in ByteDance ResearchLM-DESIGN:Sequence design for given structure w/protein LLMsDPLM:A versatile protein foundation model w/LLM+DiffusionOne more thing:Towards next-gen multimodal protein foundation model?AI is revolutionizing
13、structural biologyProtein:The central dogma of molecular biology_Credit to Ellen Zhong:The content of the following slides for introduction to structural biology is mostly modified from Ellen Zhongs keynote speech at MLSB workshop.Structure biology:The study of proteins and other biomolecules throug
14、h their 3D structureStructure biology:The study of proteins and other biomolecules through their 3D structureAll essential biological processes are carried out by proteins and protein complexesMany proteins are enzymes that catalyze chemical reactionsprimary sequence/chainsecondary structurestertiar
15、y structures/foldsquaternary structures(2+chains that interact)-pleatedsheet-helix-pleatedsheet-helicesprotein-proteiniteractionamino acidsProtein:data modalitiessequence structure functionA sequence over 20 amino acids(AAs)In solvent will fold into a unique 3D spatial structure with minimal free en
16、ergyStructure determines protein functionprotein folding(seq struct)conformationenergy landscapeSequence of amino acidsStructure(mainly backbone)3D XYZ coordinatesLocal reference frames(AF2 style)Ca coords+orientationTorsion anglesContact/distance map Sequence:20 types of amino acidsStructure(mainly
17、 backbone)3D XYZ coordinatesLocal reference frames(AF2 style)Ca coords+orientationPair-wise contact/distance map Atomic coordinates of protein 3D structuresProtein Modalities:Sequence and structure and in-between-folding and inverse folding(structure-based)sequence designaka inverse folding_pdb id:1
18、IGT.from https:/www.rcsb.org/structure/1IGTDIVLTQSPSSLSASLGDTITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding alphafold,rosettafold,esmfold,etcprotein 3d structureDesigning protein sequence and structure as generative modeling problemsin
19、verse foldingconditional sequence generation_pdb id:1IGT.from https:/www.rcsb.org/structure/1IGTDIVLTQSPSSLSASLGDTITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding conditional structure generationprotein 3d structuresequence designstructu
20、re designsequence-structure co-designDiscreteness in NLP and BiologyGoal:learning joint prob.of sequence of discrete tokensFactorization(wrt the structures of data)needed!_Noelia Ferruz&Birte Hcker.2022.Controllable protein design with language models.Nature Machine Intelligence.proteinlanguagesTran
21、sformer:Attention(over pairs)is all you needData:Model:Learning Goal:Maximum Likelihood Estimation(MLE)Diffusion(-like)modeling for Protein Structure:AlphaFoldAlphaFold 2:A solution to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with AlphaFold.John Jumper,et
22、 al.Nature.2021AlphaFold 2:A solution to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with AlphaFold.John Jumper,et al.Nature.2021AlphaFold 2:A solution to a 50-year-old grand challenge in biologyDelve into AF:sequence-conditional structure generationA soluti
23、on to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with AlphaFold.John Jumper,et al.Nature.2021Delve into AF:homologous retrieval-augmented generation(RAG)A solution to a 50-year-old grand challenge in biology_Highly accurate protein structure prediction with
24、 AlphaFold.John Jumper,et al.Nature.2021Delve into AF:encoder-decoder&Transformers_Highly accurate protein structure prediction with AlphaFold.John Jumper,et al.Nature.2021Delve into AF:diffusion(-like)recycling/iterative refinementAlphafold:summaryAF conditional structure gen w/Transformer+RAG+Diff
25、usionProtein Sequence Modeling with LLMsRecap:LLMs for natural languageSimple&universal law of the scale:the larger the merrier_Jason Wei,Yi Tay,Rishi Bommasani,Colin Raffel,Barret Zoph,Sebastian Borgeaud,Dani Yogatama et al.Emergent Abilities of Large Language Models.Transactions on Machine Learnin
26、g Research.2022_Yao Fu,Hao Peng and Tushar Khot.How does GPT Obtain its Ability?Tracing Emergent Abilities of Language Models to their Sources.2022.On Yao Fus NotionProtein Language models(pLMs)Two types of commonly-used protein LMsProtein Sequence Encoder:predictive models for classifications and r
27、egressionsformulation:psudo-likelihood p(ai|aji seq)by MLM,DAE,etc.Instance(BERT-like):ESM-1b,ESM 2 seriesProtein Sequence Decoder:generative models for learning distributions and synthesizing sequencesformulation:likelihood p(ai|aji seq)by autoregressive/causal LMInstance(GPT-like):ProGen2,ProGPTES
28、M:Evolutionary Scale ModelingBERT analog for proteins.Learned with MLM(15%8/1/1)on 250M sequences.ESM-1b:650M params._Rives,A.,Meier,J.,Sercu,T.,Goyal,S.,Lin,Z.,Liu,J.,Guo,D.,Ott,M.,Zitnick,C.L.,Ma,J.,Fergus,R.,et al.Biological structure and function emerge from scaling unsupervised learning to 250
29、million protein sequences.2020ProGen:next ChatGPT for proteins?GPT-like autoregressive model on sequences_Madani,A.,Krause,B.,Greene,E.R.,Subramanian,S.,Mohr,B.P.,Holton,J.M.,Olmos Jr,J.L.,Xiong,C.,Sun,Z.Z.,Socher,R.and Fraser,J.S.Large language models generate functional protein sequences across di
30、verse families.Nature Biotechnology.2023ESM-2 series:scaling makes differentScaling is all you need:just as in LLMs for natural languagesemergent abilities:structural awarenessphase-transition at certain scale threshold_Lin,Z.,Akin,H.,Rao,R.,Hie,B.,Zhu,Z.,Lu,W.,Smetanin,N.,Verkuil,R.,Kabeli,O.,Shmue
31、li,Y.and dos Santos Costa,A.,2022.Evolutionary-scale prediction of atomic level protein structure with a language model.bioRxiv,pp.2022-07.2022ESMFold:Protein Folding using pLMs at scaleESMFold:pLMs at scale enable single-sequence structure prediction ESM-2+structural moduleComparable to AF2,but nee
32、ding no homologs and 60 x faster_Lin,Z.,Akin,H.,Rao,R.,Hie,B.,Zhu,Z.,Lu,W.,Smetanin,N.,Verkuil,R.,Kabeli,O.,Shmueli,Y.and dos Santos Costa,A.,2022.Evolutionary-scale prediction of atomic level protein structure with a language model.bioRxiv,pp.2022-07.2022(LLM+Diffusion)x Protein:Large-scale Generat
33、ive Protein Modeling&DesignOutlineBackgroundBasics of Generative AI,LLM&DiffusionBasics of ProteinGenerative AI x Protein LLM&Diffusion in AI for Protein,Alphafold&Protein Language ModelLarge-scale Generative Protein Modeling&Design in ByteDance ResearchLM-DESIGN:Sequence design for given structure
34、w/protein LLMsDPLM:A versatile protein foundation model w/LLM+DiffusionOne more thing:Towards next-gen multimodal protein foundation model?Notable milestones of generative AI Multimodal ALL-IN-ONEZaixiang Zheng1*,Yifan Deng2*,Dongyu Xue1,Yi Zhou1,Fei Ye1 and Quanquan Gu11ByteDance Research&2UW-Madis
35、onICML 2023 Oral“NMT moment”for Structure-based Sequence Designinverse foldingconditional sequence generationDIVLTQSPSSLSASLGDTITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding conditional structure generationprotein 3d structuresequence
36、designstructure designsequence-structure co-design200222023YearStructure TransformerGVP-GNNESM-IFProteinMPNNLM-DESIGNPiFoldDenseCPDGCAByteDanceMetaU.Washington(David Baker)WestlakeMITStanfordAccuracyNYUPROGRESS OF DL-BASED PROTEIN SEQUENCE DESIGNRadius proportional to the model scaleSeque
37、nce RecoveryStructure-based protein sequence design/inverse folding_1 Ingraham,et al.Generative models for graph-based protein design.In NeurIPS 2019.2 Dauparas,et al.Robust deep learningbased protein sequence design using ProteinMPNN.Science 2022.3 Hsu,et al.Learning inverse folding from millions o
38、f predicted structures.In ICML 2022Definition:to find amino acid sequence that can fold into a desired protein backbone structure ,by learning a probabilisitic model over a certain amount of protein structure-sequence data.Existing work:graph-to-sequence autoregressive modeling(StructTransformer1,Pr
39、oteinMPNN2,ESM-IF3,etc)Challenges and motivationsSequential evolutionary knowledge should be better considered&utilized!Limited experimentally determined structures0.1%known structuresmassive protein sequences millions to billionsStructurally non-deterministic regions are less informative andmuch ha
40、rderLeft-to-right autoregressive models not necessarily best fit spatial structured data like proteinsLarge-scale protein language models can helppLMs are such strong sequence learners Q:can pLMs be better structure-based protein designers?LANGUAGE MODELS KNOW SEQUENCES THE BEST!Protein LMs(pLMs,e.g
41、.,ESM-1b/ESM-2),learned from the universe of massive protein sequences,have demonstrated emergent evolutionary knowledge to enable amazing capabilities._ ESM Rives,et tal.Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.ESM2/ESMFold Lin et
42、al.Evolutionary-scale prediction of atomic-level protein structure with a language model.Science 2023.LM-DESIGN:reprogramming pLMs as structure-conditioned sequence generative modelsStructural surgery:implanting a lightweight structural adapter into a strong pretrained pLMwe focus on Bert-like MLMs(
43、e.g.,ESM-1b/ESM2)for bidirectional receptive fieldsLM-DESIGN=pretrained pLM as sequence decoder +protein structure encoder+structural adapterstrong sequence generative capabilitystructure understandingstructure-sequencealigner/translator_protein fig credit:RFDiffusion.Training:conditional masked lan
44、guage modeling(CMLM)with pLMs frozenDiffusion-like Inference:full-sequence iterative refinement/denoising for 5 cyclesrecycling for T times:iteratively refine TclsYKTVRAGRLGSISRSLEReosclsMKTVRQERLKSIVRILEReosstructural adapter Nstructure encoder(GNNs,ProteinMPNN,GVP,IPA,etc.)Multihead ATTNFFNTransfo
45、rmer layerMultihead ATTN+FFNsequence decoder:pLM(ESM series,etc)LM-DESIGN:reprogramming pLMs as structure-conditioned sequence generative modelsLM-DESIGN Improves SoTA results by a large margin(4%-12%)Non-AR modeling is a more proper probabilisitic model for protein dataLM-DESIGN Improves SoTA resul
46、ts by a large margin(4%-12%)Data-¶meter-efficient:outperforming without any additional data(2%trainable)LM-DESIGN Improves SoTA results by a large margin(4%-12%)Data-¶meter-efficient:outperforming without any additional data(2%trainable)Modularizable:further benefit from pretrained structure
47、 encodersLM-DESIGN Improves SoTA results by a large margin(4%-12%)Data-¶meter-efficient:outperforming without any additional data(85),novel and diverse for unconditional protein sequence generation,suggesting that DPLM well captures the underlying distribution of protein sequence data.Evaluation
48、 of Protein Representation Learning on Predictive TasksDPLM is a superior protein sequence representation learner,outperforming Masked-LM(ESM2)and AR-LM while performance can improve with scaling.Conditional generation of DPLM for various needs sequence conditioning(motif-scaffolding):DPLM can gener
49、ate reasonable scaffolds for given functional motifs at high success ratecross-modal conditioning(inverse folding):DPLM yields sequences that can accurately fold into the given backbone structure.controllable generation towards desired preference(secondary structure guided protein sampling):DPLM enj
50、oys plug-and-play programmability,steered to synthesize proteins that satisfy arbitrary user-defined secondary structure annotations w/plug-&-play classifier guidanceTakeaways-DPLMpaperWe introduce diffusion protein LM(DPLM),a versatile protein LM that is capable of both protein sequence generation
51、and representation learning,as well as various needs of conditional generation,including sequence conditioning,cross-modal conditioning,and programmable generation with plug-and-play discrete classifier guidance.Potential future directions:(1)Exploring DPLMs conditional generation for wider applicat
52、ions,(2)DPLM can further benefit from best practices of cutting-edge technical advancement in the vastness of large language models(LLMs),(3)It is imperative to integrate protein structure modeling into DPLM.Developing a universal protein language model with the next-generation DPLM,which accounts f
53、or both sequence and structure,is a particularly promising avenue.Whats next?“GPT-4 moment”for multimodal protein foundation models?Towards Unified Multimodal Protein Foundation Modelsinverse foldingconditional sequence generation_pdb id:1IGT.from https:/www.rcsb.org/structure/1IGTDIVLTQSPSSLSASLGDT
54、ITITCHASQNINVWLSWYQQKPGNIPKLLIYKASNLHTGVPSRFSGSGSGTGFTLTISSLQPEDIATYYCQQGQSYPLTFGGGT.amino acid sequencefolding conditional structure generationprotein 3d structuresequence generationstructure generationsequence-structure co-designMultimodal-DPLM:One Model Can Do Whatever You Need for Proteinsuncond
55、itional structure design noise structurefolding:sequence structureco-design:noise (sequence,structure)applications:e.g,designing symmetric oligomersWere doing AI for Science at ByteDance ResearchAI Protein Modeling&DesignLearning Harmonic Molecular Representations on Riemannian Manifold.In ICLR 2023
56、On Pre-training Language Model for Antibody.In ICLR 2023Structure-informed Language Models Are Protein Designers.In ICML 2023(oral)Diffusion Language Models Are Versatile Protein Learners.In ICML 2024.Protein Conformation Generation via Force-Guided SE(3)Diffusion Models.In ICML 2024.Antigen-Specifi
57、c Antibody Design via Direct Energy-based Preference Optimization.preprint.2024Small Molecule DesignRegularized Molecular Conformation Fields.In NeurIPS 2022Zero-Shot 3D Drug Design by Sketching and Generating.In NeurIPS 2022Diffusion Models with Decomposed Priors for Structure-Based Drug Design.In ICML 2023DecompOpt:Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization.In ICLR 2024Cryo-EMCryoSTAR:Leveraging Structural Prior and Constraints for Cryo-EM Heterogeneous Reconstruction.preprint.2023DPLMLM-DESIGN