《【3】Text to Audio Generation and Editing with Latent Diffusion Models.pdf》由会员分享,可在线阅读,更多相关《【3】Text to Audio Generation and Editing with Latent Diffusion Models.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、Text to Audio Generation and Editing with Latent Diffusion ModelsYuancheng W12Text-to-Audio GenerationWhat is text-to-audio generation:Generate sounds that are semantically in line with descriptionsSome examples:A group of sheep are baaing.(animals)Water flowing down a river.(environment)Piano and v
2、iolin plays.(musician)A cat meowing and young female speaking.(human speech,animals)3Text-to-Audio GenerationSome methodsDiffSound 1:Discrete diffusion model,use a VQ-VAE model to discretize the mel-spectrogram.1 Diffsound:Discrete diffusion model for text-to-sound generation.arXiv preprint arXiv:22
3、07.09983,2022.1.a horse galloping;2.piano and violin plays;3.drums and music playing with a man speaking4Text-to-Audio GenerationSome methodsAudioGen 2:Autoregressive,decoder only,discretize the waveform directly.2 Audiogen:Textually guided audio generation.arXiv preprint arXiv:2209.15352,2022.5Text
4、-to-Audio with Latent Diffusion ModelLatent diffusion based methods:Make-an-Audio 3 and AudioLDM 43 Make-an-audio:Text-to-audio generation with prompt-enhanced diffusion models.arXiv preprint arXiv:2301.12661,2023.4 Audioldm:Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:
5、2301.12503,2023.1.a horse galloping;2.piano and violin plays;3.drums and music playing with a man speaking6Text-to-Audio with Latent Diffusion ModelSome challenges:Data:the most used audio-caption dataset AudioCaps 5 has only 50K data,AudioSet 6 has 2M audio-label data.Variable length and higher qua
6、lity audio generation.5 Audiocaps:Generating captions for audios in the wild.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.6 Audio set:An ontology and human-labeled dataset for audio events.In 2017 IEE
7、E international conference on acoustics,speech and signal processing(ICASSP).7Text-to-Audio with Latent Diffusion ModelSome examples for our text-to-audio model:A horse galloping.Some typing on a computer.A young girl giving a speech while group of people applauding.Jazz music.Wind blows and insects
8、 buzz while birds chirping.8Text-to-Audio with Latent Diffusion Model9Text-Guided Audio EditingAudio editing tasksSpeed,pitchStyle transform(human speech,music)Singing or speech voice conversionInpaintingSuper-resolutionAddingDroppingReplacement10Audio Editing by Following Human InstructionsAudio ed
9、iting by following human instructionsInpainting:Inpaint the audio!,Inpaint:a baby is crying.Super-resolution:Increase resolution!Adding:Add a jazz music in the background.Dropping:Drop the sound of birds buzzing.Replacement:Replace the sound of guitar with drum.11Zero-Shot Text-Guided Audio EditingD
10、iffusion based zero-shot editing:SDEdit 7:use a pre-trained diffusion model.Add noise to the input audio.(“A bird is singing”)Then use the target caption of the audio to guide denoising.(“A bird is singing,while a people is whistling”)7 Sdedit:Image synthesis and editing with stochastic differential
11、 equations.arXiv preprint arXiv:2108.01073,2021.12Zero-Shot Text-Guided Audio EditingWhat are the problems with SDEdit?If we want to add“a people is whistling”to the input audio with caption“A bird is singing”,we need use a new target caption“A bird is singing,while a people is whistling”to guide de
12、noising.Without knowing the caption of the input audio,we hope to use only instructions like“Add a people whistling in the background”to edit the input audio.7 Sdedit:Image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073,2021.13Zero-Shot Text-Guided Audio
13、 EditingWhat are the problems with SDEdit?It cannot ensure good editing effects.It can erroneously modify audio segments that do not require editing.7 Sdedit:Image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073,2021.14Audio Editing by Following Instructi
14、ons with Latent Diffusion ModelsOverview of AUDIT:paper:https:/arxiv.org/abs/2304.00830 demo:https:/audit-demo.github.io/15Audio Editing SamplesCheck the link:https:/audit-demo.github.io/16Methods:Latent Diffusion Model System OverviewA high-level overview of the system:AUDIT consists of a VAE,a T5
15、text encoder,and a diffusion network,and accepts the mel-spectrogram of the input audio and the edit instructions as conditional inputs and generates the edited audio as output.Latent DiffusionVAE DecoderVocoderT5 Encoder “Drop the sound of a duck quacking”VAE EncoderVAE Encoder17Methods:Generating
16、Triplet Training Data for Each Editing TaskTemplate:“Inpaint”A cat meowing.input audiorandomly maskInstruction:“Inpaint”output audioTemplate:“Increase resolution”A bird singing.input audiodown sampleInstruction:“Increase resolution”output audioTemplate:“Add in the background”Belling.Baby crying.Inst
17、ruction:“Add baby crying in the background”output audioinput audioTemplate:“Drop”A car engine accelerating.Dog barking.Instruction:“Drop dog barking”input audiooutput audioTemplate:“Replace with”Instruction:“Replace clapping with guitar”input audiooutput audioSomeone clapping.The sound of guitar.Add
18、ingDroppingReplacementInpaintingSuper-resolutionData generation for different audio editing tasks:18Methods:Generating Triplet Training Data for Each Editing TaskData processing and datasets:All audio clips have a length of 10 seconds.We convert the input audio into an 80 624 mel-spectrogram(4 10 78
19、 in the latent space).What is the advantages of our methods?We generated triplet data(instruction,input data,output data)to train a text-guided audio editing model instead of performing zero-shot editing,which can ensure good edit quality.What is the advantages of our methods?We directly use the inp
20、ut audio as a condition for supervised training of our diffusion model,allowing the model to automatically learn to preserve the parts that do not need to be edited before and after editing.Specifically,we concatenate the latent representations of the input audio and on the channel dimension and inp
21、ut them into the latent diffusion model,so that the model can see during both training and inference.What is the advantages of our methods?Instead of using a full description of the output audio,we use human instructions as the text input,which is more in line with real-world applications.22Evaluati
22、on ResultsEvaluation results of text-to-audio generation23Evaluation ResultsEvaluation results of the adding,dropping,and replacement tasks24Evaluation ResultsEvaluation results of the inpainting,super-resolution tasks25Case StudySome case studiesThe caption of the input audio is“A person is typing
23、computer”.While both AUDIT and the baseline method generate semantically correct results,the result generated by AUDIT is more natural and contextual.The caption of the input audio is“The sound of machine gun”,and the editing target is adding a bell ringing in the beginning.AUDIT performs audio editing accurately in the correct region without modifying audio segments that do not need to be edited.26Future WorkInference accelerationWe will explore more editing audio tasks with our framework,and achieve more precise control for audio editing27Thanks for your listening!