《【4】Speech signal improvement in real-time communication.pdf》由会员分享,可在线阅读,更多相关《【4】Speech signal improvement in real-time communication.pdf(29页珍藏版)》请在三个皮匠报告上搜索。
1、Speech Signal Improvement In Real-time CommunicaitonYannan WangTencent Ethereal Audio Lab,Tencent,Shenzhen,ChinaOutline1.Introduction2.Speech Signal Improvement3.Future work2 BackgroundReal-time communication(RTC)systems widely used:Teleconferencing systems Video callsReason for speech quality of cu
2、rrent RTC systems:Device robustness Acoustical capturing Noise/reverberation corruption Interfering speakers Network congestion3Introduction4Device robustnessOutline1.Introduction2.Speech Signal ImprovementI.EnhancementII.Restoration3.Future work56键盘雨声微信消息提示桌子放水杯咳嗽语音降噪7 房间墙壁、天花板、地面、各种物体的反射声波和直达波叠加,降
3、低语音质量和清晰度 传统方法缺陷:难以准确估计纯净语音和混响语音的非线性映射关系 算法需要先验信息较多,收敛较慢 去除混响的成分较少,效果不够明显去混响8说话人提取有感注册无感注册Yukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaOur pervious winner model-TEA-PSE2The 1ststage network:estimate the target speakers magnitude with noisy phaseThe 2ndstage network:estimate the residual re
4、al and imaginary part Use simple concatenation method to combine speaker embeddingRelated WorksYukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaContributionIncorporate a residual LSTM3after squeezed temporal convolution network(S-TCN)to enhance sequence modeling capabilitiesLocal-global repres
5、entation(LGR)4 structure is introduced to boost speaker information extractionMulti-STFT resolution loss5 is used to effectively capture the time-frequency characteristics of the speech signalsRetraining methods are employed based on the freeze training strategy to fine-tune the systemTEA-PSE 3.0 ra
6、nks 1st in both ICASSP 2023 DNS-Challenge track 1 and track 26TEA-PSE 3.0Yukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaNetwork structureSame dual-stage framework as TEA-PSEResidual LSTM is added after every S-TCN module to further enhance the models sequence modeling capabilitiesLocal-globa
7、l representation(LGR)is adopted to boost better speaker information extractionTEA-PSE 3.0Yukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaLoss Function1st-stage2nd-stage3rd-stageWe train MAG-Net and COM-Net sequentially as described above and then load these pre-trained models to retrain the e
8、ntire system using L2.Yukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaExperimental SetupDatasetICASSP DNS-2022-speech 750h,noise 181hData augmentationReverberationDifferent interference scenarioOur training data are generated on-the-flySNR range-5,20dB,SIR range-5,20dBMixture scale range-35,-
9、15dBFSFFT configuration20ms frame length,10ms frame shift,1024 FFT pointsFor multi-STFT resolution loss,we use 3 different groups with FFT length 512,1024,2048,window length 480,960,1920,and frameshift 240,480,960Yukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaExperimental Results and Analysi
10、sTEA-PSE 3.0 has the highest BAK and OVRLCompared with unprocessed speech,the SIG and WAcc of the submission model are decreasedYukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaDemoNoisyTEA-PSE 3.0Yukai Ju et al.ASLPNPU&Tencent Ethereal Audio Lab,ChinaDemoNoisyTEA-PSE 3.0Outline1.Introduction2
11、.Speech Signal ImprovementI.EnhancementII.Restoration3.Future work17 MotivationSpeech restoration modules:Still exist residual noise components and artifactsAffects the perceptual quality of the speech signalNoise reduction modules:Distortions of the degraded speech signalIncrease the difficulty in
12、restoring the desired speech signal1819 Speech Signal Improvement Network(SSI-Net)Restoration Module Enhancement ModuleMethodology:SSI-NetMTFAA-LiteSTFTTRGANISTFTRestoration ModuleEnhancement Module Restoration Module:TRGAN Speech distortion restoration Bandwidth expansion Preliminary denoising and
13、dereverberation20Methodology:SSI-Net21 Restoration Module:TRGAN Time-domain mapping-based generator:Pseudo quadrature mirror filter bank(PQMF)for sub-band decomposition Utilization of phase information Discriminators:Multi-resolution frequency discriminators 1 Our proposed multi-band discriminatorsM
14、ethodology:SSI-Net1 Liu Z,Qian Y.Basis-MelGAN:Efficient neural vocoder based on audio decompositionJ.arXiv preprint arXiv:2106.13419,2021.Enhancement Module22Methodology:SSI-NetFrequency-attentionAmplitude/Phase EncoderERB MergingConv-2DERB SplittingMask EstimationFrequency-attentionConv-2DFrequency
15、-attentionConv-2DNoisy Speech Signal Enhanced Speech Signal Signal ResynthesisEliminates residual noise components and artifactsAdopt the lite version of MTFAA-Net 1.Retain the frequency downsampling,frequency upsampling,and T-F convolution modules in MTFAA-NetDrop the T-attention with high time-com
16、plexity in axial self-attention.1 Zhang,G.,Wang,C.,Yu,L.,&Wei,J.(2022,May).Multi-Scale Temporal Frequency Convolutional Network with Axial Attention for Multi-Channel Speech Enhancement.In ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP)(pp.9206-9210).
17、IEEE.Experimental Setup 100,000 Room impulse responses(RIRs)with the image method Clean speech and noise:subsets from DNS Challenge 1 and some private dataset.A simulated 1500-hour dataset:coloration,discontinuity,loudness,noise and reverberationcodec23Experiments1 Dubey H,Gopal V,Cutler R,et al.ICA
18、SSP 2022 Deep Noise Suppression ChallengeC/ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:9271-9275.Evaluation on the SSI Challenge Blind Test SetTable 1:partial results of multi-dimensional subjective testSSI-Net yields a significant improvement in all metricsEfficiently alleviate the difficulties:Noise Coloration Discontinuity Loudness Reverberation24Experiments*More detailed results are available on the website:https:/ for Track 1Results for Track 2官方比赛结果:https:/