《多说话人分离技术及应用进展-洪青阳.pdf》由会员分享,可在线阅读,更多相关《多说话人分离技术及应用进展-洪青阳.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、洪青阳合作者:余洪涌、姜跃猛、李朝阳、王捷、李琳厦门大学智能语音实验室2024.3多说话人分离技术及应用进展纲 要1.研究背景2.工业版本模块化系统3.改进方案4.落地应用1.研究背景多说话人分离(说话人日志):给定一个包含多人交替说话的语音,系统需要判断每个时间段是谁在说话。多说话人分离系统音频分割信息1.研究背景应用场景:会议纪要,多说话人转录,智能客服,录音质检等.终端设备:智能手机个人电脑录音笔支持厂商:科大讯飞(智能办公本)、华为(AI纪要)、声云(语音转写).1.研究背景端到端架构模块化架构研究趋势:简单场景复杂场景2000 200220062009 2013 2018 2019
2、2020 2021 2022 2023竞赛/数据集Rich Transcription(RT)AMICALLHOMEDIHARD(I,II,III)CHiME-6M2MeT,AISHELL-4架构MIXER6挑战:噪声干扰,人数未知,语音重叠等应用:离线=在线,单麦克风=多麦克风,适配新场景VoxSRC(20,21,22,23)M2MeT2.0,CHiME-7AliMeeting1.研究背景模块化系统聚类方法:AHC1、SC2,3、VB/VBx4,5、UIS-RNN6、DNC7 1 K.C.Gowda and G.Krishna,“Agglomerative Clustering Using
3、the Concept of Mutual Nearest Neighbourhood,”Pattern Recognition,vol.10,pp.105112,1978.2 U.von Luxburg,“Atutorial on spectral clustering,”Statistics and Computing,vol.17,pp.395416,2007.3 T.Park,Kyu J.Han,Manoj Kumar,and Shrikanth S.Narayanan,“Auto-tuning Spectral Clustering for Speaker Diarization U
4、sing Normalized Maximum Eigengap,”IEEE SignalProcessing Letters,vol.27,pp.381385,2020.4 M.Diez,L.Burget,S.Wang,J.Rohdin,H.Cernocky,“Bayesian HMM based x-vector Clustering for Speaker Diarization,”Interspeech,2019,pp.346-350.5 M.Diez,L.Burget,F.Landini,J.Cernocky,Analysis of Speaker Diarization based
5、 on Bayesian HMM with Eigenvoice Priors,IEEE/ACM Transactions on Audio Speech andLanguage Processing,vol.28,p 355-368,2020.6A.Zhang,Q.Wang,Z.Zhu,J.Paisley,and C.Wang,“Fully Supervised Speaker Diarization,”ICASSP,2019.7 Q.J.Li,F.L.Kreyssig,C.Zhang,P.C.Woodland,“Discriminative Neural Clustering for Sp
6、eaker Diarisation,”IEEE Spoken Language Technology Workshop(SLT 2021),Jan 2021,Shenzhen,China.1.研究背景端到端系统端到端系统EEND1SA-EEND2TS-VAD4基于Bi-LSTM的端到端模型目标说话人音频端点检测模型1 Y.Fujita,N.Kanda,S.Horiguchi,K.Nagamatsu,and S.Watanabe,“End-to-end Neural Speaker Diarization with Permutation-free Objectives,”in Interspe
7、ech,2019,pp.43004304.2 Y.Fujita,N.Kanda,S.Horiguchi,Y.Xue,K.Nagamatsu and S.Watanabe,“End-to-End Neural Speaker Diarization with Self-Attention,”2019 IEEE Automatic Speech Recognitionand Understanding Workshop(ASRU),SG,Singapore,2019,pp.296-303.3 S.Horiguchi,Y.Fujita,S.Watanabe,Y.Xue,and K.Nagamatsu
8、,“End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,”inInterspeech,2020,pp.269273.4 I.Medennikov,M.Korenevsky,et al.,“Target-speaker Voice Activity Detection:a Novel Approach for Multi-speaker Diarization in a Dinner Party Scenario,”arXiv,vol.abs/
9、2005.07272,2020.基于Transformer encoder的端到端模型EDA-EEND3可以预测人数的EEND模型1.研究背景聚类算法汇总聚类算法训练方式输入特征重叠检测预测人数AHC无监督聚类x-vector不支持阈值VB无监督聚类i-vector不支持初始化调节VBx无监督聚类x-vector不支持初始化调节SC无监督聚类x-vector不支持阈值/NMEUIS-RNN有监督聚类d-vector不支持适合2人DNC有监督聚类d-vector支持输出节点EEND有监督聚类声学特征支持输出节点TS-VAD有监督聚类i-vector支持输出节点在线版本:研究主要集中在EEND1,
10、2或UIS-RNN3,4框架麦阵版本:多通道输入TS-VAD5或前后端联合优化特定场景:不同场景采用不同策略61 Y.Xue,S.Horiguchi,Y.Fujita,S.Watanabe,P.Garcia,and K.Nagamatsu,“Online end-to-end neural diarization with speaker-tracing buffer,”in IEEE Spoken LanguageTechnology Workshop(SLT),2021,pp.841848.2 E.Han,C.Lee,and A.Stolcke,“Bw-eda-eend:Streaming
11、 end-toend neural speaker diarization for a variable number of speakers,”in ICASSP,2021.3 E.Fini andA.Brutti,“Supervised online diarization with sample mean loss for multi-domain data,”in ICASSP,2020,pp.71347138.4 X.Wan,K.Liu,H.Zhou,Online speaker diarization equipped with discriminative modeling an
12、d guided inference,”in Interspeech,2021.5 I.Medennikov,M.Korenevsky,et al.,“Target-speaker Voice Activity Detection:a Novel Approach for Multi-speaker Diarization in a Dinner Party Scenario,”arXiv,vol.abs/2005.07272,2020.6 Y.-X.Wang,J.Du,M.-K.He,S.-T.Niu,L.Sun,C.-H.Lee,Scenario-dependent speaker dia
13、rization for DIHARD-III challgenge,in Interspeech,2021.功能:转换为聚类问题2.工业版本模块化系统2.1 分割音频功能:提取段级别说话人表征i-vectord-vectorx-vector2.工业版本模块化系统2.2 提取说话人表征功能:对相同说话人片段聚类AHC2.工业版本模块化系统2.3 聚类凝聚层次聚类(AHC)K.C.Gowda and G.Krishna,“Agglomerative Clustering Using the Concept of Mutual Nearest Neighbourhood,”Pattern Reco
14、gnition,vol.10,pp.105112,1978.2.工业版本模块化系统第一代产品(与ASV-Subtools*结合)说话人日志(SD)语音识别(ASR)语音端点检测(VAD)识别后处理原始音频说话人1说话人2说话人3说话人4算法流程:VAD-平均分割-Subtools提取x-vector-PCA降维-Cosine打分-AHC聚类*https:/ 2.1 speaker diarization pipeline:principle,benchmark,and recipe,”Interspeech 2023.提取x-vector时,去除重叠语音,并合并同一人语音解决办法:进行分段,每
15、段用神经网络判断说话人,最多3人。3.改进方案完整流程Pyannote神经网络分割Subtools提取x-vectorAHC聚类说话人分离结果5s根据聚类结果分配每个说话人音频,包括重叠语音处理ONNX模型2推理1 https:/ https:/ LSTM Feed forward Classifier3.改进方案C代码改写存在问题Python代码太慢(实时率RTF远大于1),难以实用全部改用C代码实现(实时率RTF:0.030.06)运行环境:Intel(R)Xeon(R)CPU E5-2630 v4 2.20GHz4.落地应用声云语音转写(各大应用市场可下载)新版本使用效果:日转写时长大约 330-350小时,0投诉。4.落地应用声云语音转写(各大应用市场可下载)自研引擎的优点:普通话、带角色分离、1小时耗时4分钟(友商耗时5分钟、且晚上经常拥堵)我们支持大于5小时的离线音频任务(友商最大支持5小时,超5小时无法上传)汇报完毕,敬请指正!厦门大学智能语音实验室