《MobiDev:如何有效搭建语音识别系统(2022)(英文版)(15页).pdf》由会员分享,可在线阅读,更多相关《MobiDev:如何有效搭建语音识别系统(2022)(英文版)(15页).pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Table of ContentsVOICE RECOGNITION VS SPEECH RECOGNITIONHow do speech recognition applications work?WHICH TYPE OF AI IS USED IN SPEECH RECOGNITION?WHAT IS IMPORTANT FOR SPEECH RECOGNITION TECHNOLOGY?Automatic Speech Recognition process and componentsOur 4 recommendations for improving quality of ASR
2、1.PAY ATTENTION TO THE SAMPLE RATE2.NORMALIZE RECORDING VOLUME3.IMPROVE RECOGNITION OF SHORT WORDS4.USE NOISE SUPPRESSION METHODS ONLY WHEN NEEDEDGet the enhanced ASR system1Modern voice applications use AI algorithms to recognize different sounds,including human voice and speech.In technical terms,
3、most of the voice appsperform either voice recognition or speech recognition.And while there is no bigdifference between the architecture and AI models that perform voice/speechrecognition,they actually relate to different business tasks.So first of all,let uselaborate on the difference between them
4、.VOICE RECOGNITION VS SPEECH RECOGNITIONVoice recognition is the ability to single out specific voices from other sounds,and identify the owners tone to implement security features like voicebiometrics.Speech recognition is mostly responsible for extracting meaningful informationfrom the audio,recog
5、nizing the words said,and the context they are placed in.With this we can create systems like chatbots and virtual assistants forautomated communication and precise understanding of voice commands.Both terms can often be used interchangeably,because there is not muchtechnical difference between the
6、algorithms that perform these functions.Although,depending on what you need,the pipeline for voice or speechrecognition may be different in terms of processing steps.If you are interestedin voice recognition for security systems specifically,read our article on AI voicebiometrics:In this post,well f
7、ocus on the general approach for speech recognitionapplications,and elaborate on some of the architectural principles we can applyto cover all of the possible functional requirements.2How do speech recognition applicationswork?Speech recognition covers the large sphere of business applications rangi
8、ngfrom voice-driven user interfaces to virtual assistants like Alexa or Siri.Anyspeech recognition solution is based on the Automatic Speech Recognition(ASR)technology that extracts words and grammatical constructions from the audio,to process it and provide some type of system response.WHICH TYPE O
9、F AI IS USED IN SPEECH RECOGNITION?Speech recognition models can react to speech directly as an activation signalfor any type of action.But since were speaking about speech recognition,it isimportant to note that AI doesnt extract meaningful information right from theaudio,because there are many odd
10、 sounds in it.This is where speech-to-textconversion is done as an obligatory component to further apply NaturalLanguage Processing or NLP.So the top-level scope of a speech recognition application can be represented asfollows:the users speech provides input to the AI algorithm,which helps to findth
11、e appropriate answer for the user.High-level representation of an automatic speech recognition applicationHowever,it is important to note that the model that converts speech to text forfurther processing is the most obvious component of the entire AI pipeline.Besides the conversion model,there will
12、be numerous components that ensureproper system performance.So approaching the speech recognition system development,first you mustdecide on the scope of the desired application:What will the application do?3Who will be the end users?What environmental conditions will it be used in?What are the feat
13、ures of the domain area?How will it scale in the future?WHAT IS IMPORTANT FOR SPEECH RECOGNITION TECHNOLOGY?When starting speech recognition system development,there are a number ofbasic audio properties we need to consider from the start:1.Audio file format(mp3,wav,flac etc.)2.Number of channels(st
14、ereo or mono)3.Sample rate value(8kHz,16kHz,etc.)4.Bitrate(32 kbit/s,128 kbit/s,etc.)5.Duration of the audio clips.The most important ones are audio file format and sample rate,so lets speak ofthem in detail.Input devices record audio in different file formats,and mostoften audio is saved in loosy m
15、p3,but there can also be lossless formats likeWAV or Flac.Whenever we record a sound wave,we basically digitize the soundby sampling it in discrete intervals.This is whats called a sample rate,whereeach sample is an amplitude of a waveform in a particular duration of time.Audio signal representation
16、4Some models are tolerant to format changes and sample rate variety,whileothers can intake only a fixed number of formats.In order to minimize this kindof inconsistency,we can use various built-in methods for working with audio ineach programming language.For example,if we are talking about the Pyth
17、onlanguage,then various operations such as reading,transforming,and recordingaudio can be performed using the libraries like Librosa,scipy.io.wavfile andothers.Once we get the specifics of audio processing,this will bring us to a more solidunderstanding of what data well need,and how much effort it
18、will take toprocess it.At this stage,consultancy services from a data science teamexperienced in ASR and NLP is highly recommended.Since gathering wrong dataand or setting unrealistic objectives are the biggest risks in the beginning.Automatic Speech Recognition process andcomponentsAutomatic speech
19、 recognition,speech-to-text,and NLP are some of the mostobvious modules in the whole voice-based pipeline.But they cover a very basicrange of requirements.So now lets look at the common requirements to speechrecognition,to understand what else we might include in our pipeline:The application has to
20、work in background mode,so it has to separate theusers speech from other sounds.For this feature,well need voice activitydetection methods,which will transfer only those frames that contain thetarget voice.The application is meant to be used in crowded places,which means therewill be other voices an
21、d surrounding noise.Background noise suppressionmodels are preferable here,especially neural networks which can removeboth low-frequency noise,and high frequency loud sounds like humanvoices.In cases where there will be several people talking,like in the case of a callcenter,we also want to apply sp
22、eaker diarization methods to divide theinput voice stream into several speakers,finding the required one.The application must display the result of voice recognition to the user.Then it should take into account that speech2text(ASR)models may5return text without punctuation marks,or with grammatical
23、 mistakes.Inthis case,it is advisable to apply spelling correction models,which willminimize the likelihood that the user will see a solid text in front of them.The application will be used in a domain area,where professional termsand abbreviations are used.In such cases,there is a risk that speech2
24、textmodels will not be able to correctly cope with this task and then trainingof a custom speech2text model will be required.In this way,we can derive the following pipeline design which will includemultiple modules just to fetch the correct data and process it.Automatic Speech Recognition(ASR)pipel
25、ineThroughout the AI pipeline,there are blocks that are used by default:ASR andNLP methods(for example,intent classification models).Essentially,the AIalgorithm takes sound as an input,converts it to speech using ASR models,andchooses a response for the user using a pre-trained NLP model.However,for
26、 a6qualitative result,such stages as pre-processing and post-processing arenecessary.Now well move to advanced architecture.Our 4 recommendations for improvingquality of ASRTo optimize the planning of the development and mitigate the risks before youget into trouble,it is better to know of the exist
27、ing problems within the standardapproaches in advance.MobiDev ran an explicit test of the standard pipeline,soin this section will share some of the insights found that need to be considered.1.PAY ATTENTION TO THE SAMPLE RATEAs weve mentioned before,audio has characteristics such as sample rate,numb
28、er of channels,etc.These can significantly affect the result of voicerecognition and overall operation of the ASR model.In order to get the bestpossible results,we should consider that most of the pre-trained models weretrained on datasets with 16Hz samples and only one channel,or in other words,mon
29、o audio.This brings with it some constraints on what data we can take for processing,and adds requirements to the data preparation stages.2.NORMALIZE RECORDING VOLUMEObviously,ASR methods are sensitive to audio containing a lot of extraneousnoise,and suffer when trying to recognize atypical accents.
30、But whats moreimportant,speech recognition results will strongly depend on the sound volume.Sound recordings can often be inconsistent in volume due to the distance fromthe microphone,noise suppression effects,and natural volume fluctuations inspeech.In order to avoid such inaccuracies,we can use th
31、e Pyloudnorm libraryfrom the Python language that helps to determine the sound volume range andamplify the sound without any distortion.This method is very similar to audiocompression,but brings less artifacts,improving the overall quality of themodels predictions.7Nvidia Quarznet 155 speech recogni
32、tion results with and without volumenormalizationHere you can see an example of voice recognition without volume normalization,and also with it.In the first case,the model struggled to recognize a simpleword,but after volume was restored,the results improved.3.IMPROVE RECOGNITION OF SHORT WORDSThe m
33、ajority of ASR models were trained on datasets that contain texts withproper semantic relations between each sentence.This brings us to anotherproblem with recognizing short phrases taken out of context.Below is acomparison of the performance of the ASR model on short words taken out ofcontext and o
34、n a full sentence:The result of recognizing short words in and out of context8In order to overcome this problem,it is necessary to think about the use of anypreprocessing methods that allow the model to understand in which particulararea a person wants to receive information more accurately.Addition
35、ally,ASR models can generate non-existing words and other specificmistakes during the text to speech conversion.Spell correction methods maysimply fail in the best cases,or choose to correct the word to one that is close tothe right choice,or even change to a completely wrong one.This problem alsoap
36、plies to very short words taken out of context,but it should be foreseen inadvance.4.USE NOISE SUPPRESSION METHODS ONLY WHEN NEEDEDBackground noise suppression methods can greatly help to separate a usersspeech from the surrounding sounds.However,once loud noise is present,noise suppression can lead
37、 to another problem,such as incorrect operation ofthe ASR model.Human speech tends to change in volume depending on the part of thesentence.For example,when we speak we would naturally lower our voice atthe end of the sentence,which leads to the voice blending with other soundsand being drowned out
38、by the noise suppression.This results in the ASR modelnot being able to recognize a part of the message.Below you can see anexample of noise suppression affecting only a part of a users speech.9Noise suppression effect on the speech recognitionIt is also worth considering that as a result of applyin
39、g Background NoiseSuppression models,the original voice is distorted,which adversely affects the10operation of the ASR model.Therefore,you should not apply Background NoiseSuppression without a specific need for it.Get the enhanced ASR systemBased on the mentioned points,the initial pipeline can bri
40、ng more trouble withit than actual performance benefits.This is because some of the componentsthat seem logical and obligatory may interrupt the work of other essentialcomponents.In other cases,there is a strict need to add layers of preprocessingbefore the actual AI model can interact with data.We
41、can therefore come upwith the following enhanced ASR system architecture:Enhanced automatic speech recognition system pipelineThat is why,based on the above points,noise suppression and spellingcorrection modules were removed.Instead,to solve the problem of removingnoise and getting rid of errors in
42、 the recognized text,the ASR model has to be11fine-tuned on the real data.This data will fully reflect the actual environmentalconditions and features of the domain area.As you can see,speech recognition application development has many pitfallswhich require strong software engineering and machine learning expertise to beconsidered.At MobiDev,AI engineers developed a solid practical experienceduring the projects and R&D on voice applications.If you have a project idea thatinvolves speech recognition or any type of related technology,contact us todiscuss the details.12