Large Vocabulary Continuous Speech Recognition (LVCSR) System
A Large Vocabulary Continuous Speech Recognition (LVCSR) System is an Automatic Speech Recognition (ASR) System in which the audio file is split into phonemes that matched with words and phrases in a dictionary to produce a full text transcript.
- AKA: Large Vocabulary Continuous Speech Recognizers, Text-based Indexing System.
- Example(s):
- Counter-Example(s):
- See: Audio Mining, Dictionary, Statistical Methods, Speech Recognition, Terminology.
References
2022
- (University of Sheffield, 2022) ⇒ http://spandh.dcs.shef.ac.uk/research/lvcsr.html Retrived:2022-4-3.
- QUOTE: The search problem in LVCSR can be simply stated: find the most probable sequence of words given a sequence of acoustic observations, an acoustic model and a language model. This is a demanding problem since word boundary information is not available in continuous speech and each word in the dictionary may be hypothesized to start at each frame of acoustic data. The problem is further complicated by the vocabulary size (typically 65,000 words) and the structure imposed on the search space by the language model. Direct evaluation of all the possible word sequences is impossible (given the large vocabulary) and an efficient search algorithm will consider only a very small subset of all possible utterance models. Typically, the effective size of the search space is reduced through pruning of unlikely hypotheses and/or the elimination of repeated computations.
We developed a novel, efficient search strategy based on stack decoding, that we referred to as start-synchronous search. This is a single-pass algorithm that is naturally factored into time-asynchronous processing of the word sequence and time-synchronous processing of the HMM state sequence. The search architecture enables the search to be decoupled from the language model while still maintaining the computational benefits of time-synchronous processing.
- QUOTE: The search problem in LVCSR can be simply stated: find the most probable sequence of words given a sequence of acoustic observations, an acoustic model and a language model. This is a demanding problem since word boundary information is not available in continuous speech and each word in the dictionary may be hypothesized to start at each frame of acoustic data. The problem is further complicated by the vocabulary size (typically 65,000 words) and the structure imposed on the search space by the language model. Direct evaluation of all the possible word sequences is impossible (given the large vocabulary) and an efficient search algorithm will consider only a very small subset of all possible utterance models. Typically, the effective size of the search space is reduced through pruning of unlikely hypotheses and/or the elimination of repeated computations.
2021
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Audio_mining#Large_Vocabulary_Continuous_Speech_Recognizers Retrieved:2021-6-20.
- In text-based indexing or large vocabulary continuous speech recognition (LVCSR), the audio file is first broken down into recognizable phonemes. It is then run through a dictionary that can contain several hundred thousand entries and matched with words and phrases to produce a full text transcript. A user can then simply search a desired word term and the relevant portion of the audio content will be returned.
If the text or word could not be found in the dictionary, the system will choose the next most similar entry it can find. The system uses a language understanding model to create a confidence level for its matches. If the confidence level be below 100 percent, the system will provide options of all the found matches.
- In text-based indexing or large vocabulary continuous speech recognition (LVCSR), the audio file is first broken down into recognizable phonemes. It is then run through a dictionary that can contain several hundred thousand entries and matched with words and phrases to produce a full text transcript. A user can then simply search a desired word term and the relevant portion of the audio content will be returned.
2011
- (Dahl et al., 2011) ⇒ George E. Dahl, Dong Yu, Li Deng, and Alex Acero (2011). "Large vocabulary continuous speech recognition with context-dependent DBN-HMMs". In: 2011 IEEE International Conference on acoustics, speech and signal processing (ICASSP) (pp. 4688-4691). IEEE.