Speech Recognition System
A Speech Recognition System is a natural language system that implements a speech-to-text algorithm to solve a speech-to-text task.
- AKA: Speech-to-Text Engine.
- Context:
- It can range from being a Single-Speaker Speech-to-Text System to being a Multi-Speaker Speech-to-Text System.
- It can be supported by a Speech Recognition Service.
- It can be supported by a Speech Recognition Model.
- ...
- Example(s):
- Counter-Example(s):
- See: Computational Linguistics, Voice User Interface, Domotic, Word Processor, Email.
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Speech_recognition Retrieved:2017-4-5.
- Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.
Some SR systems use "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent" systems. Systems that use training are called "speaker dependent". Speech recognition applications include voice user interfaces such as voice dialing (e.g. “Call home"), call routing (e.g. “I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input). The term voice recognition or speaker identification refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.
From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems. These speech industry players include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, SoundHound, IflyTek, CDAC many of which have publicized the core technology in their speech recognition systems as being based on deep learning.
- Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.
2011
- (Yu et al., 2011) ⇒ Dong Yu, Jinyu Li, and Li Deng. (2011). “Calibration of Confidence Measures in Speech Recognition.” In: IEEE Transactions on Audio, Speech, and Language Processing, 19(8). doi:10.1109/TASL.2011.2141988
2010
- (Vox Forge, 2010) ⇒ http://www.voxforge.org/home/docs/faq/faq/what-is-the-difference-between-a-speech-recognition-engine-and-a-speech-recognition-system
- Speech Recognition Engines ("SRE"s) are made up of the following components:
- Language Model or Grammar - Language Models contain a very large list of words and their probability of occurrence in a given sequence. They are used in dictation applications. Grammars are a much smaller file containing sets of predefined combinations of words. Grammars are used in IVR or desktop Command and Control applications. Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
- Acoustic Model - Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
- Decoder - Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds. When a match is made, the Decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It then searches the Language Model or Grammar file for the equivalent series of phonemes. If a match is made it returns the text of the corresponding word or phrase to the calling program.
- A Speech Recognition System ('SRS') on a desktop computer does what a typical user of speech recognition would expect it to do: you speak a command into your microphone and the computer does something, or you dictate something to the computer and it types it out the corresponding text on your screen.
An SRS typically includes a Speech Recognition Engine and a Dialog Manager (and may or may not include a Text to Speech Engine).
- Speech Recognition Engines ("SRE"s) are made up of the following components: