Latent Sequence Decompositions (LSD) System

From GM-RKB
Jump to navigation Jump to search

A Latent Sequence Decompositions (LSD) System is an End-To-End Automatic Speech Recognition System that can learn sub-word units as segments of sequences that are a function of both the input and output sequence.



References

2022

  • (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Speech_recognition#End-to-end_automatic_speech_recognition Retrieved:2022-4-3.
    • QUOTE: (...) An alternative approach to CTC-based models are attention-based models. Attention-based ASR models were introduced simultaneously by Chan et al. of Carnegie Mellon University and Google Brain and Bahdanau et al. of the University of Montreal in 2016. The model named "Listen, Attend and Spell" (LAS), literally "listens" to the acoustic signal, pays "attention" to different parts of the signal and "spells" out the transcript one character at a time. Unlike CTC-based models, attention-based models do not have conditional-independence assumptions and can learn all the components of a speech recognizer including the pronunciation, acoustic and language model directly. This means, during deployment, there is no need to carry around a language model making it very practical for applications with limited memory. By the end of 2016, the attention-based models have seen considerable success including outperforming the CTC models (with or without an external language model). Various extensions have been proposed since the original LAS model. Latent Sequence Decompositions (LSD) was proposed by Carnegie Mellon University, MIT and Google Brain to directly emit sub-word units which are more natural than English characters; University of Oxford and Google DeepMind extended LAS to "Watch, Listen, Attend and Spell" (WLAS) to handle lip reading surpassing human-level performance.

2017