2012 ContextDependentPreTrainedDeepN
- (Dahl et al., 2012) ⇒ George E. Dahl, Dong Yu, Li Deng, and Alex Acero. (2012). “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition.” In: IEEE Transactions on Audio, Speech, and Language Processing Journal, 20(1). doi:10.1109/TASL.2011.2134090
Subject Headings: Unsupervised Pre-Training Algorithm, Feature Detector.
Notes
Cited By
- http://scholar.google.com/scholar?q=%222012%22+Context-Dependent+Pre-Trained+Deep+Neural+Networks+for+Large-Vocabulary+Speech+Recognition
- http://dl.acm.org/citation.cfm?id=2335874.2336015&preflayout=flat#citedby
Quotes
Author Keywords
Speech recognition, deep belief network, context-dependent phone, LVSR, DNN-HMM, ANN-HMM
Abstract
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM) - HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.
Introduction
…
Recently, a major advance has been made in training densely connected, directed belief nets with many hidden layers. The resulting deep belief nets learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns in data. The deep belief net training algorithm suggested in [24] first initializes the weights of each layer individually in a purely unsupervised[1] way and then fine-tunes the entire network using labeled data. This semi-supervised approach using deep models has proved effective in a number of applications, including coding and data-driven classification for speech, audio, text, and image data ([25]–[29]).
…
Footnotes
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2012 ContextDependentPreTrainedDeepN | Dong Yu Li Deng George E. Dahl Alex Acero | Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition | 10.1109/TASL.2011.2134090 | 2012 |