2011 CalibrationofConfidenceMeasures

(Yu et al., 2011) ⇒ Dong Yu, Jinyu Li, and Li Deng. (2011). “Calibration of Confidence Measures in Speech Recognition.” In: IEEE Transactions on Audio, Speech, and Language Processing, 19(8). doi:10.1109/TASL.2011.2141988

Subject Headings: Confidence Measure; Speech Recognition

Notes

Cited By

http://scholar.google.com/scholar?q=%22Calibration+of+confidence+measures+in+speech+recognition%22+2011

Quotes

Author Keywords

confidence calibration; confidence measure; maximum entropy; distribution constraint; word distribution; deep belief network

Abstract

Most speech recognition applications in use today rely heavily on confidence measure for making optimal decisions. In this paper, we aim to answer the question: what can be done to improve the quality of confidence measure if we cannot modify the speech recognition engine? The answer provided in this paper is a post-processing step called confidence calibration, which can be viewed as a special adaptation technique applied to confidence measure. Three confidence calibration methods have been developed in this work: the maximum entropy model with distribution constraints, the artificial neural network, and the deep belief network. We compare these approaches and demonstrate the importance of key features exploited: the generic confidence-score, the application-dependent word distribution, and the rule coverage ratio. We demonstrate the effectiveness of confidence calibration on a variety of tasks with significant normalized cross entropy increase and equal error rate reduction.

1. Introduction

Automatic speech recognition (ASR) technology has been widely deployed in such applications as spoken dialog systems, voice mail (VM) transcription, and voice search [2][3]. Even though the ASR accuracy has been greatly improved over the past three decades, errors are still inevitable, especially under noisy conditions [1]. For this reason, most speech applications today rely heavily on a computable scalar quantity, called confidence measure, to select optimal dialog strategies or to inform users what information can be trusted and what cannot. The quality of the confidence measure is thus one of the critical factors in determining success of speech applications.

Based on the nature of a specific speech application, one or two types of confidence measures may be used. The word confidence measure (WCM) estimates the likelihood a word is correctly recognized. The semantic confidence measure (SCM), on the other hand, measures how likely semantic information is correctly extracted from an utterance. For example, in the VM transcription application, SCM is essential for the keyword slots such as the call-back phone number; WCM is important for transcribing a general message. In spoken dialog and voice search (VS) applications, SCM is more meaningful since the goal of these applications is to extract from users’ responses semantic information such as date/time, departure and destination cities, and business names.

Note that SCM has substantially different characteristics from WCM and requires distinct treatment primarily because the same semantic information can be delivered in different ways. For instance, number 1234 may be expressed as ―one thousand two hundred and thirty four‖ or ―twelve thirty four.‖ In addition, it is not necessary to recognize all the words correctly to obtain correct semantic information. For example, it will not cause any semantic error when November seventh is misrecognized as November seven, and vice versa. This is especially true when irrelevant or redundant words, such as ma’am in ―yes ma’am‖ and ah in ―ah yes‖, are misrecognized, or filtered out (e.g., using a garbage model [4][5].)

Numerous techniques have been developed to improve the quality of confidence measures. See [6] for a survey. Briefly, those prior techniques can be grouped into three categories. In the first category, a two-class (true or false) classifier is built based on features (e.g., acoustic and language model scores) obtained from the ASR engine and the classifier’s likelihood output is used as the confidence measure. The classification models reported in the literature include the linear discriminant function [7][8], generalized linear model [9][10], Gaussian mixture classifier [11], neural network [12][13][49], decision tree [14][15], boosting [16], and maximum entropy model [17]. The techniques in the second category take the posterior probability of a word (or semantic slot) given the acoustic signal as confidence measure. This posterior probability is typically estimated from the ASR lattices [18][19][20][21] or N-best lists [20][22]. These techniques require special handling when the lattice is not sufficiently rich. However, they do not require an additional parametric model to estimate the confidence score. Techniques in the third category treat the confidence estimation problem as an utterance verification problem. They use the likelihood ratio between the null hypothesis (e.g., the word is correct) and the alternative hypothesis (e.g., the word is incorrect) as confidence measure [8][23][24]. Discussions on the pros and cons of all the three categories of techniques can be found in [6]. Note that the parametric techniques in the first and third categories often outperform the non-parametric techniques in the second categories, because the parametric techniques can always include the posterior probability as one of the information sources and thus improve upon it.

No matter which parametric technique is used, the confidence measure is typically provided by the ASR engine and trained on a generic dataset. Therefore, it is a black box to the speech application developers. Using a generic training set can provide good average out-of-box performance across a variety of applications. However, this approach is obviously not optimal since the data used to train the confidence measure may differ vastly from the real data observed in a specific speech application. The disparity is typically caused by different language models used and different environments in which the applications are deployed. In addition, using the confidence model inside the ASR engine makes it difficult to exploit application-specific features such as the word distribution (see Section IV.) These application-specific features are either external to the ASR engine or cannot be reliably estimated from the generic training set.

Currently, only a limited number of companies and institutions have the capability and resources to build real-time large vocabulary continuous ASR engines. Most speech application developers have no access to the internals of the engines and cannot modify the built-in confidence estimation algorithms. Thus, they often have to rely on the confidence measure provided by the engines. This situation can be painful for speech application developers, especially when a poor confidence model or feature set is used in the ASR engine and the model parameters are not well tuned.

In this paper we aim at answering the following question: what can be done to improve the quality of the confidence measures if we cannot modify the ASR engine? This problem has become increasingly important recently since more speech applications are built by application developers who know nothing about the ASR engines. In this paper we use a technique called confidence calibration to solve this problem. It is a post-processing step that tunes the confidence measure for each specific application using a small amount of transcribed calibration data collected under real usage scenarios. To show why confidence calibration would help, let us consider a simple speech application that only recognizes ―yes‖ and ―no‖. Let us further assume ―yes‖ is correctly recognized 98% of the time and it consists of 80% of the responses, and ―no‖ is correctly recognized 90% of the time and it consists of 20% of the responses. In this case, a confidence score of 0.5 for ―yes‖ from the ASR engine may obviously mean differently from the same score for ―no.‖ Thus an adjusted (calibrated) score using this information would help to improve the overall quality of the confidence score if done correctly.

We propose and compare three approaches for confidence calibration: the maximum entropy model (MaxEnt) with distribution constraints (MaxEnt-DC), the conventional artificial neural network (ANN), and the deep belief network (DBN). To the best of our knowledge, this is the first time that MaxEnt-DC and DBNs are applied to confidence estimation and calibration. The contribution of this work also includes the discovery of effective yet non-obvious features such as the word distribution information and the rule coverage ratio in improving confidence measures.

References

[1] Y.-F. Gong, "Speech recognition in noisy environments: A survey", Speech Communication. 16, 261–291, 1995.

[2] Y.-Y. Wang, D. Yu, Y.-C. Ju, and A. Acero. “An Introduction to Voice Search," IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 28-38, May 2008.

[3] D. Yu, Y.-C. Ju, Y.-Y. Wang, G. Zweig, A. Acero. “Automated directory assistance system - From theory to practice," in Proceedings of Interspeech, pp. 2709-2712, 2007.

[4] J. Wilpon, L. Rabiner, and C.-H., Lee, "Automatic recognition of keywords in unconstrained speech using hidden Markov models", IEEE Trans. ASSP vol. 38, pp. 1870-1878, 1990.

[5] D. Yu, Y.-C. Ju, Y.-Y. Wang, A. Acero. “N-gram based filler model for robust grammar authoring", in Proceedings of ICASSP, vol. I, pp. 565-568, 2006.

[6] H. Jiang, "Confidence measures for speech recognition: a survey," Speech Communication, vol. 45, no. 4, pp. 455-470, Apr. 2005.

[7] R.A. Sukkar, "Rejection for connected digit recognition based on GPD segmental discrimination", in Proceedings of ICASSP, pp. I-393–I-396, 1994.

[8] R.A. Sukkar, C.-H. Lee, "Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition", IEEE Trans. Speech Audio Process. Vol. 4, no. 6, pp. 420–429, 1996.

[9] L. Gillick, Y. Ito, J. Young, "A probabilistic approach to confidence estimation and evaluation", in Proceedings of ICASSP, pp. 879–882, 1997.

[10] M. Siu, H. Gish, H., "Evaluation of word confidence for speech recognition systems", Computer Speech Language, vol.13, pp. 299–319, 1999.

[11] B. Chigier, "Rejection and keyword spotting algorithms for a directory assistance city name recognition application", in Proceedings of ICASSP, pp. II-93–II-96, 1992.

[12] L. Mathan, L., Miclet, "Rejection of extraneous input in speech recognition applications, using multi-layer perceptrons and the trace of HMMs", in Proc ICASSP, pp. 93–96, 1991.

[13] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, A. Stolcke, "Neural-network based measures of confidence for word recognition", in Proceedings of ICASSP, pp. 887–890, 1997.

[14] E. Eide, H. Gish, P. Jeanrenaud, A. Mielke, "Understanding and improving speech recognition performance through the use of diagnostic tools", in Proceedings of ICASSP, pp. 221–224, 1995.

[15] C.V. Neti, S. Roukos, E. Eide, "Word-based confidence measures as a guide for stack search in speech recognition", in Proceedings of ICASSP, pp. 883–886, 1997.

[16] P.J. Moreno, B. Logan, B. Raj, "A boosting approach for confidence scoring", in Proc EuroSpeech, 2001.

[17] C. White, J. Droppo, A. Acero, and J. Odell, "Maximum entropy confidence estimation for speech recognition," in Proceedings of ICASSP, vol. IV, pp. 809-812, 2007.

[18] T. Kemp, T. Schaaf, "Estimating confidence using word lattices", in Proceedings of EuroSpeech, pp. 827–830, 1997.

[19] F. Wessel, K. Macherey, R. Schluter, "Using word probabilities as confidence measures", in Proceedings of ICASSP, pp. 225–228, 1998.

[20] F. Wessel, K. Macherey, H. Ney., "A comparison of word graph and N-best list based confidence measures", in Proceedings of EuroSpeech, pp. 315–318, 1999.

[21] F. Wessel, R. Schluter, K. Macherey, H. Ney, "Confidence measures for large vocabulary continuous speech recognition", IEEE Trans. Speech Audio Process, vol. 9, no. 3, pp. 288–298, 2001.

[22] B. Rueber, B., "Obtaining confidence measures from sentence probabilities", in Proceedings of EuroSpeech, 1997.

[23] R.C. Rose, B.H. Juang, C.H. Lee, "A training procedure for verifying string hypothesis in continuous speech recognition", in Proceedings of ICASSP, pp. 281–284, 1995.

[24] M.G. Rahim, C.-H. Lee, B.-H. Juang, "Discriminative utterance verification for connected digits recognition", IEEE Trans. on Speech and Audio Processing vol. 5, no. 3, pp. 266–277, 1997.

[25] D. Yu, L. Deng, and A. Acero, "Using continuous features in the maximum entropy model", Pattern Recognition Letters. vol. 30, no. 8, pp.1295-1300, June, 2009.

[26] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, "The DET curve assessment of detection task performance," in Proceedings of EuroSpeech, vol. 4, pp. 1895-1898, 1997.

[27] S. Guiasu, and A. Shenitzer, "The principle of maximum entropy", The Mathematical Intelligencer, vol. 7, no. 1, 1985.

[28] A.L. Berger, S.A. Della Pietra, and V.J. Della Pietra, "A maximum entropy approach to natural language processing", Computational Linguistics, vol. 22, pp. 39-71, 1996

[29] C. Ma, P. Nguyen and M. Mahajan, M., "Finding Speaker Identities with a Conditional Maximum Entropy Model", In proc. ICASSP, vol. IV, pp. 261-264, 2007.

[30] R. Rosenfeld, "A maximum entropy approach to adaptive statistical language modeling", Computer Speech and Language, 10:187-228, 1996.

[31] D. Yu, M. Mahajan, P. Mau, and A. Acero, A., "Maximum entropy based generic filter for language model adaptation," in proc. ICASSP 2005, vol. I, pp. 597-600, 2005.

[32] F. J. Och, and H. Ney, "Discriminative training and maximum entropy models for statistical machine translation", in proc. ACL, pp. 295-302, 2002.

[33] D. Yu, L. Deng, Y. Gong, and A. Acero, "A novel framework and training algorithm for variable-parameter hidden Markov models," IEEE trans. on Audio, Speech, and Language Processing, vol 17, no. 7, pp. 1348-1360, September 2009.

[34] D. Yu, and L. Deng, "Solving nonlinear estimation problems using Splines," IEEE Signal Processing Magazine, vol. 26, no. 4, pp.86-90, July, 2009.

[35] D. Yu, L. Deng, and A. Acero, "Hidden conditional random field with distribution constraints for phonetic classification," in Proceedings of Interspeec, pp. 676-679, 2009.

[36] M. Riedmiller and H. Braun. “A direct adaptive method for faster back-propagation learning: The RPROP algorithm," in Proceedings of IEEE ICNN, vol. 1, pp. 586-591. 1993.

[37] D. van Leeuwen and N. Brümmer. “On calibration of language recognition scores," in Proceedings of IEEE Odyssey: The Speaker and Language Recognition Workshop, 2006.

[38] R. Malouf, "A comparison of algorithms for maximum entropy parameter estimation", in proc. CoNLL, vol. 2, pp. 1-7, 2002.

[39] J. Nocedal, J., "Updating quasi-Newton matrices with limited storage", Mathematics of Computation, vol. 35, pp. 773-782, 1980.

[40] S. F. Chen, and R. Rosenfeld, "A Gaussian prior for smoothing maximum entropy models", In Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.

[41] S. F. Chen, and R. Rosenfeld, "A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 37-50, 2000.

[42] J. Goodman, "Exponential priors for maximum entropy models", in Proceedings of HLT-NAACL. pp. 305-311, 2004.

[43] J. Kazama, J., "Improving maximum entropy natural language processing by uncertainty-aware extensions and unsupervised learning", Ph.D. thesis, University of Tokyo, 2004.

[44] J. Kazama, and J. Tsujii, "Maximum entropy models with inequality constraints: A case study on text categorization", Machine Learning, vol. 60, no. 1-3 , pp. 159 – 194, 2005.

[45] D. Yu, S. Wang, Z. Karam, L. Deng, "Language recognition using deep-structured conditional random fields", in Proceedings of ICASSP 2010.

[46] D. Yu, S. Wang, J. Li, L. Deng, "Word confidence calibration using a maximum entropy model with constraints on confidence and word distributions," in Proceedings of ICASSP 2010.

[47] D. Yu, L. Deng, "Semantic confidence calibration for spoken dialog applications", in Proceedings of ICASSP 2010.

[48] G. Evermann, and P. Woodland, "Large vocabulary decoding and confidence estimation using word posterior probabilities", In: Proceedings ICASSP 2000.

[49] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao Gadde, M. Plauché, C. Richey, E. Shriberg, K. Sonmez, and J. Zheng, F. Weng, "The SRI March 2000 Hub-5 conversational speech transcription system", In: Proceedings Speech Transcription Workshop, 2000.

[50] http://www.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm

[51] J. Xue and Y. Zhao, "Random forests-based confidence annotation using novel features from confusion network," In: Proceedings ICASSP, vol. I, pp. 1149-1152, 2006.

[52] D. Hillard and M. Ostendorf, "Compensating Forword Posterior Estimation Bias in Confusion Networks", In: Proceedings ICASSP, vol. I, pp. 1153-1156, 2006.

[53] R. Sarikaya, Y. Gao, M. Picheny, and H. Erdogan, "Semantic confidence measurement for spoken dialog systems," IEEE trans. on Audio, Speech, and Language Processing, vol. 13, no. 4, pp. 534-545, 2005.

[54] G. Hinton, S. Osindero and Y. Teh, "A Fast learning algorithm for deep belief nets," Neural Computation, vol. 18, 2006, pp. 1527-1554.

[55] G. Hinton, and R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, vol. 313. no. 5786, 2006, pp. 504 – 507.

[56] A.-R. Mohamed, G. Dahl, G. Hinton, "Deep Belief Networks for Phone Recognition," in Proceedings of NIPS Workshop, Dec. 2009.

[57] A.-R. Mohamed, D. Yu, and L. Deng, "Investigation of full-sequence training of deep belief networks for speech recognition," in Proceedings of Interspeech 2010.

[58] A. Acero, N. Bernstein, R. Chambers, Y. Ju, X. Li, J. Odell, P. Nguyen, O. Scholtz, and G. Zweig, "Live search for mobile: Web services by voice on the cellphone," in Proceedings of ICASSP, 2008, pp. 5256–5259.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2011 CalibrationofConfidenceMeasures	Dong Yu Jinyu Li Li Deng			Calibration of Confidence Measures in Speech Recognition						2011