Iterative Scaling Algorithm
(Redirected from Iterative Scaling)
Jump to navigation
Jump to search
A Iterative Scaling Algorithm is an optimization algorithm that ...
- AKA: GIS.
- …
- Counter-Example(s):
- See: Maximum-Entropy Model, Optimization Algorithm, Scaling Algorithm, Limited-Memory Quasi-Newton/L-BFGS, Truncated Newton's Method.
References
2007
- (Minka, 2007) ⇒ Thomas P. Minka. (2007). “A Comparison of Numerical Optimizers for Logistic Regression." Technical Report.
- QUOTE: Logistic regression is a workhorse of statistics and is closely related to methods used in Machine Learning, including the Perceptron and the Support Vector Machine. This note compares eight different algorithms for computing the maximum a-posteriori parameter estimate. A full derivation of each algorithm is given. In particular, a new derivation of Iterative Scaling is given which applies more generally than the conventional one. A new derivation is also given for the Modified Iterative Scaling algorithm of Collins et al. (2002). Most of the algorithms operate in the primal space, but can also work in dual space. All algorithms are compared in terms of computational complexity by experiments on large data sets. The fastest algorithms turn out to be conjugate gradient ascent and quasi-Newton algorithms, which far outstrip Iterative Scaling and its variants.
2003
- (Jin et al., 2003) ⇒ Rong Jin, Rong Yan, Jian Zhang, and Alexander G. Hauptmann. (2003). “A Faster Iterative Scaling Algorithm for Conditional Exponential Model.” In: Proceedings of ICML 2003.
- QUOTE: To find the optimal conditional exponential model for given training data, two groups of approaches have been used in the past research. One is named iterative scaling approach (Brown, 1959), including the Generalized Iterative Scaling (GIS) (Darroch & Ratcli, 1972) and the Improved Iterative Scaling (IIS) (Berger, 1997). The underlying idea for iterative scaling approaches is similar to the idea of Expectation- Maximization (EM) approach: by approximating the log-likelihood function of the conditional exponential model as some kind of ‘simple’ auxiliary function, the iterative scaling methods are able to decouple the correlation between the parameters and the search for the maximum point can be operated along many directions simultaneously. By carrying out this procedure iteratively, the approximated optimal point found over the ‘simplified’ function is guaranteed to converge to the true optimal point due to the convexity of the objective function. The distinction between GIS and IIS is that the GIS method requires the sum of input features to be a constant over all the examples while the IIS method doesn’t.
2000
- (McCallum et al., 2000a) ⇒ Andrew McCallum, Dayne Freitag, and Fernando Pereira. (2000). “Maximum Entropy Markov Models for Information Extraction and Segmentation.” In: Proceedings of ICML-2000.
- QUOTE: … The exponential models follow from a maximum entropy argument, and are trained by generalized iterative scaling (GIS) (Darroch & Ratcliff, 1972), which is similar in form and computational cost to the expectation-maximization (EM) algorithm (Dempster, Laird, &Rubin, 1977).
1997
- (Berger, 1997) ⇒ A. Berger. (1997). “The improved iterative scaling algorithm: A gentle introduction." Unpublished manuscript
- QUOTE: This note concerns the improved iterative scaling algorithm for computing maximum-likelihood estimates of the parameters of exponential models. The algorithm was invented by members of the machine translation group at IBM's T.J. Watson Research Center in the early 1990s. The goal here is to motivate the improved iterative scaling algorithm for conditional models in a way that is as complete and self-contained as possible yet minimizes the mathematical burden on the reader
1996
- (Ratnaparkhi, 1996) ⇒ Adwait Ratnaparkhi. (1996). “A Maximum Entropy Model for Part-of-Speech Tagging.” In: Proceedings of EMNLP Conference (EMNLP 1996).
- QUOTE: … It can be shown (Darroch and Ratcliff, 1972) that if p has the form (1) and satisfies the k constraints (2), it uniquely maximizes the entropy H(p) over distributions that satisfy (2), and uniquely maximizes the likelihood L(p) over distributions of the form (1). The model parameters for the distribution p are obtained via Generalized Iterative Scaling (Darroch and Ratcliff, 1972).
1972
- (Darroch & Ratcliff, 1972) ⇒ John N. Darroch, and Douglas Ratcliff. (1972). “Generalized Iterative Scaling for Log-Linear Models.” In: The Annals of Mathematical Statistics, 43(5).
1959
- (Brown, 1959) ⇒ D. Brown. (1959). “A Note on Approximations to Discrete Probability Distributions.” In: Information and Control.