Iterative Scaling Algorithm

AKA: GIS.
- …
Counter-Example(s):
- a Gradient Descent Algorithm.
See: Maximum-Entropy Model, Optimization Algorithm, Scaling Algorithm, Limited-Memory Quasi-Newton/L-BFGS, Truncated Newton's Method.

References

(Jin et al., 2003) ⇒ Rong Jin, Rong Yan, Jian Zhang, and Alexander G. Hauptmann. (2003). “A Faster Iterative Scaling Algorithm for Conditional Exponential Model.” In: Proceedings of ICML 2003.
- QUOTE: To find the optimal conditional exponential model for given training data, two groups of approaches have been used in the past research. One is named iterative scaling approach (Brown, 1959), including the Generalized Iterative Scaling (GIS) (Darroch & Ratcli, 1972) and the Improved Iterative Scaling (IIS) (Berger, 1997). The underlying idea for iterative scaling approaches is similar to the idea of Expectation- Maximization (EM) approach: by approximating the log-likelihood function of the conditional exponential model as some kind of ‘simple’ auxiliary function, the iterative scaling methods are able to decouple the correlation between the parameters and the search for the maximum point can be operated along many directions simultaneously. By carrying out this procedure iteratively, the approximated optimal point found over the ‘simplified’ function is guaranteed to converge to the true optimal point due to the convexity of the objective function. The distinction between GIS and IIS is that the GIS method requires the sum of input features to be a constant over all the examples while the IIS method doesn’t.

(Berger, 1997) ⇒ A. Berger. (1997). “The improved iterative scaling algorithm: A gentle introduction." Unpublished manuscript
- QUOTE: This note concerns the improved iterative scaling algorithm for computing maximum-likelihood estimates of the parameters of exponential models. The algorithm was invented by members of the machine translation group at IBM's T.J. Watson Research Center in the early 1990s. The goal here is to motivate the improved iterative scaling algorithm for conditional models in a way that is as complete and self-contained as possible yet minimizes the mathematical burden on the reader

(Ratnaparkhi, 1996) ⇒ Adwait Ratnaparkhi. (1996). “A Maximum Entropy Model for Part-of-Speech Tagging.” In: Proceedings of EMNLP Conference (EMNLP 1996).
- QUOTE: … It can be shown (Darroch and Ratcliff, 1972) that if p has the form (1) and satisfies the k constraints (2), it uniquely maximizes the entropy H(p) over distributions that satisfy (2), and uniquely maximizes the likelihood L(p) over distributions of the form (1). The model parameters for the distribution p are obtained via Generalized Iterative Scaling (Darroch and Ratcliff, 1972).

(Brown, 1959) ⇒ D. Brown. (1959). “A Note on Approximations to Discrete Probability Distributions.” In: Information and Control.