1997 MutualInformationMetricEntropya

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Cummulative Mutual Information, Cumulative Relative Entropy Risk

Notes

Cited By

Quotes

Author Keywords

Mutual information, Hellinger distance, relative entropy, metric entropy, minimax risk, Bayes risk, density estimation, Kullback-Leibler distance.

Abstract

Assume [math]\displaystyle{ \{P_θ:θ \in Θ\} }[/math] is a set of probability distributions with a common dominating measure on a complete separable metric space [math]\displaystyle{ Y }[/math]. A state [math]\displaystyle{ θ^∗ \in Θ }[/math] is chosen by Nature. A statistician obtains [math]\displaystyle{ n }[/math] independent observations [math]\displaystyle{ Y_1,…,Y_n }[/math] from [math]\displaystyle{ Y }[/math] distributed according to [math]\displaystyle{ P_{θ^∗} }[/math]. For each time t between 1 and n, based on the observations [math]\displaystyle{ Y_1,…,Y_{t−1} }[/math], the statistician produces an estimated distribution [math]\displaystyle{ \hat{P}_t }[/math] for [math]\displaystyle{ P_{θ^∗} }[/math] and suffers a loss [math]\displaystyle{ L(P_{θ^∗},\hat{P}_t) }[/math]. The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss [math]\displaystyle{ L(P_{θ^∗},\hat{P}_t) }[/math] is the relative entropy between the true distribution [math]\displaystyle{ P_{θ^∗} }[/math] and the estimated distribution [math]\displaystyle{ \hat{P}_t }[/math]. Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter [math]\displaystyle{ Θ^∗ }[/math] and the observations [math]\displaystyle{ Y_1,…,Y_n }[/math].

New bounds on this mutual information are given in terms of the Laplace transform of the Hellinger distance between pairs of distributions indexed by parameters in [math]\displaystyle{ Θ }[/math]. From these, bounds on the cumulative minimax risk are given in terms of the metric entropy of [math]\displaystyle{ Θ }[/math] with respect to the Hellinger distance. The assumptions required for these bounds are very general and do not depend on the choice of the dominating measure. They apply to both finite- and infinite-dimensional [math]\displaystyle{ Θ }[/math]. They apply in some cases where Y is infinite dimensional, in some cases where Y is not compact, in some cases where the distributions are not smooth and in some parametric cases where asymptotic normality of the posterior distribution fails.

1. Introduction

Much of classical statistics has been concerned with the estimation of probability distributions from independent and identically distributed observations drawn according to these distributions. If we let Pu * denote the true distribution generating the observations and Pˆ the esti- t mated distribution obtained after seeing ty1 independent observations, then the success of our statistical procedure can be defined in terms of a loss function that measures the difference between the true distribution P and u * the estimated distribution Pˆ . One such loss function has proven to be of t importance in several fields, including information theory, data compression, mathematical finance, computational learning theory and statistical mechanics. This is the relative entropy function. Further, in these fields, special importance is given to the cumulative relative entropy loss suffered in a sequential estimation setting, in which there are n total observations, but these observations arrive one at a time, and at each time t a new, refined estimate Pˆ is made for the unknown true distribution P , based on the t u * ty1 previous observations. This is the setting that we study in this paper.

The average of the cumulative loss over all sequences of n observations generated according to the true distribution is the ([[cumulative relative entropy]]) risk. For a given family [math]\displaystyle{ \{P_θ:θ \in Θ\} }[/math] of distributions, two types of risk u are of interest in statistics. One is the minimax risk, which is the minimum worst-case risk over possible true distributions P , where [math]\displaystyle{ θ^* \in Θ }[/math], and the minimum is over all possible sequential estimation strategies. The other is the Bayes risk, which is the minimum average-case risk over possible true distributions P drawn according to a prior distribution m on Q, and the u * minimum is again over all possible sequential estimation strategies. For cumulative relative entropy loss, the Bayes risk has a fundamental information- theoretic interpretation: it is the mutual information between a random variable representing the choice of the parameter u * of the true distribution and the random variable given by the n observations, see [15], [23] and [35]. This provides a beautiful connection between information theory and statistics.

This connection also extends to other fields, as is discussed in [6] and [15]. In data compression, the cumulative relative entropy risk is the redundancy, which is the expected excess code length for the best adaptive coding method, as compared to the best coding method that has prior knowledge of the true distribution, see [15], [28] and [42]. The minimax risk is called the ‘‘information’’ channel capacity in [18], page 184. …

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1997 MutualInformationMetricEntropyaDavid Haussler
Manfred Opper
Mutual Information, Metric Entropy and Cumulative Relative Entropy Risk1997