Growth Function
A Growth Function is a Statistical Learning function that measures the richness of a Set Family.
- AKA: Shattering Coefficient, Shattering Number.
- Context:
- It can be used to measure the complexity of a hypothesis class.
- …
- Example(s):
- "Let [math]\displaystyle{ X }[/math] be a straight line and let [math]\displaystyle{ S }[/math] be the set of all rays of the form [math]\displaystyle{ x \leq a }[/math]. In this case, the growth function is: [math]\displaystyle{ m^S(r) r + 1 }[/math].". (In: Vapnik & Chervonenkis, 1971).
- …
- Counter-Example(s):
- See: Machine Learning, Set Family, Statistical Learning Theory, Shattered Set, Vapnik-Chervonenkis Class, Vapnik-Chervonenkis Dimension, Sauer's Lemma, Massart's Lemma.
References
2019a
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Growth_function Retrieved:2019-12-6.
- The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family. It is especially used in the context of statistical learning theory, where it measures the complexity of a hypothesis class.
The term 'growth function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also proved many of its properties.[1]
It is a basic concept in machine learning.[2]
- The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family. It is especially used in the context of statistical learning theory, where it measures the complexity of a hypothesis class.
- ↑ Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.
- ↑ Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258., especially Section 3.2
- ↑ Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135.
2019b
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Shattered_set#Shatter_coefficient Retrieved:2019-12-6.
- To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as the growth function). For a collection C of sets [math]\displaystyle{ s \subset \Omega }[/math] , [math]\displaystyle{ \Omega }[/math] being any space, often a sample space, and [math]\displaystyle{ x_1,x_2,\dots,x_n \in \Omega }[/math] being any set of n points, we define
the nth shattering coefficient of C as : [math]\displaystyle{ S_C(n) = \max_{\forall x_1,x_2,\dots,x_n \in \Omega } \operatorname{card} \{\,\{\,x_1,x_2,\dots,x_n\}\cap s, s\in C \} }[/math] where [math]\displaystyle{ \operatorname{card} }[/math] denotes the cardinality of the set. [math]\displaystyle{ S_C(n) }[/math] is the largest number of subsets of any set A of n points that can be formed by intersecting A with the sets in collection C.
Here are some facts about [math]\displaystyle{ S_C(n) }[/math] :
- [math]\displaystyle{ S_C(n)\leq 2^n }[/math] for all n because [math]\displaystyle{ \{s\cap A|s\in C\}\subseteq P(A) }[/math] for any [math]\displaystyle{ A\subseteq \Omega }[/math] .
- If [math]\displaystyle{ S_C(n)=2^n }[/math], that means there is a set of cardinality n, which can be shattered by C.
- If [math]\displaystyle{ S_C(N)\lt 2^N }[/math] for some [math]\displaystyle{ N\gt 1 }[/math] then [math]\displaystyle{ S_C(n)\lt 2^n }[/math] for all [math]\displaystyle{ n\geq N }[/math] .
- The third property means that if C cannot shatter any set of cardinality N then it can not shatter sets of larger cardinalities.
- To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as the growth function). For a collection C of sets [math]\displaystyle{ s \subset \Omega }[/math] , [math]\displaystyle{ \Omega }[/math] being any space, often a sample space, and [math]\displaystyle{ x_1,x_2,\dots,x_n \in \Omega }[/math] being any set of n points, we define
2018a
- (de Mello, et al.) ⇒ Rodrigo Fernandes de Mello, Moacir Antonelli Ponti, and Carlos Henrique Grossi Ferreira (2018). "Computing the Shattering Coefficient of Supervised Learning Algorithms". arXiv preprint arXiv:1805.02627.
- QUOTE: The Shattering coefficient [math]\displaystyle{ \mathcal{N}(F, 2n) }[/math] of any h-dimensional Hilbert space [math]\displaystyle{ \mathcal{H} }[/math] being classified with a single [math]\displaystyle{ (h − 1) }[/math]- dimensional hyperplane is: [math]\displaystyle{ \mathcal{N} (F, 2n) = \displaystyle 2\sum^h_{i=0} \binom{n - 1}{i} ,\quad\quad (2) }[/math]
For a generalized data organization with sample size equals [math]\displaystyle{ n }[/math].
- QUOTE: The Shattering coefficient [math]\displaystyle{ \mathcal{N}(F, 2n) }[/math] of any h-dimensional Hilbert space [math]\displaystyle{ \mathcal{H} }[/math] being classified with a single [math]\displaystyle{ (h − 1) }[/math]- dimensional hyperplane is:
2018b
- (Rebeschini, 2018) ⇒ Patrick Rebeschini (2018). "VC Dimension. Covering and Packing Numbers - Lecture 4". In: Algorithmic Foundations of Learning. University of Oxford, Michaelmas Term 2018.
- QUOTE: Definition 4.2 The growth function of [math]\displaystyle{ \mathcal{A} }[/math] is defined as follows, for any integer [math]\displaystyle{ n \geq 1 }[/math]:[math]\displaystyle{ \tau_{\mathcal{A}}(n) := \underset{x\in \mathcal{X}^n}{sup}{|A \circ x|} }[/math]
The quantity [math]\displaystyle{ \tau_{\mathcal{A}}(n) }[/math] is the maximal cardinality of the set of distinct labelings of [math]\displaystyle{ n }[/math] points in [math]\displaystyle{ \mathcal{X} }[/math] that can be obtained using classifiers from [math]\displaystyle{ \mathcal{A} }[/math]. As the growth function of [math]\displaystyle{ \mathcal{A} }[/math] if finite, an immediate application of Massart's lemma, Lemma 2.9, shows that [math]\displaystyle{ \tau_{\mathcal{A}}(n) }[/math] can be used to upper bound the Rademacher complexity.
- QUOTE: Definition 4.2 The growth function of [math]\displaystyle{ \mathcal{A} }[/math] is defined as follows, for any integer [math]\displaystyle{ n \geq 1 }[/math]:
2018c
- (Ma et al., 2018) ⇒ Tengyu Ma (Lecturer), Bo Liu, Zhaozhuo Xu (October 22, 2018). "CS229T/STATS231: Statistical Learning Theory - Lecture #9". In: CS229T/STATS231: Statistical Learning Theory Stanford University, Autumn 2018-2019.
- QUOTE: Here [math]\displaystyle{ \mathcal{H} }[/math] is the hypothesis set with binary outcomes and [math]\displaystyle{ \mathcal{F} }[/math] is the family of loss functions associated to [math]\displaystyle{ \mathcal{H} }[/math]. Let [math]\displaystyle{ Q_{z_1,\cdots,z_n} = \{ \left(f(z_1), \cdots , f(z_n) \right) \in \R^n : f \in \mathcal{F}\} }[/math]
be all possible outputs from [math]\displaystyle{ \mathcal{F} }[/math] on the input sequence [math]\displaystyle{ z_1, z_2, \cdots , z_n }[/math], where [math]\displaystyle{ z_i = (x_i, y_i) }[/math].
(...)
Definition 2 (Growth Function). The growth function, a.k.a the shattering coefficient is
[math]\displaystyle{ S(\mathcal{F}, n) = \underset{z_1,\cdots,z_n}{max}{|Q_{z1,\cdots,z_n}|} }[/math] with [math]\displaystyle{ Q }[/math] defined as in above.
- QUOTE: Here [math]\displaystyle{ \mathcal{H} }[/math] is the hypothesis set with binary outcomes and [math]\displaystyle{ \mathcal{F} }[/math] is the family of loss functions associated to [math]\displaystyle{ \mathcal{H} }[/math]. Let
2017a
- (Sammut & Webb, 2017) ⇒ (2017). "Shattering Coefficient". In: Sammut & Webb (2017).
- QUOTE: The shattering coefficient [math]\displaystyle{ \mathcal{S_F}(n) }[/math] is a function that measures the size of a function class [math]\displaystyle{ \mathcal{F} }[/math] when its functions [math]\displaystyle{ f:\mathcal{X} \in \R }[/math] are restricted to sets of points [math]\displaystyle{ \mathbf{x} = (x_1 , \cdots, x_n) \in \mathcal{X}^n }[/math] of size [math]\displaystyle{ n }[/math]. Specifically, for each [math]\displaystyle{ n \in \N }[/math] the shattering coefficient is the maximum size of the set of vectors [math]\displaystyle{ \mathcal{F}_{\mathbf{x}} = \{\left(f(x_1), \cdots, f (x_n)\right): f \in \mathcal{F} \} \subset \R^n }[/math] that can be realized for some choice of [math]\displaystyle{ \mathbf{x} \in \mathcal{X} n }[/math] . That is, [math]\displaystyle{ \mathcal{S_F}(n)=\displaystyle\underset{\mathbf{x}\in \mathcal{X}^n}{|\mathcal{F}_x|} }[/math]
The shattering coefficient of a hypothesis class [math]\displaystyle{ \mathcal{H} }[/math] is used in generalization bounds as an analogue to the class's size in the finite case.
- QUOTE: The shattering coefficient [math]\displaystyle{ \mathcal{S_F}(n) }[/math] is a function that measures the size of a function class [math]\displaystyle{ \mathcal{F} }[/math] when its functions [math]\displaystyle{ f:\mathcal{X} \in \R }[/math] are restricted to sets of points [math]\displaystyle{ \mathbf{x} = (x_1 , \cdots, x_n) \in \mathcal{X}^n }[/math] of size [math]\displaystyle{ n }[/math]. Specifically, for each [math]\displaystyle{ n \in \N }[/math] the shattering coefficient is the maximum size of the set of vectors [math]\displaystyle{ \mathcal{F}_{\mathbf{x}} = \{\left(f(x_1), \cdots, f (x_n)\right): f \in \mathcal{F} \} \subset \R^n }[/math] that can be realized for some choice of [math]\displaystyle{ \mathbf{x} \in \mathcal{X} n }[/math] . That is,
2017b
- (Sammut & Webb, 2017) ⇒ (2017). "Growth Function". In: Sammut & Webb (2017).
- QUOTE: Shattering Coefficient.
2014
- (Scott et al., 2014) ⇒ Clayton Scott (Lecturer), Srinagesh Sharma, Scott Reed, and Petter Nilsson (2014)."Vapnik-Chevronenkis Theory". In: EECS 598: Statistical Learning Theory, Winter 2014.
- QUOTE: Let [math]\displaystyle{ \mathcal{H} \subseteq \{0, 1\}^{\mathcal{X}} }[/math]. For [math]\displaystyle{ x_1, x_2, \cdots , x_n \in \mathcal{X} }[/math] denote [math]\displaystyle{ N_{\mathcal{H}}(x_1, \cdots , x_n) := |\{(h(x_1), \cdots , h(x_n)) : h \in \mathcal{H}\}| }[/math]
Clearly [math]\displaystyle{ N_{\mathcal{H}}(x_1, \cdots, x_n) \leq 2^n }[/math]. The nth shatter coefficient is defined as
[math]\displaystyle{ S_{\mathcal{H}}(n) =: \displaystyle \underset{x_1,\cdots,x_n \in \mathcal{X}} max N_{\mathcal{H}}(x_1, \cdots , x_n) }[/math]If [math]\displaystyle{ S_{\mathcal{H}}(n) = 2n }[/math], then [math]\displaystyle{ \exists\; x_1, \cdots , x_n }[/math] such that
[math]\displaystyle{ N_{\mathcal{H}}(x_1, \cdots , x_n) = 2^n }[/math]and we say that [math]\displaystyle{ \mathcal{H} }[/math] shatters [math]\displaystyle{ x_1, \cdots , x_n }[/math].
Note. The shatter coefficient is sometimes called the growth function in the literature.
- QUOTE: Let [math]\displaystyle{ \mathcal{H} \subseteq \{0, 1\}^{\mathcal{X}} }[/math]. For [math]\displaystyle{ x_1, x_2, \cdots , x_n \in \mathcal{X} }[/math] denote
2013
- (Bartlett, 2013) ⇒ Peter Bartlett (2013). "Theoretical Statistics. Lecture 10.". In: Stat 210B Spring 2013, Berkeley University.
- QUOTE: Definition: For a class [math]\displaystyle{ F \subseteq \{0, 1\}^{\mathcal{X}} }[/math], the growth function is [math]\displaystyle{ \prod_F (n) = max\{|F(x^n_1 )| : x_1, \cdots , x_n \in \mathcal{X}\}. }[/math]
- QUOTE: Definition: For a class [math]\displaystyle{ F \subseteq \{0, 1\}^{\mathcal{X}} }[/math], the growth function is
1971
- (Vapnik & Chervonenkis, 1971) ⇒ Vladimir N. Vapnik, and A. Ya. Chervonenkis (1971 - 2015). "On The Uniform Convergence Of Relative Frequencies Of Events To Their Probabilities". In: Theory Probab. Appl., 16(2) (1971). DOI:10.1137/1116025. Translated and Republished In: Measures of complexity. Springer, Cham (2015). DOI:10.1007/978-3-319-21852-6_3.
- QUOTE: Let [math]\displaystyle{ X_r= x_1,\cdots, x_r }[/math] be a finite sample of elements in [math]\displaystyle{ X }[/math]. Each set [math]\displaystyle{ A }[/math] in [math]\displaystyle{ S }[/math] determines in this sample a subsample [math]\displaystyle{ X_r^A = x_i,\cdots, x_{ik} }[/math] consisting of those terms of the sample [math]\displaystyle{ X }[/math] which belong to [math]\displaystyle{ A }[/math]. We shall say that the set [math]\displaystyle{ A }[/math] induces the subsample [math]\displaystyle{ X_r^A }[/math] in the sample [math]\displaystyle{ X_r }[/math]. We denote the set of all different subsamples induced by the sets of [math]\displaystyle{ S }[/math] in the sample [math]\displaystyle{ X_r }[/math] by [math]\displaystyle{ S(x_1,\cdots, x_r) }[/math] or [math]\displaystyle{ S(X_r) }[/math]. The number of different subsamples of the sample [math]\displaystyle{ X_r }[/math] induced by the sets in [math]\displaystyle{ S }[/math] will be termed the index of the system [math]\displaystyle{ S }[/math] with respect to the sample [math]\displaystyle{ x_1,\cdots, x_r }[/math] and will be denoted by [math]\displaystyle{ \bigtriangleup^S(x_1, \cdots, x_r) }[/math]. Obviously, [math]\displaystyle{ \bigtriangleup^S(x_1,\cdots, x_r) }[/math] is always at most [math]\displaystyle{ 2_r }[/math]. The function [math]\displaystyle{ m^S(r) =max \bigtriangleup^S(x_1,\cdots, x_r) }[/math],
where the maximum is taken over all samples of size [math]\displaystyle{ r }[/math], will be called the growth function.
- QUOTE: Let [math]\displaystyle{ X_r= x_1,\cdots, x_r }[/math] be a finite sample of elements in [math]\displaystyle{ X }[/math]. Each set [math]\displaystyle{ A }[/math] in [math]\displaystyle{ S }[/math] determines in this sample a subsample [math]\displaystyle{ X_r^A = x_i,\cdots, x_{ik} }[/math] consisting of those terms of the sample [math]\displaystyle{ X }[/math] which belong to [math]\displaystyle{ A }[/math]. We shall say that the set [math]\displaystyle{ A }[/math] induces the subsample [math]\displaystyle{ X_r^A }[/math] in the sample [math]\displaystyle{ X_r }[/math]. We denote the set of all different subsamples induced by the sets of [math]\displaystyle{ S }[/math] in the sample [math]\displaystyle{ X_r }[/math] by [math]\displaystyle{ S(x_1,\cdots, x_r) }[/math] or [math]\displaystyle{ S(X_r) }[/math]. The number of different subsamples of the sample [math]\displaystyle{ X_r }[/math] induced by the sets in [math]\displaystyle{ S }[/math] will be termed the index of the system [math]\displaystyle{ S }[/math] with respect to the sample [math]\displaystyle{ x_1,\cdots, x_r }[/math] and will be denoted by [math]\displaystyle{ \bigtriangleup^S(x_1, \cdots, x_r) }[/math]. Obviously, [math]\displaystyle{ \bigtriangleup^S(x_1,\cdots, x_r) }[/math] is always at most [math]\displaystyle{ 2_r }[/math]. The function