Gradient Boosting-based Tree (GBT) Learning Algorithm

From GM-RKB
Jump to navigation Jump to search

A Gradient Boosting-based Tree (GBT) Learning Algorithm is a decision tree learning algorithm that is a gradient boosting-based learning algorithm which can be implemented by a Gradient Boosted Decision Tree Learning System (to solve a gradient boosted decision tree learning task).



References

2017

  • http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
    • QUOTE: Transform your features into a higher dimensional, sparse space. Then train a linear model on these features.

      First fit an ensemble of trees (totally random trees, a random forest, or gradient boosted trees) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

2015

  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting Retrieved:2015-1-14.
    • Gradient boosting is typically used with decision trees (especially CART trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.

      Generic gradient boosting at the m-th step would fit a decision tree [math]\displaystyle{ \! h_m(x) }[/math] to pseudo-residuals. Let [math]\displaystyle{ \! J }[/math] be the number of its leaves. The tree partitions the input space into [math]\displaystyle{ \! J }[/math] disjoint regions [math]\displaystyle{ \! R_{1m}, \ldots, R_{Jm} }[/math] and predicts a constant value in each region. Using the indicator notation, the output of [math]\displaystyle{ \! h_m(x) }[/math] for input x can be written as the sum:

       : [math]\displaystyle{ h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}), }[/math]

      where [math]\displaystyle{ \! b_{jm} }[/math] is the value predicted in the region [math]\displaystyle{ \! R_{jm} }[/math]. [1] Then the coefficients [math]\displaystyle{ \! b_{jm} }[/math] are multiplied by some value [math]\displaystyle{ \! \gamma_m }[/math], chosen using line search so as to minimize the loss function, and the model is updated as follows: : [math]\displaystyle{ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). \lt P\gt }[/math]

      Friedman proposes to modify this algorithm so that it chooses a separate optimal value [math]\displaystyle{ \! \gamma_{jm} }[/math] for each of the tree's regions, instead of a single [math]\displaystyle{ \! \gamma_m }[/math] for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients [math]\displaystyle{ \! b_{jm} }[/math] from the tree-fitting procedure can be then simply discarded and the model update rule becomes:

       : [math]\displaystyle{ \lt P\gt F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad \lt P\gt \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). \lt P\gt }[/math]

      === Size of trees ===

      [math]\displaystyle{ \! J }[/math], the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction between variables in the model. With [math]\displaystyle{ \! J = 2 }[/math] (decision stumps), no interaction between variables is allowed. With [math]\displaystyle{ \! J = 3 }[/math] the model may include effects of the interaction between up to two variables, and so on.

      Hastie et al. comment that typically [math]\displaystyle{ 4 \leq J \leq 8 }[/math] work well for boosting and results are fairly insensitive to the choice of [math]\displaystyle{ J }[/math] in this range, [math]\displaystyle{ J = 2 }[/math] is insufficient for many applications, and [math]\displaystyle{ J \gt 10 }[/math] is unlikely to be required.

  1. Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient [math]\displaystyle{ b_{jm} }[/math] for the region [math]\displaystyle{ R_{jm} }[/math] is equal to just the value of output variable, averaged over all training instances in [math]\displaystyle{ R_{jm} }[/math].