Bayesian Locally Weighted Regression Task
A Bayesian Locally Weighted Regression Task is a Nonparametric Regression Task that ...
- AKA: Bayesian LWR, Bayesian Locally Weighted Regression.
- …
- Example(s):
- …
- Counter-Example(s):
- See: Initialism, Scattergram, Non-Parametric Regression, k-Nearest Neighbor Algorithm, Classical Statistics, Least Squares Regression, Nonlinear Regression.
References
2017
- (Ting et al., 2017) ⇒ Jo-Anne Ting, Franzisk Meier, Sethu Vijayakumar, Stefan Schaal (2017) "Locally Weighted Regression for Control" in "Encyclopedia of Machine Learning and Data Mining" (2017) pp 759-772
- QUOTE: Ting et al. (2008) proposed a fully probabilistic treatment of LWR in an attempt to avoid cross-validation procedures and minimize any manual parameter tuning (e.g., gradient descent rates, kernel initialization, forgetting rates, etc.). The resulting Bayesian algorithm learns the distance metric of local linear model (For simplicity, a local linear model is assumed, although local polynomials can be used as well.) probabilistically, can cope with high input dimensions, and rejects data outliers automatically. The main ideas of Bayesian LWR are listed below (please see Ting, 2009 for details):
- Introduce hidden variables [math]\displaystyle{ z }[/math] to the local linear model to decompose the statistical estimation problem into [math]\displaystyle{ d }[/math] individual estimation problems (where [math]\displaystyle{ d }[/math] is the number of input dimensions). The result is an iterative expectation-maximization (EM) algorithm that is of linear computational complexity in [math]\displaystyle{ d }[/math] and the number of training data samples [math]\displaystyle{ N }[/math], i.e., [math]\displaystyle{ O(Nd) }[/math].
- Associate a scalar weight [math]\displaystyle{ w_i }[/math] with each training data sample [math]\displaystyle{ \{\mathbf{x}_{i},t_{i}\} }[/math], placing a Bernoulli prior probability distribution over a weight [math]\displaystyle{ w_{im} }[/math] for each input dimension [math]\displaystyle{ m }[/math] so that the weights are positive and between 0 and 1:
[math]\displaystyle{ \begin{array}{rcl} w_{i}& = & \prod _{m=1}^{d}w_{ im}\mbox{ where } \\ w_{im}& \,\sim \, &\ \mbox{ Bernoulli}\left (q_{im}\right )\mbox{ for }i = 1,..,N; \\ & m =& 1,..,d {}\end{array}\quad\quad }[/math](8)
The weight [math]\displaystyle{ w_i }[/math] indicates a training sample’s contribution to the local model. The formulation of the parameter [math]\displaystyle{ q_{im} }[/math] determines the shape of the weighting function applied to the local model. The weighting function [math]\displaystyle{ q_{im} }[/math] used in Bayesian LWR is listed below:
[math]\displaystyle{ \begin{array}{rcl} q_{im}& =& \frac{1} {1 + \left (x_{im} - x_{qm}\right )^{2}h_{m}}\mbox{ for }i = 1,..,N; \\ & & m = 1,..,d {}\end{array}\quad\quad }[/math](9)
where [math]\displaystyle{ \mathbf{x}_{q} \in \mathfrak{R}^{d\times 1} }[/math] is the query input point and [math]\displaystyle{ h_m }[/math] is the bandwidth parameter/distance metric of the local model in the [math]\displaystyle{ m }[/math]-th input dimension.
- Place a gamma prior probability distribution over the distance metric [math]\displaystyle{ h_m }[/math]:
[math]\displaystyle{ h_{m} \sim \mbox{ Gamma}\left (a_{hm0},b_{hm0}\right ) \quad\quad }[/math](10)
where [math]\displaystyle{ \{a_{hm0},b_{hm0}\} }[/math] are the prior parameter values of the gamma distribution.
- Treat the model as an EM-like regression problem, using variational approximations to achieve analytically tractable inference of the posterior probability distributions.
This Bayesian method can also be applied as general kernel shaping algorithm for global kernel learning methods that are linear in the parameters (e.g., to realize nonstationary Gaussian processes (Ting et al., 2008), resulting in an augmented nonstationary Gaussian process).
- QUOTE: Ting et al. (2008) proposed a fully probabilistic treatment of LWR in an attempt to avoid cross-validation procedures and minimize any manual parameter tuning (e.g., gradient descent rates, kernel initialization, forgetting rates, etc.). The resulting Bayesian algorithm learns the distance metric of local linear model (For simplicity, a local linear model is assumed, although local polynomials can be used as well.) probabilistically, can cope with high input dimensions, and rejects data outliers automatically. The main ideas of Bayesian LWR are listed below (please see Ting, 2009 for details):