Gaussian Process Regression Task
A Gaussian Process Regression Task is a nonparametric regression task that is based on the Kernel Method and Gaussian Processes.
- AKA: Kriging Task, Universal Kriging Task, Wiener-Kolmogorov Prediction Task, GPR Task.
- Context:
- Task Input:
a N-observed Numerically-Labeled Training Dataset [math]\displaystyle{ D=\{(x_1,y_1,z_1,...),(x_2,y_2,z_2,...),\cdots(x_n,y_n,z_n,...)\} }[/math] that can be represented by
- [math]\displaystyle{ \mathbf{Y} }[/math] response variable continuous dataset.
- [math]\displaystyle{ \mathbf{X} }[/math] predictor variables continuous dataset.
- Task Output:
- [math]\displaystyle{ Y^* }[/math], predited reponse variable values
- [math]\displaystyle{ \sigma_*^2 }[/math], predited covariance function.
- [math]\displaystyle{ E(\mathbf{Y}|\mathbf{X})=m(x) }[/math] predicted mean function.
- [math]\displaystyle{ Kij=\kappa(x_i,x_j) }[/math], Gram matrix (optional)
- Task Requirements:
- It requires to find a nonparametric regression function [math]\displaystyle{ f(x) }[/math] that is distributed as a Gaussian process:
[math]\displaystyle{ f(x) \sim \mathcal{GP} (m(x), k(x_i, x)) }[/math]
and that can solve [math]\displaystyle{ y_i=f(x_i)+ \epsilon_i }[/math], where [math]\displaystyle{ m(x) }[/math] is a mean function and [math]\displaystyle{ k(x_i,x) }[/math] is a kernel function.
- It may require a regression diagnostic test to determine goodness of fit of the regression model.
- It requires to find a nonparametric regression function [math]\displaystyle{ f(x) }[/math] that is distributed as a Gaussian process:
- It can be solved by Gaussian Process Regression System that implements a Gaussian Process Regression Algorithm.
- It can range from being Locally Weighted Regression Task to being a Nonlinear Regression Task.
- …
- Task Input:
- Example(s)
- Counter-Example(s)
- See: Gaussian Process, Reinforcement Learning Task, Statistics, Geostatistics, Interpolation, Covariance, Smoothing Spline, Best Linear Unbiased Prediction, Spatial Analysis#Sampling, Computer Experiment, Norbert Wiener, Andrey Kolmogorov.
References
2017a
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Kriging Retrieved:2017-8-27.
- In statistics, originally in geostatistics, kriging or Gaussian process regression is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances, as opposed to a piecewise-polynomial spline chosen to optimize smoothness of the fitted values. Under suitable assumptions on the priors, kriging gives the best linear unbiased prediction of the intermediate values. Interpolating methods based on other criteria such as smoothness need not yield the most likely intermediate values. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.
The theoretical basis for the method was developed by the French mathematician Georges Matheron in 1960, based on the Master's thesis of Danie G. Krige, the pioneering plotter of distance-weighted average gold grades at the Witwatersrand reef complex in South Africa. Krige sought to estimate the most likely distribution of gold based on samples from a few boreholes. The English verb is to krige and the most common noun is kriging ; both are often pronounced with a hard "g", following the pronunciation of the name "Krige". The word is sometimes capitalized as Kriging in the literature.
- In statistics, originally in geostatistics, kriging or Gaussian process regression is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances, as opposed to a piecewise-polynomial spline chosen to optimize smoothness of the fitted values. Under suitable assumptions on the priors, kriging gives the best linear unbiased prediction of the intermediate values. Interpolating methods based on other criteria such as smoothness need not yield the most likely intermediate values. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov.
2017b
- (Quadrianto & Buntine, 2017) ⇒ Novi Quadrianto, Wray L. Buntine (2017). "Regression" in "Encyclopedia of Machine Learning and Data Mining" (2017) pp 1075-1080
- QUOTE: Nonparametric Regression
In the parametric approach, an assumption on the mathematical form of the functional relationship between input [math]\displaystyle{ x }[/math] and output [math]\displaystyle{ y }[/math] such as linear, polynomial, exponential, or combination of them needs to be chosen a priori. Subsequently, parameters are placed on each of the chosen forms and the optimal values learned from the observed data. This is restrictive both in the fixed functional form and in the ability to vary the model complexity. Nonparametric approaches try to derive the functional relationship directly from the data, that is, they do not parameterize the regression function.
Gaussian Processes for regression, for instance, are well developed. Another approach is the kernel method, of which a rich variety exists (Hastie et al. 2003). These can be viewed as a regression variant of nearest neighbor classification where the function is made up of a local element for each data point:
[math]\displaystyle{ f(x)\ =\ \frac{\sum _{i}y_{i}K_{\lambda }(x_{i},x)} {\sum _{i}K_{\lambda }(x_{i},x)} \, }[/math]
where the function [math]\displaystyle{ K_\lambda(x_i) }[/math] is a nonnegative “bump” in [math]\displaystyle{ x }[/math] space centered at its first argument with diameter approximately given by [math]\displaystyle{ \lambda }[/math]. Thus, the function has a variable contribution from each data point and [math]\displaystyle{ \lambda }[/math] controls the bias-variance tradeoff.
- QUOTE: Nonparametric Regression
2017c =
- (Quadrianto at al., 2017) ⇒ Novi Quadrianto, Kristian Kersting, Zhao Xu(2017) "Gaussian Process" in "Encyclopedia of Machine Learning and Data Mining" (2017) pp pp 535-548
- QUOTE: In a Regression problem, we are interested to recover a functional dependency [math]\displaystyle{ y_i= f(x_i) + \epsilon_i }[/math] from [math]\displaystyle{ N }[/math] observed training data points [math]\displaystyle{ \{(xi , yi)\}^N_{i=1} }[/math], where [math]\displaystyle{ y_i \in \mathbb{R} }[/math] is the noisy observed output at input location [math]\displaystyle{ x_i \in \mathbb{R}^d }[/math] . Traditionally, in the Bayesian Linear Regression model, this regression problem is tackled by requiring us to parameterize the latent function [math]\displaystyle{ f }[/math] by a parameter [math]\displaystyle{ w \in \mathbb{R}^H ,\; f(x):=\langle\phi(x), w_i\rangle }[/math] for [math]\displaystyle{ H }[/math] fixed basis functions [math]\displaystyle{ \{\phi_h(x)\}^H_{h=1} }[/math]. A prior distribution is then defined over parameter [math]\displaystyle{ w }[/math]. The idea of the Gaussian process regression (in the geostatistical literature, this is also called kriging; see, e.g., Krige 1951; Matheron 1963) is to place a prior directly on the space of functions without parameterizing the function (...)
=== 2017c ===
- QUOTE: In a Regression problem, we are interested to recover a functional dependency [math]\displaystyle{ y_i= f(x_i) + \epsilon_i }[/math] from [math]\displaystyle{ N }[/math] observed training data points [math]\displaystyle{ \{(xi , yi)\}^N_{i=1} }[/math], where [math]\displaystyle{ y_i \in \mathbb{R} }[/math] is the noisy observed output at input location [math]\displaystyle{ x_i \in \mathbb{R}^d }[/math] . Traditionally, in the Bayesian Linear Regression model, this regression problem is tackled by requiring us to parameterize the latent function [math]\displaystyle{ f }[/math] by a parameter [math]\displaystyle{ w \in \mathbb{R}^H ,\; f(x):=\langle\phi(x), w_i\rangle }[/math] for [math]\displaystyle{ H }[/math] fixed basis functions [math]\displaystyle{ \{\phi_h(x)\}^H_{h=1} }[/math]. A prior distribution is then defined over parameter [math]\displaystyle{ w }[/math]. The idea of the Gaussian process regression (in the geostatistical literature, this is also called kriging; see, e.g., Krige 1951; Matheron 1963) is to place a prior directly on the space of functions without parameterizing the function (...)
- (Scikit-Learn, 2017) ⇒ "1.7.1. Gaussian Process Regression (GPR)" in http://scikit-learn.org/stable/modules/gaussian_process.html Retrieved:2017-09-03
- QUOTE: The
GaussianProcessRegressor
implements Gaussian processes (GP) for regression purposes. For this, the prior of the GP needs to be specified. The prior mean is assumed to be constant and zero (fornormalize_y=False
) or the training data’s mean (fornormalize_y=True
). The prior’s covariance is specified by a passing a kernel object. The hyperparameters of the kernel are optimized during fitting ofGaussianProcessRegressor
by maximizing the log-marginal-likelihood (LML) based on the passedoptimizer
. As the LML may have multiple local optima, the optimizer can be started repeatedly by specifyingn_restarts_optimizer
. The first run is always conducted starting from the initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter values that have been chosen randomly from the range of allowed values. If the initial hyperparameters should be kept fixed, can be passed as optimizer.The noise level in the targets can be specified by passing it via the parameter
alpha
, either globally as a scalar or per datapoint. Note that a moderate noise level can also be helpful for dealing with numeric issues during fitting as it is effectively implemented as Tikhonov regularization, i.e., by adding it to the diagonal of the kernel matrix. An alternative to specifying the noise level explicitly is to include aWhiteKernel
component into the kernel, which can estimate the global noise level from the data (see example below).The implementation is based on Algorithm 2.1 of [RW2006]. In addition to the API of standard scikit-learn estimators, GaussianProcessRegressor:
- allows prediction without prior fitting (based on the GP prior)
- provides an additional method
sample_y(X)
, which evaluates samples drawn from the GPR (prior or posterior) at given inputs - exposes a method
log_marginal_likelihood(theta)
, which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo.
- QUOTE: The
2017d
- (Schulz et al., 2107) ⇒ Schulz, E., Speekenbrink, M., & Krause, A. (2017). A tutorial on Gaussian process regression with a focus on exploration-exploitation scenarios. bioRxiv, 095190 DOI:10.1101/095190.
- QUOTE: In Gaussian process regression, we assume the output [math]\displaystyle{ y }[/math] of a function [math]\displaystyle{ f }[/math] at input [math]\displaystyle{ x }[/math] can be written as
[math]\displaystyle{ y = f (x) + \epsilon\quad }[/math](3)
with [math]\displaystyle{ \epsilon\approx N (0, \sigma_\epsilon^2 ) }[/math]. Note that this is similar to the assumption made in linear regression, in that we assume an observation consists of an independent “signal” term [math]\displaystyle{ f (x) }[/math] and “noise” term [math]\displaystyle{ \epsilon }[/math]. New in Gaussian process regression, however, is that we assume that the signal term is also a random variable which follows a particular distribution. This distribution is subjective in the sense that the distribution reflects our uncertainty regarding the function. The uncertainty regarding [math]\displaystyle{ f }[/math] can be reduced by observing the output of the function at different input points. The noise term reflects the inherent randomness in the observations, which is always present no matter how many observations we make. In Gaussian process regression, we assume the function [math]\displaystyle{ f(x) }[/math] is distributed as a Gaussian process:
[math]\displaystyle{ f(x) \sim \mathcal{GP} (m(x), k(x, x' )) }[/math]
A Gaussian process GP is a distribution over functions and is defined by a mean and a covariance function. The mean function [math]\displaystyle{ m(x) }[/math] reflects the expected function value at input [math]\displaystyle{ x }[/math]:
[math]\displaystyle{ m(x) = E[f(x)] }[/math]
, i.e. the average of all functions in the distribution evaluated at input [math]\displaystyle{ x }[/math]. The prior mean function is often set to [math]\displaystyle{ m(x) = 0 }[/math] in order to avoid expensive posterior computations and only do inference via the covariance directly. The covariance function [math]\displaystyle{ k(x, x) }[/math] models the dependence between the function values at different input points [math]\displaystyle{ x }[/math] and [math]\displaystyle{ x }[/math] :
[math]\displaystyle{ k(x, x' ) = E [(f (x) − m(x))(f (x') − m(x'))] }[/math].
The function [math]\displaystyle{ k }[/math] is commonly called the kernel of the Gaussian process (Jäkel, Schölkopf, & Wichmann, 2007). The choice of an appropriate kernel is based on assumptions such as smoothness and likely patterns to be expected in the data (more on this later). A sensible assumption is usually that the correlation between two points decays with the distance between the points according to a power function. This just means that closer points are expected to behave more similarly than points which are further away from each other. One very popular choice of a kernel fulfilling this assumption is the radial basis function kernel, which is defined as
[math]\displaystyle{ k(x, x') = \sigma_f^2\exp(-\frac{\parallel x-x'\parallel^2}{2 \lambda^2}) }[/math]
The radial basis function provides an expressive kernel to model smooth functions. The two hyper-parameters [math]\displaystyle{ \lambda }[/math] (called the length-scale) and [math]\displaystyle{ \sigma_f^2 }[/math] (the signal variance) can be varied to increase or reduce the correlation between points and consequentially the smoothness of the resulting function.
- QUOTE: In Gaussian process regression, we assume the output [math]\displaystyle{ y }[/math] of a function [math]\displaystyle{ f }[/math] at input [math]\displaystyle{ x }[/math] can be written as