Relevance Vector Machine (RVM) Algorithm: Difference between revisions

Revision as of 20:38, 23 December 2019

A Relevance Vector Machine (RVM) Algorithm is a probabilistic supervised learning algorithm that uses Bayesian inference...

Context:
- It can range from being a Relevance Vector Machine Regression Algorithm to being a Relevance Vector Machine Classification Algorithm.
- It can implement Expectation Maximization and Sequential Minimal Optimization Algorithms.
Example(s):
Counter-Example(s):
- a Support Vector Machine.
See: Bayesian Analysis, Automatic Relevance Determination, Sparse Bayesian Learning System, Sparse Bayesian Regression, Machine Learning, Bayesian Inference, Occam's Razor, Regression Analysis, Probabilistic Classification, Journal of Machine Learning Research, Support Vector Machine, Gaussian Process, Covariance Function, Kernel Function,

References

2019

(Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Relevance_vector_machine Retrieved:2019-10-4.
- In mathematics, a Relevance Vector Machine (RVM) is a machine learning technique that uses Bayesian inference to obtain parsimonious solutions for regression and probabilistic classification. The RVM has an identical functional form to the support vector machine, but provides probabilistic classification. It is actually equivalent to a Gaussian process model with covariance function:
  [math]\displaystyle{ k(\mathbf{x},\mathbf{x'}) = \sum_{j=1}^N \frac{1}{\alpha_j} \varphi(\mathbf{x},\mathbf{x}_j)\varphi(\mathbf{x}',\mathbf{x}_j) }[/math]
  where [math]\displaystyle{ \varphi }[/math] is the kernel function (usually Gaussian), [math]\displaystyle{ \alpha_j }[/math] are the variances of the prior on the weight vector [math]\displaystyle{ w \sim N(0,\alpha^{-1}I) }[/math] , and [math]\displaystyle{ \mathbf{x}_1,\ldots,\mathbf{x}_N }[/math] are the input vectors of the training set. Compared to that of support vector machines (SVM), the Bayesian formulation of the RVM avoids the set of free parameters of the SVM (that usually require cross-validation-based post-optimizations). However RVMs use an expectation maximization (EM)-like learning method and are therefore at risk of local minima. This is unlike the standard sequential minimal optimization (SMO)-based algorithms employed by SVMs, which are guaranteed to find a global optimum (of the convex problem). The relevance vector machine is patented in the United States by Microsoft.

2017

(Jia et al.,2017) ⇒ Yuheng Jia , Sam Kwong, Wenhui Wu, Wei Gao, and Ran Wang (2017, September). "Generalized Relevance Vector Machine". In 2017 Intelligent Systems Conference (IntelliSys) (pp. 638-645). IEEE.
- QUOTE: This paper considers the generalized version of relevance vector machine (RVM), which is a sparse Bayesian kernel machine for classification and ordinary regression. Generalized RVM (GRVM) follows the work of generalized linear model (GLM), which is a natural generalization of ordinary linear regression model and shares a common approach to estimate the parameters. GRVM inherits the advantages of GLM, i.e., unified model structure, same training algorithm, and convenient task-specific model design. It also inherits the advantages of RVM, i.e., probabilistic output, extremely sparse solution, hyperparameter auto-estimation. Besides, GRVM extends RVM to a wider range of learning tasks beyond classification and ordinary regression by assuming that the conditional output belongs to exponential family distribution (EFD). Since EFD results in inference intractable problem in Bayesian analysis, in this paper, Laplace approximation is adopted to solve this problem, which is a common approach in Bayesian inference. Further, several task-specific models are designed based on GRVM including models for ordinary regression, count data regression, classification, ordinal regression, etc. Besides, the relationship between GRVM and traditional RVM models are discussed (...)

2010

(Saarela et al., 2010) ⇒ Matti Saarela, Tapio Elomaa, and Keijo Ruohonen (2010). "An analysis of relevance vector machine regression". In Advances in Machine Learning I (pp. 227-246). Springer, Berlin, Heidelberg. DOI: 10.1007/978-3-642-05177-7_11. ISBN: 978-3-642-05177-7.
- QUOTE: The relevance vector machine (RVM) is a Bayesian framework for learning sparse regression models and classifiers. Despite of its popularity and practical success, no thorough analysis of its functionality exists. In this paper we consider the RVM in the case of regression models and present two kinds of analysis results: we derive a full characterization of the behavior of the RVM analytically when the columns of the regression matrix are orthogonal and give some results concerning scale and rotation invariance of the RVM. We also consider the practical implications of our results and present a scenario in which our results can be used to detect potential weakness in the RVM framework.

2006

(Tzikas et al., 2006) ⇒ Dimitris Tzikas, Liyang Wei, Aristidis Likas, Yongyi Yang, and Nikolas P. Galatsanos (2006). "A Tutorial On Relevance Vector Machines For Regression And Classification With Applications".
- QUOTE: Relevance vector machines (RVM) have recently attracted much interest in the research community because they provide a number of advantages. They are based on a Bayesian formulation of a linear model with an appropriate prior that results in a sparse representation. As a consequence, they can generalize well and provide inferences at low computational cost. In this tutorial we first present the basic theory of RVM for regression and classification, followed by two examples illustrating the application of RVM for object detection and classification (...)
  Relevance vector machine (RVM) is a special case of a sparse linear model, where the basis functions are formed by a kernel function [math]\displaystyle{ \phi }[/math] centred at the different training points:
  [math]\displaystyle{ y(x)=\displaystyle \sum_{i=1}^N w_i\phi(x-x_i) }[/math]
  While this model is similar in form to the support vector machines (SVM), the kernel function here does not need to satisfy the Mercer’s condition, which requires [math]\displaystyle{ \phi }[/math] to be a continuous symmetric kernel of a positive integral operator.
  Multi-kernel RVM is an extension of the simple RVM model. It consists of several different types of kernels [math]\displaystyle{ \phi_m }[/math] , given by:
  [math]\displaystyle{ y(x)=\displaystyle \sum_{m=1}^M \sum_{i=1}^N w_{m,i}\phi_m(x-x_i) }[/math]
  The sparseness property enables automatic selection of the proper kernel at each location by pruning all irrelevant kernels, though it is possible that two different kernels remain on the same location.

2005

(Rasmussen & Quinonero-Candela, 2005) ⇒ Carl Edward Rasmussen, and Joaquin Quinonero-Candela (2005, August). "Healing the Relevance Vector Machine Through Augmentation". In Proceedings of the 22nd international conference on Machine learning (pp. 689-696). ACM.
- QUOTE: The Relevance Vector Machine (RVM) introduced by Tipping (2001) produces sparse solutions using an improper hierarchical prior and optimizing over hyperparameters. The RVM is exactly equivalent to a Gaussian Process, where the RVM hyperparameters are parameters of the GP covariance function (more on this in the discussion section). However, the covariance function of the RVM seen as a GP is degenerate: its rank is at most equal to the number of relevance vectors of the RVM. As a consequence, for localized basis functions, the RVM produces predictive distributions with properties opposite to what would be desirable. Indeed, the RVM is more certain about its predictions the further one moves away from the data it has been trained on. One would wish the opposite behaviour, as is the case with non-degenerate GPs, where the uncertainty of the predictions is minimal for test points in the regions of the input space where (training) data has been seen. For non-localized basis functions, the same undesired effect persists, although the intuition may be less clear, see the discussion.

2004

(Bishop, 2004) ⇒ Christopher M. Bishop. (2004). “Recent Advances in Bayesian Inference Techniques." Keynote Presentation at SIAM Conference on Data Mining.
- Relevance Vector Machine (Tipping, 1999)
  - Bayesian alternative to support vector machine (SVM)
  - Properties
    - comparable error rates to SVM on new data
    - no cross-validation to set complexity parameters
    - applicable to wide choice of basis function
    - multi-class classification
    - probabilistic outputs
    - dramatically fewer kernels (by an order of magnitude)
    - but, slower to train than SVM

2001

(Tipping, 2001) ⇒ Michael E. Tipping (2001). "Sparse Bayesian Learning and the Relevance Vector Machine". Journal of machine learning research, 1(Jun), 211-244.
- QUOTE: Specifically, we adopt a fully probabilistic framework and introduce a prior over the model weights governed by a set of hyperparameters, one associated with each weight, whose most probable values are iteratively estimated from the data. Sparsity is achieved because in practice we find that the posterior distributions of many of the weights are sharply (indeed infinitely) peaked around zero. We term those training vectors associated with the remaining non-zero weights `relevance' vectors, in deference to the principle of automatic relevance determination which motivates the presented approach (MacKay, 1994; Neal, 1996). The most compelling feature of the RVM is that, while capable of generalisation performance comparable to an equivalent SVM, it typically utilises dramatically fewer kernel functions

2000a

(Bishop & Tipping, 2000) ⇒ Christopher M. Bishop, and Michael E. Tipping (2000). "Variational Relevance Vector Machines". Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2000.
- QUOTE: Recently Tipping [8] introduced the Relevance Vector Machine (RVM) which makes probabilistic predictions and yet which retains the excellent predictive performance of the support vector machine. It also preserves the sparseness property of the SVM. Indeed, for a wide variety of test problems it actually leads to models which are dramatically sparser than the corresponding SVM, while sacrificing little if anything in the accuracy of prediction (...)
  As we have seen, the standard relevance vector machine of Tipping [8] estimates point values for the hyperparameters. In this paper we seek a more complete Bayesian treatment of the RVM through exploitation of variational methods.

2000b

(Tipping, 2000) ⇒ Michael E. Tipping (2000). "The Relevance Vector Machine". In Advances in neural information processing systems (pp. 652-658).
- QUOTE: In this paper, we introduce the relevance vector machine (RVM), a probabilistic sparse kernel model identical in functional form to the SVM. Here we adopt a Bayesian approach to learning, where we introduce a prior over the weights governed by a set of hyperparameters, one associated with each weight, whose most probable values are iteratively estimated from the data. Sparsity is achieved because in practice we find that the posterior distributions of many of the weights are sharply peaked around zero. Furthermore, unlike the support vector classifier, the nonzero weights in the RVM are not associated with examples close to the decision boundary, but rather appear to represent 'prototypical' examples of classes. We term these examples 'relevance' vectors, in deference to the principle of automatic relevance determination (ARD) which motivates the presented approach ^[1] ^[2].

↑ D. J. C. Mackay. Bayesian non-linear modelling for the prediction competition. In ASHRAE Transactions, vol. 100, pages 1053- 1062. ASHRAE, Atlanta, Georgia, 1994.
↑ R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996

[ref4-1] D. J. C. Mackay. Bayesian non-linear modelling for the prediction competition. In ASHRAE Transactions, vol. 100, pages 1053- 1062. ASHRAE, Atlanta, Georgia, 1994.

[ref6-2] R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996

[1]

[2]

@@ Line 33: / Line 33: @@
 === 2005 ===
 * (Rasmussen & Quinonero-Candela, 2005) ⇒  [[Carl Edward Rasmussen]], and [[Joaquin Quinonero-Candela]] (2005, August). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.363.6103&rep=rep1&type=pdf "Healing the Relevance Vector Machine Through Augmentation"]. In Proceedings of the 22nd international conference on Machine learning (pp. 689-696). ACM.
-** QUOTE: The [[Relevance Vector Machine (RVM)]] introduced by [[#2001|Tipping (2001)]] produces [[sparse solution]]s using an improper [[hierarchical prior]] and optimizing over [[hyperparameter]]s. The [[RVM]] is exactly equivalent to a [[Gaussian Process]], where the [[RVM]] [[hyperparameter]]s are [[parameter]]s of the [[GP covariance function]] (more on this in the discussion section). However, the [[covariance function]] of the [[RVM]] seen as a [[GP]] is degenerate: its [[rank]] is at most equal to the number of [[relevance vector]]s of the [[RVM]]. As a consequence, for [[localized basis function]]s, the [[RVM]] produces [[predictive distribution]]s with properties opposite to what would be desirable. Indeed, the [[RVM]] is more certain about its [[prediction]]s the further one moves away from the [[data]] it has been trained on. One would wish the opposite behaviour, as is the case with [[non-degenerate GP]]s, where the [[uncertainty]] of the [[prediction]]s is [[minimal]] for [[test point]]s in the regions of the [[input space]] where [[Training Data|(training) data]] has been seen. For [[non-localized basis function]]s, the same undesired effect persists, although the intuition may be less clear, see the discussion.
+** QUOTE: The [[Relevance Vector Machine (RVM) Algorithm|Relevance Vector Machine (RVM)]] introduced by [[#2001|Tipping (2001)]] produces [[sparse solution]]s using an improper [[hierarchical prior]] and optimizing over [[hyperparameter]]s. The [[RVM]] is exactly equivalent to a [[Gaussian Process]], where the [[RVM]] [[hyperparameter]]s are [[parameter]]s of the [[GP covariance function]] (more on this in the discussion section). However, the [[covariance function]] of the [[RVM]] seen as a [[GP]] is degenerate: its [[rank]] is at most equal to the number of [[relevance vector]]s of the [[RVM]]. As a consequence, for [[localized basis function]]s, the [[RVM]] produces [[predictive distribution]]s with properties opposite to what would be desirable. Indeed, the [[RVM]] is more certain about its [[prediction]]s the further one moves away from the [[data]] it has been trained on. One would wish the opposite behaviour, as is the case with [[non-degenerate GP]]s, where the [[uncertainty]] of the [[prediction]]s is [[minimal]] for [[test point]]s in the regions of the [[input space]] where [[Training Data|(training) data]] has been seen. For [[non-localized basis function]]s, the same undesired effect persists, although the intuition may be less clear, see the discussion.
 === 2004 ===
@@ Line 54: / Line 54: @@
 === 2000a ===
 * ([[Bishop & Tipping, 2000]]) ⇒ [[Christopher M. Bishop]], and [[Michael E. Tipping]] (2000). [https://arxiv.org/pdf/1301.3838.pdf "Variational Relevance Vector Machines"]. Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2000.
-** QUOTE: Recently [[#2000b|Tipping &#91;8&#93;]] introduced the [[Relevance Vector Machine (RVM)]] which makes [[probabilistic prediction]]s and yet which retains the excellent [[predictive performance]] of the [[support vector machine]]. It also preserves the [[sparseness property]] of the [[SVM]]. Indeed, for a wide variety of [[test problem]]s it actually leads to [[model]]s which are dramatically [[sparser]] than the corresponding [[SVM]], while sacrificing little if anything in the [[accuracy]] of [[prediction]] (...) <P> As we have seen, the standard [[relevance vector machine]] of [[#2000b|Tipping &#91;8&#93;]] [[estimate]]s [[point value]]s for the [[hyperparameter]]s. In this paper we seek a more complete [[Bayesian Theory|Bayesian treatment]] of the [[RVM]] through exploitation of [[variational method]]s.
+** QUOTE: Recently [[#2000b|Tipping &#91;8&#93;]] introduced the [[Relevance Vector Machine (RVM) Algorithm|Relevance Vector Machine (RVM)]] which makes [[probabilistic prediction]]s and yet which retains the excellent [[predictive performance]] of the [[support vector machine]]. It also preserves the [[sparseness property]] of the [[SVM]]. Indeed, for a wide variety of [[test problem]]s it actually leads to [[model]]s which are dramatically [[sparser]] than the corresponding [[SVM]], while sacrificing little if anything in the [[accuracy]] of [[prediction]] (...) <P> As we have seen, the standard [[relevance vector machine]] of [[#2000b|Tipping &#91;8&#93;]] [[estimate]]s [[point value]]s for the [[hyperparameter]]s. In this paper we seek a more complete [[Bayesian Theory|Bayesian treatment]] of the [[RVM]] through exploitation of [[variational method]]s.
 === 2000b ===

Relevance Vector Machine (RVM) Algorithm: Difference between revisions

Revision as of 20:38, 23 December 2019

References

2019

2017

2010

2006

2005

2004

2001

2000a

2000b

Navigation menu

Search