Kernel Matrix
A Kernel Matrix is a symmetric and positive semidefinite matrix that encodes the relative positions of all points.
- AKA: Gram Matrix, Gramian Matrix, Gramian.
- Context:
- It can (typically) be created by a Kernel-based System (applying a kernel-based algorithm).
- Example(s):
kernelMatrix
,sklearn.metrics.pairwise.pairwise_kernels
- SciKit-Learn implementation.- …
- Counter-Example(s):
- See: Radial Basis Function, Kernel Function, Kernel Method, Kernel-based Learning Algorithm, Support Vector Machine, Gaussian Process, Kernel Target Alignment, Semidefinite Programming, Mercer Kernel.
References
2018a
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Gramian_matrix Retrieved:2018-8-10.
- In linear algebra, the Gram matrix (Gramian matrix or Gramian) of a set of vectors [math]\displaystyle{ v_1,\dots, v_n }[/math] in an inner product space is the Hermitian matrix of inner products, whose entries are given by [math]\displaystyle{ G_{ij}=\langle v_i, v_j \rangle }[/math] . [1]
An important application is to compute linear independence: a set of vectors are linearly independent if and only if the Gram determinant (the determinant of the Gram matrix) is non-zero.
It is named after Jørgen Pedersen Gram.
- In linear algebra, the Gram matrix (Gramian matrix or Gramian) of a set of vectors [math]\displaystyle{ v_1,\dots, v_n }[/math] in an inner product space is the Hermitian matrix of inner products, whose entries are given by [math]\displaystyle{ G_{ij}=\langle v_i, v_j \rangle }[/math] . [1]
2018b
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick Retrieved:2018-8-10.
- Theoretically, a Gram matrix [math]\displaystyle{ \mathbf{K} \in \mathbb{R}^{n \times n} }[/math] with respect to [math]\displaystyle{ \{\mathbf{x}_1, \dotsc, \mathbf{x}_n\} }[/math] (sometimes also called a "kernel matrix" ), where [math]\displaystyle{ K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) }[/math] , must be positive semi-definite (PSD). Empirically, for machine learning heuristics, choices of a function [math]\displaystyle{ k }[/math] that do not satisfy Mercer's condition may still perform reasonably if [math]\displaystyle{ k }[/math] at least approximates the intuitive idea of similarity. Regardless of whether [math]\displaystyle{ k }[/math] is a Mercer kernel, [math]\displaystyle{ k }[/math] may still be referred to as a "kernel". If the kernel function [math]\displaystyle{ k }[/math] is also a covariance function as used in Gaussian processes, then the Gram matrix [math]\displaystyle{ \mathbf{K} }[/math] can also be called a covariance matrix.
2018c
- (ML Wiki, 2018) ⇒ http://mlwiki.org/index.php/Gram_Matrices Retrieved:2018-8-10.
- QUOTE: A Gram matrix of vectors </math>a_1, \cdots ,a_n</math> is a matrix [math]\displaystyle{ G }[/math]
- s.t. [math]\displaystyle{ G=\langle a_i,a_j\rangle }[/math] for all [math]\displaystyle{ i,j }[/math]
- if vectors [math]\displaystyle{ a_1,\cdots ,a_n }[/math] are columns of a matrix [math]\displaystyle{ A }[/math], then [math]\displaystyle{ G=A^TA }[/math]
- a Gram matrix is Positive Definite and Symmetric.
- if vectors [math]\displaystyle{ a_1, \cdots ,a_n }[/math] are the rows of A (A would be so-called “Data Matrix"), then [math]\displaystyle{ G=AA^T }[/math], and it's called left Gram matrix.
- QUOTE: A Gram matrix of vectors </math>a_1, \cdots ,a_n</math> is a matrix [math]\displaystyle{ G }[/math]
2018d
- (R Documentation, 2018) ⇒ Alexandros Karatzoglou. https://www.rdocumentation.org/packages/kernlab/versions/0.9-26/topics/kernelMatrix Retrieved:2018-8-10.
- QUOTE:
kernelMatrix
calculates the kernel matrix [math]\displaystyle{ K_{ij} = k (x_i , x_j) }[/math] or [math]\displaystyle{ K_{ij} = k (x_i , y_j) }[/math] .kernelPol
computes the quadratic kernel expression [math]\displaystyle{ H = z_i z_j k(x_i ,x_j),\; H = z_i k_j k(x_i,y_j) }[/math].kernelMult
calculates the kernel expansion [math]\displaystyle{ f(x_i) = \sum^m_{i = 1} z_i k(x_i,x_j) }[/math]kernelFast
computes the kernel matrix, identical tokernelMatrix
, except that it also requires the squared norm of the first argument as additional input, useful in iterative kernel matrix calculations.
- QUOTE:
2018e
- (sklearn, 2018) ⇒ http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_kernels.html
- QUOTE:
sklearn.metrics.pairwise.pairwise_kernels(X, Y=None, metric=’linear’, filter_params=False, n_jobs=1, **kwds)
[source https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/metrics/pairwise.py#L1319]Compute the kernel between arrays X and optional array Y.
This method takes either a vector array or a kernel matrix, and returns a kernel matrix. If the input is a vector array, the kernels are computed. If the input is a kernel matrix, it is returned instead.
This method provides a safe way to take a kernel matrix as input, while preserving compatibility with many other algorithms that take a vector array.
If Y is given (default is None), then the returned matrix is the pairwise kernel between the arrays from both X and Y …
- QUOTE:
2017a
- (Sammut & Webb, 2017) ⇒ Claude Sammut, and Geoffrey I. Webb. (2017). "Kernel Matrix." In: (Sammut & Webb, 2017)
- QUOTE: Given a kernel function [math]\displaystyle{ k: X \times X \to }[/math] and patterns [math]\displaystyle{ x_1, \cdots, x_m \in X }[/math], the [math]\displaystyle{ m \times m }[/math] matrix [math]\displaystyle{ K }[/math] with elements [math]\displaystyle{ K_{ij} : = k(x_i, x_j) }[/math] is called the kernel matrix of [math]\displaystyle{ k }[/math] with respect to [math]\displaystyle{ x_1, \cdots, x_m }[/math].
2017b
- (Karimi, 2017) ⇒ Amir-Hossein Karimi (2017). "A Summary Of The Kernel Matrix, And How To Learn It Effectively Using Semidefinite Programming". arXiv preprint arXiv:1709.06557.
- QUOTE: Say [math]\displaystyle{ x, y \in X }[/math] are original points [math]\displaystyle{ \in \mathbb{R}^d }[/math] and [math]\displaystyle{ \phi(x), \phi(y) \in mathcal{F} }[/math] are the projected points [math]\displaystyle{ \in \mathbb{R}^D }[/math]. Now, because our classifier depends on the similarity between the projected points (i.e., [math]\displaystyle{ \langle \phi(x), \phi(y)\rangle }[/math] ), and because computing this is costly, we instead consider a non-linear mapping [math]\displaystyle{ \phi : \mathcal{X} \in \mathcal{F} }[/math], such that for all [math]\displaystyle{ x, y \in \mathbb{R}^d , \langle \phi(x), \phi(y)\rangle_\mathcal{F} = K(x, y) }[/math] for some kernel [math]\displaystyle{ K(x, y) }[/math]. From here, we can simply learn a classifier [math]\displaystyle{ H : x \in w^T\phi(x) }[/math] for some [math]\displaystyle{ w \in \mathcal{F} }[/math].
The information specifying the inner products between each pair of points in the embedding space is contained in the so-called kernel matrix, which is symmetric (due to the commutative property of distance between two points) and positive semidefinite (positive definite if all points are linearly independent). This matrix essential describes the geometry of the embedding space. The importance of this lies in the fact that since kernel-based learning algorithms extract all information needed from inner products of training data points in [math]\displaystyle{ \mathcal{F} }[/math], there is no need to learn a kernel function [math]\displaystyle{ \phi }[/math] over the entire sample space to specify the embedding of a finite training dataset. Instead, the finite-dimensional kernel matrix (also known as a Gram matrix) that contains the inner products of training points in [math]\displaystyle{ \mathcal{F} }[/math] is sufficient.
- QUOTE: Say [math]\displaystyle{ x, y \in X }[/math] are original points [math]\displaystyle{ \in \mathbb{R}^d }[/math] and [math]\displaystyle{ \phi(x), \phi(y) \in mathcal{F} }[/math] are the projected points [math]\displaystyle{ \in \mathbb{R}^D }[/math]. Now, because our classifier depends on the similarity between the projected points (i.e., [math]\displaystyle{ \langle \phi(x), \phi(y)\rangle }[/math] ), and because computing this is costly, we instead consider a non-linear mapping [math]\displaystyle{ \phi : \mathcal{X} \in \mathcal{F} }[/math], such that for all [math]\displaystyle{ x, y \in \mathbb{R}^d , \langle \phi(x), \phi(y)\rangle_\mathcal{F} = K(x, y) }[/math] for some kernel [math]\displaystyle{ K(x, y) }[/math]. From here, we can simply learn a classifier [math]\displaystyle{ H : x \in w^T\phi(x) }[/math] for some [math]\displaystyle{ w \in \mathcal{F} }[/math].
2016
- http://www.quora.com/What-are-the-major-factors-that-motivate-us-to-use-Neural-networks-over-Kernel-methods-for-large-datasets-in-layman-terms
- QUOTE: … Kernel methods, by contrast, work by comparing pairs of data points. This means that the matrix of comparisons (aka the kernel matrix) needs lots of memory … On the other hand, deep networks are rather compact function classes, hence you can get away with compact storage, at the expense of a lot more computation to train them. …
2015
- (Vilnis & McCallum, 2015) ⇒ Luke Vilnis, and Andrew McCallum. (2015). “Word Representations via Gaussian Embedding.” In: arXiv preprint arXiv:1412.6623 submitted to ICRL 2015.
- QUOTE: However, these Bayesian methods apply Bayes’ rule to observed data to infer the latent distributions, whereas our model works directly in the space of probability distributions and discriminatively trains them. This allows us to go beyond the Bayesian approach and use arbitrary (and even asymmetric) training criteria, and is more similar to methods that learn kernels (Lanckriet et al., 2004) or function-valued neural networks such as mixture density networks (Bishop, 1994).
2007
- (Nguyen & Ho, 2007) ⇒ Canh Hao Nguyen, and Tu Bao Ho (2007, January). "Kernel Matrix Evaluation". In IJCAI (pp. 987-992).
- QUOTE: Training example set [math]\displaystyle{ \{x_i\}_{i=1,\cdots, n} \subset X }[/math] with the corresponding target vector [math]\displaystyle{ y=\{ y_i \}^T_{i=1, \cdots, n} \subset \{−1, 1\}^n }[/math]. Suppose that [math]\displaystyle{ y_1 = \cdots = y_{n_+} = 1 }[/math] and [math]\displaystyle{ y_{n_++1} = .. = y_{n_++n_−} = −1;\; n_+ }[/math] examples belong to class [math]\displaystyle{ 1,\; n_− }[/math] examples belong to class [math]\displaystyle{ −1,\; n_+ + n_− = n }[/math]. Under a feature map [math]\displaystyle{ \phi }[/math], the kernel matrix is defined as:
[math]\displaystyle{ K = \{k_{ij} = \langle \phi(x_i), \phi(x_j)\rangle\}_{i=1, \cdots, n,\;j=1,\cdots, n} }[/math]
- QUOTE: Training example set [math]\displaystyle{ \{x_i\}_{i=1,\cdots, n} \subset X }[/math] with the corresponding target vector [math]\displaystyle{ y=\{ y_i \}^T_{i=1, \cdots, n} \subset \{−1, 1\}^n }[/math]. Suppose that [math]\displaystyle{ y_1 = \cdots = y_{n_+} = 1 }[/math] and [math]\displaystyle{ y_{n_++1} = .. = y_{n_++n_−} = −1;\; n_+ }[/math] examples belong to class [math]\displaystyle{ 1,\; n_− }[/math] examples belong to class [math]\displaystyle{ −1,\; n_+ + n_− = n }[/math]. Under a feature map [math]\displaystyle{ \phi }[/math], the kernel matrix is defined as:
2004
- (Lanckriet et al., 2004) ⇒ Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. (2004). “Learning the Kernel Matrix with Semidefinite Programming.” In: The Journal of Machine Learning Research, 5.
- QUOTE: Kernel-based learning algorithms work by embedding the data into an Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space --- classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques.
- ↑
Theorem 7.2.10 Let [math]\displaystyle{ v_1,\ldots,v_m }[/math] be vectors in an inner product space with inner product [math]\displaystyle{ \langle{\cdot,\cdot}\rangle }[/math] and let [math]\displaystyle{ G = [\langle{v_j,v_i}\rangle]_{i,j=1}^m \in M_m }[/math] . Then
(a) is Hermitian and positive-semidefinite
(b) is positive-definite if and only if the vectors [math]\displaystyle{ v_1,\ldots,v_m }[/math] are linearly-independent.
(c) [math]\displaystyle{ \operatorname{rank}G=\dim\operatorname{span}\{v_1,\ldots,v_m\} }[/math]