Linear Kernel-based Support Vector Machine Algorithm

A Linear Kernel-based Support Vector Machine Algorithm is an SVM training algorithm that can be implemented by a linear SVM training system to solve a linear SVM training task (to produce a linear SVM based on a linear kernel).

Context:
- It can be implement in a Linear SVM System (that solves a linear SVM task).
- It can be interpreted as maximizing the margin of a Linear Kernel (so that the distance to the closest misclassified entity is the widest).
- It can have a Linear SVM Computational Complexity, such as: [math]\displaystyle{ \lt math\gt \mathcal{O}(m^2\cdot n) }[/math] (runtime computational complexity).
Example(s):
- …
Counter-Example(s):
- Non-Linear SVM Algorithm.
- Logistic Regression Algorithm.
See: Kernel-based SVM Algorithm, Discriminative Linear Classifier Learning with a Convex Loss Function, Linearly Separable, Quadratic Programming, Lagrange Multiplier, Karush–Kuhn–Tucker Conditions, Bias of an Estimator, Dual Problem, Maximum-Margin Hyperplane.

References

2016

https://www.quora.com/Why-is-kernelized-SVM-much-slower-than-linear-SVM
- QUOTE: Basically, a kernel-based SVM requires on the order of n^2 computation for training and order of find computation for classification, where n is the number of training examples and d the input dimension (and assuming that the number of support vectors ends up being a fraction of n, which is shown to be expected in theory and in practice). Instead, a 2-class linear SVM requires on the order of find computation for training (times the number of training iterations, which remains small even for large n) and on the order of d computations for classification.

2014

(Wikipedia, 2014) ⇒ http://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM Retrieved:2014-11-28.
- Given some training data [math]\displaystyle{ \mathcal{D} }[/math], a set of n points of the form :[math]\displaystyle{ \mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n }[/math] where the yi is either 1 or −1, indicating the class to which the point [math]\displaystyle{ \mathbf{x}_i }[/math] belongs. Each [math]\displaystyle{ \mathbf{x}_i }[/math] is a p-dimensional real vector. We want to find the maximum-margin hyperplane that divides the points having [math]\displaystyle{ y_i=1 }[/math] from those having [math]\displaystyle{ y_i=-1 }[/math]. Any hyperplane can be written as the set of points [math]\displaystyle{ \mathbf{x} }[/math] satisfying : [math]\displaystyle{ \mathbf{w}\cdot\mathbf{x} - b=0,\, }[/math] where [math]\displaystyle{ \cdot }[/math] denotes the dot product and [math]\displaystyle{ {\mathbf{w}} }[/math] the (not necessarily normalized) normal vector to the hyperplane. The parameter [math]\displaystyle{ \tfrac{b}{\|\mathbf{w}\|} }[/math] determines the offset of the hyperplane from the origin along the normal vector [math]\displaystyle{ {\mathbf{w}} }[/math]. If the training data are linearly separable, we can select two hyperplanes in a way that they separate the data and there are no points between them, and then try to maximize their distance. The region bounded by them is called "the margin". These hyperplanes can be described by the equations : [math]\displaystyle{ \mathbf{w}\cdot\mathbf{x} - b=1\, }[/math] and : [math]\displaystyle{ \mathbf{w}\cdot\mathbf{x} - b=-1.\, }[/math]
  By using geometry, we find the distance between these two hyperplanes is [math]\displaystyle{ \tfrac{2}{\|\mathbf{w}\|} }[/math], so we want to minimize [math]\displaystyle{ \|\mathbf{w}\| }[/math]. As we also have to prevent data points from falling into the margin, we add the following constraint: for each [math]\displaystyle{ i }[/math] either [math]\displaystyle{ \mathbf{w}\cdot\mathbf{x}_i - b \ge 1\qquad\text{ for }\mathbf{x}_i }[/math] of the first class or : [math]\displaystyle{ \mathbf{w}\cdot\mathbf{x}_i - b \le -1\qquad\text{ for }\mathbf{x}_i }[/math] of the second.
  This can be rewritten as: : [math]\displaystyle{ y_i(\mathbf{w}\cdot\mathbf{x}_i - b) \ge 1, \quad \text{ for all } 1 \le i \le n.\qquad\qquad(1) }[/math]
  We can put this together to get the optimization problem:
  Minimize (in [math]\displaystyle{ {\mathbf{w},b} }[/math]) : [math]\displaystyle{ \|\mathbf{w}\| }[/math]
  subject to (for any [math]\displaystyle{ i = 1, \dots, n }[/math]) : [math]\displaystyle{ y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1. \, }[/math] …

2014

http://scikit-learn.org/0.11/modules/sgd.html
- Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.

2006

(Chu et al., 2006) ⇒ Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun. (2006). “Map-Reduce for Machine Learning on Multicore.” In: Proceedings of the 19th International Conference on Neural Information Processing Systems.
- QUOTE: ... Support Vector Machine (SVM) Linear SVM’s (Vapnik, 1981, Platt, 1999) primary goal is to optimize the following primal problem ...

Linear Kernel-based Support Vector Machine Algorithm

References

2016

2014

2014

2006

Navigation menu

Search