Dataset Dimensionality Reduction Task
A Dataset Dimensionality Reduction Task is a data transformation task that requires the creation of a lower-dimensional data representation that can map a data set to another data set with fewer data dimensions (from a low dimension space).
- AKA: Feature Projection.
- Context:
- Input: a Database (typically simple).
- Input (optional): count of required Dimensions.
- Input (optional): Target Data Attribute.
- output: Low-Dimensionality Database.
- It can be solved by a Dimensionality Reduction System (that implements a Dimensionality Reduction algorithm).
- It can range from being a Space-Preserving Dimensionality Reduction Task(feature selection) to being a Feature-Space Compression Task.
- It can range from being an Unsupervised Dimensionality Reduction Task to being a Supervised Dimensionality Reduction Task.
- …
- Input: a Database (typically simple).
- Counter-Example(s):
- a Dataset Creation Task (such as a feature creation task).
- a Feature Weighting Task.
- See: Lossy Compression, Curse of Dimensionality, Linear Discriminant Analysis, Canonical Correlation Analysis, Embedding, Locality Sensitive Hashing.
References
2017
- (Vlachos) ⇒ Vlachos M. (2017) "Dimensionality Reduction". In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA
- ABSTRACT: Dimensionality reduction in an important data pre-processing when dealing with Big Data. We explain how it can be used for speeding up search operation and show applications for time-series datasets.
2012
- http://en.wikipedia.org/wiki/Dimensionality_reduction
- QUOTE: In machine learning, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.
2011
- (Vlachos, 2011a) ⇒ Michail Vlachos. (2011). “Dimensionality Reduction.” In: (Sammut & Webb, 2011) p.274
2009
- (Chen et al., 2009) ⇒ Bo Chen, Wai Lam, Ivor Tsang, and Tak-Lam Wong. (2009). “Extracting Discrimininative Concepts for Domain Adaptation in Text Mining.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557045
- QUITE: … While the use of labels in linear discriminant analysis usually helps extract more discriminative features, the label information from the source domains may be also useful to learn kernels or extract features for a better domain adaptation.
2002
- (Fodor, 2002) ⇒ Imola K. Fodor. (2002). “A Survey of Dimension Reduction Techniques." LLNL technical report, UCRL ID-148494
- QUOTE: In mathematical terms, the problem we investigate can be stated as follows: given the p-dimensional random variable x = (x1,...,xp)T, find a lower dimensional representation of it, s = (s1,...,sk)T with [math]\displaystyle{ k }[/math] <= [math]\displaystyle{ p }[/math], that captures the content in the original data, according to some criterion. The components of s are sometimes called the hidden components. Different fields use different names for the [math]\displaystyle{ p }[/math] multivariate vectors: the term "variable" is mostly used in statistics, while "feature" and "attribute" are alternatives commonly used in the computer science and machine learning literature.
Throughout this paper, we assume that we have [math]\displaystyle{ n }[/math] observations, each being a realization of the p-dimensional random variable x = (x1,...,xp)T with mean
E
(x) = μ = (μ1,...,μp)T and covariance matrixE
{(x-μ)(x-μ)T = Σp x p. We denote such an observation matrix by X = {xi,j : 1 <= [math]\displaystyle{ i }[/math] <= [math]\displaystyle{ p }[/math], 1 <= [math]\displaystyle{ j }[/math] <= n}. If μi and σi = SQRT(Σp x p) denote the mean and the standard deviation of the ith random variable, respectively, then we will often standardize the observations xi,j by (xi,j - μi)/σi, where …
- QUOTE: In mathematical terms, the problem we investigate can be stated as follows: given the p-dimensional random variable x = (x1,...,xp)T, find a lower dimensional representation of it, s = (s1,...,sk)T with [math]\displaystyle{ k }[/math] <= [math]\displaystyle{ p }[/math], that captures the content in the original data, according to some criterion. The components of s are sometimes called the hidden components. Different fields use different names for the [math]\displaystyle{ p }[/math] multivariate vectors: the term "variable" is mostly used in statistics, while "feature" and "attribute" are alternatives commonly used in the computer science and machine learning literature.