High-Dimensionality Dataset

A High-Dimensionality Dataset is a structured dataset whose data record attributes form a large set.

Context:
- It can range from being a Sparse High-Dimensional Dataset to being a Dense High-Dimensional Dataset.
- It can be the Input to: High-Dimensionality Clustering, Dimensionality Reduction, ...
- It can represent a subset of a High-Dimensional Space.
- …
Example(s):
- a High-Dimensionality Learning Dataset.
- a vectorized image dataset.
- a High-Dimensional Sensory Input.
- a Record Set composed of Hypervectors.
- …
Counter-Example(s):
- Low-Dimensionality Dataset.
See: Index Data Structure, High-Dimensionality Matrix.

References

1999

(Agrawal et al., 1999) ⇒ Rakesh Agrawal, Johannes Ernst Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan. (1999). “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications." US Patent 6,003,029,
- QUOTE: Emerging data mining applications place special requirements on clustering techniques, such as the ability to handle high dimensionality, assimilation of cluster descriptions by users, description minimation, and scalability and usability. Regarding high dimensionality of data clustering, an object typically has dozens of attributes in which the domains of the attributes are large. Clusters formed in a high-dimensional data space are not likely to be meaningful clusters because the expected average density of points anywhere in the high-dimensional data space is low. The requirement for high dimensionality in a data mining application is conventionally addressed by requiring a user to specify the subspace for cluster analysis.

1996

(Berchtold et al., 1996) ⇒ Stefan Berchtold, Daniel A. Keim, and Hans-Peter Kriegel. (1996). “The X-tree: An Index Structure for High-Dimensional Data.” In: Proceedings of VLDB Conference (VLDB 1996).
- QUOTE: In many applications, indexing of high-dimensional data has become increasingly important. In multimedia databases, for example, the multimedia objects are usually mapped to feature vectors in some high-dimensional space and queries are processed against a database of those feature vectors [Fal 94]. Similar approaches are taken in many other areas including CAD [MG 93], molecular biology (for the docking of molecules) [SBK 92], string matching and sequence alignment [AGMM 90], etc. Examples of feature vectors are color histograms [SH 94], shape descriptors [Jag 91, MG 95], Fourier vectors [WW 80], text descriptors [Kuk 92], etc. In some applications, the mapping process does not yield point objects, but extended spatial objects in high-dimensional space [MN 95]. In many of the mentioned applications, the databases are very large and consist of millions of data objects with several tens to a few hundreds of dimensions. For querying these databases, it is essential to use appropriate indexing techniques which provide an efficient access to high-dimensional data. The goal of this paper is to demonstrate the limits of currently available index structures, and present a new index structure which considerably improves the performance in indexing high-dimensional data.