t-Distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
A t-Distributed Stochastic Neighbor Embedding (t-SNE) Algorithm is a high-dimensional data visualization algorithm based on Kullback–Leibler divergence to minimize the difference between probability distributions of high-dimensional data and their low-dimensional representations.
- Context:
- It can be implemented into a t-SNE-based System.
- It can (typically) perform a Nonlinear Dimensionality Reduction.
- It can produce visualization results that are highly sensitive to parameter choices.
- It can have computational complexity and memory usage that scale quadratically with the number of data points, making it less suitable for very large datasets without optimization techniques or approximations.
- ...
- Example(s):
- Visualization of gene expression profiles from multiple patients to identify patterns related to different cancer types.
- Analysis of cybersecurity data to identify patterns of attacks or anomalies within network traffic.
- Grouping of similar words or documents in natural language processing applications based on their contextual similarities.
- ...
- Counter-Example(s):
- Principal Component Analysis (PCA) Algorithm, which performs linear dimensionality reduction.
- Uniform Manifold Approximation and Projection (UMAP) Algorithm, a nonlinear dimensionality reduction technique with different mathematical foundations and performance characteristics.
- See: Kullback–Leibler Divergence, sklearn.manifold.TSNE, Dimensionality Reduction, High-Dimensional Data, Data Visualization.
References
- https://lvdmaaten.github.io/tsne/
- https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
2024
- https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- QUOTE: T-distributed Stochastic Neighbor Embedding.
t-SNE [1] is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples. For more tips see Laurens van der Maaten’s FAQ [2].
- QUOTE: T-distributed Stochastic Neighbor Embedding.
2024
- (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/t-distributed_stochastic_neighbor_embedding Retrieved:2024-3-26.
- t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP.
t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biomedical signal processing. While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data, and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results. It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering. For a data set with n elements, t-SNE runs in O(n2) time and requires O(n2) space.
- t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/t-distributed_stochastic_neighbor_embedding Retrieved:2019-8-20.
- T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization developed by Laurens van der Maaten and Geoffrey Hinton. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. ...