t-SNE and UMAP Dimensionality Reduction Cheat Sheet

Introduction to Dimensionality Reduction

🧮 Reduces high-dimensional data to lower dimensions.
👁️ Enables visualization of complex datasets.
🏃 Improves computational efficiency for ML algorithms.
🔍 Helps in feature selection and data exploration.
🧠 Preserves important structures in the data.

t-SNE Overview

📊 t-Distributed Stochastic Neighbor Embedding
🎯 Focuses on preserving local structures in data.
🔄 Uses probability distributions to map similarities.
🧮 Employs t-distribution in low-dimensional space.
👥 Excels at revealing clusters and patterns.

UMAP Overview

📊 Uniform Manifold Approximation and Projection
🌐 Based on topological data analysis and manifold learning.
⚡ Generally faster than t-SNE for large datasets.
🔗 Preserves both local and global data structures.
🔢 Can handle larger datasets more efficiently.

t-SNE Algorithm Steps

1️⃣ Compute pairwise similarities in high-dimensional space.
2️⃣ Initialize random points in low-dimensional space.
3️⃣ Compute pairwise similarities in low-dimensional space.
4️⃣ Minimize KL divergence between distributions.
5️⃣ Iterate until convergence or max iterations reached.

UMAP Algorithm Steps

1️⃣ Construct fuzzy topological representation of data.
2️⃣ Create weighted graph from fuzzy union.
3️⃣ Optimize low-dimensional layout to match graph.
4️⃣ Apply force-directed layout algorithm.
5️⃣ Refine layout to balance local and global structure.

t-SNE Parameters

🎛️ Perplexity: Balance between local and global aspects.
🔢 Number of iterations: Affects convergence quality.
🏃 Learning rate: Controls step size in optimization.
🌱 Random state: Seed for reproducibility.
📏 Metric: Distance measure (e.g., Euclidean, cosine).

UMAP Parameters

👥 n_neighbors: Size of local neighborhood.
📏 min_dist: Minimum distance between points.
🔢 n_components: Dimensionality of output.
📊 metric: Distance function to use.
🌱 random_state: Seed for reproducibility.

t-SNE Code Example


import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Generate sample data
X = np.random.randn(1000, 50)
y = np.random.randint(0, 5, 1000)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Plot results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.colorbar(scatter)
plt.title('t-SNE visualization of random data')
plt.show()

UMAP Code Example


import numpy as np
import umap
import matplotlib.pyplot as plt

# Generate sample data
X = np.random.randn(1000, 50)
y = np.random.randint(0, 5, 1000)

# Apply UMAP
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X)

# Plot results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
plt.colorbar(scatter)
plt.title('UMAP visualization of random data')
plt.show()

Comparing t-SNE and UMAP


import numpy as np
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt

# Generate sample data
X = np.random.randn(1000, 50)
y = np.random.randint(0, 5, 1000)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply UMAP
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X)

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

scatter1 = ax1.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
ax1.set_title('t-SNE visualization')
plt.colorbar(scatter1, ax=ax1)

scatter2 = ax2.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
ax2.set_title('UMAP visualization')
plt.colorbar(scatter2, ax=ax2)

plt.show()

Advantages of t-SNE

🎯 Excellent at preserving local structures.
👥 Reveals clusters effectively in the data.
🔄 Handles non-linear relationships well.
🖼️ Produces visually appealing results.
🧠 Widely used in machine learning visualization.

Advantages of UMAP

⚡ Generally faster than t-SNE, especially for large datasets.
🌐 Better preserves global structure of the data.
🔢 Can handle larger datasets more efficiently.
🔄 Supports supervised dimensionality reduction.
📈 Often provides more stable results across runs.

Limitations of t-SNE

🐢 Can be computationally expensive for large datasets.
🌐 May not preserve global structure well.
🔄 Results can vary with different random initializations.
⏱️ Can be slow for high-dimensional data.
🧮 Struggles with very sparse data.

Limitations of UMAP

🧠 Can be more difficult to interpret than t-SNE.
🎛️ Sensitive to choice of hyperparameters.
📚 Less theoretical foundation compared to t-SNE.
🔄 May produce different results across runs.
🖼️ Visualizations may be less aesthetically pleasing than t-SNE.

When to Use t-SNE

👁️ For visualizing high-dimensional data.
🔍 When focusing on local structure preservation.
👥 For cluster analysis and pattern recognition.
🧠 In exploratory data analysis of complex datasets.
📊 When dealing with moderate-sized datasets.

When to Use UMAP

📈 For large-scale data visualization.
🌐 When preserving global structure is important.
⚡ For faster dimensionality reduction on big data.
🔄 In supervised learning scenarios.
🔗 When needing to preserve relationships across multiple scales.