Introduction to Dimensionality Reduction
- 🧮 Reduces high-dimensional data to lower dimensions.
- 👁️ Enables visualization of complex datasets.
- 🏃 Improves computational efficiency for ML algorithms.
- 🔍 Helps in feature selection and data exploration.
- 🧠 Preserves important structures in the data.
t-SNE Overview
- 📊 t-Distributed Stochastic Neighbor Embedding
- 🎯 Focuses on preserving local structures in data.
- 🔄 Uses probability distributions to map similarities.
- 🧮 Employs t-distribution in low-dimensional space.
- 👥 Excels at revealing clusters and patterns.
UMAP Overview
- 📊 Uniform Manifold Approximation and Projection
- 🌐 Based on topological data analysis and manifold learning.
- ⚡ Generally faster than t-SNE for large datasets.
- 🔗 Preserves both local and global data structures.
- 🔢 Can handle larger datasets more efficiently.
t-SNE Algorithm Steps
- 1️⃣ Compute pairwise similarities in high-dimensional space.
- 2️⃣ Initialize random points in low-dimensional space.
- 3️⃣ Compute pairwise similarities in low-dimensional space.
- 4️⃣ Minimize KL divergence between distributions.
- 5️⃣ Iterate until convergence or max iterations reached.
UMAP Algorithm Steps
- 1️⃣ Construct fuzzy topological representation of data.
- 2️⃣ Create weighted graph from fuzzy union.
- 3️⃣ Optimize low-dimensional layout to match graph.
- 4️⃣ Apply force-directed layout algorithm.
- 5️⃣ Refine layout to balance local and global structure.
t-SNE Parameters
- 🎛️ Perplexity: Balance between local and global aspects.
- 🔢 Number of iterations: Affects convergence quality.
- 🏃 Learning rate: Controls step size in optimization.
- 🌱 Random state: Seed for reproducibility.
- 📏 Metric: Distance measure (e.g., Euclidean, cosine).
UMAP Parameters
- 👥 n_neighbors: Size of local neighborhood.
- 📏 min_dist: Minimum distance between points.
- 🔢 n_components: Dimensionality of output.
- 📊 metric: Distance function to use.
- 🌱 random_state: Seed for reproducibility.
t-SNE Code Example
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Generate sample data
X = np.random.randn(1000, 50)
y = np.random.randint(0, 5, 1000)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Plot results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.colorbar(scatter)
plt.title('t-SNE visualization of random data')
plt.show()
UMAP Code Example
import numpy as np
import umap
import matplotlib.pyplot as plt
# Generate sample data
X = np.random.randn(1000, 50)
y = np.random.randint(0, 5, 1000)
# Apply UMAP
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X)
# Plot results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
plt.colorbar(scatter)
plt.title('UMAP visualization of random data')
plt.show()
Comparing t-SNE and UMAP
import numpy as np
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt
# Generate sample data
X = np.random.randn(1000, 50)
y = np.random.randint(0, 5, 1000)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Apply UMAP
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X)
# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
scatter1 = ax1.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
ax1.set_title('t-SNE visualization')
plt.colorbar(scatter1, ax=ax1)
scatter2 = ax2.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
ax2.set_title('UMAP visualization')
plt.colorbar(scatter2, ax=ax2)
plt.show()
Advantages of t-SNE
- 🎯 Excellent at preserving local structures.
- 👥 Reveals clusters effectively in the data.
- 🔄 Handles non-linear relationships well.
- 🖼️ Produces visually appealing results.
- 🧠 Widely used in machine learning visualization.
Advantages of UMAP
- ⚡ Generally faster than t-SNE, especially for large datasets.
- 🌐 Better preserves global structure of the data.
- 🔢 Can handle larger datasets more efficiently.
- 🔄 Supports supervised dimensionality reduction.
- 📈 Often provides more stable results across runs.
Limitations of t-SNE
- 🐢 Can be computationally expensive for large datasets.
- 🌐 May not preserve global structure well.
- 🔄 Results can vary with different random initializations.
- ⏱️ Can be slow for high-dimensional data.
- 🧮 Struggles with very sparse data.
Limitations of UMAP
- 🧠 Can be more difficult to interpret than t-SNE.
- 🎛️ Sensitive to choice of hyperparameters.
- 📚 Less theoretical foundation compared to t-SNE.
- 🔄 May produce different results across runs.
- 🖼️ Visualizations may be less aesthetically pleasing than t-SNE.
When to Use t-SNE
- 👁️ For visualizing high-dimensional data.
- 🔍 When focusing on local structure preservation.
- 👥 For cluster analysis and pattern recognition.
- 🧠 In exploratory data analysis of complex datasets.
- 📊 When dealing with moderate-sized datasets.
When to Use UMAP
- 📈 For large-scale data visualization.
- 🌐 When preserving global structure is important.
- ⚡ For faster dimensionality reduction on big data.
- 🔄 In supervised learning scenarios.
- 🔗 When needing to preserve relationships across multiple scales.