Principal Component Analysis (PCA) Cheat Sheet

Introduction to PCA

📊 Linear dimensionality reduction technique.
🎯 Identifies principal components of variance in data.
🔄 Orthogonal transformation to uncorrelated variables.
📉 Reduces dimensionality while preserving variance.
🧮 Based on eigenvalue decomposition of covariance matrix.

PCA Algorithm Steps

1️⃣ Standardize the dataset (mean=0, variance=1).
2️⃣ Compute the covariance matrix.
3️⃣ Calculate eigenvectors and eigenvalues.
4️⃣ Sort eigenvectors by decreasing eigenvalues.
5️⃣ Choose top k eigenvectors as principal components.
6️⃣ Project data onto new k-dimensional space.

PCA Applications

📉 Dimensionality reduction for high-dimensional data.
👁️ Data visualization in 2D or 3D space.
🔍 Feature extraction and selection.
🧹 Noise reduction in datasets.
📊 Compression of large datasets.

Advantages of PCA

⚡ Efficient computation for large datasets.
🔄 Removes correlated features.
📉 Reduces overfitting in machine learning models.
🧮 Provides interpretable components.
🎯 Unsupervised method (no target variable needed).

Limitations of PCA

📊 Assumes linear relationships in data.
🔢 Sensitive to feature scaling.
🧠 May lose interpretability of original features.
📉 Can be affected by outliers.
🔍 May not capture complex patterns in data.

PCA Parameters

🔢 n_components: Number of principal components to keep.
📊 svd_solver: Algorithm to compute SVD ('auto', 'full', 'arpack', 'randomized').
🎚️ whiten: Whether to whiten the data after PCA.
🔄 random_state: Seed for reproducibility.
📏 tol: Tolerance for singular values in SVD solver.

PCA in Python (sklearn)


from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate sample data
X = np.random.rand(100, 10)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Access explained variance ratio
print(pca.explained_variance_ratio_)

# Access principal components
print(pca.components_)

Visualizing PCA Results


import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Random Data')
plt.show()

# Scree plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot')
plt.show()

Choosing Number of Components


# Determine number of components for 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Number of components: {pca.n_components_}")

# Alternatively, use a for loop
total_variance = 0
for i, variance in enumerate(pca.explained_variance_ratio_):
    total_variance += variance
    if total_variance >= 0.95:
        print(f"95% variance reached at {i+1} components")
        break

PCA for Feature Selection


# Get feature importance
feature_importance = np.abs(pca.components_).sum(axis=0)
feature_names = [f"Feature_{i}" for i in range(X.shape[1])]

# Sort features by importance
sorted_idx = np.argsort(feature_importance)
for idx in sorted_idx[::-1]:
    print(f"{feature_names[idx]}: {feature_importance[idx]:.4f}")

PCA for Noise Reduction


# Add noise to data
X_noisy = X + np.random.normal(0, 0.1, X.shape)

# Apply PCA for denoising
pca = PCA(n_components=0.95)
X_denoised = pca.inverse_transform(pca.fit_transform(X_noisy))

# Compare original, noisy, and denoised data
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.imshow(X[:10].T)
ax1.set_title("Original Data")
ax2.imshow(X_noisy[:10].T)
ax2.set_title("Noisy Data")
ax3.imshow(X_denoised[:10].T)
ax3.set_title("Denoised Data")
plt.show()

PCA vs. Other Techniques

🆚 t-SNE/UMAP: PCA is faster but less effective for non-linear data.
🆚 Factor Analysis: PCA assumes all variance is important.
🆚 ICA: PCA finds uncorrelated components, ICA finds independent ones.
🆚 LDA: PCA is unsupervised, LDA is supervised.
🆚 Autoencoder: PCA is linear, autoencoders can capture non-linear relationships.