Home
Omar Hosney
CatBoost Python Library Cheat Sheet 🐱🚀
1. Getting Started
- Installation:
pip install catboost
- Importing:
import catboost as cb
- Basic usage:
- Classifier:
model = cb.CatBoostClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
- Regressor:
model = cb.CatBoostRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
2. Basic Concepts
- What is CatBoost? A gradient boosting library for ML, developed by Yandex.
- Differences: Handles categorical features automatically, uses ordered boosting.
- Key features: Fast training, GPU support, built-in feature importance.
- Pros:
- Excellent performance on categorical data
- Less prone to overfitting
- Handles missing values automatically
- Fast prediction time
- Built-in GPU acceleration
- Cons:
- Can be slower to train than other boosting algorithms
- May require more memory for large datasets
- Less community support compared to some alternatives
- Fewer customization options for advanced users
3. Model Training
- Training:
model = cb.CatBoostClassifier(iterations=1000, learning_rate=0.1)
model.fit(X_train, y_train, cat_features=['category1', 'category2'])
- Parameters: Set
learning_rate
, iterations
, depth
.
- Categorical features: Specify with
cat_features
parameter.
- Missing values: Handled automatically.
4. Model Evaluation
- Metrics:
from sklearn.metrics import accuracy_score, f1_score
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='weighted')
- Cross-validation:
scores = model.cross_validate(X, y, cv=5)
- Hyperparameter tuning:
from sklearn.model_selection import GridSearchCV
params = {'depth': [4, 6, 8], 'learning_rate': [0.01, 0.1]}
grid_search = GridSearchCV(model, params, cv=3)
grid_search.fit(X, y)
5. Model Interpretation
- Feature importance:
importances = model.get_feature_importance()
- Partial dependence:
pd_results = model.calc_partial_dependence(X[['feature1', 'feature2']])
- SHAP values:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
6. Hyperparameter Tuning
- Grid search: See Model Evaluation section.
- Random search:
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(model, param_distributions, n_iter=10)
- Bayesian optimization:
import optuna
def objective(trial):
params = {
'depth': trial.suggest_int('depth', 4, 10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0)
}
model = cb.CatBoostClassifier(**params)
return model.cross_validate(X, y, cv=3)['test-accuracy'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)