Home
Omar Hosney
Xgboost Cheatsheet
Introduction
- 🚀 Xgboost is an optimized gradient boosting library.
- 📈 Used for supervised learning problems.
- 🔍 Efficiently handles large datasets.
Core Concepts
- 🌳 Boosting: Ensemble technique combining weak learners.
- 🔥 Gradient Boosting: Minimizes loss by adding models sequentially.
- ⚙️ Regularization: Prevents overfitting with L1/L2 penalties.
Installation
- 💻 Install via pip:
pip install xgboost
- 📦 Install via conda:
conda install -c conda-forge xgboost
Xgboost Regression
- 📉 Used for predicting continuous values.
- 🔧
model = xgb.XGBRegressor()
- ⚙️
model.fit(X_train, y_train)
- 🔍
y_pred = model.predict(X_test)
Xgboost Classifier
- 🔢 Used for predicting categorical values.
- 🔧
model = xgb.XGBClassifier()
- ⚙️
model.fit(X_train, y_train)
- 🔍
y_pred = model.predict(X_test)
Multilabel Classifier
- 🏷️ Used for predicting multiple labels for each instance.
- 🔧
from sklearn.multioutput import MultiOutputClassifier
- ⚙️
model = MultiOutputClassifier(xgb.XGBClassifier())
- 🛠️
model.fit(X_train, y_train)
- 🔍
y_pred = model.predict(X_test)
Basic Usage
- 📝 Import library:
import xgboost as xgb
- 🔢 Create DMatrix:
data = xgb.DMatrix(data, label=labels)
- ⚡ Train model:
model = xgb.train(params, data, num_rounds)
Parameters
- 🔧 eta: Learning rate, default 0.3
- 🌲 max_depth: Max depth of tree, default 6
- 💡 objective: Defines the loss function to be minimized.
Evaluation
- 📊 Use cross-validation with
xgb.cv()
- 🔍 Evaluate with metrics like rmse, logloss.
Hyperparameter Tuning
- 🔄 Use GridSearchCV for hyperparameter optimization.
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 4, 5],
'eta': [0.01, 0.1, 0.2],
'n_estimators': [100, 200 , 300]
}
grid_search = GridSearchCV(estimator=xgb.XGBClassifier(),
param_grid=param_grid,
scoring='accuracy',
cv=3,
verbose=1)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
- 🚀 Perform RandomizedSearchCV for faster tuning.
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xgb.XGBClassifier(),
param_distributions=param_grid,
n_iter=10,
scoring='accuracy',
cv=3,
verbose=1)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
Advanced Features
- 🚀 GPU acceleration: Use
tree_method='gpu_hist'
.
- ⚙️ Custom objective: Define your own loss function.
- 🌐 Distributed training: Scale Xgboost with Dask or Spark.
Tuning
- 🎯 Use GridSearchCV for hyperparameter tuning.
- 🔄 Perform early stopping to prevent overfitting.
Model Saving and Loading
- 💾 Save model:
model.save_model('model.json')
- 📂 Load model:
model = xgb.Booster()
model.load_model('model.json')
Feature Importance
- 🔍 Plot importance:
xgb.plot_importance(model)
- 📊 Get importance:
model.get_score()
Export Model as ONNX
- 📦 Install ONNX:
pip install onnxruntime skl2onnx
- 🔧 Convert model:
import skl2onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
- ⚙️ Load and run with ONNX Runtime:
import onnxruntime as rt
sess = rt.InferenceSession("model.onnx")
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], {input_name: X_test.astype(np.float32)})[0]
Resources