Subscribe to Our Mailing List
Python Machine Learning
Plotting Feature Importance Stratified KFold XGBoost Parameter Tuning More to Follow...
Other Content
Get Started with Alteryx Learn Python Python Data Science Reference



XGBoost Parameter Tuning Tutorial

XGBoost has many parameters that can be adjusted to achieve greater accuracy or generalisation for our models. Here we’ll look at just a few of the most common and influential parameters that we’ll need to pay most attention to. We’ll get an intuition for these parameters by discussing how different values can impact the performance of our models before demonstrating how to use grid search to find the best values in a given range for the model we’re working on.

Before we discuss the parameters let's just have a quick review of how the XGBoost algorithm works to enable us to understand how the changes in parameter values will impact the way our models are trained.

View Our Profile on Datasnips.com to See Our Data Science Code Snippets

XGBoost Overview

XGBoost has similar behaviour to a decision tree in that each tree is split based on certain range values in different columns but unlike decision trees, each each node is given a weight. On each iteration a new tree is created and new node weights are assigned. For each tree the training examples with the biggest error from the previous tree are given extra attention so that the next tree will optimise more for these training examples, this is the boosting part of the algorithm. Finally, the outputs of each tree get ensembled, usually through averaging the weights for each instance from each tree to derive predictions.

XGBoost Parameters

Now let’s look at some of the parameters we can adjust when training our model.

Subsample

Value Range: 0 - 1

Decrease to reduce overfitting

Each tree will only get a % of the training examples and can be values between 0 and 1. Lowering this value stops subsets of training examples dominating the model and allows greater generalisation.

Colsample_bytree

Value Range: 0 - 1

Decrease to reduce overfitting

Similar to subsample but for columns rather than rows. Again you can set values between 0 and 1 where lower values can make the model generalise better by stopping any one field having too much prominence, a prominence that might not exist in the test data.

Max_Depth

Value Range: 0 - infinity

Decrease to reduce overfitting

This limits the maximum number of child nodes each branch of the tree can have. Keeping this low stops the model becoming too complex and creating splits that might only be relevant to the training data. However if this is too low, then the model might not be able to make use of all the information in your data.

Good values to use here will vary largely on the complexity of the problem you are trying to predict and the richness of your data. The default is 6 and generally is a good place to start and work up from however for simple problems or when dealing with small datasets then the optimum value can be lower.

Min_Child_weight

Value Range: 0 - infinity

Increase to reduce overfitting

Means that the sum of the weights in the child needs to be equal to or above the threshold set by this parameter. Good values to try are 1, 5, 15, 200 but this often depends on the amount of data in your training set as fewer examples will likely result in lower child weights.

Learning_Rate

Learning rate or ETA is similar to the learning rate you have may come across for things like gradient descent. In layman’s terms it is how much the weights are adjusted each time a tree is built. Set the learning rate too high and the algorithm might miss the optimum weights but set it too low and it might converge to suboptimal values.

N_estimators

N_estimators is the number of iterations the model will perform or in other words the number of trees that will be created. Often we set this to a large value and use early stopping to roll back the model to the one with the best performance.

It is worth noting that there is interaction here between the parameters and so adjusting one will often effect what happens will happen when we adjust another. For example, increasing the min_child_weight will reduce the impact of increasing the max_depth as the first parameter will limit how how many splits can occur anyway.

XG Boost & GridSearchCV in Python

Now that we have got an intuition about what’s going on, let’s look at how we can tune our parameters using Grid Search CV with Python.

For our example we’re going to use the titanic dataset so let’s start by importing the dataset, getting dummy variables, selecting features and then splitting our data into features and target for training and testing as we would do when approaching any machine learning problem.

import pandas as pd
df = pd.read_csv('data/titanic/train.csv')
df = pd.get_dummies(df,columns=['Pclass','Sex','Embarked'],drop_first=True)
X = df[['Age', 'SibSp', 'Parch','Fare', 'Pclass_2', 'Pclass_3','Sex_male', 'Embarked_Q','Embarked_S']]
y = df[['Survived']]
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

Now let’s train and evaluate a baseline model using only standard parameter settings as a comparison for the tuned model that we will create later. The model will be set to train for 100 iterations but will stop early if there has been no improvement after 10 rounds.

import xgboost as xgb
#Declare the evaluation data set
eval_set = [(X_train, y_train),(X_val,y_val)]
#Initialise model using standard parameters
model = xgb.XGBClassifier(subsample=1,
colsample_bytree=1,
min_child_weight=1,
max_depth=6,
learning_rate=0.3,
n_estimators=100)
#Fit the model but stop early if there has been no reduction in error after 10 epochs.
model.fit(X_train,y_train,early_stopping_rounds=10, eval_metric="error",eval_set=eval_set,verbose=0)
#Make predictions using for the validation set and evaluate
predictions = model.predict(X_val)
from sklearn.metrics import accuracy_score
print('Accuracy:',accuracy_score(y_val, predictions))
> Accuracy: 0.80223

As you can see, we get an accuracy score of 80.2% against the validation set so now let’s use grid search to tune the parameters we discussed above to see if we can improve that score.

First we’ll import the GridSearchCV library and then define what values we’ll ask grid search to try. Grid search will train the model using every combination of these values to determine which combination gives us the most accurate model.

from sklearn.model_selection import GridSearchCV
PARAMETERS = {"subsample":[0.5, 0.75, 1],
"colsample_bytree":[0.5, 0.75, 1],
"max_depth":[2, 6, 12],
"min_child_weight":[1,5,15],
"learning_rate":[0.3, 0.1, 0.03],
"n_estimators":[100]}

Now let’s fit the grid search model and print out what grid search determines are the best parameters for our model.

#Initialise XGBoost Model
model = xgb.XGBClassifier(n_estimators=100, n_jobs=-1)
"""Initialise Grid Search Model to inherit from the XGBoost Model,
set the of cross validations to 3 per combination and use accuracy
to score the models."""
model_gs = GridSearchCV(model,param_grid=PARAMETERS,cv=3,scoring="accuracy")
#Fit the model as done previously
model_gs.fit(X_train,y_train,early_stopping_rounds=10, eval_metric="error",eval_set=eval_set,verbose=0)
print(model_gs.best_params_)
> {'colsample_bytree': 0.5, 'learning_rate': 0.3, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100, 'subsample': 0.5}

Finally let’s get predictions for our validation data and evaluate the accuracy of our results.

predictions = model_gs.predict(X_val)
print('Accuracy:',accuracy_score(y_val, predictions))
> Accuracy: 0.82835

As we can see, we ended up with a 82.8% accuracy which is a 2.6% increase in the accuracy of our model by using grid search to tune our model parameters.

You have seen here that tuning parameters can give us better model performance. While the parameters we’ve tuned here are some of the most commonly tuned when training XGBoost model, this list is not exhaustive and tuning other parameters may also give good results depending on the use case. In addition the values we chose here were ones we suspected from experience and knowledge of the data set would give us good results but again good choices for these values will often depend on the nature of the data you are working with.