Catboost with Python: A Simple Tutorial

What is Catboost? Use Cases Implementation

In this tutorial we will see how to implement the Catboost machine learning algorithm in Python. We will give a brief overview of what Catboost is and what it can be used for before walking step by step through training a simple model including how to tune parameters and analyse the model.

What is Catboost?

Catboost is a boosted decision tree machine learning algorithm developed by Yandex. It works in the same way as other gradient boosted algorithms such as XGBoost but provides support out of the box for categorical variables, has a higher level of accuracy without tuning parameters and also offers GPU support to speed up training.

Use Cases

Catboost is used for a range of regression and classification tasks and has been shown to be a top performer on various Kaggle competitions that involve tabular data. Below are a couple of examples of where Catboost has been successfully implemented:

Cloudflare use Catboost to identify bots trying to target it’s users websites. Full details here.
Ride hailing service Careem, based in Dubai, use Catboost to predict where it’s customers will travel to next. Full details here.

Implementation

For this short tutorial we are going to use the classic Titanic dataset to predict whether a passenger on the ship survived or not. The intention here is to keep this tutorial simple using a small dataset but the principles will apply to more complex datasets and problems you might be trying to solve.

Before we start, let’s import the libraries we will need and also the titanic dataset.

                            import pandas as pd
                        
                            from pandas.api.types import is_numeric_dtype  
                        
                            from sklearn.model_selection import train_test_split
                        
                            import catboost as cb
                        
                            from sklearn.metrics import classification_report
                        
                            import matplotlib.pyplot as plt
                        
                            import seaborn as sns
                        
                            df = pd.read_csv('titanic.csv')

Data Preparation

Initially we’re simply going to drop any rows that contain NaN for the “survived” column which is our target as this doesn’t help our model.

df.dropna(subset=['survived'],inplace=True)

Now for this tutorial we are only going to make use of 4 features; pclass, sex, age and fare. Let’s split our data into X and y to get our feature and target dataframes.

X = df[['pclass','sex', 'age', 'fare']]

y = df['survived']

Now we still need to treat some of the features. We need to convert the “pclass” column to a string data type as although it appears numeric, the values are discrete so it’s actually a categorical variable in this context. In addition, the “fare” and “age” columns contain some NaNs so we’ll replace these with zeros.

                            X['pclass'] = X['pclass'].astype('str')
                        
                            X['fare'].fillna(0,inplace=True)
                        
                            X['age'].fillna(0,inplace=True)

Preparing Categorical Features

As mentioned above, Catboost provides support for categorical features with no need to one-hot encode or dummy any columns. To enable this we need to do two things.

First we need to generate a list of column indices that contain the categorical data. This list will be passed to the model during training. Now, we only have two categorical variables so it’s easy for us to identify the column indices manually (in our case it’s the first two columns; “pclass” and “sex”) but in another project you could be working with any number of categorical columns. Given this, let’s create a function that takes a dataframe and returns indices of all non-numeric columns as a list.

                            def get_categorical_indicies(X):  
                        
                            cats = []
                        
                            for col in X.columns:
                        
                            if is_numeric_dtype(X[col]):
                        
                            pass
                        
                            else:
                        
                            cats.append(col)
                        
                            cat_indicies = []
                        
                            for col in cats:
                        
                            cat_indicies.append(X.columns.get_loc(col))
                        
                            return cat_indicies
                        
                            categorical_indicies = get_categorical_indicies(X)

Now we can reuse this function to get the indices of non-numerical columns for any dataframe when we use Catboost.

The second thing we need to do is convert all categorical columns to the category data type which is required by Catboost. To do this we are going to use another function and similar logic to the previous step to identify non-numerical columns and convert them to the category data type.

                            def convert_cats(X):  
                        
                            cats = []
                        
                            for col in X.columns:
                        
                            if is_numeric_dtype(X[col]):
                        
                            pass
                        
                            else:
                        
                            cats.append(col)
                        
                            cat_indicies = []
                        
                            for col in cats:
                        
                            X[col] = X[col].astype('category')
                        
                            convert_cats(X)

Finally, before we begin training our model we need to split our data into two datasets for training and testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101, stratify=y)

Now there is an additional complication; if we print out the survival rate of of our test set we can see that our training data is imbalanced.

print('Test Survival Rate:',y_test.sum()/y_test.count())

Test Survival Rate: 0.3816793893129771

There are a few ways to handle this but in our example we are simply going to undersample the training data.

                            train_df = pd.concat([X,y],axis=1)
                        
                            survived = train_df[train_df['survived']==1]
                        
                            deceased = train_df[train_df['survived']==0]
                        
                            deceased = deceased.sample(n=len(survived), random_state=101)
                        
                            train_df = pd.concat([survived,deceased],axis=0)
                        
                            X_train = train_df.drop('survived',axis=1)
                        
                            y_train = train_df['survived']

Training

To train our model we are going to wrap our train and test datasets in a Catboost pool constructor. We can define our features, target and list of categorical features inside the pool constructor and then pass these as one when training and evaluating our model.

train_dataset = cb.Pool(X_train,y_train, cat_features=categorical_indicies)

test_dataset = cb.Pool(X_test,y_test, cat_features=categorical_indicies)

Now let’s initiate the Catboost Classifier.

model = cb.CatBoostClassifier(loss_function='Logloss', eval_metric='Accuracy')

As this is a binary classification problem we’ll use log loss as the loss function and evaluate based on accuracy.

Note: If you are looking for an intuitive explanation of log loss then check out this article from Daniel Godoy.

To train the model we are going to use Catboost’s inbuilt grid search method. If you have used Sci-Kit learns Grid Search CV then this works in the same way. First we declare a dictionary of the hyperparameters that we want to tune and lists of values to test. We have decided to tune just a few of the most influential parameters: learning rate, tree depth, L2 leaf regularisation and also the number of iterations we will train the model for.

                            grid = {'learning_rate': [0.03, 0.1],
                        
                            'depth': [4, 6, 10],
                        
                            'l2_leaf_reg': [1, 3, 5,],
                        
                            'iterations': [50, 100, 150]}

Now we can fit the model using the grid search method by passing the grid dictionary we declared above along with the training data pool. By default grid search splits the training data into an 80/20 split for training and testing with a three fold cross validation strategy.

model.grid_search(grid,train_dataset)

The model has now been trained and you can print out the optimum parameters that have been found using grid search if you’re interested.

model.get_params()

{'loss_function': 'Logloss',
'eval_metric': 'Accuracy',
'depth': 10,
'l2_leaf_reg': 1,
'iterations': 100,
'learning_rate': 0.1}

Evaluation

Now that we have trained our model we can evaluate how it performs on our test data and then briefly see what features are most influential.

To start with we’ll use our model to make predictions for our test and the print out a classification report.

pred = model.predict(X_test)

print(classification_report(y_test, pred))

As we can see we got an accuracy of 79% on our test set which isn’t bad considering we are only using four features.

To delve further under the hood of our model we can analyse what impact our features have had by plotting the feature importance.

def plot_feature_importance(importance,names,model_type):

#Create arrays from feature importance and feature names

feature_importance = np.array(importance)

feature_names = np.array(names)

#Create a DataFrame using a Dictionary

data={'feature_names':feature_names,'feature_importance':feature_importance}

fi_df = pd.DataFrame(data)

#Sort the DataFrame in order decreasing feature importance

fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))

#Define size of bar plot

plt.figure(figsize=(10,8))

#Plot Searborn bar chart

sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])

#Add chart labels

plt.title(model_type + ' FEATURE IMPORTANCE')

plt.xlabel('FEATURE IMPORTANCE')

plt.ylabel('FEATURE NAMES')

plot_feature_importance(cb_model.get_feature_importance(),train.columns,'CATBOOST')

As you can see the “sex” feature was by far the most dominant feature.

If you want a more detailed breakdown of the feature importance function then you can read about it here.

Finally, we can analyse how our model performed on our test data by breaking the performance down by feature values. Catboost comes with function called calc_feature_statistics which plots the average real value and average value of our models prediction for each feature value. Let’s generate a feature statistic plot for the “sex” feature in our test data.

model.calc_feature_statistics(test_dataset,feature='sex',plot=True,prediction_type='Class')

This tells us that we had a lot more examples of male passengers in our test dataset but that females were more likely to survive. The plot also indicates that our model was biased towards females surviving compared to men than was actually the case in our test data.

So there we have it, a quick walk through of how to implement Catboost using Python. Catboost contains many additional options to customise your training and evaluate your model. For further detail, check out the Catboost documentation here.

Subscribe to Our Mailing List

Python Machine Learning

Other Content