Random Forest Feature Importance Plot

A big part of analysing our models post training is whether the features we used for training actually helped in predicting the target and by how much. Tree based machine learning algorithms such as Random Forest and XGBoost come with a feature importance attribute that outputs an array containing a value between 0 and 100 for each feature representing how useful the model found each feature in trying to predict the target. This gives us the opportunity to analyse what contributed to the accuracy of the model and what features were just noise. With this information we can check that the model is working as we would expect, discard features if we feel they are not adding any value and use it to hypothesis about new features that we could engineer for another iteration of the model.

View Our Profile on Datasnips.com to See Our Data Science Code Snippets

However, the problem with the feature importance attribute is that the output is an unlabelled, unordered array of values so looking at it in isolation won’t tell us much about our model. So we are going to take this array and create a function that plots the feature importance data on a labelled and ordered Seaborn bar chart that will give us a more intuitive understanding of which features our model has deemed useful.

Import Libraries

First let’s make sure we have imported all the required libraries. We are going to need Pandas, Numpy, Matplotlib and Seaborn. In this example we have already trained a Random Forest model using a data frame named “train_X” and named it “rf_model”.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Declare Plot Feature Importance Function

Now to start with, we are going to declare the function “plot_feature_importance” and tell it what parameters we’re going to pass when calling. In this case we are going to pass in the feature importance values (importance), the feature names from training data (names) and also a string identifying the model type that we’ll use to title the bar chart.

def plot_feature_importance(importance,names,model_type):

Cast Numpy Arrays

Next we are going to cast the feature importance and feature names as Numpy arrays. This allows us to construct a two column data frame from the two arrays.

def plot_feature_importance(importance,names,model_type):

#Create arrays from feature importance and feature names

feature_importance = np.array(importance)

feature_names = np.array(names)

Construct Data Frame

To construct the data frame we will use a Dictionary containing the feature importance values and the feature names where the Dictionary key will be the column names. Once this has been created we can then sort the data frame by feature importance value giving us a labelled and ordered feature importance data frame.

def plot_feature_importance(importance,names,model_type):

#Create arrays from feature importance and feature names

feature_importance = np.array(importance)

feature_names = np.array(names)

#Create a DataFrame using a Dictionary

data={'feature_names':feature_names,'feature_importance':feature_importance}

fi_df = pd.DataFrame(data)

#Sort the DataFrame in order decreasing feature importance

fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))

Plot Feature Importance Bar Chart

Finally we can use Matplotlib and Seaborn to plot the feature importance bar chart. Here we use the model_type parameter that we will pass to the function to give our plot it’s title.

def plot_feature_importance(importance,names,model_type):

#Create arrays from feature importance and feature names

feature_importance = np.array(importance)

feature_names = np.array(names)

#Create a DataFrame using a Dictionary

data={'feature_names':feature_names,'feature_importance':feature_importance}

fi_df = pd.DataFrame(data)

#Sort the DataFrame in order decreasing feature importance

fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))

#Define size of bar plot

plt.figure(figsize=(10,8))

#Plot Searborn bar chart

sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])

#Add chart labels

plt.title(model_type + 'FEATURE IMPORTANCE')

plt.xlabel('FEATURE IMPORTANCE')

plt.ylabel('FEATURE NAMES')