Data Wrangling
Importing Data Select, Drop & Rename Filter, Sort & Sample Add Columns Cleaning Data Dates & Time Join Data Aggregate & Transform
Data Analysis
Exploring Data Plotting Continuous Variables Plotting Discrete Variables
Machine Learning
Data Preparation Linear Models
Other Tutorials & Content
Learn Python for Data Science Learn Alteryx Blog



Machine Learning Data Preparation

Split Data Into Features & Target

Create a DataFrame that contains the features (X) that will be used to predict the target and seperate Series that contain just the target (y):

X = data[['neighbourhood_group', 'latitude', 'longitude', 'room_type']]
y = data['price']

Create Dummy Variables Using Pandas get_dummies

Convert the categorical features "neighbourhood_group" and "room_type" into dummy variables with first level of each feature dropped:

X = pd.get_dummies(X, columns=['neighbourhood_group','room_type'], drop_first=True)

Split Data Into Train & Test Sets Using Sklearn train_test_split

Split the data into train and test sets where the test set with train taking 70% of the data and test taking 30%:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Scale Data Using Standard Scaler from Sklearn

Fit the standard scaler to X_train data then transform X_train and X_test using the fitted scaler. As the transform function outputs the data as a Numpy array then we convert back this back to Pandas DataFrame.

from sklearn.preprocessing import StandardScaler
#Initalise standard scaler
scaler = StandardScaler()
#Fit the scaler using X_train data
scaler.fit(X_train)
#Transform X_train and X_test using the scaler and convert back to DataFrame
X_train = pd.DataFrame(scaler.transform(X_train), columns = X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)