Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.
In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-
- Scikit learn preprocessing LabelEncoder
- Pandas getdummies
- Looping
- Mapping
We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here. In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.
We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-
Here are few other ways to dummy coding-
Here is an excellent Kaggle Kernel for detailed feature engineering.
Cheers!
You must be logged in to post a comment.