Categorical Variables Dummy Coding

Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.

In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-

  1. Scikit learn preprocessing LabelEncoder
  2.  Pandas getdummies
  3. Looping
  4. Mapping

We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here.  In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.

We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-

clf1clf2clf3clf4clf5clf6clf7

Here are few other ways to dummy coding-

dummy_coding1dummy_coding2dummy_coding3

Here is an excellent Kaggle Kernel for detailed feature engineering.

Cheers!