Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.
In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-
- Scikit learn preprocessing LabelEncoder
- Pandas getdummies
We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here. In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.
We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-
Here are few other ways to dummy coding-
Here is an excellent Kaggle Kernel for detailed feature engineering.
There are many types of joins such as inner, outer, left, right which can be easily done in Python. Let’s work with an example to go through it. More details on our example can be found here
Use keys from left frame only
Use keys from right frame only
Use union of keys from both frames
Use intersection of keys from both frames
Just like in Excel, we can do Pivot Tables in Pandas as well. This is a very convenient feature when it comes to data summarizing. Let’s look at an example-
Pandas is an open source Python library which create dataframes similar to Excel tables and play an instrumental role in data manipulation and data munging in any data science projects. Generally speaking, underlying data values in pandas is stored in the numpy array format as you will see shortly.
Let’s look at some examples-
First, let’s import a file (using read_csv) to work on. Then we will begin data exploration. Particularly, we will be doing following in the below example-
- Import pandas and numpy
- Import csv file
- Check type, shape, index and values of the dataframe
- Display top 5 and bottom 5 rows of the data using head() and tail()
- Generate descriptive statistics such as mean, median, percentile etc
- Transpose dataframe
- Sort data frame by rows and columns
- Indexing, slicing and dicing using loc and iloc. More on this is here
- Adding new columns
- Boolean indexing
- Inserting date time in the data frame
Here are some examples on how you can check that necessary packages are installed in the python environment and check their version before moving forward. These are some of the must have packages. If any of the packages are not installed, you can do the anaconda install using conda prompt. Further directions are shown in the link
You can search for any package in anaconda environment by using the following code-
anaconda search -t conda seaborn
Installing a package using anaconda prompt is as simple as the line shown below. In this case we are installing a package called Seaborn on anaconda prompt. You can go to the anaconda prompt by typing anaconda prompt in the search menu.
conda install seaborn
Please note that sometimes the anaconda prompt may not let you install new packages and display certain errors like “access denied“. In that case you need to right click on the anaconda prompt shortcut and start as an administrator.
If your conda prompt screen is getting too cluttered you can always clear the screen by typing the command “cls”