Categorical Variables Dummy Coding

Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.

In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-

  1. Scikit learn preprocessing LabelEncoder
  2.  Pandas getdummies
  3. Looping
  4. Mapping

We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here.  In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.

We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-

clf1clf2clf3clf4clf5clf6clf7

Here are few other ways to dummy coding-

dummy_coding1dummy_coding2dummy_coding3

Here is an excellent Kaggle Kernel for detailed feature engineering.

Cheers!

Pandas Join Tables

There are many types of joins such as inner, outer, left, right which can be easily done in Python. Let’s work with an example to go through it. More details on our example can be found here

left

Use keys from left frame only

right

Use keys from right frame only

outer

Use union of keys from both frames

inner

Use intersection of keys from both frames

join1join2join3join4join5join6

Cheers!

Introduction to Python Pandas

Pandas is an open source Python library which create dataframes similar to Excel tables and play an instrumental role in data manipulation and data munging in any data science projects. Generally speaking, underlying data values in pandas is stored in the numpy array format as you will see shortly.

Let’s look at some examples-

First, let’s import a file (using read_csv) to work on. Then we will begin data exploration.  Particularly, we will be doing following in the below example-

  • Import pandas and numpy
  • Import csv file
  • Check type, shape, index and values of the dataframe
  • Display top 5 and bottom 5 rows of the data using head() and tail()
  • Generate descriptive statistics such as mean, median, percentile etc
  • Transpose dataframe
  • Sort data frame by rows and columns
  • Indexing, slicing and dicing using loc and iloc. More on this is here
  • Adding new columns
  • Boolean indexing
  • Inserting date time in the data frame

etc.

pandas1.png

pandas2.png

pandas3.png

pandas4pandas5pandas6pandas7pandas8pandas9pandas10

Cheers!

Install and check Python Packages

Here are some examples on how you can check that necessary packages are installed in the python environment and check their version before moving forward. These are some of the must have packages. If any of the packages are not installed, you can do the anaconda install using conda prompt.  Further directions are shown in the link 

You can search for any package in anaconda environment by using the following code-

anaconda search -t conda seaborn

Installing a package using anaconda prompt is as simple as the line shown below. In this case we are installing a package called Seaborn on anaconda prompt. You can go to the anaconda prompt by typing anaconda prompt in the search menu.

conda install seaborn

Please note that sometimes the anaconda prompt may not let you install new packages and display certain errors like “access denied“. In that case you need to right click on the anaconda prompt shortcut and start as an administrator.

If your conda prompt screen is getting too cluttered you can always clear the screen by typing the command “cls”

Python_version

Cheers!