Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.
In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using two different methods-
- Scikit learn preprocessing LabelEncoder
- Pandas getdummies
We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here. In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.
We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-
Similar to the Hierarchical Clustering that we did earlier, we will now build clusters on the same data. However, we will use K-means technique this time.
So here we go-
Linear regression is one of the most fundamental machine learning technique in Python. For more on linear regression fundamentals click here. In this blog, we will build a regression model to predict house prices by looking into independent variables such as crime rate, % lower status population, quality of schools etc. We will be leveraging Scikit-learn library and in built data set called “Boston”.
Let’s now jump onto how to build a multiple linear regression model in Python.
You can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.
Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.
Standardization is when a variable is made to follow the standard normal distribution ( mean =0 and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.
Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?
- In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable Y will outweigh variable X due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
- Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
- Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
- Multivariate models may become more stable and the coefficients more reliable if you normalize the data
- It provides immunity from the problem of outliers
Let’s look at a Python example on how we can normalize data-
There are many types of joins such as inner, outer, left, right which can be easily done in Python. Let’s work with an example to go through it. More details on our example can be found here
Use keys from left frame only
Use keys from right frame only
Use union of keys from both frames
Use intersection of keys from both frames
Just like in Excel, we can do Pivot Tables in Pandas as well. This is a very convenient feature when it comes to data summarizing. Let’s look at an example-