Categorical Variables Dummy Coding

Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.

In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using two different methods-

  1. Scikit learn preprocessing LabelEncoder
  2.  Pandas getdummies

We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here.  In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.

We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-

clf1clf2clf3clf4clf5clf6clf7

Cheers!

Hierarchical Clustering with Python

As highlighted in the article, clustering and segmentation play an instrumental role in Data Science. In this blog, we will show you how to build a Hierarchical Clustering with Python.

For this purpose, we will work with a R dataset called “Cheese”. Please install package called “Bayesm” in R and export this data set in csv format to be imported in Python. More on this dataset can be found here.

Let’s begin with the clustering in Python then. hclust1hclust2hclust3hclust4hclust5hclust6hclust7

hclust8

Cheers!

Python Machine Learning Linear Regression with Scikit- learn

Linear regression is one of the most fundamental machine learning technique in Python. For more on linear regression fundamentals click here. In this blog, we will build a regression model to predict house prices by looking into independent variables such as crime rate, % lower status population, quality of schools etc. We will be leveraging Scikit-learn library and in built data set called “Boston”.

Let’s now jump onto how to build a multiple linear regression model in Python.

linear1linear2linear3linear4linear5linear6

You can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.

Cheers!

Data Standardization or Normalization

Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.

Standardization is when a variable is made to follow the standard normal distribution ( mean =0  and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.

Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?

  1. In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable Y will outweigh variable X due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
  2. Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
  3. Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
  4. Multivariate models may become more stable and the coefficients more reliable if you normalize the data
  5. It provides immunity from the problem of outliers

Let’s look at a Python example on how we can normalize data-

scaling1scaling2scaling3scaling4scaling5scaling6

Cheers!

Basic Statistics and Data Visualization

Doing exploratory, diagnostic and descriptive statistics is the first and very crucial part of any data analytics project.

Here are some more details on each of the steps involved in Exploratory Data Analysis ( EDA)

Let’s now look at examples on how to accomplish these tasks in Python.

EDA1EDA2EDA3EDA4EDA5EDA6EDA7EDA8EDA9EDA10

EDA11EDA12EDA13EDA14EDA15EDA16EDA17EDA18EDA19EDA20EDA21EDA22EDA23EDA24EDA25EDA26EDA27EDA28

Cheers!

 

Missing Values Treatment

Data cleaning is a crucial part in any data science project as uncleaned data may impact the results significantly. In this blog, we will look at how to deal with the missing values in our data. Let’s look at an example-

cleaning1cleaning2cleaning3cleaning4cleaning5cleaning6cleaning7cleaning8cleaning9

Cheers!

Learn Data Science using Python Step by Step

Here is how you can learn Python step by step

  1. Setup Python environment
  2. How to start jupyter notebook
  3. Install and check Packages
  4. Arithmetic operations
  5. Comparison or logical operations
  6. Assignment and augmented assignment in Python
  7. Variables naming conventions
  8. Types of variables in Python and typecasting
  9. Python Functions
  10. Exception handling in Python
  11. String manipulation and indexing
  12. Conditional and loops in Python
  13. Python data structure and containers
  14. Introduction to Python Numpy
  15. Introduction to Python SciPy
  16. Introduction to Python Pandas
  17. Python pivot tables
  18. Pandas join tables
  19. Missing value treatment
  20. Dummy coding of categorical variables 
  21. Basic statistics and visualization
  22. Data standardization or normalization
  23. Linear Regression with scikit- learn (Machine Learning library)
  24. Logistic Regression with scikit- learn (Machine Learning library)
  25. Hierarchical clustering with Python
  26. K-means clustering with Scikit Python
  27. Decision trees using Scikit Python
  28. Principal Component Analysis (PCA) using Scikit Python- Dimension Reduction
  29. Linear Discriminant Analysis (LDA) using Scikit Python- Dimension Reduction and Classification
  30. Market Basket Analysis or Association Rules or Affinity Analysis or Apriori Algorithm
  31. Recommendation Engines using Scikit-Surprise
  32. Price Elasticity of Demand using Log-Log Ordinary Least Square (OLS) Model
  33. Other topics (coming soon)

Cheers!

Pandas Join Tables

There are many types of joins such as inner, outer, left, right which can be easily done in Python. Let’s work with an example to go through it. More details on our example can be found here

left

Use keys from left frame only

right

Use keys from right frame only

outer

Use union of keys from both frames

inner

Use intersection of keys from both frames

join1join2join3join4join5join6

Cheers!