Recommender Engines

Recommendation engines or systems are all around us. Few common examples are-

  • Amazon- People who buy this also buy this or who viewed this also viewed this
  • Facebook- Friends recommendation
  • Linkedin- Jobs that match you or network recommendation or who viewed this profile also viewed this profile
  • Netflix- Movies recommendation
  • Google- news recommendation, youtube videos recommendation

and so on…

The main objective of these recommendation systems is to do following-

  • Customization or personalizaiton
  • Cross sell
  • Up sell
  • Customer retention
  • Address the “Long Tail” phenomenon seen in Online stores vs Brick and Mortar stores


There are three main approaches for building any recommendation system-

  • Collaborative Filtering

Users and items matrix is built. Normally this matrix is sparse, i.e. most of the cells will be empty. The goal of any recommendation system is to find similarities among the users and items and recommend items which have high probability of being liked by a user given the similarities between users and items.

Similarities between users and items can be assessed using several similarity measures such as Correlation, Cosine Similarities, Jaccard Index, Hamming Distance. The most commonly used similarity measures are Cosine Similarity and Jaccard Index in a recommendation engine

  • Content Based-

This type of recommendation engine focuses on finding the characteristics, attributes, tags or features of the items and recommend other items which have some of the same features. Such as recommend another action movie to a viewer who likes action movies.

  • Hybrid- 

These recommendation systems combine both of the above approaches.

Read more here

Build Recommendation System in Python using ” Scikit – Surprise”-

Now let’s switch gears and see how we can build recommendation engines in Python using a special Python library called Surprise.

This library offers all the necessary tools such as different algorithms (SVD, kNN, Matrix Factorization),  in built datasets, similarity modules (Cosine, MSD, Pearson), sampling and models evaluations modules.

Here is how you can get started

  • Step 1- Switch to Python 2.7 Kernel, I couldn’t make it work in 3.6 and hence needed to install 2.7 as well in my Jupyter notebook environment
  • Step 2- Make sure you have Visual C++ compilers installed on your system as this package requires Cython Wheels. Here are couple of links to help you in this effort

Please note that if you don’t do the Step 2 correctly, you will get errors such as shown below – ” Failed building wheel for Scikit-surprise” or ” Microsoft Visual C++ 14 is required”c1c2

  • Step 3- Install Scikit- Surprise. Please make sure that you have Numpy installed before this

pip install numpy

pip install scikit-surprise

  • Step 4- Import scikit-surprise and make sure it’s correctly loaded

from surprise import Dataset

  • Step 5- Follow along the below examples


Getting Started

Movie Example



Decision Tree using Python Scikit

If you are not familiar with Decision Trees, please read this article first.

First let’s look at a very simple example on the Iris data-

Decision Tree in Python

Decision Tree in Python

Now let’s look at slightly more complex data-

Let’s first build a logistic regression model in Python using machine learning library Scikit. Please read here about the dataset and dummy coding.






Categorical Variables Dummy Coding

Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.

In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-

  1. Scikit learn preprocessing LabelEncoder
  2.  Pandas getdummies
  3. Looping
  4. Mapping

We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here.  In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.

We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-


Here are few other ways to dummy coding-


Here is an excellent Kaggle Kernel for detailed feature engineering.


Python Machine Learning Linear Regression with Scikit- learn

Linear regression is one of the most fundamental machine learning technique in Python. For more on linear regression fundamentals click here. In this blog, we will build a regression model to predict house prices by looking into independent variables such as crime rate, % lower status population, quality of schools etc. We will be leveraging Scikit-learn library and in built data set called “Boston”.

Let’s now jump onto how to build a multiple linear regression model in Python.

Import packages and Boston dataset

Image 1- Importing Packages and Boston Dataset

Explore Boston Dataset

Image 2- Explore Boston Dataset

Creating Features and Labels and Running Correlations

Image 3- Creating Features and Labels and Running Correlations

Creating Features and Labels and Running Correlation Heatmap

Image 4- Creating Features and Labels and Running Correlation Heatmap

Test/Train Split, Linear Regression Model Fitting and Model Evaluation

Image 5- Test/Train Split, Linear Regression Model Fitting and Model Evaluation

Appending Predicted Data and Plotting the Errors

Image 6- Appending Predicted Data and Plotting the Errors

You can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.

Mean Squared Error (MSE)

Image 7- Mean Squared Error (MSE) Definition

Mean Absolute Percent Error (MAPE)

Image 8- Mean Absolute Percent Error (MAPE)

Model Evaluation Metrics

Fitting Linear Regression Model using Statmodels

Image 9- Fitting Linear Regression Model using Statmodels

OLS Regression Output

Image 10- OLS Regression Output

itting Linear Regression Model with Significant Variables

Image 11- Fitting Linear Regression Model with Significant Variables

Heteroscedasticity Consistent Linear Regression Estimates

Image 12- Heteroscedasticity Consistent Linear Regression Estimates

More details on the metrics can be found at the below links-


Here is a blog with excellent explanation of all metrics


Install and check Python Packages

Here are some examples on how you can check that necessary packages are installed in the python environment and check their version before moving forward. These are some of the must have packages. If any of the packages are not installed, you can do the anaconda install using conda prompt.  Further directions are shown in the link 

You can search for any package in anaconda environment by using the following code-

anaconda search -t conda seaborn