Recommender Engines using Sklearn-Surprise in Python


What is a Recommendation Engine?

Recommendation engines or systems are machine learning algorithms to make relevant recommendations about the products and services and they are all around us. Few common examples are-

  • Amazon- People who buy this also buy this or who viewed this also viewed this
  • Facebook- Friends recommendation
  • Linkedin- Jobs that match you or network recommendation or who viewed this profile also viewed this profile
  • Netflix- Movies recommendation
  • Google- news recommendation, youtube videos recommendation

Why do we have Recommendation Engines?

The main objective of these recommendation systems is to do following-

  • Customization or personalizaiton
  • Cross sell
  • Up sell
  • Customer retention
  • Address the “Long Tail” phenomenon seen in Online stores vs Brick and Mortar stores

60% of video watch time on Youtube is driven by the recommendation engine.

How do we build a Recommendation Engine?

There are three main approaches for building any recommendation system-

  • Collaborative Filtering

Users and items matrix is built. Normally this matrix is sparse, i.e. most of the cells will be empty and hence some sort of matrix factorization ( such as SVD) is used to reduce dimensions. More on matrix factorization will be discussed later in this article.

The goal of these recommendation system is to find similarities among the users and items and recommend items which have high probability of being liked by a user given the similarities between users and items.

Similarities between users and items embeddings can be assessed using several similarity measures such as Correlation, Cosine Similarities, Jaccard Index, Hamming Distance. The most commonly used similarity measures are dotproducts, Cosine Similarity and Jaccard Index in a recommendation engine

These algorithms don’t require any domain expertise (unlike Content Based models) as it requires only a user and item matrix and related ratings/feedback and hence these algorithms can make a recommendation about an item to a user as long it can identify similar users and item in the matrix .

The flip side of these algorithms is that they may not be suitable for making recommendations about a new item that was not there in the user / item matrix on which the model was trained.

  • Content Based-

This type of recommendation engine focuses on finding characteristics, attributes, tags or features of the items and recommend other items which have some of the same features. Such as, recommend another action movie to a viewer who likes action movies.

Since this algorithm uses features of a product or service to make recommendations, this offers advantage of referring unique or niche items and can be scaled to make recommendations for a wide array of users. On the other hand, defining product features accurately will be key to success of these algorithms.

  • Hybrid- 

These recommendation systems combine both of the above approaches.

Read more here

Build Recommendation System in Python using ” Scikit – Surprise”-

Now let’s switch gears and see how we can build recommendation engines in Python using a special Python library called Surprise. In this exercise, we will build a Collaborative Filtering algorithm using Singular Value Decomposition (SVD) for dimension reduction of a large User-Item Sparse matrix to provide more robust recommendations while avoiding computational complexity.

Here is how you can get started


Please note that if you don’t do the Step 2 correctly, you will get errors such as shown below – ” Failed building wheel for Scikit-surprise” or ” Microsoft Visual C++ 14 is required”

  • Step 3- Install Scikit- Surprise. Please make sure that you have Numpy installed before this

pip install numpy

pip install scikit-surprise

  • Step 4- Import scikit-surprise and make sure it’s correctly loaded

For sake of simplicity, you can also use Google Colab to work on the below example-

Let’s import Movielens small dataset for the purpose of building couple of Recommendation Engines using KNN and SVD algorithms. Please note the that the Surprise package offers many- many more algorithms to choose from. Data can be found at the link-

Download the zip files and you will see the following files that you can import in Python to explore. However, for the purpose of CF models, we only need the ratings.csv file.

Here are some key steps that we will follow to build Recommendation Engine for this data

  • Install Scikit Surprise and Pandas Profiling Packages
  • Import necessary packages
  • Type Magic command to print multiple statements on a same line
  • Import all files to explore data
  • Explore datasets using Pandas Profiling Package
  • Use Reader class to parse the file correctly for Surprise package to read and process the file
  • Build SVD model using cross-validation methodology
  • Build SVD model using Train/Test methodology
  • Make predictions of Ratings for a particular user and movie
  • Build KNN based Recommender and optimize hyperparameters using Gridsearch
  • Find the best parameters and the best score with the optimized hyperparameters

Below are some other useful links from the Surprise Package.


Getting Started

Movie Example

Finally, here is a paper on Amazon Recommendation Engine.


Decision Tree using Python Scikit

If you are not familiar with Decision Trees, please read this article first.

First let’s look at a very simple example on the Iris data-

Decision Tree in Python

Decision Tree in Python

Now let’s look at slightly more complex data-

Let’s first build a logistic regression model in Python using machine learning library Scikit. Please read here about the dataset and dummy coding.






Logistic Regression using Scikit Python

If you are not familiar with logistics regression, please read this article first. Moreover, if you are not familiar with the sklearn machine learning model building process, please read this article also.

Assuming you are now familiar, this is how you can build a logistic regression model in Python using machine learning library Scikit.  Please read here about the dataset and dummy coding. 




Categorical Variables Dummy Coding

Converting categorical variables into numerical dummy coded variable is generally a requirement in machine learning libraries such as Scikit as they mostly work on numpy arrays.

In this blog, let’s look at how we can convert bunch of categorical variables into numerical dummy coded variables using four different methods-

  1. Scikit learn preprocessing LabelEncoder
  2.  Pandas getdummies
  3. Looping
  4. Mapping

We will work with a dataset from IBM Watson blog as this has plenty of categorical variables. You can find the data here.  In this data, we are trying to build a model to predict “churn”, which has two levels “Yes” and “No”.

We will convert the dependent variable using Scikit LabelEncoder and the independent categorical variables using Pandas getdummies. Please note that LabelEncoder will not necessarily create additional columns, whereas the getdummies will create additional columns in the data. We will see that in the below example-


Here are few other ways to dummy coding-


Here is an excellent Kaggle Kernel for detailed feature engineering.


Python Machine Learning Linear Regression with Scikit- learn

What is a “Linear Regression”-

Linear regression is one of the most powerful and yet very simple machine learning algorithm. Linear regression is used for cases where the relationship between the dependent and one or more of the independent variables is supposed to be linearly correlated in the following fashion-

Y = b0 + b1*X1 + b2*X2 + b3*X3 + …..

Here Y is the dependent variable and X1, X2, X3 etc are independent variables. The purpose of building a linear regression model is to estimate the coefficients b0, b1, b2 et cetera that provides the least error rate in the prediction. More on the error will be discussed later in this article.

In the above equation, b0 is the intercept, b1 is the coefficient for variable X1, b2 is the coefficient for the variable X2 and so on…

What is a “Simple Linear Regression” and “ Multiple Linear Regression”?

When we have only one independent variable, resulting regression is called a “Simple Linear Regression” when we have 2 or more independent variables the resulting regression is called “Multiple Linear Regression”

What are the requirements for the dependent and independent variables in the regression analysis?

The dependent variable in linear regression is generally Numerical and Continuous such as sales in dollars, gdp, unemployment rate, pollution level, amount of rainfall etc. On the other hand, the independent variables can be either numeric or categorical. However, please note that the categorical variables will need to be dummy coded before we can use these variables for building a regression model in the sklearn library of Python.

What are some of the real world usage of linear regression?

As we discussed earlier, this is one of the most commonly used algorithm in ML. Some of the use cases are listed below-

Example 1-

Predict sales amount of a car company as a function of the # of models, new models, price, discount,GDP, interest rate, unemployment rate, competitive prices etc.

Example 2-

Predict weight gain/loss of a person as a function of calories intake, junk food, genetics, exercise time and intensity, sleep, festival time, diet plans, medicines etc.

Example 3-

Predict house prices as a function of sqft, # of rooms, interest rate, parking, pollution level, distance from city center, population mix etc.

Example 4-

Predict GDP growth rate as a function of inflation, unemployment rate, investment, new business, weather pattern, resources, population

How do we evaluate linear regression model’s performance? 

There are many metrics that can be used to evaluate a linear regression model’s performance and choose the best model.  Some of the most commonly used metrics are-

Mean Square Error (MSE)- This is an error and lower the amount the better it is. It is defined using the below formula


Mean Squared Error (MSE)


Mean Absolute Percent Error (MAPE)- This is an error and lower the amount the better it is. It is defined using the below formula

Mean Absolute Percent Error (MAPE)

R Square– This is called coefficient of determination and provides a gauge of model’s explaining power. For example, for a linear regression model with a RSquare of 0.70 or 70% would imply that 70% of the variation in the dependent variable can be explained by the model that has been built.

How do we build a linear regression model in Python?

In this exercise, we will build a linear regression model on Boston housing data set which is an inbuilt data in the scikit-learn library of Python. However, before we go down the path of building a model, let’s talk about some of the basic steps in any machine learning model in Python

In most cases, any of the machine learning algorithm in sklearn library will follow the following steps-

  • Split original data into features and label. In other words,  create dependent variable and set of independent variables in two different arrays separately. Please note this requirement exists only for the supervised learning ( where a dependent variable is present). For unsupervised learning, we don’t have a dependent variable and hence there is no need to split the data into features and label
  • Scale or Normalize the features and label data. Please note that this is not a necessity for all algorithms and/or datasets. Also we are assuming that all the data cleaning and feature engineering  such as missing value treatment, outlier treatment, bogus values fixes and dummy coding of the categorical variables have been done before doing this step
  • Create training and test data sets from the original data. Training data set will be used for training the model whereas the test data set will be used for validating the accuracy or the prediction power of the model on a new dataset. We would need to split both the features and labels into the training and the test split.
  • Create an instance of the model object that will be used for the modelling exercise. This process is called “Instantiation”.  In simpler words, during this process we are loading the model package necessary to build a model.
  • “Fit” the model instance on the training data. During this step, the model is leveraging both the features and the label information provided in the training data to connect the features to label. Please note that we are going with all the default option during fitting of the model.  As you get more expertise you may want to play with some parameter optimization, however we are just going with the defaults for now.
  • “Predict” using the model instance on test data. During this step, the model is only using the features information to predict the label.
  • Based on the predictions generated on the test data, we generate key performance indicators of  model performance. This generally includes metrics such as Precision, Recall F score, Confusion Matrix, Accuracy, Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Area Under the Curve (AUC), Mean Absolute Percentage error (MAPE) etc.
  • Once the model performance is evaluated and its deemed to be satisfactory for the purpose of the business uses, we implement the model for new unseen data

So let’s get started with building this model-


  • import the necessary packages including the train_test_split package which will be used for splitting the data into the training and test samples


  • Import interactive shell magic command which will help us print many statements on the same line


  • Import the Boston Housing dataset from sklearn library. Python has many such inbuilt datasets for various purposes. Most of the data sets in such libraries are stored as dictionary format.


  • Find out more about this data set by typing the below command
  • Let’s do some more exploratory analysis such as- printing the features,  the label shape of the data etc.


  • Convert the original array data into a dataframe and append the column names.
  • Add a new variable in the dataframe for the target ( or label) variable


  • Since we are building a linear regression model it may be helpful to generate the correlation matrix and then the correlation heatmap using the seaborn library



  • Create features and labels using Pandas  ‘.drop() ‘ method to drop certain variables. In this case we are dropping the house price as this is the label.


  • Split the data into the training and test datasets


  • Instantiate– import the model object and create an instance of the model


  • Fit – Fit the model instant on the training data using ‘ .fit() ‘ method. Note that we are passing on both the features and label here


  • Predict– Predict using the model instant and training done on the training data using ‘ .predict() ‘ method. Please note that here we are only passing on the features and having the model predict the values of the label.


  • We can find out many important things such as the coefficients of the parameters using the fitted object methods. In the below case, we are getting the coefficient values for all the feature parameters in the model.





  • We can plot the feature importance in a bar chart format as well using the ‘.plot’ method of the Pandas dataframe.  Please note that we can also specify the figure size and the X and Y variables in the plot method under the different parameters possible




  • Let’s now generate some of the model performance metrics  such as R2, MSE and MAE. All of these model performance metrics can be generated using the scikit-learn inbuilt packages such as ‘metrics’.




  • In the last step we are appending the predicted house prices into the original data and computing the error in estimation for the test data.



As you can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.


Fitting Linear Regression Model using Statmodels

Image 9- Fitting Linear Regression Model using Statmodels

OLS Regression Output

Image 10- OLS Regression Output

itting Linear Regression Model with Significant Variables

Image 11- Fitting Linear Regression Model with Significant Variables

Heteroscedasticity Consistent Linear Regression Estimates

Image 12- Heteroscedasticity Consistent Linear Regression Estimates

More details on the metrics can be found at the below links-


Here is a blog with excellent explanation of all metrics


Install and check Python Packages

Here are some examples on how you can check that necessary packages are installed in the python environment and check their version before moving forward. These are some of the must have packages. If any of the packages are not installed, you can do the anaconda install using conda prompt.  Further directions are shown in the link 

You can search for any package in anaconda environment by using the following code-

anaconda search -t conda seaborn

Installing a package using anaconda prompt is as simple as the line shown below. In this case we are installing a package called Seaborn on anaconda prompt. You can go to the anaconda prompt by typing anaconda prompt in the search menu.

conda install seaborn

Please note that sometimes the anaconda prompt may not let you install new packages and display certain errors like “access denied“. In that case you need to right click on the anaconda prompt shortcut and start as an administrator.

If your conda prompt screen is getting too cluttered you can always clear the screen by typing the command “cls”