Python Machine Learning Linear Regression with Scikit- learn

What is a “Linear Regression”-

Linear regression is one of the most powerful and yet very simple machine learning algorithm. Linear regression is used for cases where the relationship between the dependent and one or more of the independent variables is supposed to be linearly correlated in the following fashion-

Y = b0 + b1*X1 + b2*X2 + b3*X3 + …..

Here Y is the dependent variable and X1, X2, X3 etc are independent variables. The purpose of building a linear regression model is to estimate the coefficients b0, b1, b2 et cetera that provides the least error rate in the prediction. More on the error will be discussed later in this article.

In the above equation, b0 is the intercept, b1 is the coefficient for variable X1, b2 is the coefficient for the variable X2 and so on…

What is a “Simple Linear Regression” and “ Multiple Linear Regression”?

When we have only one independent variable, resulting regression is called a “Simple Linear Regression” when we have 2 or more independent variables the resulting regression is called “Multiple Linear Regression”

What are the requirements for the dependent and independent variables in the regression analysis?

The dependent variable in linear regression is generally Numerical and Continuous such as sales in dollars, gdp, unemployment rate, pollution level, amount of rainfall etc. On the other hand, the independent variables can be either numeric or categorical. However, please note that the categorical variables will need to be dummy coded before we can use these variables for building a regression model in the sklearn library of Python.

What are some of the real world usage of linear regression?

As we discussed earlier, this is one of the most commonly used algorithm in ML. Some of the use cases are listed below-

Example 1-

Predict sales amount of a car company as a function of the # of models, new models, price, discount,GDP, interest rate, unemployment rate, competitive prices etc.

Example 2-

Predict weight gain/loss of a person as a function of calories intake, junk food, genetics, exercise time and intensity, sleep, festival time, diet plans, medicines etc.

Example 3-

Predict house prices as a function of sqft, # of rooms, interest rate, parking, pollution level, distance from city center, population mix etc.

Example 4-

Predict GDP growth rate as a function of inflation, unemployment rate, investment, new business, weather pattern, resources, population

How do we evaluate linear regression model’s performance? 

There are many metrics that can be used to evaluate a linear regression model’s performance and choose the best model.  Some of the most commonly used metrics are-

Mean Square Error (MSE)- This is an error and lower the amount the better it is. It is defined using the below formula


Mean Squared Error (MSE)


Mean Absolute Percent Error (MAPE)- This is an error and lower the amount the better it is. It is defined using the below formula

Mean Absolute Percent Error (MAPE)

R Square– This is called coefficient of determination and provides a gauge of model’s explaining power. For example, for a linear regression model with a RSquare of 0.70 or 70% would imply that 70% of the variation in the dependent variable can be explained by the model that has been built.

How do we build a linear regression model in Python?

In this exercise, we will build a linear regression model on Boston housing data set which is an inbuilt data in the scikit-learn library of Python. However, before we go down the path of building a model, let’s talk about some of the basic steps in any machine learning model in Python

In most cases, any of the machine learning algorithm in sklearn library will follow the following steps-

  • Split original data into features and label. In other words,  create dependent variable and set of independent variables in two different arrays separately. Please note this requirement exists only for the supervised learning ( where a dependent variable is present). For unsupervised learning, we don’t have a dependent variable and hence there is no need to split the data into features and label
  • Scale or Normalize the features and label data. Please note that this is not a necessity for all algorithms and/or datasets. Also we are assuming that all the data cleaning and feature engineering  such as missing value treatment, outlier treatment, bogus values fixes and dummy coding of the categorical variables have been done before doing this step
  • Create training and test data sets from the original data. Training data set will be used for training the model whereas the test data set will be used for validating the accuracy or the prediction power of the model on a new dataset. We would need to split both the features and labels into the training and the test split.
  • Create an instance of the model object that will be used for the modelling exercise. This process is called “Instantiation”.  In simpler words, during this process we are loading the model package necessary to build a model.
  • “Fit” the model instance on the training data. During this step, the model is leveraging both the features and the label information provided in the training data to connect the features to label. Please note that we are going with all the default option during fitting of the model.  As you get more expertise you may want to play with some parameter optimization, however we are just going with the defaults for now.
  • “Predict” using the model instance on test data. During this step, the model is only using the features information to predict the label.
  • Based on the predictions generated on the test data, we generate key performance indicators of  model performance. This generally includes metrics such as Precision, Recall F score, Confusion Matrix, Accuracy, Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Area Under the Curve (AUC), Mean Absolute Percentage error (MAPE) etc.
  • Once the model performance is evaluated and its deemed to be satisfactory for the purpose of the business uses, we implement the model for new unseen data

So let’s get started with building this model-


  • import the necessary packages including the train_test_split package which will be used for splitting the data into the training and test samples


  • Import interactive shell magic command which will help us print many statements on the same line


  • Import the Boston Housing dataset from sklearn library. Python has many such inbuilt datasets for various purposes. Most of the data sets in such libraries are stored as dictionary format.


  • Find out more about this data set by typing the below command
  • Let’s do some more exploratory analysis such as- printing the features,  the label shape of the data etc.


  • Convert the original array data into a dataframe and append the column names.
  • Add a new variable in the dataframe for the target ( or label) variable


  • Since we are building a linear regression model it may be helpful to generate the correlation matrix and then the correlation heatmap using the seaborn library



  • Create features and labels using Pandas  ‘.drop() ‘ method to drop certain variables. In this case we are dropping the house price as this is the label.


  • Split the data into the training and test datasets


  • Instantiate– import the model object and create an instance of the model


  • Fit – Fit the model instant on the training data using ‘ .fit() ‘ method. Note that we are passing on both the features and label here


  • Predict– Predict using the model instant and training done on the training data using ‘ .predict() ‘ method. Please note that here we are only passing on the features and having the model predict the values of the label.


  • We can find out many important things such as the coefficients of the parameters using the fitted object methods. In the below case, we are getting the coefficient values for all the feature parameters in the model.





  • We can plot the feature importance in a bar chart format as well using the ‘.plot’ method of the Pandas dataframe.  Please note that we can also specify the figure size and the X and Y variables in the plot method under the different parameters possible




  • Let’s now generate some of the model performance metrics  such as R2, MSE and MAE. All of these model performance metrics can be generated using the scikit-learn inbuilt packages such as ‘metrics’.




  • In the last step we are appending the predicted house prices into the original data and computing the error in estimation for the test data.



As you can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.


Fitting Linear Regression Model using Statmodels

Image 9- Fitting Linear Regression Model using Statmodels

OLS Regression Output

Image 10- OLS Regression Output

itting Linear Regression Model with Significant Variables

Image 11- Fitting Linear Regression Model with Significant Variables

Heteroscedasticity Consistent Linear Regression Estimates

Image 12- Heteroscedasticity Consistent Linear Regression Estimates

More details on the metrics can be found at the below links-


Here is a blog with excellent explanation of all metrics


Data Standardization or Normalization

Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.

Standardization is when a variable is made to follow the standard normal distribution ( mean =0  and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.

Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?

  1. In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable Y will outweigh variable X due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
  2. Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
  3. Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
  4. Multivariate models may become more stable and the coefficients more reliable if you normalize the data
  5. It provides immunity from the problem of outliers

Let’s look at a Python example on how we can normalize data-



Basic Statistics and Data Visualization

Doing exploratory, diagnostic and descriptive statistics is the first and very crucial part of any data analytics project.

Here are some more details on each of the steps involved in Exploratory Data Analysis ( EDA)

Let’s now look at examples on how to accomplish these tasks in Python.

You can find all the inbuilt datasets in the seaborn library using the below command-


The following datasets are available-



















Missing Values Treatment

Data cleaning is a crucial part in any data science project as uncleaned data may impact the results significantly. In this blog, we will look at how to deal with the missing values in our data. Let’s look at an example-



Learn Data Science using Python Step by Step

Here is how you can learn Data Science using Python step by step. Please feel free to reach out to me on my personal email id if you have any question or comments related to any topics.


  1. Setup Python environment
  2. How to start jupyter notebook
  3. Open Jupyter Notebook in Browser of your Choice
  4. Install and check Packages
  5. Arithmetic operations
  6. Comparison or logical operations
  7. Assignment and augmented assignment in Python
  8. Variables naming conventions
  9. Types of variables in Python and typecasting
  10. Python Functions
  11. Exception handling in Python
  12. String manipulation and indexing
  13. Conditional and loops in Python
  14. Python data structure and containers
  15. Introduction to Python Numpy
  16. Introduction to Python SciPy
  17. Conduct One Sample and Two Sample Equality of Means T Test in Python
  18. Introduction to Python Pandas
  19. Python pivot tables
  20. Pandas join tables
  21. Missing value treatment
  22. Dummy coding of categorical variables 
  23. Basic statistics and visualization
  24. Data standardization or normalization
  25. Linear Regression with scikit- learn (Machine Learning library)
  26. Lasso, Ridge and Elasticnet Regularization in GLM
  27. Classification Algorithm Evaluation Metrics
  28. Logistic Regression with scikit- learn (Machine Learning library)
  29. Hierarchical clustering with Python
  30. K-means clustering with Scikit Python
  31. Decision trees using Scikit Python
  32. Regression Decision Trees with Scikit Python
  33. Support Vector Machine using Scikit Python
  34. Hyperparameters Optimization using Gridsearch and Cross Validations
  35. Principal Component Analysis (PCA) using Scikit Python- Dimension Reduction
  36. Linear Discriminant Analysis (LDA) using Scikit Python- Dimension Reduction and Classification
  37. Market Basket Analysis or Association Rules or Affinity Analysis or Apriori Algorithm
  38. Recommendation Engines using Scikit-Surprise
  39. Price Elasticity of Demand using Log-Log Ordinary Least Square (OLS) Model
  40. Timeseries Forecasting using Facebook Prophet Package
  41. Model Persistence and Productionalization Using Python Pickle
  42. Deep Learning- Introduction to deep learning and environment setup
  43. Deep Learning- Multilayer perceptron (MLP) in Python
  44. Deep Learning- Convolution Neural Network (CNN) in Python
  45. Other topics (coming soon)



Pandas Join Tables

There are many types of joins such as inner, outer, left, right which can be easily done in Python. Let’s work with an example to go through it. More details on our example can be found here


Use keys from left frame only


Use keys from right frame only


Use union of keys from both frames


Use intersection of keys from both frames



Python Pivot Tables

Just like in Excel, we can do Pivot Tables in Pandas as well. This is a very convenient feature when it comes to data summarizing. Let’s look at an example-



Introduction to Python Pandas

Pandas is an open source Python library which create dataframes similar to Excel tables and play an instrumental role in data manipulation and data munging in any data science projects. Generally speaking, underlying data values in pandas is stored in the numpy array format as you will see shortly.

Let’s look at some examples-

First, let’s import a file (using read_csv) to work on. Then we will begin data exploration.  Particularly, we will be doing following in the below example-

  • Import pandas and numpy
  • Import csv file
  • Check type, shape, index and values of the dataframe
  • Display top 5 and bottom 5 rows of the data using head() and tail()
  • Generate descriptive statistics such as mean, median, percentile etc
  • Transpose dataframe
  • Sort data frame by rows and columns
  • Indexing, slicing and dicing using loc and iloc. More on this is here
  • Adding new columns
  • Boolean indexing
  • Inserting date time in the data frame







Introduction to Python SciPy

Scipy is a Python open source package used for the scientific computing across many domains such as engineering, mathematics, sciences etc. Here are some examples of Scipy.

Let’s say that that income of a company’s employees is normally distributed with mean of 10,000 USD and standard deviation of 1,000 USD. Approximately what percent of the employees will be earning 11,000 USD of salary or less?

This can be easily accomplished using SciPy.  The answer is 84.1% of employees.


We can also say that 100-84.1% or roughly 16% of employees may be earning higher than 11,000 USD.


Here in another example on how we can pick a random sample from a particular normal distribution.



Introduction to Python Numpy

Numpy is Python open source packages which make the numerical computing possible in Python using N dimensional array. This forms the foundation of other data munging and manipulation packages such as Pandas.

Let’s look at why Numpy is needed. Assume that we want to add members of two lists as shown in the below example.


As you can see from the above example, numerical computing is possible in Python largely due to Numpy.

Let’s dig deeper into other aspects on Numpy.