Python Machine Learning Linear Regression with Scikit- learn

Linear regression is one of the most fundamental machine learning technique in Python. For more on linear regression fundamentals click here. In this blog, we will build a regression model to predict house prices by looking into independent variables such as crime rate, % lower status population, quality of schools etc. We will be leveraging Scikit-learn library and in built data set called “Boston”.

Let’s now jump onto how to build a multiple linear regression model in Python.

Import packages and Boston dataset

Image 1- Importing Packages and Boston Dataset

Explore Boston Dataset

Image 2- Explore Boston Dataset

Creating Features and Labels and Running Correlations

Image 3- Creating Features and Labels and Running Correlations

Creating Features and Labels and Running Correlation Heatmap

Image 4- Creating Features and Labels and Running Correlation Heatmap

Test/Train Split, Linear Regression Model Fitting and Model Evaluation

Image 5- Test/Train Split, Linear Regression Model Fitting and Model Evaluation

Appending Predicted Data and Plotting the Errors

Image 6- Appending Predicted Data and Plotting the Errors

You can see from the above metrics that overall this plain vanilla regression model is doing a decent job. However, it can be significantly improved upon by either doing feature engineering such as binning, multicollinearity and heteroscedasticity fixes etc. or by leveraging more robust techniques such as Elastic Net, Ridge Regression or SGD Regression, Non Linear models.

Mean Squared Error (MSE)

Image 7- Mean Squared Error (MSE) Definition

Mean Absolute Percent Error (MAPE)

Image 8- Mean Absolute Percent Error (MAPE)

Model Evaluation Metrics

Fitting Linear Regression Model using Statmodels

Image 9- Fitting Linear Regression Model using Statmodels

OLS Regression Output

Image 10- OLS Regression Output

itting Linear Regression Model with Significant Variables

Image 11- Fitting Linear Regression Model with Significant Variables

Heteroscedasticity Consistent Linear Regression Estimates

Image 12- Heteroscedasticity Consistent Linear Regression Estimates

More details on the metrics can be found at the below links-


Here is a blog with excellent explanation of all metrics


Data Standardization or Normalization

Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.

Standardization is when a variable is made to follow the standard normal distribution ( mean =0  and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.

Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?

  1. In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable Y will outweigh variable X due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
  2. Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
  3. Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
  4. Multivariate models may become more stable and the coefficients more reliable if you normalize the data
  5. It provides immunity from the problem of outliers

Let’s look at a Python example on how we can normalize data-



Basic Statistics and Data Visualization

Doing exploratory, diagnostic and descriptive statistics is the first and very crucial part of any data analytics project.

Here are some more details on each of the steps involved in Exploratory Data Analysis ( EDA)

Let’s now look at examples on how to accomplish these tasks in Python.

You can find all the inbuilt datasets in the seaborn library using the below command-


The following datasets are available-



















Missing Values Treatment

Data cleaning is a crucial part in any data science project as uncleaned data may impact the results significantly. In this blog, we will look at how to deal with the missing values in our data. Let’s look at an example-



Learn Data Science using Python Step by Step

Here is how you can learn Data Science using Python step by step. Please feel free to reach out to me on my personal email id if you have any question or comments related to any topics.


  1. Setup Python environment
  2. How to start jupyter notebook
  3. Open Jupyter Notebook in Browser of your Choice
  4. Install and check Packages
  5. Arithmetic operations
  6. Comparison or logical operations
  7. Assignment and augmented assignment in Python
  8. Variables naming conventions
  9. Types of variables in Python and typecasting
  10. Python Functions
  11. Exception handling in Python
  12. String manipulation and indexing
  13. Conditional and loops in Python
  14. Python data structure and containers
  15. Introduction to Python Numpy
  16. Introduction to Python SciPy
  17. Conduct One Sample and Two Sample Equality of Means T Test in Python
  18. Introduction to Python Pandas
  19. Python pivot tables
  20. Pandas join tables
  21. Missing value treatment
  22. Dummy coding of categorical variables 
  23. Basic statistics and visualization
  24. Data standardization or normalization
  25. Linear Regression with scikit- learn (Machine Learning library)
  26. Lasso, Ridge and Elasticnet Regularization in GLM
  27. Logistic Regression with scikit- learn (Machine Learning library)
  28. Hierarchical clustering with Python
  29. K-means clustering with Scikit Python
  30. Decision trees using Scikit Python
  31. Regression Decision Trees with Scikit Python
  32. Support Vector Machine using Scikit Python
  33. Hyperparameters Optimization using Gridsearch and Cross Validations
  34. Principal Component Analysis (PCA) using Scikit Python- Dimension Reduction
  35. Linear Discriminant Analysis (LDA) using Scikit Python- Dimension Reduction and Classification
  36. Market Basket Analysis or Association Rules or Affinity Analysis or Apriori Algorithm
  37. Recommendation Engines using Scikit-Surprise
  38. Price Elasticity of Demand using Log-Log Ordinary Least Square (OLS) Model
  39. Timeseries Forecasting using Facebook Prophet Package
  40. Model Persistence and Productionalization Using Python Pickle
  41. Deep Learning- Introduction to deep learning and environment setup
  42. Deep Learning- Multilayer perceptron (MLP) in Python
  43. Deep Learning- Convolution Neural Network (CNN) in Python
  44. Other topics (coming soon)



Pandas Join Tables

There are many types of joins such as inner, outer, left, right which can be easily done in Python. Let’s work with an example to go through it. More details on our example can be found here


Use keys from left frame only


Use keys from right frame only


Use union of keys from both frames


Use intersection of keys from both frames



Python Pivot Tables

Just like in Excel, we can do Pivot Tables in Pandas as well. This is a very convenient feature when it comes to data summarizing. Let’s look at an example-



Introduction to Python Pandas

Pandas is an open source Python library which create dataframes similar to Excel tables and play an instrumental role in data manipulation and data munging in any data science projects. Generally speaking, underlying data values in pandas is stored in the numpy array format as you will see shortly.

Let’s look at some examples-

First, let’s import a file (using read_csv) to work on. Then we will begin data exploration.  Particularly, we will be doing following in the below example-

  • Import pandas and numpy
  • Import csv file
  • Check type, shape, index and values of the dataframe
  • Display top 5 and bottom 5 rows of the data using head() and tail()
  • Generate descriptive statistics such as mean, median, percentile etc
  • Transpose dataframe
  • Sort data frame by rows and columns
  • Indexing, slicing and dicing using loc and iloc. More on this is here
  • Adding new columns
  • Boolean indexing
  • Inserting date time in the data frame







Introduction to Python SciPy

Scipy is a Python open source package used for the scientific computing across many domains such as engineering, mathematics, sciences etc. Here are some examples of Scipy.

Let’s say that that income of a company’s employees is normally distributed with mean of 10,000 USD and standard deviation of 1,000 USD. Approximately what percent of the employees will be earning 11,000 USD of salary or less?

This can be easily accomplished using SciPy.  The answer is 84.1% of employees.


We can also say that 100-84.1% or roughly 16% of employees may be earning higher than 11,000 USD.


Here in another example on how we can pick a random sample from a particular normal distribution.



Introduction to Python Numpy

Numpy is Python open source packages which make the numerical computing possible in Python using N dimensional array. This forms the foundation of other data munging and manipulation packages such as Pandas.

Let’s look at why Numpy is needed. Assume that we want to add members of two lists as shown in the below example.


As you can see from the above example, numerical computing is possible in Python largely due to Numpy.

Let’s dig deeper into other aspects on Numpy.