Data Standardization or Normalization

Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.

Standardization is when a variable is made to follow the standard normal distribution ( mean =0 and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.

Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?

In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable X will outweigh variable Y due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
Multivariate models may become more stable and the coefficients more reliable if you normalize the data
It provides immunity from the problem of outliers

Let’s look at a Python example on how we can normalize data-

scaling1 scaling2 scaling3 scaling4 scaling5 scaling6

Cheers!

RP’s Blog on AI

Connect with RP- https://www.linkedin.com/in/ratnakarpandey/

Data Standardization or Normalization

One thought on “Data Standardization or Normalization”

Leave a comment Cancel reply

Share this:

One thought on “Data Standardization or Normalization”

Leave a comment Cancel reply