Data Standardization or Normalization

Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.

Standardization is when a variable is made to follow the standard normal distribution ( mean =0  and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.

Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?

  1. In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable X will outweigh variable Y due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
  2. Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
  3. Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
  4. Multivariate models may become more stable and the coefficients more reliable if you normalize the data
  5. It provides immunity from the problem of outliers

Let’s look at a Python example on how we can normalize data-

scaling1scaling2scaling3scaling4scaling5scaling6

Cheers!

One thought on “Data Standardization or Normalization

  1. Pingback: Learn Python Step by Step | RP's Blog on data science

Leave a comment