Data standardization or normalization plays a critical role in most of the statistical analysis and modeling. Let’s spend sometime to talk about the difference between the standardization and normalization first.
Standardization is when a variable is made to follow the standard normal distribution ( mean =0 and standard deviation = 1). On the other hand, normalization is when a variable is fitted within a certain range ( generally between 0 and 1). Here are more details of the above.
Let’s now talk about why we need to do the standardization or normalization before many statistical analysis?
- In a multivariate analysis when variables have widely different scales, variable(s) with higher range may overshadow the other variables in analysis. For example, let’s say variable X has a range of 0-1000 and variable Y has a range of 0-10. In all likelihood, variable Y will outweigh variable X due to it’s higher range. However, if we standardize or normalize the variable, then we can overcome this issue.
- Any algorithms which are based on distance computations such as clustering, k nearest neigbour ( KNN), principal component ( PCA) will be greatly affected if you don’t normalize the data
- Neural networks and deep learning networks also need the variables to be normalized for converging faster and giving more accurate results
- Multivariate models may become more stable and the coefficients more reliable if you normalize the data
- It provides immunity from the problem of outliers
Let’s look at a Python example on how we can normalize data-