Principal Component Analysis ( PCA) is generally used as an unsupervised algorithm for reducing the data dimensions to address Curse of Dimensionality, detecting outliers, removing noise, speech recognition and other such areas.
The underlying algorithm in PCA is generally a linear algebra technique called Singular Value Decomposition (SVD). PCAs take the original data and create orthogonal components (uncorrelated components) that capture the information contained in the original data however with significantly less number of components.
Either the components themselves or key loading of the components can be plugged in any further modeling work, rather than the original data to minimize information redundancy and noise.
There are three main ways to select the right number of components-
- Number of components should explain at least 80% of the original data variance or information [Preferred One]
- Eigen value of each PCA component should be more than or equal to 1. This means that they should express at least one variable worth of information
- Elbow or Scree method- look for the elbow in the percentage of variance explained by each components and select the components where an elbow or kink is visible.
You can use any one of the above or combination of the above to select the right number of components. It is very critical to standardize or normalize data before conducting PCA.
In the below case study we will use the first criterion shown above, i.e. 80% or more of the original data variance should be explained by the selected number of components.
Pingback: Linear Discriminant Analysis ( LDA) with Scikit | RP's Blog on data science
Pingback: Learn Data Science using Python Step by Step | RP's Blog on data science