Are you wondering when you should use principal component analysis (PCA)? Or maybe you want to hear more about how PCA compares to similar dimension reduction techniques? Well then you are in the right place!
In this article, we tell you everything you need to know to understand whether PCA is right for you. We start out with a discussion of what kinds of datasets PCA can be used on. After that, we discuss some of the main advantages and disadvantages you should keep in mind when deciding whether to use PCA. Finally, we provide specific examples of scenarios where you should and should not use PCA.
Datasets for principal components analysis
What kind of dataset should PCA be used on? PCA is an unsupervised algorithm which means that it does not require there to be a specific outcome variable you are trying to predict in your dataset. Instead, PCA is used when you have a set of features and you want to reduce the dimensionality of your feature set. This simply means that you want to condense as much of the information in your input features as possible into a smaller set of transformed features. In particular, PCA is intended to be used when you have a set of numeric features you want to condense.
Advantages and disadvantages of PCA
What are the main advantages and disadvantages of PCA? Here are some advantages and disadvantages you should keep in mind when deciding whether to use PCA.
Advantages of principal component analysis
- Guaranteed to produce uncorrelated features. No matter how highly correlated the input features that go into your PCA model are, the transformed features that come out of the model are guaranteed to be uncorrelated. This is a big advantage as correlated features tend to cause problems for a lot of machine learning algorithms.
- Relatively fast. Another advantage of PCA is that it is relatively fast compared to other dimensionality reduction techniques. PCA makes use of simple linear algebra computations that are easy for computers to handle. That means it is a good option when you have a large dataset with many observations.
- Not sensitive to choice of seed. Another advantage of PCA is that it is not sensitive to the choice of seed or any other initialization conditions. PCA is a deterministic algorithm, which means that it will always produce the same result when applied to the same dataset.
- No hyperparameters. Another advantage of PCA is that there are no hyperparameters that need to be tuned. This means that you do not have to go through the additional step of hyperparameter tuning when applying PCA to your data.
- Popular and well studied. PCA is one of the most common dimensionality reduction techniques out there, which means that many data scientists are familiar with it. This means that it will be easier for collaborators to contribute to projects that use PCA than it would be for them to contribute to projects that use more obscure algorithms.
Disadvantages of principal component analysis
- Assumes relationships between features are linear. One of the main disadvantages of PCA is that it makes the assumption that the relationships between the different features in the input data are linear. This means that it may not perform well in situations where the relationships between features are non-linear
- Does not necessarily preserve local structure of data. PCA does not necessarily preserve the local structure of your data. This means that observations that are close together in the original features space will not necessarily be close together in the transformed features space. This can be a problem if you want to apply something like clustering or visualization techniques to the data.
- Need to rescale features. Another disadvantage of PCA is that it is sensitive to scale. That means that you may need to rescale your features before you apply PCA to them.
- Sensitive to outliers. Another disadvantage of PCA is that it is sensitive to outliers. If there are outliers in your dataset, they may have an oversized effect on the model. You will end up with transformed features that are more representative of a few outlying points than the bulk of the data.
- Cannot handle missing values. Another disadvantage of traditional PCA is that it cannot handle missing data. This means you may have to preprocess your data to handle any missing values. There are some extensions of PCA that can handle missing values, but they may or may not be available in common machine learning libraries.
- Only suitable for continuous data. Another disadvantage of PCA is that it is only suitable for continuous variables. If you have a mixture of continuous and categorical variables in your dataset, you may want to consider other dimensionality reduction methods.
- Does not perform well when input features are not correlated. PCA does not perform well in situations where none of the input features are correlated with one another. If there is no information that is shared between features, the algorithm will not be able to compress shared information into fewer features.
When to use principal component analysis
When should you use principal component analysis rather than another dimensionality reduction technique? Here are some examples of situations where you should use principal component analysis.
- Many correlated features. If you have many correlated features in your dataset you want to apply an algorithm that does not perform well on correlated features to the dataset, this is a great use case for PCA. All you have to do is apply PCA to the set of correlated features and replace the input features with the transformed features produced by the PCA model. The transformed features are guaranteed to be independent of one another no matter how highly correlated the input features were.
- Quick and easy dimension reduction. PCA is a great model to use if you need to apply a quick and easy dimension reduction technique for something like a prototype or proof-of-concept. The model is deterministic and there are no hyperparameters to tune, so you only have to apply the model to your data one time and you are done.
When not to use principal component analysis
When should you avoid using principal component analysis? Here are some examples of situations where you should avoid using principal component analysis.
- Features are not linearly related. Principal component analysis performs best when it is applied to a dataset where all of the features are linearly related. If you do not think that the features in your dataset are linearly related, you may be better off using a dimensionality reduction technique that makes fewer assumptions about the data. For example, t-sne is an example of a non-parametric algorithm that makes fewer assumptions about the structure of the data.
- Visualizing data. If the primary reason you want to reduce the number of dimensions in your data is so that you can visualize your data, you are generally better off using an algorithm like t-sne that preserves local relationships in the data. Algorithms that preserve local relationships try to ensure that observations that are close together in the input feature space are also close together in the transformed feature space, which is what you want if you are trying to visualize data. PCA focuses more on preserving global trends in the data and less on preserving local relationships between specific points.
- Need interpretable features. Most dimension reduction techniques produce features that do not have a straightforward interpretation. If you need all of the features in your dataset to be directly interpretable, you may be better off using feature selection techniques instead of traditional dimensionality reduction techniques to reduce the size of your dataset.
Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.