Are you wondering when you should use DBSCAN? Or maybe you want to hear more about the practical differences between DBSCAN and other clustering algorithms? Well either way, you are in the right place!
In this article, we tell you everything you need to know to understand when to use DBSCAN. We start out by discussing the types of datasets that DBSCAN should be used for. After that, we talk about some of the advantages and disadvantages of DBSCAN. Finally, we prove specific examples of scenarios where you should and should not use DBSCAN.
What kind of data can you use DBSCAN for?
What kind of datasets are suitable for DBSCAN? In general, DBSCAN is an unsupervised clustering algorithm that should be used when you do not have a particular outcome variable you want to predict. Instead, you should have a set of features you want to use to identify patterns across your dataset. In general, DBSCAN is intended to be used in cases where all of your features are numeric.
Advantages and disadvantages of DBSCAN
What are the main advantages and disadvantages of DBSCAN? Here are some of the main advantages and disadvantages you should keep in mind when deciding whether to use DBSCAN.
Advantages of DBSCAN
- Handles irregularly shaped and sized clusters. One of the main advantages of DBSCAN is its ability to detect clusters that are irregularly shaped. Of all the common clustering algorithms out there, DBSCAN is one of the algorithms that makes the fewest assumptions about the shape of your clusters. That means that DBSCAN can be used to detect clusters that are oddly or irregularly shaped, such as clusters that are ring-shaped.
- Robust to outliers. Another big advantage of DBSCAN is that it is able to detect outliers and exclude them from the clusters entirely. That means that DBSCAN is very robust to outliers and great for datasets with multiple outliers.
- Does not require the number of clusters to be specified. Yet another advantage of DBSCAN is that it does not require the user to specify the number of clusters. Instead, DBSCAN can automatically detect the number of clusters that exist in the data. This is great for cases where you do not have much intuition on how many clusters there should be.
- Less sensitive to initialization conditions. DBSCAN is less sensitive to initialization conditions like the order of the observations in the dataset and the seed that is used than some other clustering algorithms. Some points that are on the borders between clusters may shift around when initialization conditions change, but the majority of the observations should remain in the same cluster.
- Relatively fast. While DBSCAN is not the fastest clustering algorithm out there, it is certainly not the slowest either. There are multiple implementations of DBSCAN that aim to optimize the time complexity of the algorithm. DBSCAN is generally slower than k-means clustering but faster than hierarchical clustering and spectral clustering.
Disadvantages of DBSCAN
- Difficult to incorporate categorical features. One of the main disadvantages of DBSCAN is that it does not perform well on datasets with categorical features. That means that you are best off using DBSCAN in cases where most of your features are numeric.
- Requires a drop in density to detect cluster borders. With DBSCAN, there must be a drop in the density of the data points between clusters in order for the algorithm to be able to detect the boundaries between clusters. If there are multiple clusters that are overlapping without a drop in data density between them, they may get grouped into a single cluster.
- Struggles with clusters of varying density. DBSAN also has a difficulty detecting clusters of varying density. This is because DBSCAN determines where clusters start and stop by looking at places where the density of data points drops below a certain threshold. It may be difficult to find a threshold that captures all of the points in the less dense cluster without excluding too many extraneous outliers in the more dense cluster.
- Sensitive to scale. Like many other clustering algorithms, DBSCAN is sensitive to the scale of your variables. That means that you may need to rescale your variables if they are on very different scales.
- Struggles with high dimensional data. Like many clustering algorithms, the performance of DBSCAN tends to degrade in situations where there are many features. In general, you are better off using dimensionality reduction or features selection techniques to reduce the number of features if you have a high-dimensional dataset.
- Not as well known. Another disadvantage of DBSCAN is that it is not as popular and well-studied as other clustering algorithms like k-means clustering and hierarchical clustering. It may not be as easy for collaborators that are not familiar with the algorithm to contribute to a project that uses DBSCAN.
When to use DBSCAN
When should you use DBSCAN over another clustering algorithm? Here are some examples of scenarios where you should use DBSCAN.
- You suspect there may be irregularly shaped clusters. If you have reason to expect that the clusters in your dataset may be irregularly shaped, DBSCAN is a great option. DBSCAN will be able to identify clusters that are spherical or ellipsoidal as well as clusters that have more irregular shapes.
- Data has outliers. DBSCAN is also a great option for cases where there are many outliers in your dataset. DBSCAN is able to detect outlying data point that do not belong to any clusters and exclude those data points from the the clusters.
- Anomaly detection. Since DBSCAN automatically detects outliers and excludes them from all clusters, DBSCAN is also a good option in cases where you want to be able to detect outliers in your dataset.
When not to use DBSCAN
When should you avoid using DBSCAN? Here are some examples of scenarios where you should avoid using DBSCAN.
- No drop in density between clusters. In general, DBSCAN requires there to be a drop in the density of data points in order to detect boundaries between clusters. That means that you should not use DBSCAN if you do not expect there to be much of a drop in density between different clusters. For example, if you expect many of your clusters overlap, multiple clusters might get grouped together into one large cluster.
- Many categorical features. DBSCAN is generally intended to be used in scenarios where the majority of your features are numeric. That means that you should avoid using DBSCAN in cases where you have many categorial features. In these scenarios, you may be better off using hierarchical clustering with an appropriate distance metric or an extension of k-means clustering like k-modes to k-prototypes.
Extensions of DBSCAN
What are some common extensions of the DBSCAN algorithm? Here are some other clustering algorithms that can be viewed as an extension of DBSCAN.
- HDBSCAN. HDBSCAN is an extension of DBSCAN that combines aspects of DBSCAN and hierarchical clustering. HDBSCAN performs better when there are clusters of varying density in the data and is less sensitive to parameter choice.
- OPTICS. OPTICS is another extension of DBSCAN that performs better on datasets that have clusters of varying densities.
- When to use hierarchical clustering
- When to use k-means clustering
- When to use gaussian mixture model
- When to use spectral clustering
Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.