Are you wondering whether you should use k-means clustering for your next data science project? Well then you are in the right place! In this article, we tell you everything you need to know to determine when to use k-means clustering.
This article starts out with a discussion of what types of problems k-means clustering can be used for. After that, we discuss some of the main advantages and disadvantages of k-means clustering that you should keep in mind when determining whether to use k-means clustering. Finally, we provide examples of scenarios where you should and should not use k-means clustering.
Data for k-means clustering
What kind of data can k-means clustering be used for? K-means clustering is generally used when you do not have a specific outcome variable that you are trying to predict. Instead, it is used when you have a set of features you want to use to find collections of observations that share similar characteristics.
Specifically, k-means is intended to be used when all of your features are numeric. There are ways you can adapt your data to be suitable if you have some categorical features, but in general the majority of your features should be numeric. If your features are numeric, or you have a mixture of numeric and categorical features, there are extensions of the k-means algorithm that are designed to handle these situations. These extensions are referenced at the end of this article.
Advantages and disadvantages of k-means clustering
What are the main advantages and disadvantages of k-means clustering? Here are the main advantages and disadvantages you should keep in mind when deciding whether to use k-means clustering.
Advantages of k-means clustering
- Many common implementations. One of the main advantages of k-means clustering is that it has many common implementations across a variety of different machine learning libraries. No matter what language or library you are using to implement your clustering model, k-means is the most likely clustering model to be available. In some cases, k-means clustering may even be the only option that is available.
- Popular and well studied. The reason that k-means clustering has so many implementations across a variety of languages and libraries is that it is probably the most popular and well-studied clustering algorithm out there. This popularity confers some benefits of its own, as it will make it easier for other contributors to jump in to assist or even take over an ongoing project. If the model is going to be used to score data repeatedly, using a well studied algorithm will also reduce the burden of maintenance.
- Comparatively fast. While clustering algorithms are known to be relatively slow, the k-means algorithm is comparatively fast. K-means is an iterative algorithm that involves calculating the distance between each point in your data and the center of each cluster. Unlike many other clustering algorithms, it does not require you to calculate the pairwise distance between points in your dataset. That means the performance scales linearly with the number of data points in your dataset.
Disadvantages of k-means clustering
- Assumes spherical density. One of the main disadvantages of k-means clustering is that it constrains all clusters to have a spherical shape. This means that k-means clustering does not perform as well in situations where clusters naturally have irregular shapes. This is a relatively strict assumption that is not made by all clustering algorithms.
- Sensitive to scale. Since k-means clustering works by calculating the distance between your data points and the size of centers of your clusters, it can be thrown off by situations where your variables have different scales. If one of your variables is on a much larger scale than the others, for example, that variable will have an outsized effect on the distance calculated. This means that you generally need to re-scale your data before using k-means clustering.
- Difficult to incorporate categorical variables. As is common with many clustering algorithms, k-means is intended for situations where all of your features are numeric. As such, it does not perform as well in cases where you need to incorporate categorical features in your dataset.
- Sensitive to outliers. Unlike some other clustering algorithms that are able to identify and exclude outliers, k-means clustering includes every data point in a cluster. That means that the algorithm is somewhat sensitive to large outliers.
- Sensitive to choice of seed. K-means clustering is relatively sensitive to the starting conditions that are used to initialize the algorithm such as the choice of seed or the order of the data points. This means that you may not get the same results if changes are made to the initialization conditions.
- Have to choose the number of clusters. Like many other clustering algorithms, k-means clustering requires you to specify the number of clusters that will be created ahead of time. This may be difficult in cases where the true number of clusters is unknown.
- Struggles with high dimensional data. Like many other clustering algorithms, k-means clustering starts to struggle when many features are included in the model. If you have many potential features, you should consider applying feature selection or dimensionality reduction algorithms to your data before creating your clusters.
When to use k-means clustering
When should you use k-means clustering in place of another type of clustering algorithm? Here are some examples of cases where you should consider using k-means clustering.
- Project will be touched by multiple contributors. If you are working on a project that you expect to be touched by many data scientists over the course of its life, you may be better off using k-means clustering. If you use a well known algorithm, it will be easier for new contributors to jump in and help.
- Hesitant coworkers. If you are running an analysis for coworkers that are skeptical of machine learning algorithms, you may be better off using a simple, well known algorithm like the k-means algorithm. There are many beginner-friendly resources available to help explain the k-means algorithm to non-technical audiences that can help you to build trust around your model.
- Large datasets. If you are working with a large dataset with many observations, you may be better off using k-means clustering than other clustering algorithms. K-means clustering is relatively fast compared to some other clustering algorithms.
When not to use k-means clustering
When should you avoid using k-means clustering? Here are some examples of situations where you should avoid using k-means clustering.
- Want to identify most similar observations. When using k-means clustering to examine a given observation, the main output of the k-means algorithm is the name of the cluster that observation falls in. There is no additional information given about which other observations are most similar to that observation. If you are interested in identifying the observations that are most similar to a given observation, you are better off using an algorithm like hierarchical clustering that provides mode granular information about which observations are most similar.
- Irregularly shaped clusters. If you have reason to expect that your data has irregularly shaped or sized clusters, you should avoid using k-means clustering. If it is reasonable to assume the clusters will be ellipsoidal, you can use gaussian mixture models instead. If you do not want to make any assumptions about the shapes of your clusters, you can use a more flexible algorithm like DBSCAN.
Extensions of k-means clustering
There are a few common extensions of the k-means algorithms that can be used when your dataset has characteristics that make it difficult to use k-means. Here are some of the most common extensions of the k-means algorithm.
- K-medians. K-medians is an extension of k-means that is less sensitive to outliers than k-means. It is a great option to reach for in cases where your data has many outliers, but it does constrain you to using specific distance metrics.
- K-medoids. K-medoids is a more flexible extension of k-means that is also robust to outliers. Unlike k-medians, k-medoids can be used with a wide variety of distance metrics. It can even be used with distance metrics that are not appropriate for k-means.
- K-modes. K-modes is an extension to k-means that can be used when you have categorical features rather than numeric features. Specifically, k-modes should be used when all of your data is categorical.
- K-prototypes. K-prototypes is an extension to k-prototypes that can be used when you have a mixture of numeric and categorical features.
- Mini batch k-means. Mini batch k-means is an extension of k-means that performs better on very large datasets. The algorithm is generally faster and requires less memory than k-means, which is already one of the faster clustering algorithms.
Related articles
- When to use hierarchical clustering
- When to use DBSCAN
- When to use gaussian mixture model
- When to use spectral clustering
Are you trying to figure out which machine learning model is best for your next data science project? Check out our comprehensive guide on how to choose the right machine learning model.
Thanks for this informative post.