Are you wondering what active learning is? Or maybe you want to know more about when you should use active learning? Well either way, you are in the right place! In this article, we tell you everything you need to know to understand when to use active learning.
We start this out by discussing what active learning is and how it is implemented. After that, we discuss what type of data is required in order to be able to use active learning. Next, we discuss some of the main advantages and disadvantages of active learning. This section will provide context that will inform the final section about when active learning should and should not be used.
What is active learning?
What is active learning? Active learning is a family of techniques that also falls under the umbrella of weakly supervised learning. Weakly supervised learning is a large family of techniques that can be applied in situations when you have labeled data that is imperfect, such as when the labels in your data are noisy or when not all of your data is labeled. Active learning is specifically a family of weekly supervised learning techniques that can be used when you have incomplete labels, meaning that not all of your data is labeled.
Semi supervised learning is another family of techniques that can also be used for incomplete labels. The main thing that differentiates semi supervised learning from active learning is whether the techniques require iterative feedback to be given from a human throughout the modeling process. With semi supervised learning, no incremental feedback needs to be given from a human. In active learning, there is a requirement for someone to give incremental feedback and provide additional labels for some of the unlabeled data throughout the model training process.
Types of active learning techniques
In this section, we will prove some examples of common types of active learning techniques. This will provide some concrete details around how active learning techniques are applied. The main difference between these techniques is the strategy that is applied to determine when to ask a human for feedback (in the form of a label for a selected datapoint).
- Stream-based selective sampling. Stream-based sampling is a family of techniques that look at unlabeled records one at a time and determine whether to request a label for that datapoint. These techniques do not consider the broader distribution of unlabeled records, but rather focus on one record at a time and the level of confidence around what the label should be for that data point to make decisions.
- Pool-based sampling. With pool-based sampling techniques, the whole pool of unlabeled records is considered before a decision is made. The model aims to rank the informativeness of each record to select the best records to send back for human feedback.
- Membership query synthesis. Query synthesis techniques are a little more creative than the previous two techniques that were mentioned. We say this because query synthesis models do not just select existing records from the dataset and send them back to a human for feedback. Instead, they apply generative techniques to create synthetic examples of records that are representative of a large batch of real data. These synthetic examples are then labeled by a human and used to inform future labeling decisions.
What data is needed for active learning?
What data is needed for active learning? Active learning techniques are fairly complex techniques that require a few different types of data. Specifically, active learning requires an initial batch of data that you can start out the modeling process with as well as incremental pieces of data that must be added throughout the modeling process. The initial batch of data should contain a small subset of the data that is labeled with high quality labels and a larger batch of data that is not labeled at all. The incremental pieces of data that are added throughout the modeling process should consist of new labels that are manually added to some of the unlabeled data.
Advantages and disadvantages of active learning
What are some of the main advantages and disadvantages of active learning? In this section, we will discuss some of the main advantages and disadvantages of active learning.
Advantages of active learning
Here are some of the main advantages of active learning.
- Only requires a subset of data to be labeled. One of the main advantages of active learning is that it only requires a subset of the data that you want to use to train the model to be labeled. While this can provide benefits in many situations, it is most beneficial if you are in a situation where it is not feasible to label all of the data you want to use to train your model.
- Cost efficient. It often takes a lot of time and resources to manually label data in order to train a model. If you are in a situation where generating labels for your training data is costly, then reducing the amount of data that needs to be labeled via active learning can provide some nice cost benefits.
Disadvantages of active learning
Here are some of the main disadvantages of active learning.
- Does require some labeled data. One example of a disadvantage of active learning is that it does require some data to be labeled. If labeling data is very costly and resource intensive, there will still be some expenses associated with generating labels for data.
- Requires iterative human input. Another disadvantage of active learning is that it requires a human to give iterative feedback and provide additional labels for examples in your dataset throughout the learning process. This can be a large blocker if you need to retain external talent in order to generate labeled data because you may not be able to reach back out to them for additional labels later on.
- Lower accuracy than some other learning paradigms. Another disadvantage of active learning is that the models that are trained generally have lower accuracy than they would if they were training using another paradigm like supervised learning. If it would be feasible to label all of your data and high accuracy is important, then it may make sense to label all of your data and use another paradigm like supervised learning.
- Not commonly understood. Active learning is a relatively niche field, which means that you cannot expect all of your teammates to understand what active learning is or how it should be used. That means that it will take longer to onboard teammates onto your project. It may also mean that it will be more difficult to find someone who can give you meaningful feedback on your project.
When to use active learning
When should you use active learning in your machine learning project? In this section, we will provide examples of situations where it makes sense to use active learning.
- When getting a small amount of labeled data is easy, but scaling up labeling efforts is difficult. In general, it makes sense to use paradigms like active learning or semi supervised learning if you are in a situation where getting a small amount of labeled data is easy, but scaling up data labeling efforts to label all of your data is difficult. Paradigms like active learning allow you to take advantage of all of the data that is available to you even if that data is not completely labeled.
- When getting iterative feedback from humans is easy. As we stated before, the main difference between semi supervised learning techniques and active learning techniques is that active learning techniques require you to get iterative feedback from a human throughout the modeling process. That means that active learning should be used when it is easy to get additional human feedback throughout the modeling process. If this is possible, then models that are trained in active learning paradigms generally perform better than models that are trained in semi supervised learning paradigms.
When not to use active learning
When should you avoid using active learning in your machine learning project? In this section, we will provide examples of situations where it does not make sense to use active learning.
- When getting any amount of labeled data is difficult. When you are using active learning techniques, it is important that you have a sample of labeled data that has high quality labels that you trust. You may be better off opting for techniques like self supervised learning if you are in a situation where it is difficult to get any labels for your data.
- When there is not a lot of unlabeled data available. Active learning techniques are intended to be used in situations where there is a lot of unlabeled data available, but only a small amount of labeled data available. If you are not in a situation where there is a lot of unlabeled data available, you are generally best off labeling all of your data and using supervised learning.
- When labeled data is easy to obtain. If you are in a situation where labeled data is easy to obtain, then there is no need to use active learning. In these situations, you are better off labeling all of your data and using supervised learning.
- Common model training paradigms in machine learning
- When to use self supervised learning
- When to use semi supervised learning
- When to use weakly supervised learning
- When to use transfer learning