Are you wondering when you should use student teacher networks for a machine learning project? Or maybe you want to hear more about the advantages and disadvantages of using student teacher networks? Well either way, you are in the right place! In this article, we tell you everything you need to know to understand when you should and should not use student teacher networks.
We start out by providing some high level context on what student teacher networks are and what types of situations they are designed for. After that, we discuss the type of data that is required in order to create a student teacher network. After that, we discuss the main advantages and disadvantages of student teacher networks. This provides useful context that informs the final section about when to use student teacher networks.
What are student teacher networks?
What are student teacher networks? When you use student teacher networks, there are actually two different neural networks that you need to train and use. First, you will have a large network with many parameters that can do a task very well. This is your teacher network. Second, you will have a much smaller network with fewer parameters that cannot do the task as well because of its limited size. This is your student network. The idea behind using student teacher networks is that you can use the large teacher network to teach the smaller network how to complete the task better than it would be able to do on its own. In essence, you are infusing the knowledge that has been captured by the teacher network into the student network. This is typically called knowledge distillation.
And how do student teacher networks work? How does the student network effectively learn from the teacher network? The way that student teacher networks work is that instead of just trying to replicate the output of the teacher network, the student network actually tries to replicate the internal thought process that is used by the teacher network. Specifically, the student network aims to replicate the output of the teacher network at every layer of the network (rather than just trying to replicate the output of the final layer that provides the final answer).
What type of data is needed for student teacher networks?
What type of data is needed in order to train student teacher networks? Student teacher networks are generally used when you have a supervised task that you want to teach the student network to complete. They can be used along with standard supervised learning paradigms as well as other learning paradigms that create supervised tasks from unlabeled data, such as self supervised learning. In the end though, there is generally a label that the model is trying to guess.
The exact type of data that you will need is going to depend on the type of learning paradigm you are operating within. If you are using a standard supervised learning paradigm where your data comes with labels, then you will need labeled data that can be used to train both your student network and your teacher network. If you are operating in a paradigm like self supervised learning, then you only need to have unlabeled data that has some natural structure that can be exploited to convert it into labeled data after the fact.
The amount of data that is required may also depend on the learning paradigm that is used to train the initial teacher model. If you are using a paradigm like transfer learning or semi supervised learning that does not require a large amount of labeled data to train your teacher network, then you will need less labeled data. If you are training your teacher network from scratch (which is more standard), then you will need a large amount of data that can be used to train your teacher network.
Advantages and disadvantages of student teacher networks
What are the main advantages and disadvantages of using student teacher networks? Here are some of the main advantages and disadvantages of using student teacher networks.
Advantages of student teacher networks
Here are some of the main advantages of using student teacher networks.
- Fewer compute resources required for inference. When you use student teacher networks, your smaller student network is the network that is ultimately used for inference. Since the smaller network is used for inference, that means that you only need to use compute resources that can run the student network. It is common for the teacher network to require much, much larger computational resources, so this can be a large advantage if you have limited compute resources available for model serving or are working on something like embedded devices where the size of the compute resources available to you is limited.
- Cheaper inference. If you are running a smaller model on smaller compute resources in order to train your model, that also means that it will be less expensive to run inference using your model. You will only have to pay for small compute resources that are appropriately sized for the student network.
- Faster inference. If you are using a smaller model to make inference, that also means that your model will likely be able to make inference more quickly. There will not be as many computations that need to be run before your model returns an answer.
- Increased performance compared to a standard student-sized model. When you are using student teacher networks, your student model generally is able to achieve better predictive performance and accuracy than it would if you trained the model on the dataset itself without trying to replicate the inner thought process of the teacher network.
Disadvantages of student teacher networks
Here are some of the main disadvantages of student teacher networks.
- May require a lot of labeled training data. If you are using a supervised training paradigm to train your teacher model on a supervised task, that means that you will need to have a lot of labeled training data available in order to be able to train your teacher model. This can be a large disadvantage if it is costly to obtain labeled data.
- Large compute resources required for training. While you do not need large compute resources to make inference with your student network, you will need to have large compute resources available to train your teacher network. This can be prohibitive if you are working in an environment where sufficiently large compute resources are not available.
- More money spent training models. The fact that you need to train your large teacher model on large compute resources and a lot of data is going to mean that it will be expensive to train your teacher model. The fact that you are training two different models will also contribute to this.
- More time spent training models. The fact that your large teacher model will be a large model with a lot of parameters that need to be trained will also mean that your models will take longer to train. This will also be exacerbated by the fact that two different models need to be trained.
- Decreased performance compared to a standard teacher-sized model. While your student model will generally perform better than a similar model that was trained only on the data (and not the inner through process of a teacher network), it may not perform as well as the larger teacher model did.
When to use student teacher networks
When should you use student teacher networks for your machine learning projects? In this section, we will discuss situations where it makes sense to use student teacher networks.
- When you have a good model, but you are running into operational constraints. The main situation where it makes sense to use student teacher networks is when you have a good model, but you are running into operational constraints that prevent you from deploying that model. For example, the model may be too slow, too expensive, or require compute resources that are excessively large. In this situation, you can use the large teacher network to train a smaller student network that adverts these issues.
When not to use student teacher networks
When should you avoid using student teacher networks for your machine learning project? Here are some examples of situations where it may not make sense to use student teacher networks.
- When you have a good model and resources are plentifully available. If you have a large model that is working well for you and you are not running into any operational constraints, such as the model being excessively expensive to serve, then it likely does not make sense to use student teacher networks. It can be tricky to get these networks to work well and it may not be worth the investment to create a smaller network if the large network is working for you.