What is knowledge distillation?

Pro Search

3 steps completed
Knowledge distillation, also known as model distillation, is a process in machine learning that involves transferring knowledge from a large model (often referred to as the "teacher" model) to a smaller one (known as the "student" model). The goal is to create a smaller model that can perform as accurately as the larger one, but with less computational expense. This makes the smaller model more suitable for deployment on less powerful hardware, such as mobile devices. The process of knowledge distillation involves training the student model on a transfer set, which is different from the dataset used to train the teacher model. The loss function used is typically the cross-entropy between the output of the student model and the output produced by the teacher model on the same record. A high value of softmax temperature is used for both models during this process. There are different types of knowledge distillation techniques, based on how the knowledge is transferred across the teacher-student network. The student model learns to mimic the teacher model to achieve similar or even superior performance accuracy. Once the large deep neural network is appropriately compressed, it can be deployed on low-grade hardware devices to run real-world inferences. Knowledge distillation has been successfully used in several applications of machine learning such as object detection, acoustic models, natural language processing, and graph neural networks applicable to non-grid data. It is particularly popular in natural language processing (NLP) to obtain fast, lightweight models that are easier and computationally cheaper to train. Other NLP use cases include neural machine translation and text generation.
what are the benefits of knowledge distillation
what are some common algorithms used in knowledge distillation
how is knowledge distillation used in natural language processing