AI Model Distillation Technique

From GM-RKB
Jump to navigation Jump to search

An AI Model Distillation Technique is a AI model compression technique that transfers the knowledge from a large complex AI model (teacher) to a smaller, more efficient model (student) by using both hard labels and soft outputs (probability distributions) from the teacher.

  • Context:
    • It can (often) use the soft targets from the teacher model, which represent more detailed output probabilities rather than binary labels, to train the student model.
    • It can (often) leverage a balance of distillation loss (difference between the student and teacher soft outputs) and classification loss (difference between the student output and actual labels) to train the student model.
    • ...
    • It can range from being applied to simple models with millions of parameters to large-scale models like transformer-based architectures, distilling them into smaller versions.
    • ...
    • It can include Model Quantization Techniques.
    • It can include Model Pruning Techniques.
    • ...
  • Example(s):
    • When applied to a Convolutional Neural Network to create a smaller version for mobile devices, retaining its high image classification accuracy.
    • When applied to an NLP Transformer Model to create a smaller student model while keeping the core functionality of text understanding intact.
    • ...
  • Counter-Example(s):
    • Model Pruning techniques, which directly remove redundant parts of the model to reduce its size, rather than transferring knowle
  • See: Ensemble Averaging (Machine Learning), Statistical Model Validation, DistilBERT, Synthetically-Generated Text.


References

2024

  • (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Knowledge_distillation Retrieved:2024-9-6.
    • In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller model without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).

      Knowledge distillation has been successfully used in several applications of machine learning such as object detection, acoustic models, and natural language processing. Recently, it has also been introduced to graph neural networks applicable to non-grid data.


2019