DistilBERT Model

Context:
- It can retain 97% of BERT's performance on the General Language Understanding Evaluation (GLUE) benchmark, despite having 40% fewer parameters``【oaicite:6】````【oaicite:5】``.
- It can perform comparably on downstream tasks like IMDb sentiment classification and SQuAD v1.1 question answering, while significantly reducing model size and inference time``【oaicite:4】``.
- It introduces a triple loss that combines language modeling, distillation, and cosine-distance losses during the pre-training phase``【oaicite:3】``.
- It can be an efficient option for edge applications, demonstrated by its substantially faster inference times on mobile devices compared to BERT-base``【oaicite:2】``.
- ...
Example(s):
- ...
Counter-Example(s):
- BERT
- ELMo
See: Transformer Architecture, BERT, Knowledge Distillation, GLUE Benchmark.

References

(Sanh et al., 2019) ⇒ Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. (2019). “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” arXiv preprint arXiv:1910.01108