Transformer-based Vision Model
(Redirected from Vision Transformer)
Jump to navigation
Jump to search
A Transformer-based Vision Model is a vision model that is a transformer-based model.
- Counter-Example(s):
- See: Pre-Trained Visual Encoder, Language-Vision Multimodal Model.
References
2023
- GBard
- A Vision Transformer-based Model is a deep learning model that is inspired by the Transformer architecture, which was originally designed for natural language processing (NLP) tasks.
- Vision Transformers (ViTs) represent images as a sequence of patches, just like NLP models represent text as a sequence of words. This allows ViTs to learn long-range dependencies between different parts of an image, which is essential for many computer vision tasks, such as image classification and object detection.
- ViTs have achieved state-of-the-art results on a variety of computer vision benchmarks, and they are quickly becoming the preferred model architecture for many tasks.
- Here are some examples of Vision Transformer-based Models:
- DeiT (Data-efficient Image Transformer)
- ViLT (Vision and Language Transformer)
- Swin Transformer
- ConvMixer
- Pyramid Vision Transformer