Transformer-based Vision Model: Difference between revisions
Jump to navigation
Jump to search
(Created page with "A Transformer-based Vision Model is a vision model that is a transformer-based model. * <B>Counter-Example(s):</B> ** Transformer-based Language Model. * <B>See:</B> Pre-Trained Visual Encoder. ---- ---- == References == === 2023 === * GBard ** A Vision Transformer-based Model is a deep learning model that is inspired by the Transformer architecture, which was originally designed for natural language processing (NLP) tasks. ** Vision Transformers (V...") |
No edit summary |
||
(One intermediate revision by one other user not shown) | |||
Line 2: | Line 2: | ||
* <B>Counter-Example(s):</B> | * <B>Counter-Example(s):</B> | ||
** [[Transformer-based Language Model]]. | ** [[Transformer-based Language Model]]. | ||
* <B>See:</B> [[Pre-Trained Visual Encoder]]. | * <B>See:</B> [[Pre-Trained Visual Encoder]], [[Language-Vision Multimodal Model]]. | ||
---- | ---- | ||
---- | ---- | ||
Line 22: | Line 22: | ||
---- | ---- | ||
__NOTOC__ | __NOTOC__ | ||
[[Category:Concept]] |
Latest revision as of 12:30, 16 October 2023
A Transformer-based Vision Model is a vision model that is a transformer-based model.
- Counter-Example(s):
- See: Pre-Trained Visual Encoder, Language-Vision Multimodal Model.
References
2023
- GBard
- A Vision Transformer-based Model is a deep learning model that is inspired by the Transformer architecture, which was originally designed for natural language processing (NLP) tasks.
- Vision Transformers (ViTs) represent images as a sequence of patches, just like NLP models represent text as a sequence of words. This allows ViTs to learn long-range dependencies between different parts of an image, which is essential for many computer vision tasks, such as image classification and object detection.
- ViTs have achieved state-of-the-art results on a variety of computer vision benchmarks, and they are quickly becoming the preferred model architecture for many tasks.
- Here are some examples of Vision Transformer-based Models:
- DeiT (Data-efficient Image Transformer)
- ViLT (Vision and Language Transformer)
- Swin Transformer
- ConvMixer
- Pyramid Vision Transformer