KOSMOS-1 Architecture

A KOSMOS-1 Architecture is a decoder-only architecture (that processes both text and visual tokens) based on the foundational approach utilized in the Kosmos-1 model.

Context:
- It can (typically) be used to build performant Multimodal Large Language Models (MLLMs).
- It can (often) serve as a base for integrating visual capabilities into Large Language Models, enhancing their ability to understand and generate content that combines language and imagery.
- It can leverage Transformer Models to process a sequence of inputs, including both textual and visual data.
- ...
Example(s):
- as referenced in the original Kosmos-1 model, which integrates text and image understanding in a unified framework.
- as referenced in (McKinzie et al., 2024)
- ...
Counter-Example(s):
- A Bidirectional Encoder Representations from Transformers (BERT) model, which exclusively processes textual data.
- A Vision Transformer (ViT) model, designed solely for image classification tasks.
See: Large Language Model, Visual Tokens, Transformer Architecture.

References

(McKinzie et al., 2024) ⇒ Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah et al. (2024). “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.” arXiv preprint arXiv:2403.09611
- QUOTE: "The type of MLLMs concerned in this work build upon a strong pre-trained autoregressive LLM that consumes both text and visual tokens the latter obtained via an image encoder. Our approach is based on a decoder-only architecture akin to Kosmos-1." (Huang et al., 2023)

(Huang et al., 2023) ⇒ Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv et al. (2023). “Language is Not all You Need: Aligning Perception with Language Models.” Advances in Neural Information Processing Systems, 36.
- ABSTRACT: A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, language generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit