Multimodal Language Model
Jump to navigation
Jump to search
A Multimodal Language Model is a language model that can process and generate output using multiple modalities (such as text, image, audio, and video).
- Context:
- It can (typically) be a Foundation Model through pre-training process.
- It can (typically) be a Large Language Model through neural network architecture.
- It can process Multiple Input Types through multimodal encoders.
- It can generate Multiple Output Types through multimodal decoders.
- It can incorporate:
- It can range from being a Dual-Modal Model to being an Omni-Modal Model, depending on its supported modality count.
- ...
- Example(s):
- Vision-Language Models, such as:
- Audio-Language Models, such as:
- Multi-Modal Models, such as:
- ...
- Counter-Example(s):
- Text-Only Language Model, which processes only text input.
- Vision-Only Model, which processes only image input.
- Audio-Only Model, which processes only audio input.
- See: Multimodal Encoder, Cross-Modal Attention, Modal Fusion, Multimodal Training Data.