Multi-Modal Large Language Model (MLLM)

A Multi-Modal Large Language Model (MLLM) is a large language model that is a text-to-* model.

Context:
- It can (typically) be developed using advanced ML Techniques and Deep Learning Techniques.
- It can (often) surpass traditional text-only LLMs by integrating and processing information from various modalities, enhancing its understanding and response generation capabilities.
- It can (often) be a Fine-Tuned Multi-Modal LLM.
- It can be applied in a wide range of NLP Tasks and AI applications, including but not limited to multimodal interaction, content generation, and intelligent search engines.
- It can be instrumental in driving innovations in AI-driven customer support, educational technologies, healthcare, and entertainment.
- It can leverage Transformer Architectures and Neural Network models that are specifically designed to handle multimodal data inputs.
- It can be a focus of current AI Research and development, aiming to create models that better mimic human-like understanding of complex, multimodal information.
- ...
Example(s):
- Google Gemini LLM, which demonstrates state-of-the-art multimodal capabilities with its models like Gemini Ultra, Gemini Pro, and Gemini Nano.
- OpenAI's DALL·E, capable of generating images from textual descriptions.
- CLIP by OpenAI, designed to understand images in the context of natural language descriptions.
- ...
Counter-Example(s):
- Purely text-based LLMs like GPT-3 or BERT, which do not natively process audio, image, or video data.
- Single-Modal LLMs that are designed for specific tasks within a single modality, such as text or image processing only.
See: AI Safety, Machine Learning Techniques, Deep Learning Architecture, Transformer Architecture.

References

2023

(Koh et al., 2023) ⇒ JY Koh, R Salakhutdinov, D Fried. (2023). “Grounding Language Models to Images for Multimodal Generation." In: arXiv preprint arXiv:2301.13823.
- QUOTE: “… language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model … This allows our model to process …”
- NOTE: It discusses grounding language models to images, exploring techniques in large-scale text-only pretraining for multimodal generation.

2023

(Driess et al., 2023) ⇒ D Driess, F Xia, MSM Sajjadi, C Lynch, and others. (2023). “Palm-e: An Embodied Multimodal Language Model." In: arXiv preprint.
- QUOTE: “… language models to directly incorporate real-world continuous sensor modalities into language models … Input to our embodied language model are multi-modal sentences that interleave …”
- NOTE: It introduces "Palm-e," an embodied multimodal language model, which incorporates real-world continuous sensor modalities into language models.

2023

(Zang et al., 2023) ⇒ Y Zang, W Li, J Han, K Zhou, CC Loy. (2023). “Contextual Object Detection with Multimodal Large Language Models." In: arXiv preprint arXiv:2305.18279.
- QUOTE: “… Three representative scenarios are investigated, including the language cloze test, visual … multimodal model that is capable of end-to-end differentiable modeling of visual language …”
- NOTE: It explores contextual object detection with multimodal large language models, including the application of visual language modeling.

2023

(Fu et al., 2023) ⇒ C Fu, P Chen, Y Shen, Y Qin, M Zhang, X Lin, and others. (2023). “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models." In: arXiv preprint.
- QUOTE: “… Large Language Model (LLM) has paved a new road to the multimodal field, ie, Multimodal Large Language Model … It refers to using LLM as a brain to process multimodal information …”
- NOTE: It presents an evaluation benchmark named "MME" for multimodal large language models, examining the use of Large Language Model (LLM) in processing multimodal information.

2023

(Wang et al., 2023) ⇒ Dong Wang, Kavé Salamatian, Yunqing Xia, Weiwei Deng, and Qi Zhang. (2023). “BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction.” In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2023).

2022

(Kwon et al., 2022) ⇒ G. Kwon, Z. Cai, A. Ravichandran, E. Bas, and others. (2022). “Masked Vision and Language Modeling for Multi-Modal Representation Learning." In: arXiv preprint arXiv.
- QUOTE: “… modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (… vision and language modeling, …”
- NOTE: It introduces a method for multi-modal representation learning that combines masked language modeling (MLM) and masked image modeling, focusing on vision and language (V+L) representation.

2021

(Tsimpoukelli et al., 2021) ⇒ M Tsimpoukelli, JL Menick, S Cabi, and others. (2021). “Multimodal Few-Shot Learning with Frozen Language Models." In: Advances in NeurIPS.
- QUOTE: “… We have presented a method for transforming large language models into multimodal few-… preserving text prompting abilities of the language model. Our experiments confirm that the …”
- NOTE: It discusses a method for transforming large language models into multimodal few-shot learning, focusing on preserving text prompting abilities of the language model.

2021

(Rahmani et al., 2021) ⇒ K. Rahmani, M. Raza, S. Gulwani, V. Le, D. Morris, and others. (2021). “Multi-Modal Program Inference: A Marriage of Pre-Trained Language Models and Component-Based Synthesis." In: Languages.
- QUOTE: “LANGUAGES Our multi-modal program synthesis algorithm is not designed for a particular programming domain and is parameterized by an arbitrary domain-specific language (DSL) …”
- NOTE: It explores the concept of multi-modal program synthesis, describing an algorithm that is not confined to a particular programming domain and can be parameterized by an arbitrary domain-specific language (DSL).

2014

(Chen et al., 2014) ⇒ H. Chen, M. Cooper, D. Joshi, B. Girod. (2014). “Multi-Modal Language Models for Lecture Video Retrieval." In: Proceedings of the 22nd ACM.
- QUOTE: “… Multi-modal Language Models (MLMs), which adapt latent variable techniques for document analysis to exploring co-occurrence relationships in multi-modal … a multi-modal probabilistic …”
- NOTE: It presents Multi-modal Language Models (MLMs) that adapt latent variable techniques for document analysis. These models explore co-occurrence relationships in multi-modal contexts, focusing on lecture video retrieval.

2014

(Kiros et al., 2014) ⇒ R Kiros, R Salakhutdinov, and others. (2014). “Multimodal Neural Language Models." In: Machine Learning, proceedings.mlr.press.
- QUOTE: “… This work takes a first step towards generating image descriptions with a multimodal language model and sets a baseline when no additional structures are used. For future work

Multi-Modal Large Language Model (MLLM)

References

2023

2023

2023

2023

2023

2022

2021

2021

2014

2014

Navigation menu

Search