2023 TheDawnofLMMsPreliminaryExplora

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Large Multimodel Model, Visual Understanding.

Notes

Cited By

Quotes

Abstract

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V (ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 TheDawnofLMMsPreliminaryExploraKevin Lin
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)10.48550/arXiv.2309.174212023