2024 VisualizationofThoughtElicitsSp

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Visualization-of-Thought Prompting, Spatial Reasoning, Mental Image, Multimodal Large Language Model, Human Cognition and AI Modeling.

Notes

Cited By

Quotes

Abstract

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Humans possess a remarkable ability to create mental images of unseen objects and actions through a process known as [math]\displaystyle{ \textbf{the Mind's Eye} }[/math], enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought ([math]\displaystyle{ \textbf{VoT} }[/math]) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate \ [[textit{mental images}]] to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs.

References

  • (Radford et al., 2021) ⇒ Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. (2021). "Learning Transferable Visual Models From Natural Language Supervision.” In: Proceedings of Machine Learning Research, PMLR 139:8748-8763.
    • NOTE: "The paper discusses the development of computer vision systems trained through natural language supervision using a large dataset of image-text pairs. This method enables models to perform zero-shot learning tasks by predicting which caption goes with which image, showing robust performance across various computer vision datasets without the need for dataset-specific training​."
  • (Rozanova et al., 2021) ⇒ Julia Rozanova, et al. (2021). "Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?" arXiv preprint arXiv:2104.10120.
    • NOTE: "This paper explores the ability of large language models to understand spatial information through grounding natural language instructions to UI elements. It evaluates various models, like BERT and RoBERTa, for their capacity to perform spatial reasoning, necessary for accurately identifying interface elements from natural language commands."
  • (Yamada et al., 2023) ⇒ Yutaro Yamada, Yihan Bao, Andrew K. Lampinen, Jungo Kasai, and Ilker Yildirim. (2023). "Evaluating Spatial Understanding of Large Language Models." arXiv preprint arXiv:2301.00485.
  • (Shepard, 1978) ⇒ Roger N. Shepard. (1978). "The Mental Image." In: American Psychologist, 33(2).
    • NOTE: "Roger Shepard's paper discusses how humans process and manipulate mental images. His insights are foundational for understanding visual cognition and mental imagery, shaping further research on how mental representations of visual information are formed and used by the brain."
  • (Moulton & Kosslyn, 2009) ⇒ Samuel T. Moulton, Stephen M. Kosslyn. (2009). "Imagining Predictions: Mental Imagery as Mental Emulation." In: Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 364, no. 1521, pp. 1273-1280.
    • NOTE: "The paper discusses the cognitive processes behind mental imagery, suggesting that imagining an event involves simulating the sensory and motor characteristics of the situation. It provides a detailed examination of mental imagery as a form of mental emulation, offering insights into predictive and perceptual functions of the brain.";


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 VisualizationofThoughtElicitsSpFuru Wei
Li Dong
Shaoguang Mao
Yan Xia
Wenshan Wu
Yadong Zhang
Lei Cui
Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models10.48550/arXiv.2404.036222024