2024 OmniParserforPureVisionBasedGUI
- (Lu et al., 2024) ⇒ Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. (2024). “OmniParser for Pure Vision Based GUI Agent.” doi:10.48550/arXiv.2408.00203
Subject Headings: OmniParser, Vision-based AI Model, Screen Parsing, Multimodal AI, UI Element Semantics, Interactivity Detection, Functional Semantics Extraction, Cross-Platform AI Agents, Benchmarks for Vision Models.
Notes
- The publication introduces OmniParser, a vision-based method for parsing UI screenshots into structured elements without requiring HTML or DOM information.
- The publication addresses two main challenges: reliable identification of interactable icons and understanding the semantics of UI elements.
- The publication demonstrates that GPT-4V's capabilities were previously underestimated due to inadequate screen parsing techniques.
- The publication curated a dataset of 67k unique screenshots with labeled interactable regions from popular webpage DOM trees.
- The publication developed a fine-tuned YOLOv8 model for detecting interactable regions in UI screenshots.
- The publication created a dataset of 7k icon-description pairs using GPT-4 for training an icon description model.
- The publication fine-tuned a BLIP-2 model to generate functional descriptions of detected UI elements.
- The publication combines outputs from three components: interactable region detection, icon description, and OCR.
- The publication evaluates performance on three major benchmarks: ScreenSpot, Mind2Web, and AITW.
- The publication shows significant improvement in GPT-4V's performance when using local semantics (93.8% vs 70.5% on the SeeAssign task).
- The publication achieves state-of-the-art results on the ScreenSpot benchmark, outperforming models that use HTML.
- The publication demonstrates effectiveness across multiple platforms: mobile, desktop, and web interfaces.
- The publication identifies three main limitations: handling repeated icons/texts, coarse bounding box prediction, and icon misinterpretation.
- The publication proposes future improvements, including context-aware icon description and combined OCR/region detection.
- The publication outperforms the GPT-4V baseline by 4.7% on the AITW mobile navigation benchmark.
- The publication introduces a novel approach to overlay bounding boxes with unique IDs using Set-of-Marks prompting.
- The publication demonstrates superior cross-platform generalization compared to platform-specific approaches.
- The publication shows that incorporating local semantics significantly reduces hallucination in GPT-4V's responses.
- The publication provides a foundation for developing more general-purpose GUI agents that can work across different platforms.
- The publication represents a significant step toward pure vision-based UI understanding without relying on platform-specific metadata.
Cited By
Quotes
Abstract
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power of multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen. To fill these gaps, we introduce OmniParser, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser significantly improves GPT-4V's performance on the ScreenSpot benchmark. And on Mind2Web and AITW, OmniParser with screenshot-only input outperforms the GPT-4V baselines requiring additional information outside of the screenshot.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 OmniParserforPureVisionBasedGUI | Yelong Shen Ahmed Awadallah Jianwei Yang Yadong Lu | OmniParser for Pure Vision Based GUI Agent | 10.48550/arXiv.2408.00203 | 2024 |