Visual Question Answering (VQA) Task

Context:
- It can be solved by a Visual QA System (that typically integrates capabilities from both computer vision and natural language processing to analyze images and generate text-based answers).
- It can range from being a Simple VQA Task with direct questions about visible objects to being a Complex VQA Task involving abstract concepts or contextual reasoning.
- It can require Datasets such as COCO, Visual Genome, or VQA datasets, which contain image-question-answer triples.
- It can leverage techniques such as Attention Mechanisms, Transformer Models, and Object Detection to focus on relevant parts of the image when formulating an answer.
- ...
Example(s):
- a COCO Dataset image of a street scene where the model answers the question "How many cars are in the image?" by detecting and counting cars.
- a Visual Genome Dataset image where the model answers "What is the color of the shirt of the man standing on the right?" by identifying the man and analyzing the visual properties of his clothing.
- ...
Counter-Example(s):
- Image Captioning, where the task is to generate a descriptive text for an image as a whole without any specific question being asked.
- Text QA Task, which solely relies on textual input for generating answers.
See: Multimodal Learning, Image Recognition, Natural Language Processing, VQA Challenge.

References

(Schwenk et al., 2022) ⇒ Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. (2022). “A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge." In: European Conference on Computer Vision, pages 146-162. Cham: Springer Nature Switzerland.
- QUOTE: "The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, little world knowledge needed outside of the paired image, and limited reasoning required to arrive at the correct answer."

(Antol et al., 2015) ⇒ Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. (2015). “VQA: Visual Question Answering." In: Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433.
- QUOTE: "We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended."