Visual Genome Dataset
Jump to navigation
Jump to search
A Visual Genome Dataset is an annotated image dataset and knowledge base that connects structured image concepts to language.
- Context:
- It can include Image Data.
- It can include Image Region Graph Datas.
- It can include Image Scene Graph Datas.
- …
- Example(s):
- Version 1.4 of dataset completed as of July 12, 2017.
- Version 1.2 of dataset completed as of August 29, 2016.
- Version 1.0 of dataset completed as of December 10, 2015.
- See: CLEVR, Image-based Question Answering.
References
2018
- https://visualgenome.org/
- QUOTE: Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
- 108,077 Images
- 5.4 Million Region Descriptions
- 1.7 Million Visual Question Answers
- 3.8 Million Object Instances
- 2.8 Million Attributes
- 2.3 Million Relationships
- Everything Mapped to Wordnet Synsets
- QUOTE: Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
2017
- (Krishna et al., 2017) ⇒ Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji data, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. (2017). “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.” In: International Journal of Computer Vision, 123(1). doi:10.1007/s11263-016-0981-7
- QUOTE: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding? ", computers will need to identify the objects in an image as well as the relationships riding (man, carriage) and pulling (horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of [math]\displaystyle{ 35 }[/math] objects, [math]\displaystyle{ 26 }[/math] attributes, and [math]\displaystyle{ 21 }[/math] pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
Fig. 1 An overview of the data needed to move from perceptual awareness to cognitive understanding of images. We present a dataset of images densely annotated with numerous region descriptions, objects, attributes, and relationships. Some examples of region descriptions (e.g. “girl feeding large elephant” and “a man taking a picture behind girl”) are shown (top). The objects (e.g. elephant), attributes (e.g. large) and relationships (e.g. feeding) are shown (bottom). Our dataset also contains image related question answer pairs (not shown)
- QUOTE: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding? ", computers will need to identify the objects in an image as well as the relationships riding (man, carriage) and pulling (horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of [math]\displaystyle{ 35 }[/math] objects, [math]\displaystyle{ 26 }[/math] attributes, and [math]\displaystyle{ 21 }[/math] pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.