2017 VisualGenomeConnectingLanguagea

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Visual Genome Dataset, Vision Task.

Notes

Cited By

Quotes

Abstract

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding? ", computers will need to identify the objects in an image as well as the relationships riding (man, carriage) and pulling (horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage.” In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of [math]\displaystyle{ 35 }[/math] objects, [math]\displaystyle{ 26 }[/math] attributes, and [math]\displaystyle{ 21 }[/math] pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

1. Introduction


Fig. 1 An overview of the data needed to move from perceptual awareness to cognitive understanding of images. We present a dataset of images densely annotated with numerous region descriptions, objects, attributes, and relationships. Some examples of region descriptions (e.g. “girl feeding large elephant” and “a man taking a picture behind girl”) are shown (top). The objects (e.g. elephant), attributes (e.g. large) and relationships (e.g. feeding) are shown (bottom). Our dataset also contains image related question answer pairs (not shown)

A holy grail of computer vision is the complete understanding of visual scenes: a model that is able to name and detect objects, describe their attributes, and recognize their relationships. Understanding scenes would enable important applications such as image search, question answering, and robotic interactions. Much progress has been made in recent years towards this goal, including image classification (Perronnin et al. 2010; Simonyan and Zisserman 2014; Krizhevsky et al. 2012; Szegedy et al. 2015) and object detection (Girshick et al. 2014; Sermanet et al. 2013; Girshick 2015; Ren et al. 2015b). An important contributing factor is the availability of a large amount of data that drives the statistical models that underpin today’s advances in computational visual understanding. While the progress is exciting, we are still far from reaching the goal of comprehensive scene understanding. As Fig. 1 shows, existing models would be able to detect discrete objects in a photo but would not be able to explain their interactions or the relationships between them. Such explanations tend to be cognitive in nature, integrating perceptual information into conclusions about the relationships between objects in a scene (Bruner 1990; Firestone and Scholl 2015). A cognitive understanding of our visual world thus requires that we complement computers’ ability to detect objects with abilities to describe those objects (Isola et al. 2015) and understand their interactions within a scene (Sadeghi and Farhadi 2011).

There is an increasing effort to put together the next generation of datasets to serve as training and benchmarking datasets for these deeper, cognitive scene understanding and reasoning tasks, the most notable being MS-COCO (Lin et al. 2014) and VQA (Antol et al. 2015). The MS-COCO dataset consists of 300K real-world photos collected from Flickr. For each image, there is pixel-level segmentation of 80 object classes (when present) and 5 independent, user-generated sentences describing the scene. VQA adds to this a set of 614K question answer pairs related to the visual contents of each image (see more details in Sect. 3.1). With this information, MS-COCO and VQA provide a fertile training and testing ground for models aimed at tasks for accurate object detection, segmentation, and summary-level image captioning (Kiros et al. 2014; Mao et al. 2014; Karpathy and Fei-Fei 2015) as well as basic QA (Ren et al. 2015a; Malinowski et al. 2015; Gao et al. 2015; Malinowski and Fritz 2014). For example, a state-of-the-art model (Karpathy and Fei-Fei 2015) provides a description of one MS-COCO image in Fig. 1 as “two men are standing next to an elephant.” But what is missing is the further understanding of where each object is, what each person is doing, what the relationship between the person and elephant is, etc. Without such relationships, these models fail to differentiate this image from other images of people next to elephants.

To understand images thoroughly, we believe three key elements need to be added to existing datasets: a grounding of visual concepts to language (Kiros et al. 2014), a more complete set of descriptions and QAs for each image based on multiple image regions (Johnson et al. 2015), and a formalized representation of the components of an image (Hayes 1978). In the spirit of mapping out this complete information of the visual world, we introduce the Visual Genome dataset. The first release of the Visual Genome dataset uses 108,077

images from the intersection of the YFCC100M (Thomee et al. 2016) and MS-COCO (Lin et al. 2014). Section 5 provides a more detailed description of the dataset. We highlight below the motivation and contributions of the three key elements that set Visual Genome apart from existing datasets.

The Visual Genome dataset regards relationships and attributes as first-class citizens of the annotation space, in addition to the traditional focus on objects. Recognition of relationships and attributes is an important part of the complete understanding of the visual scene, and in many cases, these elements are key to the story of a scene (e.g., the difference between “a dog chasing a man” versus “a man chasing a dog”). The Visual Genome dataset is among the first to provide a detailed labeling of object interactions and attributes, grounding visual concepts to language.1

An image is often a rich scenery that cannot be fully described in one summarizing sentence. The scene in Fig. 1 contains multiple “stories”: “a man taking a photo of elephants,” “a woman feeding an elephant,” “a river in the background of lush grounds,” etc. Existing datasets such as Flickr 30K (Young et al. 2014) and MS-COCO (Lin et al. 2014) focus on high-level descriptions of an image.2 Instead, for each image in the Visual Genome dataset, we collect more than 50

descriptions for different regions in the image, providing a much denser and more complete set of descriptions of the scene. In addition, inspired by VQA (Antol et al. 2015), we also collect an average of 17 question answer pairs based on the descriptions for each image. Region-based question answers can be used to jointly develop NLP and vision models that can answer questions from either the description or the image, or both of them.

With a set of dense descriptions of an image and the explicit correspondences between visual pixels (i.e. bounding boxes of objects) and textual descriptors (i.e. relationships, attributes), the Visual Genome dataset is poised to be the first image dataset that is capable of providing a structured formalized representation of an image, in the form that is widely used in knowledge base representations in NLP (Zhou et al. 2007; GuoDong et al. 2005; Culotta and Sorensen 2004; Socher et al. 2012). For example, in Fig. 1, we can formally express the relationship holding between the woman and food as holding(woman, food). Putting together all the objects and relations in a scene, we can represent each image as a scene graph (Johnson et al. 2015). The scene graph representation has been shown to improve semantic image retrieval (Johnson et al. 2015; Schuster et al. 2015) and image captioning (Farhadi et al. 2009; Chang et al. 2014; Gupta and Davis 2008). Furthermore, all objects, attributes and relationships in each image in the Visual Genome dataset are canonicalized to its corresponding WordNet (Miller 1995) ID (called a synset ID). This mapping connects all images in Visual Genome and provides an effective way to consistently query the same concept (object, attribute, or relationship) in the dataset. It can also potentially help train models that can learn from contextual information from multiple images (Figs. 2, 3).

In this paper, we introduce the Visual Genome dataset with the aim of training and benchmarking the next generation of computer models for comprehensive scene understanding. The paper proceeds as follows: In Sect. 2, we provide a detailed description of each component of the dataset. Section 3 provides a literature review of related datasets as well as related recognition tasks. Section 4 discusses the crowdsourcing strategies we deployed in the ongoing effort of collecting this dataset. Section 5 is a collection of data analysis statistics, showcasing the key properties of the Visual Genome dataset. Last but not least, Sect. 6 provides a set of experimental results that use Visual Genome as a benchmark. Further visualizations, API, and additional information on the Visual Genome dataset can be found online.3 Open image in new windowFig. 2 Fig. 2

An example image from the Visual Genome dataset. We show 3 region descriptions and their corresponding region graphs. We also show the connected scene graph collected by combining all of the image’s region graphs. The top region description is “a man and a woman sit on a park bench along a river.” It contains the objects: man, woman, bench and river. The relationships that connect these objects are: sits_on(man, bench), in_front_of(man, river), and sits_on(woman, bench) Open image in new windowFig. 3 Fig. 3

An example image from our dataset along with its scene graph representation. The scene graph contains objects (child, instructor, helmet, etc.) that are localized in the image as bounding boxes (not shown). These objects also have attributes: large, green, behind, etc. Finally, objects are connected to each other through relationships: wears(child, helmet), wears(instructor, jacket), etc Open image in new windowFig. 4 Fig. 4

A representation of the Visual Genome dataset. Each image contains region descriptions that describe a localized portion of the image. We collect two types of question answer pairs (QAs): freeform QAs and region-based QAs. Each region is converted to a region graph representation of objects, attributes, and pairwise relationships. Finally, each of these region graphs are combined to form a scene graph with all the objects grounded to the image. Best viewed in color 2 Visual Genome Data Representation

The Visual Genome dataset consists of seven main components: region descriptions, objects, attributes, relationships, region graphs, scene graphs, and question answer pairs. Figure 4 shows examples of each component for one image. To enable research on comprehensive understanding of images, we begin by collecting descriptions and question answers. These are raw texts without any restrictions on length or vocabulary. Next, we extract objects, attributes and relationships from our descriptions. Together, objects, attributes and relationships comprise our scene graphs that represent a formal representation of an image. In this section, we break down Fig. 4 and explain each of the seven components. In Sect. 4, we will describe in more detail how data from each component is collected through a crowdsourcing platform.

2.1 Multiple Regions and Their Descriptions

References

  • 1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA: Visual Question Answering, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), p.2425-2433, December 07-13, 2015
  • 2. Antol, S., Zitnick, C. L., & Parikh, D. (2014). Zero-shot Learning via Visual Abstraction. In European Conference on Computer Vision (pp. 401-416). Springer.
  • 3. Collin F. Baker, Charles J. Fillmore, John B. Lowe, The Berkeley FrameNet Project, Proceedings of the 17th International Conference on Computational Linguistics, August 10-14, 1998, Montreal, Quebec, Canada
  • 4. Betteridge, J., Carlson, A., Hong, S. A., Hruschka, E. R, Jr., Law, E. L., Mitchell, T. M., Et Al. (2009). Toward Never Ending Language Learning. In AAAI Spring Symposium: Learning by Reading and Learning to Read (pp. 1-2).
  • 5. Steven Bird, NLTK: The Natural Language Toolkit, Proceedings of the COLING/ACL on Interactive Presentation Sessions, p.69-72, July 17-18, 2006, Sydney, Australia
  • 6. Bruner, J. (1990). Culture and Human Development: A New Look. Human Development, 33(6), 344-355.
  • 7. Razvan C. Bunescu, Raymond J. Mooney, A Shortest Path Dependency Kernel for Relation Extraction, Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.724-731, October 06-08, 2005, Vancouver, British Columbia, Canada
  • 8. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic Parsing for Text to 3D Scene Generation. In ACL 2014 (p. 17).
  • 9. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P., et al. (2015). Microsoft COCO Captions: Data Collection and Evaluation Server. ArXiv:1504.00325.
  • 10. Chen, X., & Lawrence Zitnick, C. (2015). Mind's Eye: A Recurrent Visual Representation for Image Caption Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2422-2431).
  • 11. Chen, X., Liu, Z., & Sun, M. (2014). A Unified Model for Word Sense Representation and Disambiguation. In EMNLP (pp. 1025-1035). Citeseer.
  • 12. Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta, NEIL: Extracting Visual Knowledge from Web Data, Proceedings of the 2013 IEEE International Conference on Computer Vision, p.1409-1416, December 01-08, 2013
  • 13. Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese, Understanding Indoor Scenes Using 3D Geometric Phrases, Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, p.33-40, June 23-28, 2013
  • 14. Aron Culotta, Jeffrey Sorensen, Dependency Tree Kernels for Relation Extraction, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.423-es, July 21-26, 2004, Barcelona, Spain
  • 15. Yann N. Dauphin, Harm De Vries, Yoshua Bengio, Equilibrated Adaptive Learning Rates for Non-convex Optimization, Proceedings of the 28th International Conference on Neural Information Processing Systems, p.1504-1512, December 07-12, 2015, Montreal, Canada
  • 16. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A Large-scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009) (pp. 248-255). IEEE.
  • 17. Denkowski, M., & Lavie, A. (2014). Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation. Citeseer.
  • 18. Piotr Dollar, Christian Wojek, Bernt Schiele, Pietro Perona, Pedestrian Detection: An Evaluation of the State of the Art, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.34 n.4, p.743-761, April 2012
  • 19. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2625-2634).
  • 20. Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, Andrew Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, v.88 n.2, p.303-338, June 2010
  • 21. Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., et al. (2015). From Captions to Visual Concepts and Back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1473-1482).
  • 22. Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing Objects by their Attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009) (pp. 1778-1785). IEEE.
  • 23. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth, Every Picture Tells a Story: Generating Sentences from Images, Proceedings of the 11th European Conference on Computer Vision: Part IV, September 05-11, 2010, Heraklion, Crete, Greece
  • 24. Li Fei-Fei, Rob Fergus, Pietro Perona, Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories, Computer Vision and Image Understanding, v.106 n.1, p.59-70, April, 2007
  • 25. Vittorio Ferrari, Andrew Zisserman, Learning Visual Attributes, Proceedings of the 20th International Conference on Neural Information Processing Systems, p.433-440, December 03-06, 2007, Vancouver, British Columbia, Canada
  • 26. Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., et al. (2010). Building Watson: An Overview of the Deepqa Project. AI Magazine, 31(3), 59-79.
  • 27. Firestone, C., & Scholl, B. J. (2015). Cognition Does Not Affect Perception: Evaluating the Evidence for Top-down Effects. Behavioral and Brain Sciences (pp. 1-72).
  • 28. Kenneth D. Forbus, Qualitative Process Theory, Artificial Intelligence, v.24 n.1-3, p.85-168, Dec. 1984
  • 29. Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, Proceedings of the 28th International Conference on Neural Information Processing Systems, p.2296-2304, December 07-12, 2015, Montreal, Canada
  • 30. Geman, D., Geman, S., Hallonquist, N., & Younes, L. (2015). Visual Turing Test for Computer Vision Systems. Proceedings of the National Academy of Sciences, 112(12), 3618-3623.
  • 31. Ross Girshick, Fast R-CNN, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), p.1440-1448, December 07-13, 2015
  • 32. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.580-587, June 23-28, 2014
  • 33. Christoph Göering, Erik Rodner, Alexander Freytag, Joachim Denzler, Nonparametric Part Transfer for Fine-Grained Recognition, Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.2489-2496, June 23-28, 2014
  • 34. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 Object Category Dataset. Technical Report 7694.
  • 35. Zhou GuoDong, Su Jian, Zhang Jie, Zhang Min, Exploring Various Knowledge in Relation Extraction, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, p.427-434, June 25-30, 2005, Ann Arbor, Michigan
  • 36. Abhinav Gupta, Larry S. Davis, Beyound Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers, Proceedings of the 10th European Conference on Computer Vision: Part I, October 12-18, 2008, Marseille, France
  • 37. Abhinav Gupta, Aniruddha Kembhavi, Larry S. Davis, Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.31 n.10, p.1775-1789, October 2009
  • 38. Hayes, P. J. (1978). The Naive Physics Manifesto. Geneva: Institut Pour Les Études Sémantiques Et Cognitives/Université De Genève.
  • 39. Hayes, P. J. (1985). The Second Naive Physics Manifesto. Theories of the Commonsense World (pp. 1-36).
  • 40. Marti A. Hearst, Support Vector Machines, IEEE Intelligent Systems, v.13 n.4, p.18-28, July 1998
  • 41. Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, Neural Computation, v.9 n.8, p.1735-1780, November 15, 1997
  • 42. Micah Hodosh, Peter Young, Julia Hockenmaier, Framing Image Description As a Ranking Task: Data, Models and Evaluation Metrics, Journal of Artificial Intelligence Research, v.47 n.1, p.853-899, May 2013
  • 43. Chih-Sheng Johnson Hou, Natalya Fridman Noy, Mark A. Musen, A Template-Based Approach Toward Acquisition of Logical Sentences, Proceedings of the IFIP 17th World Computer Congress - TC12 Stream on Intelligent Information Processing, p.77-89, August 25-30, 2002
  • 44. Huang, G. B., Mattar, M., Berg, T., & Learned-Miller, E. (2008). Labeled Faces in the Wild: A Database Forstudying Face Recognition in Unconstrained Environments. InWorkshop on Faces in 'real-life' Images: Detection, Alignment, and Recognition.
  • 45. Isola, P., Lim, J. J., & Adelson, E. H. (2015). Discovering States and Transformations in Image Collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1383-1391).
  • 46. Hamid Izadinia, Fereshteh Sadeghi, Ali Farhadi, Incorporating Scene Context and Object Layout Into Appearance Modeling, Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.232-239, June 23-28, 2014
  • 47. Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M., et al. (2015). Image Retrieval Using Scene Graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • 48. Karpathy, A., & Fei-Fei, L. (2015). Deep Visual-semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128-3137).
  • 49. Ryan Kiros, Ruslan Salakhutdinov, Richard Zemel, Multimodal Neural Language Models, Proceedings of the 31st International Conference on International Conference on Machine Learning, June 21-26, 2014, Beijing, China
  • 50. Ranjay A. Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, Michael S. Bernstein, Embracing Error to Enable Rapid Crowdsourcing, Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, May 07-12, 2016, Santa Clara, California, USA
  • 51. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Proceedings of the 25th International Conference on Neural Information Processing Systems, p.1097-1105, December 03-06, 2012, Lake Tahoe, Nevada
  • 52. Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to Detect Unseen Object Classes by Between-class Attribute Transfer. In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009) (pp. 951-958). IEEE.
  • 53. Claudia Leacock, George A. Miller, Martin Chodorow, Using Corpus Statistics and WordNet Relations for Sense Identification, Computational Linguistics, v.24 n.1, March 1998
  • 54. Lebret, R., Pinheiro, P. O., & Collobert, R. (2015). Phrase-based Image Captioning. ArXiv:1502.03671.
  • 55. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common Objects in Context. In Computer Vision-ECCV 2014 (pp. 740-755). Springer.
  • 56. Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual Relationship Detection Using Language Priors. In European Conference on Computer Vision (ECCV). IEEE.
  • 57. Lin Ma, Zhengdong Lu, Hang Li, Learning to Answer Questions from Image Using Convolutional Neural Network, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona
  • 58. Mateusz Malinowski, Mario Fritz, A Multi-world Approach to Question Answering About Real-world Scenes based on Uncertain Input, Proceedings of the 27th International Conference on Neural Information Processing Systems, p.1682-1690, December 08-13, 2014, Montreal, Canada
  • 59. Mateusz Malinowski, Marcus Rohrbach, Mario Fritz, Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), p.1-9, December 07-13, 2015
  • 60. Malisiewicz, T., Efros, A., Et Al. (2008). Recognition by Association via Learning Per-exemplar Distances. In IEEE Conference on Computer Vision and Pattern Recognition, 2008 (CVPR2008) (pp. 1-8). IEEE.
  • 61. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55-60).
  • 62. Mao, J., Xu, W., Yang, Y., Wang, J., & Yuille, A. L. (2014). Explain Images with Multimodal Recurrent Neural Networks. ArXiv:1410.1090.
  • 63. Mihalcea, R., Chklovski, T. A., & Kilgarriff, A. (2004). The Senseval-3 English Lexical Sample Task. Association for Computational Linguistics, UNT Digital Library.
  • 64. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ArXiv:1301.3781.
  • 65. George A. Miller, WordNet: A Lexical Database for English, Communications of the ACM, v.38 n.11, p.39-41, Nov. 1995
  • 66. Feng Niu, Ce Zhang, Christopher Ré, Jude Shavlik, Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference, International Journal on Semantic Web & Information Systems, v.8 n.3, p.42-73, July 2012
  • 67. Vicente Ordonez, Girish Kulkarni, Tamara L Berg, Im2Text: Describing Images Using 1 Million Captioned Photographs, Proceedings of the 24th International Conference on Neural Information Processing Systems, p.1143-1151, December 12-15, 2011, Granada, Spain
  • 68. Pal, A. R., & Saha, D. (2015). Word Sense Disambiguation: A Survey. ArXiv:1508.01346.
  • 69. Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, BLEU: A Method for Automatic Evaluation of Machine Translation, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania
  • 70. Genevieve Patterson, Chen Xu, Hang Su, James Hays, The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding, International Journal of Computer Vision, v.108 n.1-2, p.59-81, May 2014
  • 71. Florent Perronnin, Jorge Sánchez, Thomas Mensink, Improving the Fisher Kernel for Large-scale Image Classification, Proceedings of the 11th European Conference on Computer Vision: Part IV, September 05-11, 2010, Heraklion, Crete, Greece
  • 72. Alessandro Prest, Cordelia Schmid, Vittorio Ferrari, Weakly Supervised Learning of Interactions Between Humans and Objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.34 n.3, p.601-614, March 2012
  • 73. Ramanathan, V., Li, C., Deng, J., Han, W., Li, Z., Gu, K., et al. (2015). Learning Semantic Relationships for Better Action Retrieval in Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1100-1109).
  • 74. Ren, M., Kiros, R., & Zemel, R. (2015a). Image Question Answering: A Visual Semantic Embedding Model and a New Dataset. ArXiv:1505.02074.
  • 75. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks, Proceedings of the 28th International Conference on Neural Information Processing Systems, p.91-99, December 07-12, 2015, Montreal, Canada
  • 76. Ronchi, M. R., & Perona, P. (2015). Describing Common Human Visual Actions in Images. In X. Xie, M.W. Jones, & G. K. L. Tam (Eds.), Proceedings of the British Machine Vision Conference (BMVC 2015) (pp. 52.1-52.12). BMVA Press.
  • 77. Rothe, S., & Schütze, H. (2015). Autoextend: Extending Word Embeddings to Embeddings for Synsets and Lexemes. ArXiv:1507.01127.
  • 78. (Russakovsky et al., 2015) ⇒ Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. (2015). “ImageNet Large Scale Visual Recognition Challenge". In: International Journal of Computer Vision (IJCV). DOI:10.1007/s11263-015-0816-y.
  • 79. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, William T. Freeman, LabelMe: A Database and Web-Based Tool for Image Annotation, International Journal of Computer Vision, v.77 n.1-3, p.157-173, May 2008
  • 80. Sadeghi, F., Divvala, S. K., & Farhadi, A. (2015). Viske: Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1456-1464).
  • 81. M. A. Sadeghi, A. Farhadi, Recognition Using Visual Phrases, Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, p.1745-1752, June 20-25, 2011
  • 82. Niloufar Salehi, Lilly C. Irani, Michael S. Bernstein, Ali Alkhatib, Eva Ogbe, Kristy Milland, Clickhappier, We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers, Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, April 18-23, 2015, Seoul, Republic of Korea
  • 83. Schank, R. C., & Abelson, R. P. (2013). Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures. Hove: Psychology Press.
  • 84. Karin Kipper Schuler, Martha S. Palmer, Verbnet: A Broad-coverage, Comprehensive Verb Lexicon, University of Pennsylvania, Philadelphia, PA, 2005
  • 85. Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., & Manning, C. D. (2015). Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval. In Proceedings of the Fourth Workshop on Vision and Language (pp. 70-80). Citeseer.
  • 86. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated Recognition, Localization and Detection Using Convolutional Networks. ArXiv:1312.6229.
  • 87. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus, Indoor Segmentation and Support Inference from RGBD Images, Proceedings of the 12th European Conference on Computer Vision, October 07-13, 2012, Florence, Italy
  • 88. Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-scale Image Recognition. ArXiv:1409.1556.
  • 89. Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and Fast---but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks, Proceedings of the Conference on Empirical Methods in Natural Language Processing, October 25-27, 2008, Honolulu, Hawaii
  • 90. Richard Socher, Brody Huval, Christopher D. Manning, Andrew Y. Ng, Semantic Compositionality through Recursive Matrix-vector Spaces, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, July 12-14, 2012, Jeju Island, Korea
  • 91. Steinbach, M., Karypis, G., Kumar, V., Et Al. (2000). A Comparison of Document Clustering Techniques. In KDD Workshop on Text Mining, Boston (Vol. 400, Pp. 525-526).
  • 92. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
  • 93. Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li-Jia Li, YFCC100M: The New Data in Multimedia Research, Communications of the ACM, v.59 n.2, February 2016
  • 94. Antonio Torralba, Rob Fergus, William T. Freeman, 80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.30 n.11, p.1958-1970, November 2008
  • 95. Manik Varma, Andrew Zisserman, A Statistical Approach to Texture Classification from Single Images, International Journal of Computer Vision, v.62 n.1-2, p.61-81, April-May 2005
  • 96. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015a). Cider: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4566-4575).
  • 97. Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C. Lawrence Zitnick, Devi Parikh, Learning Common sense through Visual Abstraction, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), p.2542-2550, December 07-13, 2015
  • 98. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3156-3164).
  • 99. Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
  • 100. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A., et al. (2010). Sun Database: Large-scale Scene Recognition from Abbey to Zoo. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3485-3492). IEEE.
  • 101. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. CoRR. ArXiv:1502.03044.
  • 102. Deva Ramanan, Recognizing Proxemics in Personal Photos, Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.3522-3529, June 16-21, 2012
  • 103. Yao, B., & Fei-Fei, L. (2010). Modeling Mutual Context of Object and Human Pose in Human-object Interaction Activities. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 17-24). IEEE.
  • 104. Benjamin Yao, Xiong Yang, Song-Chun Zhu, Introduction to a Large-scale General Purpose Ground Truth Database: Methodology, Annotation Tool and Benchmarks, Proceedings of the 6th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, August 27-29, 2007, Ezhou, China
  • 105. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78.
  • 106. Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the Blank Image Generation and Question Answering. ArXiv:1506.00278.
  • 107. Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation Classification via Convolutional Deep Neural Network. In Proceedings of COLING (pp. 2335-2344).
  • 108. Zhou, G., Zhang, M., Ji, D. H., & Zhu, Q. (2007). Tree Kernel-based Relation Extraction with Context-sensitive Structured Parse Tree Information. In EMNLP-CoNLL 2007 (p. 728).
  • 109. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, Ji-Rong Wen, StatSnowball: A Statistical Approach to Extracting Entity Relationships, Proceedings of the 18th International Conference on World Wide Web, April 20-24, 2009, Madrid, Spain
  • 110. Zhu, Y., Fathi, A., & Fei-Fei, L. (2014). Reasoning About Object Affordvances in a Knowledge Base Representation. In European Conference on Computer Vision.
  • 111. Zhu, Y., Zhang, C., Ré, C., & Fei-Fei, L. (2015). Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries. ArXiv:1507.05670.
  • 112. C. Lawrence Zitnick, Devi Parikh, Bringing Semantics Into Focus Using Visual Abstraction, Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, p.3009-3016, June 23-28, 2013

}};


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2017 VisualGenomeConnectingLanguageaLi Fei-Fei
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li-Jia Li
David A. Shamma
Michael S. Bernstein
Kenji H ata
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations10.1007/s11263-016-0981-72017