2016 ADatasetforJointNounNounCompoun
- (Fares, 2016) ⇒ Murhaf Fares. (2016). “A Dataset for Joint Noun-Noun Compound Bracketing and Interpretation.” In: Proceedings of 54th Annual Meeting of the Association for Computational Linguistics - ACL 2016 Student Research Workshop.
Subject Headings: Noun Compound Bracketing Algorithm; Noun Compound Bracketing Task.
Notes
- This an open-access article distributed under a Creative Common Attribution 4.0 International License.
Cited By
Quotes
Abstract
We present a new, sizeable dataset of noun-noun compounds with their syntactic analysis (bracketing) and semantic relations. Derived from several established linguistic resources, such as the Penn Treebank, our dataset enables experimenting with new approaches towards a holistic analysis of noun-noun compounds, such as joint learning of noun-noun compounds bracketing and interpretation, as well as integrating compound analysis with other tasks such as syntactic parsing.
1 Introduction
Noun–noun compounds are abundant in many languages, and English is no exception. According to O Seaghdha (2008), three percent of all words ´ in the British National Corpus (Burnard, 2000, BNC) are part of nominal compounds. Therefore, in addition to being an interesting linguistic phenomenon per se, the analysis of noun– noun compounds is important to other natural language processing (NLP) tasks such as machine translation and information extraction. Indeed, there is already a nontrivial amount of research on noun–noun compounds within the field of computational linguistics (Lauer, 1995; Nakov, 2007; O Seaghdha, 2008; Tratz, 2011, inter alios).
As Lauer and Dras (1994) point out, the treatment of noun–noun compounds involves three tasks: identification, bracketing and semantic interpretation. With a few exceptions (Girju et al., 2005; Kim and Baldwin, 2013), most studies on noun–noun compounds focus on one of the aforementioned tasks in isolation, but these tasks are of course not fully independent and therefore might benefit from a joint-learning approach, especially bracketing and semantic interpretation.
Reflecting previous lines of research, most of the existing datasets on noun–noun compounds either include bracketing information or semantic relations, rarely both. In this article we present a fairly large dataset for noun–noun compound bracketing as well as semantic interpretation. Furthermore, most of the available datasets list the compounds out of context. Hence they implicitly assume that the semantics of noun–noun compounds is type-based; meaning that the same compound will always have the same semantic relation. To test this assumption of type-based vs. token-based semantic relations, we incorporate the context of the compounds in our dataset and treat compounds as tokens rather than types. Lastly, to study the effect of noun–noun compound bracketing and interpretation on other NLP tasks, we derive our dataset from well-established resources that annotate noun–noun compounds as part of other linguistic structures, viz. the Wall Street Journal Section of the Penn Treebank (Marcus et al., 1993, PTB), PTB noun phrase annotation by Vadas and Curran (2007), DeepBank (Flickinger et al., 2012), the Prague Czech–English Dependency Treebank 2.0 (Hajic et al., 2012, PCEDT) ˇ and NomBank (Meyers et al., 2004). We therefore can quantify the effect of compound bracketing on syntactic parsing using the PTB, for example.
In the following section, we review some of the existing noun compound datasets. In § 3, we present the process of constructing a dataset of noun–noun compounds with bracketing information and semantic relations. In § 4, we explain how we construct the bracketing of noun–noun compounds from three resources and report ‘interresource’ agreement levels. In § 5, we present the semantic relations extracted from two resources and the correlation between the two sets of relations. In § 6, we conclude the article and present an outlook for future work.
2 Background
The syntax and semantics of noun–noun compounds have been under focus for years, in linguistics and computational linguistics. Levi (1978) presents one of the early and influential studies on noun–noun compounds as a subset of socalled complex nominals. Levi (1978) defines a set of nine “recoverably deletable predicates” which express the “semantic relationship between head nouns and prenominal modifiers” in complex nominals. Finin (1980) presented one of the earliest studies on nominal compounds in computational linguistics, but Lauer (1995) was among the first to study statistical methods for noun compound analysis. Lauer (1995) used the Grolier encyclopedia to estimate word probabilities, and tested his models on a dataset of 244 three-word bracketed compounds and 282 two-word compounds. The compounds were annotated with eight prepositions which Lauer takes to approximate the semantics of noun–noun compounds.
Table 1 shows an overview of some of the existing datasets for nominal compounds. The datasets by Nastase and Szpakowicz (2003) and Girju et al. (2005) are not limited to noun–noun compounds; the former includes compounds with adjectival and adverbial modifiers, and the latter has many noun-preposition-noun constructions. The semantic relations in O S eaghdha and Copestake (2007) and Kim and Baldwin (2008) are based on the relations introduced by Levi (1978) and Barker and Szpakowicz (1998), respectively. All of the datasets in Table 1 list the compounds out of context. In addition, the dataset by Girju et al. (2005) includes three-word bracketed compounds, whereas the rest include two-word compounds only. On the other hand, (Girju et al., 2005) is the only dataset in Table 1 that is not publicly available.
3 Framework
This section gives an overview of our method to automatically construct a bracketed and semantically annotated dataset of noun–noun compounds from four different linguistic resources. The construction method consists of three steps that correspond to the tasks defined by Lauer and Dras (1994): identification, bracketing and semantic interpretation.
Firstly, we identify the noun–noun compounds in the PTB WSJ Section using two of the compound identification heuristics introduced by Fares et al. (2015), namely the so-called syntax-based NNPh heuristic which includes compounds that contain common and proper nouns but excludes the ones headed by proper nouns, and the syntaxbased NNP0 heuristic which excludes all compounds that contain proper nouns, be it in the head position or the modifier position. Table 2 shows the number of compounds and compound types we identified using the NNPh and NNP0 heuristics. Note that the number of compounds will vary in the following sections depending on the resources we use.
Secondly, we extract the bracketing of the identified compounds from three resources: PTB noun phrase annotation by Vadas and Curran (2007), DeepBank and PCEDT. Vadas and Curran (2007) manually annotated the internal structure of noun phrases (NPs) in PTB which were originally left unannotated. However, as is the case with other resources, Vadas and Curran (2007) annotation is not completely error-free, as shown by Fares et al. (2015). We therefore crosscheck their bracketing through comparing to those of DeepBank and PCEDT. The latter two, however, do not contain explicit annotation of noun–noun compound bracketing, but we can ‘reconstruct’ the bracketing based on the dependency relations assigned in both resources, i.e. the logical form meaning representation in DeepBank and the tectogrammatical layer (t-layer) in PCEDT. Based on the bracketing extracted from the three resources, we define the subset of compounds that are bracketed similarly in the three resources. Lastly, we extract the se mantic relations of two-word compounds as well as multi-word bracketed compounds from two resources: PCEDT and NomBank.
On a more technical level, we use the socalled phrase-structure layer (p-layer) in PCEDT to identify noun–noun compounds, because it includes the NP annotation by Vadas and Curran (2007), which is required to apply the noun–noun compound identification heuristics by Fares et al. (2015). For bracketing, we also use the PCEDT player, in addition to the dataset prepared by Oepen et al. (2016) which includes DeepBank and the PCEDT tectogrammatical layer. We opted for the dataset by Oepen et al. (2016) because they converted the tectogrammatical annotation in PCEDT to dependency representation in which the “set of graph nodes is equivalent to the set of surface tokens.” For semantic relations, we also use the dataset by Oepen et al. (2016) for PCEDT relations and the original NomBank files for NomBank relations.
Throughout the whole process we store the data in a relational database with a schema that represents the different types of information, and the different resources from which they are derived. As we will show in § 4 and § 5, this set-up allows us to combine information in different ways and therefore create ‘different’ datasets.
4 Bracketing
Noun-noun compound bracketing can be defined as the disambiguation of the internal structure of compounds with three nouns or more. For example, we can bracket the compound noon fashion show in two ways:
- 1. Left-bracketing:
[[noon fashion] show]
- 2. Right-bracketing:
[noon [fashion show]]
- 1. Left-bracketing:
In this example, the right-bracketing interpretation (a fashion show happening at noon) is more likely than the left-bracketing one (a show of noon fashion). However, the correct bracketing need not always be as obvious, some [[compound]s can be subtler to bracket, e.g. car radio equipment (Girju et al., 2005).
4.1 Data & Results
As explained in § 3, we first identify noun–noun compounds in the WSJ Corpus, then we extract and map their bracketing from three linguistic resources: PCEDT, DeepBank and noun phrase annotation by Vadas and Curran (2007) (VC-PTB, henceforth). Even though we can identify 38,917 noun–noun compounds in the full WSJ Corpus (cf. Table 2), the set of compounds that constitutes the basis for bracketing analysis (i.e. the set of compounds that occur in the three resources) is smaller. First, because DeepBank only annotates the first 22 Sections of the WSJ Corpus. Second, because not all the noun sequences identified as compounds in VC-PTB are treated as such in DeepBank and PCEDT. Hence, the number of compounds that occur in the three resources is 26,500. Furthermore, three-quarters (76%) of these compounds consist of two nouns only, meaning that they do not require bracketing, which leaves us a subset of 6,244 multi-word compounds — we will refer to this subset as the bracketing subset.
After mapping the bracketings from the three resources we find that they agree on the bracketing of almost 75% of the compounds in the bracketing subset. Such an agreement level is relatively good compared to previously reported agreement levels on much smaller datasets, e.g. Girju et al. (2005) report a bracketing agreement of 87% on a set of 362 three-word compounds. Inspecting the disagreement among the three resources reveals two things. First, noun–noun compounds which contain proper nouns (NNP) constitute 45% of the compounds that are bracketed differently. Second, 41% of the differently bracketed compounds are actually sub-compounds of larger compounds. For example, the compound consumer food prices is left-bracketed in VC-PTB, i.e. [[consumer food] prices], whereas in PCEDT and DeepBank it is right-bracketed. This difference in bracketing leads to two different subcompounds, namely consumer food in VC-PTB and food prices in PCEDT and DeepBank.
It is noteworthy that those two observations do not reflect the properties of compounds containing proper nouns or sub-compounds; they only tell us their percentages in the set of differently bracketed compounds. In order to study their properties, we need to look at the number of sub-compounds and compounds containing NNPs in the set of compounds where the three resources agree. As it turns out, 72% of the compounds containing proper nouns and 76% of the sub-compounds are bracketed similarly. Therefore when we exclude them from the bracketing subset we do not see a significant change in bracketing agreement among the three resources, as shown in the right-most column in Table 3.
We report pairwise bracketing agreement among the three resources in Table 3. We observe higher agreement level between PCEDT and VCPTB than the other two pairs; we speculate that the annotation of the t-layer in PCEDT might have been influenced by the so-called phrase-structure layer (p-layer) which in turn uses VC-PTB annotation. Further, PCEDT and VC-PTB seem to disagree more on the bracketing of noun–noun compounds containing NNPs; because when proper nouns are excluded (NNP0 ), the agreement level between PCEDT and VC-PTB increases, but it decreases for the other two pairs.
As we look closer at the compound instances where at least two of the three resources disagree, we find that some instances are easy to classify as annotation errors. For example, the compound New York streets is bracketed as right-branching in VC-PTB, but we can confidently say that this a left-bracketing compound. Not all bracketing disagreements are that easy to resolve though; one example where left- and right-bracketing can be accepted is European Common Market approach, which is bracketed as follows in DeepBank (1) and PCEDT and VC-PTB (2):
- 1.
[[European [Common Market]] approach]
- 2.
[European [[Common Market] approach]]
- 1.
Even though this work does not aim to resolve or correct the bracketing disagreement between the three resources, we will publish a tool that allows resource creators to inspect the bracketing disagreement and possibly correct it.
5 Relations
5.1 Data & Results
6 Conclusion & Future Work
References
- Ken Barker and Stan Szpakowicz. 1998. SemiAutomatic Recognition of Noun Modifier Relationships. In: Proceedings of the 17th International Conference on Computational Linguistics and the 36th Meeting of the Association for Computational Linguistics, page 96 – 102, Montreal, Quebec, Canada.
- Lou Burnard. 2000. Reference guide for the British National Corpus version 1.0.
- Silvie Cinkova, Jan Hajic, Marie Mikulova, Lucie Mladova, Anja Nedoluzko, Petr Pajas, Jarmila Panevova, Jir Semecky, Jana Sindlerova, Josef Toman, Zdenka Uresova, and Zdenek Zabokrtsky. 2006. Annotation of English on the tectogrammatical level: reference book. Technical report, Charles University, Prague. version 1.0.1.
- Murhaf Fares, Stephan Oepen, and Erik Velldal. 2015. Identifying Compounds: On The Role of Syntax. In International Workshop on Treebanks and Linguistic Theories), page 273 – 283, Warsaw, Poland.
- Timothy Wilking Finin. 1980. The Semantic Interpretation of Compound Nominals. PhD thesis, University of Illinois at Urbana-Champaign.
- Dan Flickinger, Yi Zhang, and Valia Kordoni. 2012. DeepBank. A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, page 85 – 96, Lisbon, Portugal. Edic¸oes ˜ Colibri.
- Roxana Girju, Dan Moldovan, Marta Tatu, and Daniel Antohe. 2005. On the semantics of noun compounds. Computer Speech & Language, 19(4):479 – 496.
- Jan Hajic, Eva Hajicova, Jarmila Panevova, Petr Sgall, Ondrej Bojar, Silvie Cinkova, Eva Fucıkova, Marie Mikulova, Petr Pajas, Jan Popelka, Jir * Semecky, Jana Sindlerova, Jan Stepanek, Josef Toman, Zdenka Uresova, and ZdenekZabokrtsky. 2012. Announcing Prague Czech-English Dependency Treebank 2.0. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, page 3153 – 3160, Istanbul, Turkey.
- Su Nam Kim and Timothy Baldwin. 2008. Standardised Evaluation of English Noun Compound Interpretation. In: Proceedings of the LREC Workshop: Towards a Shared Task for Multiword Expressions, page 39 – 42, Marrakech, Morocco.
- Su Nam Kim and Timothy Baldwin. 2013. A lexical semantic approach to interpreting and bracketing English noun compounds. Natural Language Engineering, 19(03):385 – 407.
- Paul Kingsbury and Martha Palmer. 2002. From TreeBank to PropBank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, page 1989 – 1993, Las Palmas, Spain. Mark Lauer and Mark Dras. 1994. A probabilistic model of compound nouns. In: Proceedings of the 7th Australian Joint Conference on AI, page 474 – 481, Armidale, Australia.
- Mark Lauer. 1995. Designing Statistical Language Learners. Experiments on Noun Compounds. Doctoral dissertation, Macquarie University, Sydney, Australia.
- Judith N Levi. 1978. The syntax and semantics of complex nominals. Academic Press. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpora of English. The Penn Treebank. Computational Linguistics, 19:313 – 330.
- Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. 2004. Annotating noun argument structure for NomBank. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, page 803 – 806, Lisbon, Portugal.
- Adam Meyers. 2007. Annotation guidelines for NomBank-noun argument structure for PropBank. Technical report, New York University.
- Preslav Ivanov Nakov. 2007. Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics. Doctoral dissertation, EECS Department, University of California, Berkeley.
Vivi Nastase and Stan Szpakowicz. 2003. Exploring Noun-Modifier Semantic Relations. In Fifth International Workshop on Computational Semantics, page 285 – 301.
- Diarmuid O Seaghdha and Ann Copestake. 2007. Co-occurrence Contexts for Noun Compound Interpretation. In: Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, page 57 – 64, Prague, Czech Republic. Association for Computational Linguistics.
- Diarmuid O Seaghdha. 2008. Learning compound ´ noun semantics. Technical Report UCAM-CL-TR735, University of Cambridge, Computer Laboratory, Cambridge, UK.
- Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, Angelina Ivanova, and Zdenka Uresova. 2016. Towards Comparability of Linguistic Graph Banks for Semantic Parsing. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), page 3991 – 3995, Portoroz, Slovenia. European Language Resources ˇ Association.
- Stephen Tratz and Eduard Hovy. 2010. A taxonomy, dataset, and classifier for automatic noun compound interpretation. In: Proceedings of the 48th Meeting of the Association for Computational Linguistics, page 678 – 687, Uppsala, Sweden.
- Stephen Tratz. 2011. Semantically-enriched parsing for natural language understanding. Doctoral dissertation, University of Southern California.
- David Vadas and James Curran. 2007. Adding Noun Phrase Structure to the Penn Treebank. In: Proceedings of the 45th Meeting of the Association for Computational Linguistics, page 240 – 247, Prague, Czech Republic.;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2016 ADatasetforJointNounNounCompoun | Murhaf Fares | A Dataset for Joint Noun-Noun Compound Bracketing and Interpretation | 2016 |