CORA Citation Matching Benchmark Task

From GM-RKB
Jump to navigation Jump to search

A CORA Citation Matching Benchmark Task is a Benchmark Task for the task



References

brodley1992 <author> Brodley, C. E. & Utgoff, P. E. </author> <year> (1992), </year> <title> Multivariate versus univariate decision trees, </title> <type> Tech--nical Report COINS TR 92-8, </type> <institution> Department of Computer Science, University of Mas-sachusetts,</institution><address> Amherst, MA, </address>

        • <NEWREFERENCE>1

brodley1992 <author> Brodley, C. E. & Utgoff, P. E. </author> <year> (1992), </year> <title> Multivariate versus univariate decision trees, </title> <type> Technical Report COINS TR 92-8, </type> <institution> Department of Computer Science, University of Massachusetts,</institution>,<address> Amherst, MA, </address>

      • fahl-labeled
        • <NEWREFERENCE>0

aha1987 <author> Kibler, D. & Aha, D. W. </author> <year> (1987). </year> <title> Learning Representative Exemplars of Concepts: An Initial Case Study. </title> <booktitle> Proceedings of the Fourth International Workshop on Machine Learning</booktitle> <pages> (pp. 24-30). </pages> <address> Irvine, CA: </address> <publisher> Morgan Kaufmann. </publisher>

        • <NEWREFERENCE>1

aha1987 <author> Kibler, D., & Aha, D. W. </author> <year> (1987). </year> <title> Learning representative exemplars of concepts: An initial case study. </title> <booktitle> In: Proceedings of the Fourth International Workshop on Machine Learning</booktitle> <pages> (pp.24-30. </pages> <address> Irvine, CA: </address> <publisher> Morgan Kaufmann. </publisher>

2009

  • (Wick et al., 2009) ⇒ Michael Wick, Aron Culotta, Khashayar Rohanimanesh, and Andrew McCallum. (2009). “An Entity Based Model for Coreference Resolution.” In: Proceedings of the SIAM International Conference on Data Mining (SDM 2009).
    • For our citation matching experiments, we use the CORA corpus, a collection of research paper citations and authors, to evaluate our approach. The corpus contains 1295 citations referring to 134 different research papers for an average cluster size of roughly ten citations per cluster.
    • We focus our experiments on the citation matching task using the following attributes of a citation:
      • venue; publication date; publisher; publication title; volume; page numbers
    • The attribute values in CORA are imperfect and contain a variety of errors including human-introduced typos, as well as extraction errors from automated segmentation algorithms. For example, a researcher’s name may be incorrectly segmented and become part of the title instead of being contained in the list of authors.
    • Furthermore, the citations were originally created by different authors and come from a variety publication venues with different citation formats. For example, page numbers may be written “pp 22-33” or “pages 22-33”, and dates: “2003” or “jan 03” or “01/03”. Additionally, there are a wide variety of ways to include information about the venue. Some citations contain “in the proceedings of the 23rd...” or “In: Proceedings of twenty-third annual...” while others may omit that information entirely and choose not to include the annual conference number.
    • These various sources of error and heterogeneity make CORA an ideal and realistic testing-ground for canonical-wise coreference.
    • For our experiments we performed three-fold cross validation using the same splits provided by Poon and Domingos [13].
    • The current state-of-the-art model by Poon and Domingos [13] on the CORA dataset jointly models segmentation and coreference achieving 95.6% pairwise F1, which is only slightly higher than our 94.7%.

2007