ART Corpus
Jump to navigation
Jump to search
See: ART Corpus, Annotated Corpus, SAPIENT System, Scientific Research Paper, Physical Chemistry Research, Biochemistry Research.
References
2010
- (Liakata et al., 2010) ⇒ Maria Liakata, Simone Teufel, Advaith Siddharthan, Colin R. Batchelor. (2010). “Corpora for the Conceptualisation and Zoning of Scientific Papers.” In: Proceedings of LREC Conference (LREC 2010).
- QUOTE: We used the CoreSC annotation scheme and the semantic annotation tool SAPIENT (Liakata et al., 2009) to construct a corpus of 265 annotated papers (Liakata and Soldatova, 2009) from physical chemistry and biochemistry. The CoreSC corpus was developed in two different phases
2009
- (Liakata & Soldatova, 2009) ⇒ Maria Liakata and Larisa N. Soldatova. (2009). “The ART Corpus.” Technical report, Aberystwyth University.
- QUOTE: The ART corpus consist of 225 papers manually annotated the CISP labels (i.e. “Goal", "Method", "Result"). The ART Corpus is >1 million words, 35,040 sentences. These papers cover topics in physical chemistry and biochemistry and were provided by the Royal Society of Chemistry (RSC) Publishing. The Corpus was developed primarily to add value to scientific papers, through semantic markup that would make it easier for natural language processing and semantic web applications to automatically extract information pertaining to core scientific concepts. The ART corpus can also be used as a training set for machine learning algorithms, in order to automate the annotation of papers with CISP meta-data. The corpus is available as a collection of 225 .xml files, where each file corresponds to a separate paper whose sentences have been annotated individually with core scientific concepts.