2008 TheITITXMCorpora

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Information Extraction, ITI TXM Corpora

Notes

  • This paper reports on a thorough annotation project of PubMed papers with several annotators working for over a year.
  • The project resulted in two different corpora: one for protein-to-protein interactions, the other for t expression.
  • The resulting corpus consists of 217 documents, 133 selected from PubMedCentral and 84 documents selected from the whole of PubMed. Document selection for the TE corpus was performed against PubMed."
  • Annotation was performed by a group of nine biologists, all qualified to PhD level in biology, working under the supervision of an annotation manager (also a biologist) and collaborating with a team of NLP researchers.
  • They have a nice characterization of relation attributes
  • Before I publish the PPLRE dataset I may borrow their XML markup approach.
  • I asked for the proportion of inter-sentential relations, and they provided an annecdotal estimate of: 10% for PPI, and 30% for TE
  • Claims to be the first sizeable corpora for tissue expression (TE).
  • The ITI TXM corpora were created as part of an ITI Life Sciences Scotland (http://www.itilifesciences.com) research programme with Cognia EU and the University of Edinburgh.

Quotes

Abstract

We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, Developmental- Stage, Disease, DrugCompound, ExperimentalMethod, Fragment, Fusion, GOMOP, Gene, Modification, mRNAcDNA, Mutant, Protein, Tissue), normalisations of selected entities to the NCBI Taxonomy, RefSeq, EntrezGene, ChEBI and MeSH and enriched relations (protein-protein interactions, tissue expressions and fragment- or mutant-protein relations). While one corpus targets protein-protein interactions (PPIs), the focus of other is on tissue expressions (TEs). This paper describes the selected markables and the annotation process of the ITI TXM corpora, and provides a detailed breakdown of the inter-annotator agreement (IAA).

In a sentence such as “Protein A interacts with B in the presence of Drug C but not D.”, the annotators would mark two PPI relations between “A” and “B”, one Positive with “C” as a Drug-Compound attribute, and the other negative with “D” as a DrugCompound attribute.


Table 2: Entity types and counts in each corpus. A long dash indicates that the entity was not marked in that corpus. Entity type PPI TE CellLine 7,676 — Complex 7,668 4,033 DevelopmentalStage — 1,754 Disease — 2,432 DrugCompound 11,886 16,131 ExperimentalMethod 15,311 9,803 Fragment 13,412 4,466 Fusion 4,344 1,459 GOMOP — 4,647 Gene — 12,059 Modification 6,706 — mRNAcDNA — 8,446 Mutant 4,829 1,607 Protein 88,607 60,782 Tissue — 36,029


Table 4: Relation types in each corpus. Corpus Relation_Type Relation_Count Relation_Type_Description PPI PPI() 11,523 Indicates that the text is referring to an interaction between Proteins, Fragments, Mutants, Complexes or Fusions. PPI FRAG() 16,002 Connects Fragment or Mutant to its parent Protein. TE TE() 12,426 Links a gene or gene product to a Tissue, indicating that the text is stating that the gene or gene product is expressed in that Tissue. TE FRAG() 4,735 Connects Fragment or Mutant to its parent Protein.


Table 5: Property names, values and counts in each corpus. A long dash indicates that the property was not marked in this corpus. Name Value PPI TE IsPositive Positive 10,718 10,243 Negative 836 2,067 IsDirect Direct 7,599 — NotDirect 3,977 — IsProven Proven 7,562 9,694 Referenced 2,894 1,837 Unspecified 1,096 736


Table 12: IAA of attributes (in F1)in the PPI corpus. The total number of true positives is shown in brackets. NAME IAA ModificationBeforeEntity 65.3 (31) ModificationAfterEntity 86.7 (248) DrugTreatmentEntity 45.4 (61) CellLineEntity 64.0 (244) ExperimentalMethodEntity 36.9 (94) MethodEntity 55.4 (274)

References

  • Beatrice Alex, Malvina Nissim, and Claire Grover. (2006). The impact of annotation on the performance of protein tagging in biomedical text. In: Proceedings of LREC.
  • Razvan C. Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond Mooney, Arun K. Ramani, and Yuk W. Wong. 2005. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2):139–155.
  • Jean Carletta, David McKelvie, Amy Isard, Andreas Mengel, Marion Klein, and Morton Baun Møller. (2005). A generic approach to software support for linguistic annotation using XML. In Geoffrey Sampson and Diana McCarthy, editors, Readings in Corpus Linguistics. Continuum International.
  • Kevin B. Cohen, Lynne Fox, Philip V. Ogren, and Lawrence Hunter. (2005). Corpus design for biomedical natural language processing. In: Proceedings of ISMB.
  • Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37– 46.
  • FetchProt, (2005). The FetchProt Corpus: documentation and annotation guidelines. Available online at: http://fetchprot.sics.se.
  • Claire Grover, Michael Matthews, and Richard Tobin. (2006). Tools to address the interdependence between tokenisation and standoff annotation. In: Proceedings of NLPXML.
  • Martin Krallinger, Rainer Malik, and Alfonso Valencia. (2006). Text mining and protein annotations: the construction and use of protein description sentences. Genome Inform, 17(2):121–130.
  • Seth Kulick, Ann Bies, Mark Liberman, Mark Mandel, Ryan Mcdonald, Martha Palmer, Andrew Schein, Lyle H. Ungar, Scott Winters, and Pete White. (2004). Integrated annotation for biomedical information extraction. In: Proceedings of the BioLINK.
  • Zhiyong Lu, Michael Bada, Philip V. Ogren, K. Bretonnel Cohen, and Lawrence Hunter. (2006). Improving biomedical corpus annotation guidelines. In: Proceedings of the Joint BioLINK and 9th Bio-Ontologies Meeting.
  • Inderjeet Mani, Zhangzhi Hu, Seok Bae Jang, Ken Samuel, Matthew Krause, Jon Phillips, and Cathy H. Wu. (2005). Protein name tagging guidelines: lessons learned. Comparative and Functional Genomics, 6(1-2):72–76.
  • Tara McIntosh and James R. Curran. (2007). Challenges for extracting biomedical knowledge from full text. In: Proceedings of BioNLP.
  • Claire Nedellec. (2005). Learning language in logic - genic interaction extraction challenge. In: Proceedings of the ICML Workshop on Learning Language in Logic.
  • Tomoko Ohta, Yuka Tateisi, and Jin-Dong Kim. (2002). GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of HLT.
  • Scott Piao, Ekaterina Buyko, Yoshimasa Tsuruoka, Katrin Tomanek, Jin-Dong Kim, John McNaught, Udo Hahn, and Sophia Ananiadou. (2007). BootStrep annotation scheme - encoding information for text mining. Proceedings of the 4th Corpus Linguistics Conference.
  • Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Bj¨orne, Jorma Boberg, Jouni J¨arvinen, and Tapio Salakoski. (2007). BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(1).
  • Parantu K. Shah, Carolina Perez-Iratxeta, Peer Bork, and Miguel A. Andrade. (2003). Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics, 4(20).
  • Hagit Shatkay and Ronen Feldman. (2003). Mining the biomedical literature in the genomic era: an overview. Journal of Computational Biology, 10(6):821–855.
  • Lorraine Tanabe, Natalie Xie, Lynne H. Thom, Wayne Matten, and W. John Wilbur. (2005). GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 6 Suppl 1.
  • GENIA Treebank, (2005). GENIA Treebank Beta Version. Available online at: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/topics/Corpus/GTB.html.
  • John W. Wilbur, Andrey Rzhetsky, and Hagit Shatkay. (2006). New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics, 7(1).

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 TheITITXMCorporaClaire Grover
Bea Alex
Barry Haddow
Mijail Kabadjov
Ewan Klein
Michael Matthews
Stuart Roebuck
Richard Tobin
Xinglong Wang
The ITI TXM Corpora: Tissue Expressions and Protein-Protein InteractionsProceedings of the Workshop on Building & Evaluation Resources for Biomedical Text Mining collocated with LREC-2008http://www.ltg.ed.ac.uk/np/publications/ltg/papers/Alex2008Corpora.pdf2008