2010 BuildingASemAnnCorpusOfClinicRecs

(Roberts, Gaizauskas et al., 2010) ⇒ Angus Roberts, Robert Gaizauskas, Mark Hepple, George Demetriou, Yikun Guo, Ian Roberts, Andrea Setzer. (2010). “Building a Semantically Annotated Corpus of Clinical Texts.” In: Journal of Biomedical Informatics, 42 (5). doi:10.1016/j.jbi.2008.12.013

Subject Headings: Semantically Annotated Corpus, Clinical Record, Annotation Methodology.

Notes

It references UMLS.
It extends the CLEF Corpus.

Quotes

Author Key words

Corpora; Semantic annotation; Clinical text; Natural language processing; Gold standards; Evaluation; Information Extraction; Text mining; Temporal annotation; Annotation guidelines

Abstract

In this paper we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains.

1. Introduction

We describe the creation of a semantically annotated corpus of clinical texts. The documents of this corpus are drawn from the free text component of patient records, and the annotations capture clinically significant information communicated by these texts. The corpus is intended for use in developing and evaluating systems that can automatically extract this kind of clinically significant information from the textual component of patient records. The corpus has been created within the context of the CLinical E-Science Framework (CLEF) project [1]: a multi-site research project that has been developing the technology and techniques required for a high quality repository of electronic patient records. Such a repository must meet high standards of security and interoperability, and should enable ethical and user-friendly access to patient information, so as to facilitate both clinical care and biomedical research. CLEF has chosen to work in the area of cancer informatics, as one of the project partners

References

[1] Rector A, Rogers J, Taweel A, Ingram D, Kalra D, Milan J, et al. CLEF — joining up healthcare with clinical and post-genomic research. In: Proceedings of UK e-Science All Hands Meeting 2003. Nottingham, UK; 2003. p. 264–267.
[2] Grishman R. Information Extraction. In: Mitkov R, editor. The Oxford Handbook of Computational Linguistics; 2003. Chapter 30.
[3] Harkema H, Roberts I, Gaizauskas R, Hepple M. Information Extraction from Clinical Records. In: Cox SJ, Walker DW, editors. Proceedings of the UK e-Science All Hands Meeting 2005. Nottingham, UK; 2005. p. 254–258.
[4] Riloff E. Automatically Generating Extraction Patterns from Untagged Text. In: AAAI/IAAI, Vol. 2; 1996. p. 1044–1049.
[5] Roberts A, Gaizauskas R, Hepple M, Davis N, Demetriou G, Guo Y, et al. The CLEF Corpus: Semantic Annotation of Clinical Text. In: Proc AMIA Symp. Chicago, IL, USA; 2007. p. 625–629.
[6] Roberts A, Gaizauskas R, Hepple M, Demetriou G, Guo Y, Setzer A, et al. Semantic Annotation of Clinical Text: The CLEF Corpus. In: Proceedings of Building and evaluating resources for biomedical text mining: workshop at Sixth International Conference on Language Resources and Evaluation, LREC 2008. Marrakech, Morocco: ELRA; 2008. .
[7] Kim JD, Ohta T, Tateisi Y, Tsujii J. GENIA corpus — a semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(1):i180–i182.
[8] Kim JD, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics. 2008;9(1).
[9] Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, et al. Integrated Annotation for Biomedical Information Extraction. In: Hirschman L, Pustejovsky J, editors. HLT-NAACL 2004Workshop: BioLINK 2004, Linking Biological Literature, Ontologies and Databases. Boston, Massachusetts, USA: Association for Computational Linguistics; 2004. p. 61–68.
[10] Franzén K, Gunnar, Eriksson, Olsson F, Asker L, Lidén P, et al. Protein names and how to find them. Int J Med Inform. 2002;67(1–3):49–61.
[11] Rosario B, Hearst MA. Classifying semantic relations in bioscience texts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics; 2004. p. 430.
[12] Rosario B, Hearst MA. Multi-way relation classification: application to protein-protein interactions. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Morristown, NJ, USA: Association for Computational Linguistics; 2005. p. 732–739.
[13] Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, et al. The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions. In: Proceedings of Building and evaluating resources for biomedical text mining: Workshop at Sixth International Conference on Language Resources and Evaluation, LREC 2008. Marrakech, Morocco; 2008. p. 11–18. In press.
[14] Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics. 2005;6(Suppl 1)(S3).
[15] Nédellec C. Learning Language in Logic - Genic Interaction Extraction Challenge. In: Proceedings of the ICML05 Workshop on Learning Language in Logic. Bonn, Germany; 2005. p. 31–37.
[16] TREC Genomics Track. [cited 6 June 2008]; Available from http://ir.ohsu.edu/genomics;.
[17] Pestian JP, Brew C, Matykiewicz P, Hovermale D, Johnson N, Cohen KB, et al. A shared task involving multi-label classification of clinical free text. In: Biological, translational, and clinical language processing. Prague, Czech Republic: Association for Computational Linguistics; 2007. p. 97–104.
[18] Hersh WR, Muller H, Jensen JR, Yang J, Gorman PN, Ruch P. Advancing Biomedical Image Retrieval: Development and Analysis of a Test Collection. J Am Med Inform Assoc. 2006;13(5):488–496.
[19] Mller H, Deselaers T, Lehmann TM, Clough PD, Hersh W. Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks. In: Cross Language Evaluation Forum (CLEF)Workshop 2006. vol. 4730. Alicante, Spain: Springer; 2007. p. 595–608.
[20] i2b2 NLP shared task. [cited 6 June 2008]; Available from http://ir.ohsu.edu/genomics/;.
[21] Ogren PV, Savova G, Buntrock JD, Chute CG. Building and Evaluating Annotated Corpora for Medical NLP Systems. In: Proc AMIA Symp; 2006. p. 1050.
[22] Meystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation. Journal of Biomedical Informatics. 2006;39(6):589–599.
[23] Denny JC, Smithers JD, Miller RA, Spickard A. “Understanding” Medical School Curriculum Content Using KnowledgeMap. Journal of the American Medical Informatics Association. 2003;10(4):351–362.
[24] Elkin PL, Brown SH, Bauer BA, Husser CS, Carruth W, Bergstrom LR, et al. A controlled trial of automated classification of negation from clinical notes. BMC Medical Informatics and Decision Making. 2005;5(13).
[25] Friedman C, Hripcsak G. Evaluating natural language processors in the clinical domain. Methods of Information in Medicine. 1998;37(4-5):334–44.
[26] International Classification of Diseases (ICD).
[cited 6 June 2008]; Available from http://www.who.int/classifications/icd;.
[27] Rogers J, Puleston C, Rector A. The CLEF Chronicle: Patient Histories Derived from Electronic Health Records. Data Engineering Workshops, 2006 Proceedings 22nd International Conference on. 2006;p. x109–x109.
[28] Hallett C, Power R, Scott D. Summarisation and Visualisation of e-Health Data Repositories. In: Proceedings of the UK e-Science All Hands Meeting. Nottingham, UK; 2006. p. 69–77.
[29]Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubézy M, Eriksson H, et al. The evolution of Protégé: an environment for knowledge-based systems development. International Journal Human-Computer Studies. 2003;58(1):89– 123.
[30] Ogren PV. Knowtator: a Protégé plug-in for annotated corpus construction. In: Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown, NJ, USA: Association for Computational Linguistics; 2006. p. 273–275.
[31] Defense Advanced Research Projects Agency. Proceedings of the Seventh Message Understanding Conference (MUC-7); 1998. Available at http://www.itl.nist.gov/iaui/894.02/related projects/muc/.
[32] Boisen S, Crystal MR, Schwartz R, Stone R, Weischedel R. Annotating resources for information extraction. In: Proceedings of the Second Language Resources and Evaluation, LREC 2000; 2000. p. 1211–1214.
[33] Demetriou G, Gaizauskas R, Sun H, Roberts A. ANNALIST – ANNotation ALIgnment and Scoring Tool. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008. Marrakech, Morocco: ELRA; 2008. In press.
[34] Hripcsak G, Rothschild A. Agreement, F-measure and reliability in information retrieval. J Am Med Inform Assoc. 2005 May-June;12(3):296–298.
[35] Cunningham H, Maynard D, Bontcheva K, Tablan V. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Philadelphia, PA, USA; 2002. p. 168–175.
[36] GATE – General Architecture for Text Engineering. [cited 6 June 2008]; Available from http://gate.ac.uk;.
[37] UMLS Knowledge Sources, 2007AB; 2007.
[38] Pustejovsky J, no JC, Ingria R, Saur´i R, Gaizauskas R, Setzer A, et al. TimeML: Robust Specification of Event and Temporal Expressions in Text. In: Proceedings of the Fifth International Workshop on Computational Semantics (IWCS-5). Tilburg; 2003. .
[39] Verhagen M, Gaizauskas R, Schilder F, Hepple M, Katz G, Pustejovsky J. SemEval-2007 Task 15: TempEval Temporal Relation Identification. In: Proceedings of the 4th International Workshop on Semantic Evaluations. Prague; 2007. p. 75–80.
[40] Mani I, Wilson G. Robust temporal processing of news. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000). New Brunswick, New Jersey; 2000. p. 69–76.
[41] Harkema H, Gaizauskas R, Hepple M, Davis N, Guo Y, Roberts A, et al. A Large-Scale Resource for Storing and Recognizing Technical Terminology. In: Proceedings of 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal; 2004. p. 83–86.
[42] Lindberg D, Humphreys B, McCray A. The Unified Medical Language System. Methods Inf Med. 1993;32(4):281–291.
[43] Li Y, Bontcheva K, Cunningham H. SVM Based Learning System for Information Extraction. In: Deterministic and statistical methods in machine learning: first international workshop. No. 3635 in Lecture Notes in Computer Science. Springer; 2005. p. 319–339.
[44] Roberts A, Gaizauskas R, Hepple M, Guo Y. Combining terminology resources and statistical methods for entity recognition: an evaluation. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008. Marrakech, Morocco; 2008. .
[45] Roberts A, Gaizauskas R, Hepple M. Extracting Clinical Relationships from Patient Narratives. In: Proceedings of the Workshop on BioNLP 2008. Columbus, OH, USA: Association for Computational Linguistics; 2008. .
[46] Thompson CA, Califf ME, Mooney RJ. Active learning for natural language parsing and information extraction. In: Proceedings16th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA; 1999. p. 406–414.
[47] Ghani R, Jones R, Mitchell T, Riloff E. Active Learning For Information Extraction With Multiple View Feature Sets. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003) Workshop on Adaptive Text Extraction and Mining; 2003. .
[48] SAFE, the Semantic Annotation Factory Environment. [cited 2 October 2008]; Available from http://gate.ac.uk/safe/;.
[49] BioNotate. [cited 2 October 2008]; Available from http://sourceforge.net/projects/bionotate/;.
[50] Clinical E-Science Framework: Sheffield NLP. [cited 2 October 2008]; Available from http://nlp.shef.ac.uk/clef/;.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 BuildingASemAnnCorpusOfClinicRecs	Angus Roberts Robert Gaizauskas Mark Hepple George Demetriou Yikun Guo Ian Roberts Andrea Setzer			Building a Semantically Annotated Corpus of Clinical Texts			http://eprints.whiterose.ac.uk/10186/	10.1016/j.jbi.2008.12.013

2010 BuildingASemAnnCorpusOfClinicRecs

Notes

Quotes

Author Key words

Abstract

1. Introduction

References

Navigation menu

Search