GENETAG Corpus
Jump to navigation
Jump to search
A GENETAG Corpus is an benchmark annotated corpus for Biomedical NER.
- Context:
- It can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz
- See: Genia Corpus.
References
2005
- (2005_GENETAG) ⇒ Lorraine Tanabe, Natalie Xie, Lynne H. Thom, Wayne Matten, and W. John Wilbur. (2005). “GENETAG: a tagged corpus for gene/protein named entity recognition.” In: BMC bioinformatics, 6(Suppl 1). doi:S3doi:10.1186/1471-2105-6-S1-S3
- QUOTE: We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition. … Each sentence in GENETAG was annotated with acceptable alternatives to the gene/protein names it contained, allowing for partial matching with semantic constraints. … The annotation of GENETAG required intricate manual judgments by annotators which hindered tagging consistency. The data were pre-segmented into words, to provide indices supporting comparison of system responses to the “gold standard”. However, character-based indices would have been more robust than word-based indices. GENETAG Train, Test and Round1 data and ancillary programs are freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz. A newer version of GENETAG-05, will be released later this year.