BioCreative Benchmark Task
A BioCreative Benchmark Task is a biomedical text mining benchmark task from the BioCreative Research Program.
- Context:
- It can (typically) be associated with a BioCreative Workshop.
- It can assess the performance of Information Extraction Systems on Biomedicine Corpus.
- It can include a Named Entity Recognition Task, for Biomedical Entities.
- It can include an Entity Mention Coreference Resolution Task, for Biomedical Entities.
- It can include an Entity Mention Normalization Task, for Biomedical Entities (a BioCreative Gene Normalization Benchmark Task).
- … Semantic Relation Extraction Task, for Bio....
- …
- Example(s):
- BioCreative I (2004)
- BioCreAtIvE I Evaluation Series: Task 1A, Task 1B, and Task 2.
- Task 2 was to assign Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents.
- BioCreative II (2006)
- BioCreative III (2010)
- …
- BioCreative I (2004)
- Counter-Example(s):
- See: Biomedical Entity, Information Extraction Task.
References
2017a
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/BioCreative Retrieved:2017-7-16.
- BioCreAtIvE (A critical assessment of text mining methods in molecular biology) consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain. Three main tasks were posed at the first BioCreAtIvE challenge: the entity extraction task, the gene name normalization task, and the functional annotation of gene products task. The data sets produced by this contest serve as a Gold Standard training and test set to evaluate and train Bio-NER tools and annotation extraction tools.
The second BioCreAtIvE included three tasks organized by Lynette Hirschman and Alex Morgan of MITRE; Alfonso Valencia and Martin Krallinger of CNIO in Spain; and W. John Wilbur, Lorrie Tanabe and Larry Smith of NIH.
BioCreative V will have 5 different tracks, described at:
- BioCreAtIvE (A critical assessment of text mining methods in molecular biology) consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain. Three main tasks were posed at the first BioCreAtIvE challenge: the entity extraction task, the gene name normalization task, and the functional annotation of gene products task. The data sets produced by this contest serve as a Gold Standard training and test set to evaluate and train Bio-NER tools and annotation extraction tools.
2017b
- (NCBI) ⇒ https://www.ncbi.nlm.nih.gov/research/bionlp/Research#BioCreative Retrieved:2017-07-16
- QUOTE: Critical Assessment of Information Extraction in Biology (BioCreative) is a community effort for evaluating text mining and information extraction systems applied to the biological domain. Since 2004, the BioCreative Evaluation series has included over ten different tasks such as ranking of relevant documents ("document triage"), extraction of genes and proteins ("gene mention") and their linkage to database identifiers ("gene normalization"), as well as creation of functional annotations in standard ontologies (e.g., GO) and extraction of entity-relations (e.g., protein-protein interaction). As part of the BioCreative executive committee, we have led the organization of multiple shared tasks in recent years such as:
- Chemical-Disease Relation Extraction - BioCreative 2015
- BioC: The BioCreative Interoperability Initiative - BioCreative 2015 & 2013
- Automatic Gene Ontology (GO) Annotation - BioCreative 2013
- Multi-species Gene Normalization (GN) - BioCreative 2010
- QUOTE: Critical Assessment of Information Extraction in Biology (BioCreative) is a community effort for evaluating text mining and information extraction systems applied to the biological domain. Since 2004, the BioCreative Evaluation series has included over ten different tasks such as ranking of relevant documents ("document triage"), extraction of genes and proteins ("gene mention") and their linkage to database identifiers ("gene normalization"), as well as creation of functional annotations in standard ontologies (e.g., GO) and extraction of entity-relations (e.g., protein-protein interaction). As part of the BioCreative executive committee, we have led the organization of multiple shared tasks in recent years such as:
2012
- http://www.biocreative.org/events/biocreative-iv/CFP/
- QUOTE: BioCreative: Critical Assessment of Information Extraction in Biology is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. Built on the success of the previous BioCreative Challenge Evaluations and Workshops (BioCreative I, II, II.5, III, and 2012 workshop) [1-5] the BioCreative Organizing Committee will host the BioCreative IV Challenge (http://www.biocreative.org/events/Biocreative IV/workshop/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013. One key goal of BioCreative is the active involvement of the text mining user community in the design of the tracks, preparation of corpus and the testing of interactive systems. For BioCreative IV, the selection of the tracks has been driven in part by suggestions from the biocuration community during the BioCreative workshop 2012, and by our goal of addressing interoperability -- a major barrier to adoption to text mining tools.
2011
- http://biocreative.sourceforge.net/biocreative_glossary.html
- BioCreAtIvE: Critical Assessment of Information Extraction systems in Biology challenge evaluation consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain.
2009
- (Wermter et al., 2009) ⇒ Joachim Wermter, Katrin Tomanek, and Udo Hahn. (2009). “High-performance gene name normalization with GENO.” In: Bioinformatics, 25(6)
- obtains an F-measure performance of 86.4% (precision: 87.8%, recall: 85.0%) on the BIOCREATIVE-II test set
- employs a carefully crafted suite of symbolic and statistical methods
- It relies on publicly available software and data resources, including extensive background knowledge based on semantic profiling.
2008
- (Morgan,2008) ⇒ Alex Morgan (2008) ⇒ "Human gene/protein normalization", published onilne at :http://biocreative.sourceforge.net/biocreative_2_gn.html
- QUOTE: Premise - Systems will be required to return the EntrezGene (formerly Locus Link) identifiers corresponding to the human genes and direct gene products appearing in a given MEDLINE abstract. This has relevance to improving document indexing and retrieval, and to linking text mentions to database identifiers in support of more sophisticated information extraction tasks. It is similar to Task 1B of BioCreAtIvE I [1].
System Input - Participating groups will be given a master list of human EntrezGene identifiers with some common gene and protein names (synonyms) for each identifier in the master list. For the evaluation task, the input is a collection of plain text abstracts.
System Output - For each abstract, the system will return a list of the EntrezGene identifiers and corresponding text excerpts for each human gene or gene product mentioned in the abstract. The excerpt required is a single mention of the gene's 'name' found in the abstract. Even if a gene is mentioned several different places in an abstract with alternate names being used, only a single excerpt/mention is to be returned by the system. If desired, groups may also include a fourth column which contains a confidence measure that ranges from 0 (no confidence) to 1 (absolute confidence). This is not a part of the main evaluation, and is included as an option for interested groups at the request of some participants. The return format is a single file, with each entry on one line, and the field delimited by tabs. The columns should then be: PUBMED ID, EntrezGene (LocusLink) ID, Mention Text, and optionally Confidence. There should be no column headers or line numbers, and the fields should all be separated with tabs. Although the hand annotated training file contains multiple text excerpts for each identifier, that is just meant to aid in training and only one would be expected from a participating system (any one of the set would be 'correct', although getting the right text is not the main part of the evaluation). An example line with made up identifiers follows:
123456 987 foobar
If interested in the optional confidence numbers:
123456 987 foobar .87
- QUOTE: Premise - Systems will be required to return the EntrezGene (formerly Locus Link) identifiers corresponding to the human genes and direct gene products appearing in a given MEDLINE abstract. This has relevance to improving document indexing and retrieval, and to linking text mentions to database identifiers in support of more sophisticated information extraction tasks. It is similar to Task 1B of BioCreAtIvE I [1].
2006
- (Krallinger, 2006) ⇒ Martin Krallinger (2006)."BioCreAtIvE challenge evaluation" Publised online at: http://biocreative.sourceforge.net/
- QUOTE: Description - The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge evaluation consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain.
The organization of BioCreAtIvE was motivated by the increasing number of groups working in the area of text mining. However, despite increased activity in this area, there were no common standards or shared evaluation criteria to enable comparison among the different approaches. The various groups were addressing different problems, often using private data sets, and as a result, it was impossible to determine how good the existing systems were, whether they would scale to real applications, and what performance could be expected.
The main emphasis of BioCreAtIvE is on the comparison of methods and the community assessment of scientific progress, rather than on the purely competitive aspects.
There is a considerable difficulty in constructing suitable “gold standard” data for training and testing new information extraction systems which handle life science literature. Thus the data sets derived from the BioCreAtIvE challenge - because they have been examined by biological database curators and domain experts - serve as useful resources for the development of new applications as well as helping to improve existing ones.
Two main issues are addressed at BioCreAtIvE, both concerned with the extraction of biologically relevant and useful information from the literature. The first one is concerned with the detection of biologically significant entities (names) such as gene and protein names and their association to existing database entries. The second one is concerned with the detection of entity-fact associations (e.g. protein - functional term associations).
The first BioCreAtIvE challenge evaluation in 2003-2004 attracted considerable attention within the bioinformatics and biomedical text mining community. Overall, 27 groups from some 10 countries participated in the evaluation. The first BioCreAtIvE was organized through collaborations between text mining and NLP groups, biological database curators and bioinformatics researchers and has served as the promoting force for the organization of the second BioCreAtIvE challenge.
- QUOTE: Description - The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge evaluation consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain.
- ↑ Hirschman, L., et al., Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, 2005. 6 Suppl 1: p. S11.