Gene Mention Normalization Task

A Gene Mention Normalization Task is a domain specific entity mention normalization task that is restricted to the mapping of gene mentions to canonical gene records.

AKA: Protein Mention Normalization Task.
Context:
- It can be solved by a Gene Mention Normalization System (that implements a Gene Mention Normalization Algorithm).
- It can (often) be intended to apply to mapping Gene Product Mentions (Protein Mentions) as well.
Example(s):
- The BioCreAtIvE II - Gene Normalization Task.
- (from PPLRE).
- …
Counter-Example(s)
- a Gene Record Normalization Task.
- a Product Mention Normalization Task.
See: Protein NER, Organism Mention Normalization Task, PPLRE Project.

References

2011

(Chatr-aryamontri et al., 2011) ⇒ Andrew Chatr-aryamontri, Andrew Winter, Livia Perfetto, Leonardo Briganti, Luana Licata, Marta Iannuccelli, Luisa Castagnoli, Gianni Cesareni, and Mike Tyers. (2011). “Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases.” In: BMC Bioinformatics 2011, 12(Suppl 8):S8 doi:10.1186/1471-2105-12-S8-S8
- … Gene normalization is the process of linking genes or proteins to stable database identifiers and as such is a crucial step in the annotation of biological interactions. Expert curators from both BioGRID and MINT participated with curators from other databases in the annotation of the test set for the gene normalization task. Curation specifications were set by the BioCreative III organizers and, for each gene mentioned in the full-text, required the annotation of taxon and Entrez Gene identifier. If either of these conditions could not be met, the gene was not annotated.

2009

(Cusick et al., 2009) ⇒ Michael E Cusick, Haiyuan Yu, Alex Smolyar, Kavitha Venkatesan, Anne-Ruxandra Carvunis, Nicolas Simonis, Jean-François Rual, Heather Borick, Pascal Braun, Matija Dreze, Jean Vandenhaute, Mary Galli, Junshi Yazaki, David E Hill, Joseph R Ecker, Frederick P Roth, and Marc Vidal. (2009). “Literature-Curated Protein Interaction Datasets.” In: Nature Methods 6, 39 - 46 (2009)
- Why is reliability of literature curation so low? Our findings of large error rates in curated protein interaction databases, at least for yeast and human, are consistent with recent hints that the quality of literature-curated datasets may not be as high as widely perceived23,29,43–45. Perhaps occasionally curator error is responsible. However, we suggest that the errors are due not so much to curators but to the simple reality that extracting accurate information from a long free-text document can be extremely difficult. Gene name confusion is particularly thorny30,46. An example from our curated yeast sample illustrates the difficulties. A purification with a tandem affinity purification tag with Vps71/Swc6 (slash separates synonymous approved names) as bait47 pulls down a protein named Swc3, but double-checking this finds that the coresponding open reading frame is actually SWC3 (locus name YAL011w), and not the ALR1/SWC3 (locus name YOL130w) open reading frame curated in the database. A shared synonym thoroughly muddled the curation.

2008

(Morgan et al., 2008) ⇒ Alexander A Morgan, Zhiyong Lu, Xinglong Wang, Aaron M Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, Jörg Hakenberg, Chengjie Sun, Heng-hui Liu, Rafael Torres, Michael Krauthammer, William W Lau, Hongfang Liu, Chun-Nan Hsu, Martijn Schuemie, K Bretonnel Cohen, and Lynette Hirschman. (2008). “Overview of BioCreative II gene normalization.” In: Genome Biology 2008, 9(Suppl 2):S3. doi:10.1186/gb-2008-9-s2-s3.
- QUOTE:The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. … Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. … Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.
(Farkas, 2008) ⇒ Richárd Farkas. (2008). “The strength of co-authorship in gene name disambiguation.” In: BMC Bioinformatics 2008, 9:69. doi:10.1186/1471-2105-9-69
- QUOTE:Taken one step further, the goal of Gene Name Normalisation (GN) [2] is to assign a unique identifier to each gene name found in a text.

2005

ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: MiningBiological Semantics

2004

(Morgan et al., 2004) ⇒ Alexander A. Morgan, Lynette Hirschman, Marc E. Colosimo, Alexander S. Yeh, and Jeff B. Colombe. (2004). “Gene Name Identification and Normalization Using a Model Organism Database.” In: Journal of Biomedical Informatics 37(6). doi:10.1016/j.jbi.2004.08.010

2002

(Cohen et al., 2002) ⇒ K. Bretonnel Cohen, Andrew Dolbey, George Acquaah-Mensah, Lawrence Hunter. “Contrast and Variability in Gene Names.” In: Proceedings of the ACL-2002 Workshop on Natural Language Processing in the Biomedical Domain.