2008 OverviewOfBioCreativeIIGeneNorml
- (Morgan et al., 2008) ⇒ Alexander A. Morgan, Zhiyong Lu, Xinglong Wang, Aaron M. Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, Jörg Hakenberg, Chengjie Sun, Heng-hui Liu, Rafael Torres, Michael Krauthammer, William W Lau, Hongfang Liu, Chun-Nan Hsu, Martijn Schuemie, K. Bretonnel Cohen, and Lynette Hirschman. (2008). “Overview of BioCreative II gene normalization.” In: Genome Biology, 9(Suppl 2):S3. doi:10.1186/gb-2008-9-s2-s3.
Subject Headings: BioCreAtIvE II - Gene Normalization Task, Gene Normalization Task, Entity Mention Normalization Task, Biomedical Text Mining.
Notes
Cited By
Quotes
Abstract
- Background
The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.
- Results
Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers.
- Conclusion
Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.
Background
The goal of the gene normalization (GN) task is to determine the unique identifiers of genes and proteins mentioned in scientific literature. For the BioCreative II GN task, the identifiers are Entrez Gene IDs, the genes and proteins are associated with humans, and the targeted literature is a collection of abstracts taken from PubMed/MEDLINE. Gene normalization is a challenging task even for the human expert; despite the existence of various standards bodies, there is great variability in how genes and gene products are mentioned in the literature. There are two problems. First, genes are often described, rather than referred to by gene name or symbol, as in 'p65 subunit of NF-kappaB' or 'light chain-3 of microtubule-associated proteins 1A and 1B.' This can make correct association with the Entrez Gene identifier difficult. Second, gene mentions can be ambiguous.
Discussion
The different teams approached the GN task from a variety of angles. The Materials and methods section (below) contains brief overviews of the specific approaches of the different teams; the workshop proceedings contain more extended descriptions of the individual systems. In this section, we highlight some of the commonalities and some of the successful approaches that were used in the evaluation.
For overview purposes, we can break down the GN task into three basic tasks.
- Preprocessing of the text to regularize it and to identify linguistic units such as words and sentences, and even categories of words and phrases, such as mentions of genes and gene products. This step could also include special handling for prefixes, suffixes, and enumerations or conjunctions.
- Generation of candidate gene identifiers, generally by associating text strings (sequences of words in the text) with identifiers, using a lexicon.
- Pruning of the list of candidate identifiers to remove false positives and to disambiguate in cases where a mention could be mapped to more than one identifier.
Not all teams followed this approach. For example, Team 14 [9] did no linguistic preprocessing; they generated candidate gene identifiers using a text categorization approach, and followed this step by identifying text evidence for the selected gene/protein identifiers. Other teams (7 and 42) avoided the tokenization step and relied on matching against a lexicon to find candidate gene identifiers.
Conclusion
Performance on the BioCreative II GN task demonstrates progress since the first BioCreative workshop in (2004). The results obtained for human gene/protein identification are comparable to results obtained earlier for mouse and fly; three teams achieved an F-measure of 0.80 or above for one of their runs. However, there is significant progress along several new dimensions. First, the assessment involved 20 groups, as compared with eight groups for BioCreative I. The results achieved by combining input from all of the participating systems outperformed any single system, achieving F-measures from 0.85 to 0.92, depending on the method of combination.
The participating teams explored the 'solution space' for this challenge evaluation well. Four teams incorporated explicit handling of conjunction and enumeration; this no longer seems to be a significant cause of loss in recall. The 'maximum recall' system achieved a recall of 96.2% (precision 23.1%). A number of groups did contrastive studies on the utility of adding lexical resources and contextual resources, and on the benefits of lexicon curation. The participants also explored novel approaches to the matching of mentions and lexical resources, and there was significant exploration of contextual models for disambiguation and removal of false positives. An interesting finding was that many groups did not feel the need for large training corpora, especially those using lexicon-based approaches.
What does this mean in terms of practical performance? Performance depends on a number of factors: the quality and completeness of the lexical resources; the selection criteria of the articles, including date, journal, domain, and whether they are likely to contain curatable information; the amount of both intra-species and inter-species gene symbol ambiguity; the types of textual input (abstract, full text) and the types of preprocessing required in particular for full text articles; and quantity of data to be handled (all of PubMed/MEDLINE versus specialized subsets).
The formulation of the BioCreative II GN task is still quite artificial. A more realistic task would be to extract and normalize protein names across multiple species, from full text articles, such as was required for the PPI task.
Despite these limitations, normalization technology is making rapid progress. It has the potential to provide improved annotation consistency for gene mention linkages to databases, more efficient updating of existing annotations, and, when applied across large collections, more focused gene-centric literature search.
It is always important to evaluate the evaluation. Criteria for a successful evaluation include participation, progress, diversity of approaches, exchange of scientific information, and emergence of standards. We can see all of these happening in the BioCreative evaluation. There is enthusiastic participation across the entire range of BioCreative tasks. The research community is making significant progress, as shown by the larger number of high-performing systems. There are more groups engaged, and more teams are emerging that combine skills from multiple disciplines, including biology, bioinformatics, linguistics, machine learning, natural language understanding, and information retrieval. There is a healthy variety of approaches represented. We are seeing exploration of ideas developed in the first BioCreative, such as use of a high-recall gene mention 'nomination' process, following by a filtering stage. Also, although the GN task was designed to leverage existing standards, such as Entrez Gene identifiers, we are seeing the emergence of reusable component-ware, and a number of high-performing systems that are taking advantage of this. As we go forward, the BioCreative Workshop will provide an opportunity to exchange insights and to define the next set of challenges for this community to tackle.
Team 109 (Hongfang Liu, Georgetown University Medical Center)
The base gene/protein name normalization system included three modules. The first module was lexicon-lookup, where the lexicon consisted of terms associated with human Entrez Gene records. The second module used machine learning to integrate the results of the gene/protein name mention tagger [32], name sources, name ambiguity, false positive rates, popularity, and token shape information. The third module used a similarity-based method to associate Entrez Gene records with long phrases detected by the gene/protein name mention tagger.
The lexicon was compiled from terms for human genes from Entrez Gene, Online Mendelian Inheritance in Man, HGNC, and BioThesaurus [33]. The synonymy relationship was based on rich cross-reference information provided by Entrez Gene and UniProtKB. All terms were then normalized by changing to lower case, ignoring punctuation marks, and transferring words to their base forms according to the UMLS Specialist Lexicon. The same normalization procedure was applied to each document, followed by longest string matching lookup. If the string contained specialized patterns, which usually were abbreviated forms for several entities from the same family (for example, 'HAP2, 3, 4' or 'HAP2-4', 'HAP-2, -3, and -4', or 'HAP2/4'), then they were separated and reassembled into distinct strings with their own mappings. For example, 'HAP2/4' would become two strings, 'HAP2' and 'HAP4'. This stage returned a list of pairs (Phrase, EGID), in which Phrase was a text string mapped to a lexicon entry and EGID was the Entrez Gene identifier. Each pair (Phrase, EGID) was then transformed into a feature vector, and machine learning was used to classify the pair as valid or invalid. The features included the following:
- Phrase-specific features: the gene/protein mention tagger result, the ambiguity of Phrase, the number of occurrences of Phrase in the document, the number of occurrences of Phrase in the top one million words provided by MedPost, and some typographic features.
- EGID-specific features: the number of different strings mapped to EGID and their occurrences in the text.
- (Phrase, EGID)-specific features: a metric to measure the association power between Phrase and EGID based on the a Boolean feature indicating whether Phrase could be mapped to EGID through exact string matching, and the false positive rate of the pair in the training set.
Names with multiple words in a lexicon may appear in the text with some of the words missing, or in different word orders or forms. The system incorporated a similarity-based method for normalizing names detected by the gene/protein mention tagger. The number of overlapping words was counted between phrases detected as entity names in text and names in the dictionary. If over 90% of the words in a name from the dictionary were found in the names detected by the gene/protein name tagger, the names in the text were normalized to associated record(s) of the name.
Team 6 (Xinglong Wang, University of Edinburgh)
Team 6 adapted a GN system used in its NLP pipeline [47] for extracting protein-protein interactions from biomedical texts. The system was developed for normalizing proteins but it can also normalize other biological entities (drug compounds, disease types, and experimental methods) without requiring extensive knowledge of the new domain.
The system first uses a gene mention named entity component to mark up entities of types gene and gene product. A string-distance-based fuzzy matcher then searches the gene lexicon and calculates scores of string similarity between the mentions and lexicon entries using a formula similar to JaroWinkler. The matcher takes into account the commonality and differences in string suffixes, such as Arabic and Roman numbers. Sets of equivalent suffixes are defined (e.g., Roman I = Arabic 1). Strings with common suffixes are rewarded whilst those with different ones are penalized. The value is finally normalized by the length of the string. At the end of the fuzzy-matching stage, each mention recognized by named entity recognition is associated with the single highest-scoring match from the gene in terms of the string similarity measure, where each match is associated with one or more identifiers (in cases where ambiguity occurs).
To resolve ambiguity, a machine learning algorithm learns a model to predict the most probable identifier out of a pool of candidates returned by the fuzzy matcher. The machine learning algorithm uses contextual properties surrounding gene mentions such as adjacent words, their part-of-speech tags, and so on, as well as complex features such as NER confidence and string similarity scores between all the mentions in the document and the description associated with the gene identifier. An SVM model was then trained to predict the most probable identifiers for gene mentions.
References
- 1. Lynette Hirschman, Marc E. Colosimo, Morgan A, Yeh A: Overview of BioCreAtIvE task 1B: Normalized Gene Lists. BMC BioInformatics 2005, 6(Suppl 1):S11. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
- 2. Marc E. Colosimo, Morgan A, Yeh A, Colombe J, Lynette Hirschman: Data Preparation and Interannotator Agreement: BioCreAtIvE Task 1B. BMC Bioinformatics 2005, 6(Suppl 1):S12. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
- 3. Cohen A: Unsupervised gene/protein entity normalization using automatically extracted dictionaries. In: Proceedings of the BioLINK2005 Workshop Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. Detroit, MI: Association for Computational Linguistics; 2005:17-24.
- 4. Xu H, Fan J-W, Hripcsak G, Mendonça EA, Markatou M, Friedman C: Gene symbol disambiguation using knowledge-based profiles. Bioinformatics 2007, 23(8):1015-1022. PubMed Abstract | Publisher Full Text
- 5. Sehgal AK, Srinivasan P: Retrieval with gene queries. BMC Bioinformatics 2006, 7:220. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
- 6. Fang H-R, Kevin Murphy, Jin Y, Kim J, White P: Human gene name normalization using text matching with automatically extracted synonym dictionaries. In: Proceedings of the BioNLP workshop on linking natural language processing and biology. Association for Computational Linguistics; 2006:41-48.
- 7. Krallinger M, Florian Leitner, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008.
- 8. Wilbur W, Smith L, L T: BioCreative 2. Gene Mention Task. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain: CNIO; 2007:7-16.
- 9. Ehrler F, Gobeill J, Tbahriti I, Ruch P: GeneTeam site report for BioCreative II: Customizing a simple toolkit for text mining in molecular biology. Proc of the Second BioCreative Challenge Evaluation Workshop: Madrid, Spain 2007, 199-207.
- 10. Carpenter R: Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval. 13th Annual Text Retrieval Conference: Gaithersburg, MD 2004.
- 11. Settles B: ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21(14):3191-3192. PubMed Abstract | Publisher Full Text
- 12. Baumgartner W, Lu Z, Johnson H, Caporaso J, Paquette J, Lindemann A, White E, Medvedeva O, Fox L, Cohen K, et al.: An integrated approach to concept recognition in biomedical text. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:257-271.
- 13. Crim J, McDonald R, Pereira F: Automatically annotating documents with normalized gene lists. BMC Bioinformatics 2005, 6(Suppl 1):S13. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
- 14. Morgan A, Ben Wellner, Colombe J, Arens R, Marc E. Colosimo, Lynette Hirschman: Evaluating the Automatic Mapping of Human Gene and Protein Mentions to Unique Identifiers. Pacific Symposium on Biocomputing: Maui 2007, 281-291.
- 15. Morgan A, Lynette Hirschman: Overview of BioCreative II Gene Normalization. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:17-27.
- 16. Gene Ontology Homepage [1] webcite
- 17. NCBI FTP site [2] webcite
- 18. Wu C, Apweiler R, Bairoch A, Natale D, Barker W, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, (32 Database):D187-191. PubMed Abstract | Publisher Full Text | PubMed Central Full Text
- 19. Wain HM, Lush MJ, Ducluzeau F, Khodiyar VK, Povey S: Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res 2004, (32 Database):D255-257. PubMed Abstract | Publisher Full Text | PubMed Central Full Text
- 20. BioCreAtIvE 2 Homepage. [3] webcite
- 21. Hakenberg J, Royer L, Plake C, Strobelt H, Schroeder M: Me and my friends: gene name normalization using background knowledge. Proc 2nd BioCreative Challenge Evaluation: Madrid, Spain 2007, 141-144.
- 22. Doms A, Schroeder M: GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res 2005, 33:W783-786. PubMed Abstract | Publisher Full Text | PubMed Central Full Text
- 23. Fundel K, Güttler D, Zimmer R, Apostolakis J: A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 2005, 6(Suppl 1):S15. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
- 24. Fundel K, Zimmer R: Human Gene Normalization by an Integrated Approach including Abbreviation Resolution and Disambiguation. In: Proceedings of the Second BioCreAtIvE Challenge Workshop - Critical Assessment of Information Extraction in Molecular Biology. Madrid, Spain: CNIO; 2007:153-156.
- 25. Hanisch D, Fundel K, Mevissen H, Zimmer R, J F: ProMiner: Rule based protein and gene entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S14. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
- 26. Fluck J, Mevissen T, Dach H, Oster M, Hoffmann-Apitius M: ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries. Second BioCreAtIvE Challenge Workshop: Critical Assessment of Information Extraction in Molecular Biology; Madrid Spain 2007, 149-152.
- 27. The Open Biomedical Ontologies [4] webcite
- 28. Fukuda K, Tsunoda T, Tamura A, Takagi T: Toward Information Extraction: Identifying Protein Names from Biological Papers. Pacific Symposium on Biocomputing 1998, 705-716.
- 29. Tanabe L, Wilbur J: Tagging gene and protein names in biomedical text. Bioinformatics 2002, 18:1124-1132. PubMed Abstract | Publisher Full Text
- 30. Kinoshita S, Cohen K, Ogren P, Hunter L: BioCreAtIvE Task 1A: Entity Identification with a Stochastic Tagger. BioMed Central Bioinformatics 2005.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
- 31. Schwartz A, Hearst M: A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Symposium for Biocomputing 2003, 451-462.
- 32. Liu H, Torii M, Hu ZZ, Wu C: Gene mention and gene normalization based on machine learning and online resources. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain: CNIO; 2007:135-140.
- 33. Liu H, Hu ZZ, Zhang J, C W: BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22(1):103-105. PubMed Abstract | Publisher Full Text
- 34. Torres R, Sánchez PD, Pascual L, Blaschke C: Text Detective: Gene/proteins annotation tool by Alma Bioinformatics. Proceedings of the Second BioCreative Challenge Evaluation Workshop: Madrid, Spain 2007, 125-130.
- 35. Chiang JH, Liu HH: A Hybrid Normalization approach with capability of disambiguation. Proceedings of the Second BioCreative Challenge Evaluation Workshop: Madrid, Spain 2007, 157-159.
- 36. Luong T, Tran N, Krauthammer M: Context-Aware Mapping of Gene Names using Trigrams. Proc of the Second BioCreative Challenge Evaluation Workshop: Madrid, Spain 2007, 145-148.
- 37. LingPipe homepage [5]
- 38. Schuemie MJ, Jelier R, Kors JA: Peregrine: Lightweight gene name normalization by dictionary lookup. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:131-133.
- 39. Kors JA, Schuemie MJ, Schijvenaars BJA, Weeber M, Mons B: Combination of genetic databases for improving identification of genes and proteins in text. BioLINK, Detroit 2005.
- 40. Lau W, C J: Rule-based gene normalization with a statistical and heuristic confidence measure. Proc of the Second BioCreative Challenge Evaluation Workshop: Madrid, Spain 2007, 165-168.
- 41. Cohen AM: Automatically Expanded Dictionaries with Exclusion Rules and Support Vector Machine Text Classifiers: Approaches to the BioCreAtIve 2 GN and PPI-IAS Tasks. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO Centro Nacional de Investigaciones Oncológicas; 2007:169-174.
- 42. Kuo C-J, Chang Y-M, Huang H-S, Lin K-T, Yang B-H, Lin Y-S, Hsu C-N, Chung I-F: Exploring Match Scores to Boost Precision of Gene Normalization. Proc of the Second BioCreative Challenge Evaluation Workshop (BioCreative II): Madrid, Spain 2007.
- 43. Kuo C-J, Chang Y-M, Huang H-S, Lin K-T, Yang B-H, Lin Y-S, Hsu C-N, Chung I-F: Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. Proc of the Second BioCreative Challenge Evaluation Workshop (BioCreative II) Madrid, Spain 2007.
- 44. Freund Y, Schapire R: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 1997, 55(1):119-139. Publisher Full Text
- 45. Nakov P, Divoli A: BioText Report for the Second BioCreAtIvE Challenge. Proceedings of the Second BioCreative Challenge Evaluation Workshop: Madrid, Spain 2007, 297-306.
- 46. David Yarowsky: Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the Annual Meeting of the Association for Computational Linguistics 1995, 189-196.
- 47. Grover C, Haddow B, Klein E, Matthews M, Neilsen L, Tobin R, Wang X: Adapting a relation extraction pipeline for the BioCreative II tasks. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:273-286.
- 48. Gonzalez G, Tari L, Gitter A, Leaman R, Nikkila S, Wendt R, Zeigler A, Baral C: Integrating knowledge from biomedical literature. In: Proceedings Second BioCreative Challenge Evaluation Workshop. CNIO; 2007:227-236.
- 49. Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 6(22):658-664.
- 50. Aronson A, Demner-Fushman D, Humphrey S, Lin J, Liu H, Ruch P, Ruiz M, Smith L, Tanabe L, Wilbur W: Fusion of knowledge-intensive and statistical approaches for retrieving and annotating textual genomics documents. In TREC Proceedings. TREC Gaithersburg, MD, USA; 2005.
- 51. Sun C, Lin L, Wang X, Guan Y: Study for Application of Discriminative Models in Biomedical Literature Mining. Proceedings of the Second BioCreative Challenge Evaluation Workshop Madrid, Spain 2007, 319-321.
,