2008 IntroducingMetaServicesForBioMedExtraction
- (Leitner et al., 2008) ⇒ Florian Leitner, Martin Krallinger, Carlos Rodriguez-Penagos, Jörg Hakenberg, Conrad Plake, Cheng-Ju Kuo, Chun-Nan Hsu, Richard Tzong-Han Tsai, Hsi-Chuan Hung William W Lau, Calvin A Johnson, Rune Sætre, Kazuhiro Yoshida, Yan Hua Chen, Sun Kim, Soo-Yong Shin, Byoung-Tak Zhang, William A. Baumgartner Jr, Lawrence Hunter, Barry Haddow, Michael Matthews, Xinglong Wang, Patrick Ruch, Frédéric Ehrler, Arzucan Özgür, Güneş Erkan, Dragomir Radev, Michael Krauthammer, ThaiBinh Luong, Robert Hoffmann, Chris Sander, Alfonso Valencia. (2008). “Introducing Meta-Services for Biomedical Information Extraction.” In: Genome Biology, 9(Suppl 2):S6 doi:10.1186/gb-2008-9-s2-s6
Subject Headings: Gene Mention Recognition System, Gene Mention Normalization System, PubMed, Uniprot, BMCS, BioCreate Meta-Services, Entity Mention Normalization Task.
Notes
Cited By
Quotes
Abstract
We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; http://bcms.bioinfo.cnio.es/ webcite). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations.
Background
Information retrieval (IR), information extraction (IE), and text mining have become integral parts of computational biology over the past decade [1]. However, these services are dispersed, integrated in specific packages, and include proprietary software. Therefore, progress in the field requires offering better access to the tools, methods, and their results [2]. Other areas, such as sequence analysis, genome analysis, or protein structure prediction, have benefited greatly from enhanced access to services and tools for the community of biologists, bioinformaticians (through web servers and portals), and developers (by providing free, open source academic software) [3].
Web services, widely used throughout the internet to provide the functionality for distributed systems, are becoming a common part of bioinformatics tools; For example, one of the most used text mining applications, namely iHOP (Information Hyperlinked Over Proteins), provides such an infrastructure to access its data [4]. Meta-services, too, are a ubiquitous component of the world wide web, found as meta-search engines, in business-to-buisness and business-to-consumer transactions (for example, for flight booking systems), and are used in scientific research (for example, for protein structure prediction) [5]. Another example of a distributed meta-service is BioDAS (Distributed Annotation System), a platform to exchange biologic sequence annotations between independent resources [6].
This publication describes the development of the BioCreative MetaServer (BCMS) prototype. The Results section (below) provides an overview of the system design and introduces the basic components, followed by short descriptions of the IE systems currently available through the platform prototype. The Discussion section (below) reviews what problems are solved and what issues need further investigation. The Conclusions section (below) closes with current and future utilities of this platform for the biomedical community. Technical details on the platform and implementation aspects can be found in the Materials and methods section (below).
Results
The fundamental aim of the BCMS platform is to provide users with annotations on biomedical texts from different systems. At the current prototype level, the dataset is restricted to a fixed number of approximately 22,800 PubMed/Medline abstracts. The available annotations consist of marking passages that are detected as gene or protein name mentions, annotating the articles with the gene/protein and taxonomic IDs (providing hyperlinks to the corresponding database entries), and a confidence score for whether the text contains protein-protein interaction information. Expanding on stand alone IE systems, this platform gathers the results of several systems developed by various research groups, unifies them, and allows the user to access abstracts and annotations in a combined view. It is conceivable that collating classification results will often enhance performance, simply because multiple equal classifications for a given annotation are more likely to be correct. The gathered data are accessible to the user both as human-readable hypertext and as machine processable XML in the form of XML-RPC requests.
Annotation Systems
Gene/protein normalization (GN): detect which genes or proteins are mentioned, assigning sequence database identifiers to the text.
Biotec TU Dresden and Humboldt-Universität zu Berlin (JH, CP) [Hakenberg]
The annotations we currently provide are gene mention normalization (32,795 human genes from EntrezGene)
Entity mention normalization is based on large lexicon of known names and synonyms, which are kept in main memory at all times for efficiency. Once a potential named entity has been found, we further identify it using context profiles in case multiple entities share the same name [15];
References
- Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol 2005, 6:224. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- Cohen A, Hersh W: A survey of current work in biomedical text mining. Brief Bioinform 2005, 6:57-71. PubMed Abstract | Publisher Full Text OpenURL
- Labarga A, Valentin F, Anderson M, Lopez R: Web Services at the European Bioinformatics Institute. Nucleic Acids Res 2007, (35 Web server):W6-W11. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- Fernández J, Hoffmann R, Valencia A: iHOP web services. Nucleic Acids Res 2007, (35 Web server):W21-W26. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- Bujnicki JM, Elofsson A, Fischer D, Rychlewski L: Structure prediction meta server. Bioinformatics 2001, 17:750-751. PubMed Abstract | Publisher Full Text OpenURL
- Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The distributed annotation system. BMC Bioinformatics 2001, 2:7. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- BioCreative Homepage [1] webcite
- XML-RPC Specification [2] webcite
- BioCreative MetaServer [3] webcite
- BioCreative XML-RPC MetaService [4] webcite
- Krallinger M, Morgan A, Smith L, Florian Leitner, Tanabe L, Wilbur J, Lynette Hirschman, Valencia A: Evaluation of text mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 2008, 9(Suppl 2):S1. OpenURL
- Smith L, Tanabe LK, Johnson nee Ando R, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA Jr, Hunter L, Carpenter B, Tsai RT-H, Dai H-J, Liu F, Chen Y, Sun C, Sophia Katrenko, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, et al.: Overview of BioCreative II gene mention recognition. Genome Biology 2008, 9(Suppl 2):S2. OpenURL
- Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu H-h, Torres R, Krauthammer M, Lau WW, Liu H, Hsu C-N, Schuemie M, Cohen KB, Lynette Hirschman: Overview of BioCreative II gene normalization. Genome Biol 2008, 9(Suppl 2):S3. OpenURL
- Krallinger M, Florian Leitner, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008, 9(Suppl 2):S4. OpenURL
- Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008, 9(Suppl 2):S14. OpenURL
- Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U: AliBaba: PubMed as a graph. Bioinformatics 2006, 22:2444-2445. PubMed Abstract | Publisher Full Text OpenURL
- Kuo CJ, Chang YM, Huang HS, Lin KT, Yang BH, Lin YS, Hsu CN, Chung IF: Rich feature set, unification of bidirectional parsing and dictionary filtering for high F-score gene mention tagging. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL
- Mallet: A machine learning for language toolkit [5] webcite
- Yoshimasa Tsuruoka, Tateishi Y, Kim JD, Ohta T, McNaught J, Sophia Ananiadou, Jun'ichi Tsujii: Developing a robust part-of-speech tagger for biomedical text. In Advances in Informatics, 10th Panhellenic Conference on Informatics; 11-13 November (2005). Volos, Greece. Springer; 2005:382-392. OpenURL
- Dai HJ, Hung HC, Tsai RTH, Hsu WL: IASL systems in the gene mention tagging task and protein interaction article subtask. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL
- Tsai RTH, Sung CL, Dai HJ, Hung HC, Sung TY, Hsu WL: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 2006, 7(suppl 5):S11. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- Tsai RTH, Hung HC, Dai HJ, Hsu WL: Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. Proceedings of the 6th International Conference on Bioinformatics; HongKong-Hanoi-Nansha; 27-31 August 2007 OpenURL
- Sinica Annotation Server - Web Service [6] webcite
- Lau WW, Johnson CA: Rule-based human gene normalization in biomedical text with confidence estimation. Comput Syst Bioinformatics Conf 2007, 6:371-379. PubMed Abstract | Publisher Full Text OpenURL
- Nelder J, Mead R: A simplex method for function minimization. Computer J 1965, 7:308-313. OpenURL
- Sætre R, Sagae K, Jun'ichi Tsujii: Syntactic features for protein-protein interaction extraction. Short Paper Proceedings of the 2nd International Symposium on Languages in Biology and Medicine (LBM-2007); 6-7 December 2007; Singapore OpenURL
- Sætre R, Yoshida K, Yakushiji A, Miyao Y, Matsubyashi Y, Ohta T: AKANE system: protein-protein interaction pairs in BioCreAtIvE2 challenge, PPI-IPS subtask. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; 2007:209-212. OpenURL
- Chen YH, Ramampiaro H, Lægreid A, Sætre R: ProtIR prototype: abstract relevance for protein-protein interaction in BioCreAtIvE2 challenge, PPI-IAS subtask. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; 2007:179-182. OpenURL
- Jang H, Lim J, Lim JH, Park SJ, Lee KC, Park SH: Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics 2006, 22:e220-e226. PubMed Abstract | Publisher Full Text OpenURL
- Fan W, Stolfo S, Zhang J, Chan P: AdaCost: misclassification cost-sensitive boosting. Proceedings of the 16th International Conference on Machine Learning; 27-30 (1999). Bled, Slovenia 1999, 97-105. OpenURL
- PIE: Protein Interaction Information Extraction [7] webcite
- Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE Task1A: entity identification with a stochastic tagger. BMC Bioinformatics 2005, 6(suppl 1):S4. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- Baumgartner WA Jr, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A, White EK, Medvedeva O, Cohen KB, Hunter L: Concept recognition for extracting protein interaction relations from biomedical text. Genome Biology 2008, 9(Suppl 2):S9. OpenURL
- Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, Tobin R, Wang X: Automating curation using a natural language processing pipeline. Genome Biol 2008, 9(Suppl 2):S10. OpenURL
- Grover C, Haddow B, Klein E, Matthews M, Nielsen LA, Tobin R, Wang X: Adapting a relation extraction pipeline for the BioCreAtIvE II task. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL
- Alex B, Haddow B, Grover C: Recognising nested named entities in biomedical text. Proceedings of BioNLP; June 2007; Prague, Czech Republic 2007, 65-72. OpenURL
- Wang X: Rule-based protein term identification with help from automatic species tagging. Proceedings of CICLING; Mexico City, Mexico 2007, 288-298. OpenURL
- Nielsen LA: Extracting protein-protein interactions using simple contextual features. Proceedings of BioNLP; New York 2006, 120-121. OpenURL
- Matthews M: Improving biomedical text categorization with nlp. Proceedings of the SIGs, The Joint BioLINK-Bio-Ontologies Meeting 2006, 93-96. OpenURL
- Ehrler F, Geissbuhler A, Jimeno A, Ruch P: Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot. BMC Bioinformatics 2005, 6(suppl 1):S23. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
- Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 2006, 22:658-664. PubMed Abstract | Publisher Full Text OpenURL
- Pillet V, Zehnder M, Seewald AK, Veuthey AL, Petrak J: GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005, 21:1743-1744. PubMed Abstract | Publisher Full Text OpenURL
- Genia Tagger [8] webcite
- de Marneffe MC, MacCartney B, Manning CD: Generating typed dependency parses from phrase structure parses. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) OpenURL
- Erkan G, Özgür A, Radev DR: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL); Prague, Czech Republic 2007, 1:228-237. OpenURL
- Erkan G, Özgür A, Radev DR: Extracting interacting protein pairs and evidence sentences by using dependency parsing and machine learning techniques. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL
- Krauthammer M, Nenadic G: Term identification in the biomedical literature. J Biomed Inform 2004, 37:512-526. PubMed Abstract | Publisher Full Text OpenURL
- Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21:3191-3192. PubMed Abstract | Publisher Full Text OpenURL
- Luong T, Tran N, Krauthammer M: Context-aware mapping of gene names using trigrams. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; 2007:145-148. OpenURL
- Hoffmann R, Valencia A: A gene network for navigating the literature. Nat Genet 2004, 36:664. PubMed Abstract | Publisher Full Text OpenURL
- Hoffmann R, Valencia A: Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 2005, 21(suppl 2):ii252-22258. PubMed Abstract | Publisher Full Text OpenURL
- MEDLINE/PubMed update charts [9] webcite
- Valencia A: Meta, Meta(N) and cyber servers. Bioinformatics 2003, 19:795. PubMed Abstract | Publisher Full Text OpenURL
- eUtils SOAP API [10] webcite
- PostgreSQL Open Source Database [11] webcite
- Django Web Development Framework [12] webcite
- jQuery JavaScript and AJAX library [13] webcite
- LingPipe - Java Text Mining Library and Medline Importer [14] webcite
- Python Programming Language [15] webcite
- ITI Life Sciences Homepage [16] webcite
- Cognia EU Homepage [17] webcite
- Instituto Nacional de Bioinformática [18] webcite,