2004 Textpresso

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Textpresso System, Wormbase.

Notes

Cited By

2008

Quotes

Abstract

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.


References

  • Alper S, Kenyon C. (2002). The zinc finger protein REF-2 functions with the Hox genes to inhibit cell fusion in the ventral epidermis of C. elegans. Development 129: 3335–3348.
  • Andrade MA, Bork P. (2000) Automated extraction of information in molecular biology. FEBS Lett 476: 12–17.
  • Bei Y, Hogan J, Berkowitz LA, Soto M, Rocheleau CE, et al.. (2002). SRC-1 and Wnt signaling act together to specify endoderm and to control cleavage orientation in early C. elegans embryos. Dev Cell 3: 113–125.
  • Blaschke C, Valencia A. (2001) Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genomics 2: 196–206.
  • Blaschke C, Valencia A. (2002). Molecular biology nomenclature thwarts information-extraction progress. IEEE Intell Syst 17: 73–76.
  • Boxem M, van den Heuvel S. (2002). C. elegans class B synthetic multivulva genes act in G(1) regulation. Curr Biol 12: 906–911.
  • Brill E (1992) A simple rule-based part of speech tagger. In: Proceedings of the third conference on applied natural processing. Trento (Italy): ACL. pp. 152–155.
  • de Bruijn B, Martin J. (2002). Getting to the (c)ore of knowledge: Mining biomedical literature. Int J Med Inf 67: 7–18.
  • Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, et al.. (2003). PreBIND and Textomy — mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4: 11.
  • Francis R, McGrath G, Zhang J, Ruddy DA, Sym M, et al.. (2002). aph-1 and pen-2 are required for Notch pathway signaling, gamma-secretase cleavage of beta-APP, and presenilin protein accumulation. Dev Cell 3: 85–97.
  • Friedman C, Kra P, Hong Y, Krauthammer M, Rzhetsky A. (2001) GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17: S74–S82.
  • Fukuda K, Tsunoda T, Tamura A, Takagi T (1998) Towards information extraction: Identifying protein names from biological papers. Pac Symp Biocomput 1998: 707–718.
  • The Gene Ontology Consortium (2000) Gene Ontology: Tool for the unification of biology. Nat Genet 25: 25–29.
  • Gupta BP, Sternberg PW. (2002). Tissue-specific regulation of the LIM homeobox gene lin-11 during development of the Caenorhabditis elegans egg-laying system. Dev Biol 247: 102–115.
  • Hanisch D, Fluck J, Mevissen HT, Zimmer R. (2003). Playing biology’s name game: Identifying protein names in scientific text. Pac Symp Biocomput 2003: 403–414.
  • Huang NN, Mootz DE, Vidal M, Hunter CP, Walhout AJ. (2002). MEX3 interacting proteins link cell polarity to asymmetric gene expression in Caenorhabditis elegans. Development 129: 747–759.
  • Jenssen TK, Lgreid A, Komorowski J, Hovig E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28: 21–28.
  • Maduzia LL, Gumienny TL, Zimmerman CM, Wang H, Shetgiri P, et al.. (2002). lon-1 regulates Caenorhabditis elegans body size downstream of the dbl-1 TGF beta signaling pathway. Dev Biol 246: 418–428.
  • Marcotte EM, Xenarios I, Eisenberg D. (2001) Mining literature for proteinprotein interactions. Bioinformatics 17(Suppl 1): 359–363.
  • Norman KR, Moerman DG. (2002). Alpha spectrin is essential for morphogenesis and body wall muscle formation in Caenorhabditis elegans. J Cell Biol 157: 665– 677.
  • Ono T, Hishigaki H, Tanigami A, Takagi T. (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17: 155–161.
  • Piekny AJ, Mains PE. (2002). Rho-binding kinase (LET-502) and myosin phosphatase (MEL-11) regulate cytokinesis in the early Caenorhabditis elegans embryo. J Cell Sci 115: 2271–2282.
  • Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B (1998) Detecting gene symbols and names in biological texts: A first step toward pertinent information extraction. Genome Inform Ser Workshop Genome Inform 9: 72–80.
  • Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. (2000) EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 2000: 515–524. Scott BA, Avidan MS, Crowder CM
  • (2002) Regulation of hypoxic death in C. elegans by the insulin/IGF receptor homolog DAF-2. Science 296: 2388–2391.
  • Sekimizu T, Park HS, Jun’ichi T (1998) Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. Genome Inform Ser Workshop Genome Inform 9: 62–71.
  • Staab S, editor. (2002). Mining information for functional genomics. IEEE Intell Syst 17: 66.
  • Stapley BJ, Benoit G. (2000) Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput 2000: 529–540.
  • Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J. (2001) WormBase: Network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res 29: 82–86.
  • Thomas J, Milward D, Ouzounis C, Pulman S, Carroll M. (2000) Automatic extraction of protein interactions from scientific abstracts. Pac Symp Biocomput 2000: 502–513.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 TextpressoHans-Michael Müller
Eimear E. Kenny
Paul W. Sternberg
Textpresso: an ontology-based information retrieval and extraction system for biological literaturePLoS Biolhttp://www.plosbiology.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pbio.0020309&representation=PDF10.1371/journal.pbio.00203092004