2003 ExtractingSynonymousGeneProteinTerms

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Relation Recognition, Word Normalization.

Notes

Cited By

2004

Quotes

Abstract

Motivation: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance.

Results: We have explored four complementary approaches for extracting gene and protein synonyms from text, namely the unsupervised, partially supervised, and supervised machine-learning techniques, and the manual knowledge-based approach. We report results of a large scale evaluation of these alternatives over an archive of biological journal articles. Our evaluation shows that our extraction techniques could be a valuable supplement to resources such as SWISSPROT, as our systems were able to capture gene and protein synonyms not listed in the SWISSPROT database.

Data Availability: The extracted gene and protein synonyms are available at http://synonyms.cs.columbia.edu/

References

  • Agichtein,E. and Gravano,L. (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the ACM International Conference on Digital Libraries.
  • Blum,A. and Mitchell,T. (1998) Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of ICML.
  • Brin,S. (1998) Extracting patterns and relations from the World-Wide-Web. In: Proceedings of the SIGMODWorkshop on theWeb and Databases (WebDB).
  • Califf,M.E. and Mooney,R.J. (1998) Relational learning of pattern-match rules for information extraction. In: Proceedings of the AAAI Symposium on Applying Machine Learning to Discourse Processing.
  • Collins,M. and Singer,Y. (1999) Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
  • Dagan,I., Marcus,S. and Markovitch,S. (1995) Contextual word similarity and estimation from sparse data. Computer, Speech and Language.
  • Dietterich,T.G. (2000) Ensemble methods in machine learning. LNCS.
  • Fellbaum,C. (1999) WordNet: an Electronic Lexical Database. MIT Press.
  • Friedman,C., Kra,P., Yu,H., Krauthammer,M. and Rzhetsky,A. (2001) Genies: a natural-language processing system for the extraction of molecular pathways from complete journal articles. Bioinformatics, 17 (Suppl 1), S74–S82.
  • Fukuda,K., Tamura,A., Tsunoda,T. and Takagi,T. (1998) Toward information extraction: identifying protein names from biological papers. In: Proceedings of the Pacific Symposium on Biocomputing. pp. 707–718.
  • Hatzivassiloglou,V., Duboue,P. and Rzhetsky,A. (2001) Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics, 17 (Suppl 1), S97–S106.
  • Hearst,M. (1992) Automatic acquistion of hyponyms from large text corpora. In: Proceedings of COLING.
  • Hole,W. and Srinivasan,S. (2000) Discovering missed synonyms in a large concept-oriented metathesaurus. In: Proceedings of the AMIA Symposium. pp. 354–358.
  • Humphreys,B. and Lindberg,D. (1993) The UMLS project: making the conceptual connection between users and the information they need. Bull. Med. Lib. Assoc., 81, 170–177.
  • Joachims,T. (1998) Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA.
  • Krauthammer,M., Kra,P., Iossifov,I., Gomez,S., Hripcsak,G., Hatzivassiloglou, V., Friedman,C. and Rzhetsky,A. (2002). Of truth and pathways: chasing bits of information through myriads of articles. Bioinformatics, 18 (Suppl 1), S249–S257.
  • Kushmerick,N., Weld,D.S. and Doorenbos,R.B. (1997) Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
  • Li,H. and Abe,N. (1998)Word clustering and disambiguation based on co-occurrence data. In: Proceedings of COLING.
  • D. Lin. (1998) Automatic retrieval and clustering of similar words. In: Proceedings of ACL.
  • Liu,H. and Friedman,C. (2003). Mining terminological knowledge in large biomedical corpora. In: Proceedings of the Pacific Symposium on Biocomputing.
  • Magnini,B., Negri,M., Prevete,R. and Tanev,H. (2002). Is it the right answer? exploiting web redundancy for answer validation. In: Proceedings of ACL.
  • Muslea,I., Minton,S. and Knoblock,C. (1998) STALKER: learning extraction rules for semistructured web-based information sources. In: Proceedings of AAAI-98 Workshop on AI and Information Integration.
  • Pakhomov,S. (2002). Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical text. In: Proceedings of ACL.
  • Park,Y. and Byrd,R. (2001) Hybrid text mining for finding abbreviations and their definitions. In: Proceedings of EMNLP.
  • Proux,D., Rechenmann,F., Julliard,L., Pillet,V. and Jacq,B. (1998) Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Proceedings of Workshop on Genome Informatics.
  • Resnik,P. (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI).
  • Riloff,E. (1996) Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence. pp. 1044–1049.
  • Riloff,E. and Jones,R. (1999) Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence.
  • Rindflesch,T., Tanabe,L., Weinstein,J. and Hunter,L. (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. In: Proceedings of the Pacific Symposium on Biocomputing.
  • Schwartz,A. and Hearst,M. (2003). A simple algorithm for identifying abbreviation definitions in biomedical text. In: Proceedings of the Pacific Symposium on Biocomputing.
  • Soderland,S. (1999) Learning information extraction rules for semistructured and free text. Machine Learning, 34.
  • Tanabe,L. and Wilbur,W. (2002). Tagging gene and protein names in biomedical text. Bioinformatics, 18, 1124–1132.
  • Thomas,J., Milward,D., Ouzounis,C., Pulman,S. and Carroll,M. (2000) Automatic extraction of protein interactions from scientific abstracts. In: Proceedings of the Pacific Symposium on Biocomputing.
  • Wilbur,W. and Kim,W. (2001) Flexible phrase-based query handling algorithms. In: Proceedings of the ASIST.
  • Yangarber,R. and Grishman,R. (1998) NYU: description of the Proteus/PET system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7).
  • Yangarber,R., Grishman,R., Tapanainen,P. and Huttunen,S. (2000) Unsupervised discovery of scenario-level patterns for information extraction. In: Proceedings of Conference on Applied Natural Language Processing ANLP-NAACL.
  • Yarowsky,D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the ACL.
  • Yoshida,M., Fukuda,K. and Takagi,T. (2000) PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics, 16, 169–175.
  • Yu, H., Friedman, C. and Hripcsak, G. (2002). Mapping abbreviations to full forms in biomedical articles. J. Amer. Med. Inform. Assoc., 9, 262–272.
  • Yu,H., Hatzivassiloglou,V., Friedman,C., Rzhetsky,A. and Wilbur,W.J. (2002). Automatic extraction of gene and protein synonyms from medline and journal articles. In: Proceedings of the AMIA Symposium. pp. 413–423.

BibTeX

@inproceedings{DBLP:conf/ismb/YuA03,

 author    = {Hong Yu and
              Eugene Agichtein},
 title     = {Extracting synonymous gene and protein terms from biological
              literature.},
 booktitle = {ISMB (Supplement of Bioinformatics)},
 year      = {2003},
 pages     = {340-349},
 ee        = {http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i340?etoc},
 crossref  = {DBLP:conf/ismb/2003},
 bibsource = {DBLP, http://dblp.uni-trier.de}

}

@proceedings{DBLP:conf/ismb/2003,

 title     = {Proceedings of the Eleventh International Conference on
              Intelligent Systems for Molecular Biology, June 29 - July
              3, 2003, Brisbane, Australia},
 booktitle = {ISMB (Supplement of Bioinformatics)},
 year      = {2003},
 bibsource = {DBLP, http://dblp.uni-trier.de}

}


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 ExtractingSynonymousGeneProteinTermsHong Yu
Eugene Agichtein
Extracting Synonymous Gene and Protein Terms from Biological LiteratureProceedings of the 11th Inthttp://www.cs.columbia.edu/~eugene/papers/ismb2003.pdf2003