2005 AutoAnnotDocsWithNormalizedGeneLists

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Normalization Algorithm, BioCreAtIvE Benchmark Task.

Notes

Cited By

Quotes

Abstract

  • Background: Document gene normalization is the problem of creating a list of unique identifiers for genes that are mentioned within a document. Automating this process has many potential applications in both information extraction and database curation systems. Here we present two separate solutions to this problem. The first is primarily based on standard pattern matching and information extraction techniques. The second and more novel solution uses a statistical classifier to recognize valid gene matches from a list of known gene synonyms.
  • Results: We compare the results of the two systems, analyze their merits and argue that the classification based system is preferable for many reasons including performance, simplicity and robustness. Our best systems attain a balanced precision and recall in the range of 74%–92%, depending on the organism.

Conclusion

  • Comparing Tables 3 and 4, we see that maximum entropy classification does just as well or better than the pattern matching system. A primary advantage of maximum entropy classification over pattern matching is that the system is uniform across organisms, hence the method is more likely to perform well when extended to different organisms.
  • There are many ways in which the maximum entropy model can also be improved. The most obvious of which is to include more expert knowledge into the model. Maximum entropy models are widely used since they easily allow for the integration of such expert knowledge through the definition of new features. For extracting gene mentions from text, these features generally take the form of lexical resources and indicative regular expressions [2]. For gene normalization, it may be possible to have experts additionally curate the synonym list to indicate which synonyms should be trusted and which should not. This could greatly improve performance, particularly for synonym matches not seen in training. If the system matches have a feature indicating that a synonym is trustworthy it could provide evidence to classify the match as valid. Currently, the model's features are based primarily on textual matching and contain no domain specific information. It may also be possible to improve performance by introducing more context or some syntactic features from the extracted matches. However, preliminary experiments on the development data suggested that additional context had a negligible effect on accuracy and only served to increase the time it took to train the model.
  • Another potential improvement would be to relax the criteria when extracting matches. Under perfect conditions we should be able to extract all good matches and use the classifier to eliminate the bad ones. Currently our matching criteria extracts as low as 79% of all good matches, which bounds the recall of the system. We are experimenting with different string distance metrics proposed by Cohen et al. [7] to try and raise the number of good matches returned.

References

  • 1. Kazama J, Makino T, Ohta Y, Jun'ichi Tsujii: Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proceedings of Natural Language Processing in the Biomedical Domain, ACL (2002). OpenURL
  • 2. McDonald R, Pereira F: Identifying gene mentions in text using conditional random fields. BMC Bioinformatics 2005, 6(Suppl 1):S6. PubMed Abstract | BioMed Central Full Text OpenURL
  • 3. Narayanaswamy M, Ravikumar KE, Vijay-Shanker K: A Biological Named Entity Recognizer. Proceedings of Pacific Symposium on Biocomputing (2003). OpenURL
  • 4. A critical assessment of text mining methods in molecular biology workshop [1] webcite (2004). OpenURL
  • 5. Morgan AA, Lynette Hirschman, Marc E. Colosimo, Yeh A, Colombe J: Gene Name Identification and Normalization Using a Model Organism Database. To appear in Journal of Biomedical Informatics (2004). OpenURL
  • 6. Lynette Hirschman, Marc E. Colosimo, Morgan A, Yeh A: Overview of BioCreAtIvE task 1B: Normalized Gene Lists. BMC Bioinformatics 2005, 6(Suppl 1):S11. PubMed Abstract | BioMed Central Full Text OpenURL
  • 7. W. Cohen, Ravikumar P, Feinberg S: Comparison of String Distance Metrics for Name-Matching Tasks. Proceedings of IIWeb workshop (2003). OpenURL
  • 8. Porter MF: An algorithm for suffix stripping. Program 1980, 14(3):130-137. OpenURL
  • 9. A. McCallum: MALLET: A Machine Learning for Language Toolkit. [2] webcite (2002). OpenURL
  • 10. Berger AL, Della Pietra SA, Della Pietra VJ: A maximum entropy approach to natural language processing. Computational Linguistics 1996., 22(1): OpenURL
  • 11. Chen SF, Rosenfeld R: A Gaussian prior for smoothing maximum entropy models. (1999). OpenURL
  • 12. Malouf R: A comparison of algorithms for maximum entropy parameter estimation. Proceedings of Sixth Conference on Natural Language Learning (2002). OpenURL
  • 13. Sha F, F. Pereira: Shallow parsing with conditional random fields. Proceedings of HLT-NAACL 2003, 213-220. OpenURL
  • 14. Yoshimasa Tsuruoka, Jun'ichi Tsujii: Boosting Precision and Recall of Dictionary-based Protein Name Recognition. Proceedings of the ACL-03 Workshop on Natural Language Processing in Biomedicine 2003, 41-48. OpenURL
  • 15. Yu H, E. Agichtein: Extracting synonymous gene and protein terms from biological literature. Bioinformatics 2003, 19(ISMB supplement):340-349. Publisher Full Text OpenURL,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 AutoAnnotDocsWithNormalizedGeneListsRyan T. McDonald
Fernando Pereira
Jeremiah Crim
Automatically Annotating Documents with Normalized Gene Listshttp://www.biomedcentral.com/1471-2105/6/s1/s13