070309 DrAmosBairoch
- (Bairoch, 2007) ⇒ Amos Bairoch. (2007). “Interview with Amos Bairoch.” personal interview by Gabor Melli.
Subject Headings: SwissProt.
Quotes
Overview
I met with Dr. Amos Bairoch, a Swiss-Prot expert, on 2007-March-09 to discuss the application of Swiss-Prot to the PPLRE project. Topics discussed included: Swiss-Prot, PPLRE, NER - Protein, Swissknife, IE - Life Sciences
Outcomes tasks included:
- Download the latest version of Swiss-Prot data (DONE)
- Analyze the quantity of SUBCELLULAR LOCATION entries in Swiss-Prot. (DONE)
- Follow-up with Fiona about creating a mapping between PSortdb's and Swiss-Prot's SCL terminology.
- Analyze the overlap between ePSORTdb and Swiss-Prot.
- Extract the relevant information from Swiss-Prot.
- Decide on how best to use the information (team meeting).
- Move Swiss-Prot processing code to use Swissknife.
- Research the NER - Protein task again.
Subcellular Location Data in Swiss-Prot
Controlled Vocabulary for SUBCELLULAR LOCATION
- By happy coincidence Swiss-Prot is about to update the SUBCELLULAR LOCATION data field to allign to a Controlled Vocabulary.
- He me earlier with Fiona and began discussions on creating a mapping to their terminology. (Task: chat with Fiona about this)
- The SUBCELLULAR LOCATION field will still contain free form comments.
Swiss-Prot and ePSORTdb
I spent some time with him working through an example in order to be clear about how to align ePSORTdb data with Swiss-Prot data.
The Example: Ubiquinol oxidase
- The protein was from E.Coli: Ubiquinol oxidase polypeptide II precursor.
- 1) It is in ePSORTdb and has a PMID reference: http://db.psort.org/php/e/annotation.php?gi=118071
- 2) It has a Swiss-Prot entry with a comment on "SUBCELLULAR LOCATION": http://expasy.org/uniprot/P0ABJ1
Location/Localization Terminology
- Notice how the "COMMENTS" section of the Swiss-Prot record contains the entry:
- SUBCELLULAR LOCATION: Cell inner membrane; multi-pass membrane protein.
- On ePSORTdb the experimental_scl property reads:
- CytoplasmicMembrane
- This is an example where we need the mapping between the two projects vocabularies.
Matched References
- Interestingly the single PMID reference in the ePSORTdb record (11017202) also exists in the Swiss-Prot record.
- This is likely because there are likely few papers that report on experiments on this protein.
SUBCELLULAR LOCATION - Comments
- Notice that one of the comments in the Swiss-Prot record mentions a "SUBCELLULAR LOCATION" of Cell inner membrane multi-pass membrane protein
- This differs from ePSORTdb's entry of CytoplasmicMembrane.
- This is an example where the mapping between the two vocabularies is required.
- Dr. Bairoch mentioned that when the name is followed by anything in parentheses, for example "(By Similarity)" or "(Predicted)", that these are not experimentally validated entries.
- A quick test through the Swiss-Prot for entries withOUT parentheses suggests that there are
- 25,607 experimentally validated Bacteria proteins
- 1,343 experimentally validated Archaea proteins.
% grep "SUBCELLULAR LOCATION" uniprot_sprot_bacteria.dat | grep '\-!-' | grep -v "(" | wc -l
25607
% grep "SUBCELLULAR LOCATION" uniprot_sprot_archaea.dat | grep '\-!-' | grep -v "(" | wc -l
1343
- A quick test through the Swiss-Prot for entries WITH parentheses suggests that there are
- 31,689 NON-experimentally tested Bacteria proteins
- 1,693 NON-experimentally tested Archaea proteins.
% grep "SUBCELLULAR LOCATION" uniprot_sprot_archaea.dat | grep '\-!-' | grep "(" | wc -l
1603
% grep "SUBCELLULAR LOCATION" uniprot_sprot_bacteria.dat | grep '\-!-' | grep "(" | wc -l
31689
- NOTE: I may have overlooked something significant, but at worst the actual numbers would be at most four times (4x) smaller. I.e. my guess is that worst-case there are only ~6,000 validated proteins; which is still a good size.
SUBCELLULAR LOCATION - References
- The Swiss-Prot record has many references for this protein: thirteen (13) of them.
- Each reference is labelled with an indicator of the type of information that was extracted from the paper.
- Notice how reference number ten (10) to PMID=16079137 is labeled with "SUBUNIT, AND SUBCELLULAR LOCATION".
- Notice that this PMID differs from the one referenced in ePSORTdb. The reason for this is likely that it is a newer publication (2005).
- We can use this label ourselves to get papers that contain experimentally validated OPLs.
- A quick test through the Swiss-Prot for entries WITH references that are labeled as "SUBCELLULAR LOCATION" suggests that:
- There are 13 references with experimentally validated Bacteria proteins
- There are 543 references with experimentally validated Bacteria proteins
% grep "SUBCELLULAR LOCATION" uniprot_sprot_bacteria.dat | grep RP | wc -l
543
% grep "SUBCELLULAR LOCATION" uniprot_sprot_archaea.dat | grep RP | wc -l
13
- Note: the number of papers is signicantly smaller than proteins (543 vs. 25607) because the labeling of papers began later.
New Accession Number
- Notice that the Accession number use in ePSORTdb differs from the one in the Swiss-Prot record. This is an example where the Swiss-Prot record was split into two records two years ago when they decided to have each entry be specific to one organism. In this case the protein was formerly shared between "E.Coli" and "E.Coli O6".
- It is possible to perform the join by looking through Swiss-Prot's old accession numbers.
- He took an action item down that he would contact Fiona about updating ePSORTdb's Swiss-Prot reference.
Whole Paper
- The paper reference in Swiss-Prot for the localization of this protein (PMID=16079137) is interesting in that its abstract does not mention the protein.
- The reason for this absence is that the paper contains a multitude of results: ~43 proteins in total.
- Notice the wonderfully suggestive title: Protein Complexes of the Escherichia coli Cell Envelope
- http://www.jbc.org/cgi/reprint/280/41/34409.pdf
- (BTW, I manually experimented with the whole document. The first challenge is that the PDF is not in PubMed Central, the second challenge is that even with Adobe's latest version of Acrobat the extraction of text is still very noisy. Many sentences are chopped up and spaces missing. I.e. not a discouraging result)
General Recommendations for IE from BioMed papers
- One way to improve performance will be to use the whole document not just the abstract.
- His experience with text mining however also suggests that PDF to text conversion is problematic.
Other Candidate IE Tasks
His comments on other candidate IE tasks that came to mind:
Post Translational Modification (PTM)
- (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Posttranslational_modification
- Pros: This relation is clearly annotated in Swiss-Prot
- Cons: This relation is complicated and unlikely to be in one sentence or in the abstract
- Cons: There are other groups working on this.
Mutation and Variations
- Another area that he thought relevant is "mutation and variations".
- He pointed me to one of the earlier papers on the application of information extraction to this domain:
- (Horn et al., 2004) ⇒ F. Horn, A. L. Lau and F. E. Cohen. (2004). “Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors.” In: Bioinformatics
- A quick review of the paper suggests that a manual surface+NER pattern was used in that experiment. Here is a quote: The pattern must start with one amino acid in the one- or three-letter code followed by a number, and optimally by another amino acid encoded with the same letter code format as the first one. The regular expression we use is: ([A–Z][1–9][0–9] + $)|([A–Z][1–9][0–9] ∗ [A–Z]$) |([A–Z][a–z][a–z][1–9][0–9] ∗ $) |([A–Z][a–z][a–z][1–9][0–9] ∗ [A–Z][a–z][a–z]$)
- (Horn et al., 2004) ⇒ F. Horn, A. L. Lau and F. E. Cohen. (2004). “Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors.” In: Bioinformatics
Transcription Regulation
- This final IE task he suggested is the extraction of data on Transcription Regulation
- Pros: The Wasserman Lab at UBC (http://www.cisreg.ca/) is interested in this data.
Miscellaneous Notes
- He uses http://www.crisp.com to (impressively) navigate through biomedical data
- Swiss-Prot has ~60 +/-10 annotators
- Swiss-Prot will likely move over to an XML based repository within the year.
- He pointed me to the Swissknife tool
- A Perl-based tool to process files in Swiss-Prot Accession Format
- ftp://ftp.ebi.ac.uk/pub/software/swissprot/Swissknife/
- http://swissknife.sourceforge.net/docs/