PPLRE Corpus Stats

From GM-RKB
Jump to navigation Jump to search

This page contains summary statistics of the PPLRE Corpus that may help to characterize its contents.


Version 2.3 of the corpus

  • There are ____ (~20,000) abstracts in the collection.
  • Below is an estimate of the distribution of sentences per abstract:

Sents/Abstract Abstracts Proportion ~*=1% 1 86 0.8% * 2 232 2.1% 3 438 3.9% 4 524 4.7% * 5 805 7.2% *** 6 1194 10.6% ****** 7 1479 13.2% ********* 8 1552 13.8% ********** 9 1482 13.2% ********* 10 1212 10.8% ****** 11 909 8.1% **** 12 586 5.2% * 13 327 2.9% * 14 183 1.6% 15 115 1.0% * 16 55 0.5% * 17 32 0.3% 18 12 0.1% 19 9 0.1% 20 2 0.0% 21 2 0.0% 22 1 0.0% 23 2 0.0% 25 1 0.0% 35 1 0.0% TOTAL 11,241


Number of Organisms per Document

  • ~50% of the documents in the curated set contain only one (1) organism.

Number of Location (GO) Classes per Document

  • ~50% of the documents contained only one localization.

Number of Protein (Instances) per Document


Version 2.5 of the corpus

  • Currently under construction
  • There are ____ (~260,000) abstracts in the collection.

Misc

for PSID in `grep -v "#" ../concordance.tab | grep -v REMOVED | awk '{print $1}'`; do (echo -n "$PSID "; wc -l $PSID/2_AnnotatorFiles/v2.2/AbstractDir/sentences.txt) >> /tmp/PPLREsentences.txt ; done