PPLRE Corpus Stats
Jump to navigation
Jump to search
This page contains summary statistics of the PPLRE Corpus that may help to characterize its contents.
- See: PPLRE Curated Data.
Version 2.3 of the corpus
- There are ____ (~20,000) abstracts in the collection.
- Below is an estimate of the distribution of sentences per abstract:
Sents/Abstract Abstracts Proportion ~*=1%
1 86 0.8% *
2 232 2.1%
3 438 3.9%
4 524 4.7% *
5 805 7.2% ***
6 1194 10.6% ******
7 1479 13.2% *********
8 1552 13.8% **********
9 1482 13.2% *********
10 1212 10.8% ******
11 909 8.1% ****
12 586 5.2% *
13 327 2.9% *
14 183 1.6%
15 115 1.0% *
16 55 0.5% *
17 32 0.3%
18 12 0.1%
19 9 0.1%
20 2 0.0%
21 2 0.0%
22 1 0.0%
23 2 0.0%
25 1 0.0%
35 1 0.0%
TOTAL 11,241
Number of Organisms per Document
- ~50% of the documents in the curated set contain only one (1) organism.
Number of Location (GO) Classes per Document
- ~50% of the documents contained only one localization.
Number of Protein (Instances) per Document
Version 2.5 of the corpus
- Currently under construction
- There are ____ (~260,000) abstracts in the collection.
Misc
- for PSID in `grep -v "#" ../concordance.tab | grep -v REMOVED | awk '{print $1}'`; do (echo -n "$PSID "; wc -l $PSID/2_AnnotatorFiles/v2.2/AbstractDir/sentences.txt) >> /tmp/PPLREsentences.txt ; done