KDD 2009 Abstracts Analysis
A KDD-2009 Abstracts Analysis is an Analysis Report for a Research Paper Abstract Analysis Task applied to the Research Papers presented at KDD-2009.
- Context:
- It makes use of a Technical Term Identification Algorithm.
- See: Technical Term.
Ngram-based Analysis
- This report is based on a Word-level Semantic Analysis Task (based on N-grams of Stemmed Words).
Two-Word terms
| Papers | Term | | 18 | data mining | 15 | experimental result | 13 | social network | 7 | search engine | 7 | recommender system | 7 | machine learning | 7 | learning algorithm | 6 | training data | 6 | learning method | 6 | labeled data | 5 | training dataset | 5 | synthetic dataset | 5 | data point | 5 | collaborative filtering | 4 | unlabeled data | 4 | training example | 4 | time series | 4 | optimization problem | 4 | learning problem | 4 | frequent itemset | 4 | empirical result | 4 | benchmark dataset
One-Word Terms
| Papers | Term | |87|Data |69|Paper |56|Algorithm |51|Result |47|Dataset |46|Problem |43|Model |39|Information |36|Application |36|Approach |34|Product |32|Network |32|Number |28|Framework |28|Feature |26|Analysis |26|Time |24|Experiment |24|Knowledge
Code
This section contains the basic code to identify the 1-gram, 2-gram, and 3-grams.
#!/bin/bash wget -r http://kdd09.crowdvine.com/ cd . for file in `(cd ..; /bin/ls 4??? 5???)` do fileLoc="../$file" # Determine talk type track=`grep ">Track:<" $fileLoc | awk '{print $2}' | sed "s/://"` trackG=`echo $track | perl -ne 'chomp; s/([SPWDT]).*/$1/g; print $_'` # Extract the abstract start=`grep -n class=\"body $fileLoc | awk '{print $1}' | sed "s/://"` start=`echo $start | perl -ne 'chomp; print 1+$_'` abstract=`head -$start $fileLoc | tail -1 | sed "s/<p>''" | sed "s/<\/p>''" | sed "s/ / /g"` abstractP=`echo $abstract | perl -ne 's/([\.\,\:\;\"])/ $1 /g; print $_'` echo $abstractP > $trackG/$file cat $trackG/$file | ../unigram.pl | ../stemmer.pl | sort -u > ${file}.unigram cat $trackG/$file | ../bigram.pl | ../stemmer.pl | sort -u > ${file}.bigram cat $trackG/$file | ../trigram.pl | ../stemmer.pl | sort -u > ${file}.trigram done </code> ===bigram.pl=== Base on <code> #!/bin/perl $word2="" ; while(<>) { chop; tr/A-Z/a-z/; foreach $word1 (split) { $bigram = "$word2 $word1"; $word2 = $word1; $count{$bigram}++; } } foreach $bigram (sort numerically keys %count) { print "$bigram\n"; } sub numerically { # compare two words numerically $count{$b} <=> $count{$a}; # decreasing order }
Semantic Analysis of Technical Terms
- coming in Dec 2009