Word Frequency List

From GM-RKB
Jump to navigation Jump to search

A Word Frequency List is a frequency list with word frequency value for some corpus.



References

2012

  • http://www.monlp.com/2012/03/26/calculating-word-frequency-tables/
    • Now that we can segment words and sentences, it is possible to produce word and tuple frequency tables. Here I show you how to create a word frequency table for a large collection of text files.

       Word frequency tables are useful for a wide range of applications, including collocation detection, spelling correction, and n-gram modeling. This article concentrates on simple word frequencies, but the code can (and will) be extended to also calculate n-grams.

      Applications that require word frequency tables, require the frequencies across the problem domain. Typically the domain is a subset of texts (e.g. news articles) but it is frequently across larger domains (e.g. English literature). Either way, the sample text is probably going to be large and consist of many (possible thousands or tens of thousands) of texts. The script below was written to create a word frequency table for all of the text files in a directory. It has been used with both the entire English language Gutenberg library, and the entire set of Wikipedia English pages. Preparing these texts will be covered in future articles.

      Frequency tables are stored using NLTK ’s FreqDist class. This derives from a standard Python Dictionary, but stores a word count for each word (key). This allows the resulting table to be used by NLTK , if so desired.

      Words are segmented using my own word and sentence segmenter. Punctuation and numbers are dropped from the word counts, but these could be easily included if your application requires it. Acronyms, abbreviations, and acronyms that include numeric digits are included. E.g. “1970″ is dropped because it is a number, but “1970s” (an abbreviation) is not.