Word Frequency List
A Word Frequency List is a frequency list with word frequency value for some corpus.
- AKA: Term Frequency Table.
- Context:
- It can range from being an Absolute Word Frequency List(word count table) to being a Relative Word Frequency List.
- It can range from being an English Word Frequency List to being a Chinese Word Frequency List, ...
- It can be produced by a Word Frequency List Generator.
- It can be a Temporal Word Frequency List, such as Google Books Ngram Viewer http://books.google.com/ngrams/
- Example(s):
- Counter-Example(s):
- a Ranked Word List (with word ranks), such as a Dutch Word Frequency List, such as: http://nl.wiktionary.org/wiki/WikiWoordenboek:Lijst_met_1000_basiswoorden
- a Letter Frequency List.
- See: Relative Word Frequency Value, Word Absolute Frequency Value.
References
2012
- http://www.monlp.com/2012/03/26/calculating-word-frequency-tables/
- Now that we can segment words and sentences, it is possible to produce word and tuple frequency tables. Here I show you how to create a word frequency table for a large collection of text files.
Word frequency tables are useful for a wide range of applications, including collocation detection, spelling correction, and n-gram modeling. This article concentrates on simple word frequencies, but the code can (and will) be extended to also calculate n-grams.
Applications that require word frequency tables, require the frequencies across the problem domain. Typically the domain is a subset of texts (e.g. news articles) but it is frequently across larger domains (e.g. English literature). Either way, the sample text is probably going to be large and consist of many (possible thousands or tens of thousands) of texts. The script below was written to create a word frequency table for all of the text files in a directory. It has been used with both the entire English language Gutenberg library, and the entire set of Wikipedia English pages. Preparing these texts will be covered in future articles.
Frequency tables are stored using NLTK ’s FreqDist class. This derives from a standard Python Dictionary, but stores a word count for each word (key). This allows the resulting table to be used by NLTK , if so desired.
Words are segmented using my own word and sentence segmenter. Punctuation and numbers are dropped from the word counts, but these could be easily included if your application requires it. Acronyms, abbreviations, and acronyms that include numeric digits are included. E.g. “1970″ is dropped because it is a number, but “1970s” (an abbreviation) is not.
- Now that we can segment words and sentences, it is possible to produce word and tuple frequency tables. Here I show you how to create a word frequency table for a large collection of text files.