nltk.tokenizer.punk Tokenizer
Jump to navigation
Jump to search
An nltk.tokenizer.punk Tokenizer is a text tokenizer included in NLTK.
- See: NLTK Stemmer.
References
2014
- http://www.nltk.org/_modules/nltk/tokenize/punkt.html
- QUOTE: This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English.
- QUOTE: This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.