AutoPhrase Text Segmenter
(Redirected from AutoPhrase)
Jump to navigation
Jump to search
An AutoPhrase Text Segmenter is a informative phrase text segmenter.
- See: SegPhrase.
References
2018.03.04
2018
- https://github.com/shangjingbo1226/AutoPhrase
- QUOTE:
- Fix a few bugs during the pre-processing and post-processing, i.e., Tokeninzer.java. Previously, when the corpus contains characters like /, the results could be wrong or errors may occur.
- When the phrasal segmentation is serving new text, for the phrases (every token is seen in the traning corpus) provided in the knowledge base (wiki_quality.txt), the score is set as 1.0. Previously, it was kind of infinite.
- Support extremely large corpus (e.g., 100GB or more). Please comment out the // define LARGE in the beginning of src/utils/parameters.h before you run AutoPhrase on such a large corpus.
- Quality phrases (every token is seen in the raw corpus) provided in the knowledge base will be incorporated during the phrasal segmentation, even their frequencies are smaller than MIN_SUP.
- Stopwords will be treated as low quality single-word phrases.
- Model files are saved separately. Please check the variable MODEL in both auto_phrase.sh and phrasal_segmentation.sh.
- The end of line is also a separator for sentence splitting.
- QUOTE:
2018b
- (Shang et al., 2018) ⇒ Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. (2018). “Automated Phrase Mining from Massive Text Corpora.” In: IEEE Transactions on Knowledge and Data Engineering Journal, PP(99). doi:10.1109/TKDE.2018.2812203
- QUOTE: … In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.
2015
- (Liu et al., 2015) ⇒ Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. (2015). “Mining Quality Phrases from Massive Text Corpora.” In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ISBN:978-1-4503-2758-9 doi:10.1145/2723372.2751523