2006 TFICFANewTermWeightingSchemefor
- (Reed et al., 2006) ⇒ Joel W Reed, Yu Jiao, Thomas E. Potok, Brian A. Klump, Mark T. Elmore, and Ali R. Hurson. (2006). “TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams.” In: Machine Learning and Applications, 2006. ICMLA'06. 5th International Conference on.
Subject Headings: Text Item Clustering Algorithm, TF-ICF.
Notes
Cited By
Quotes
Abstract
In this paper, we propose a new term weighting scheme called term frequency-inverse corpus frequency (TF-ICF). It does not require term frequency information from other documents within the document collection and thus, it enables us to generate the document vectors of N streaming documents in linear time. In the context of a machine learning application, unsupervised document clustering, we evaluated the effectiveness of the proposed approach in comparison to five widely used term weighting schemes through extensive experimentation. Our results show that TF-ICF can produce document clusters that are of comparable quality as those generated by the widely recognized term weighting schemes and it is significantly faster than those methods.
Introduction
Document clustering is an enabling technique for many other machine learning applications, such as information classification, filtering, routing, topic tracking, and new event detection [2]. Today, dynamic data stream clustering poses significant challenges to traditional methods.
Typically, clustering algorithms use the Vector Space Model (VSM) [17] to encode documents. The VSM relates terms to documents, and since different terms have different importance in a given document, a term weight is associated with every term [18]. These term weights are often derived from the frequency of a term within a document or set of documents. Many term weighting schemes have been proposed [5,9,18]. Most of these existing methods work under the assumption that the whole data set is available and static. For instance, in order to use the popular Term Frequency - Inverse Document Frequency (TF-IDF) approach and its variants, one needs to know the number of documents in which a term occurred at least once (document frequency). This requires a priori knowledge of the data, and that the data set does not change during the calculation of term weights.
…
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2006 TFICFANewTermWeightingSchemefor | Joel W Reed Yu Jiao Thomas E. Potok Brian A. Klump Mark T. Elmore Ali R. Hurson | TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams | 2006 |