2005 SummarizingItemsetPatterns
Jump to navigation
Jump to search
- (Yan et al., 2005) ⇒ Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin. (2005). “Summarizing Itemset Patterns: a profile-based approach.” In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. doi:10.1145/1081870.1081907
Subject Headings: Frequent Itemset Summarization Task, Frequent Itemset Mining Task, Pattern Summarization Task, Hierarchical Agglomerative Clustering Task, K-Means Clustering Task.
Notes
Cited By
- Google Scholar: ~100 Citations.
- ACM DL: ~ 79 Citations
Quotes
Abstract
- Frequent-pattern mining has been studied extensively on scalable methods for mining various kinds of patterns including itemsets, sequences, and graphs. However, the bottleneck of frequent-pattern mining is not at the efficiency but at the interpretability, due to the huge number of patterns generated by the mining process. In this paper, we examine how to summarize an collection of itemset patterns using only K representatives, a small number of patterns that a user can handle easily. The K representatives should not only cover most of the frequent patterns but also approximate their supports. A generative model is built to extract and profile these representatives, under which the supports of the patterns can be easily recovered without consulting the original dataset. Based on the restoration error, we propose a quality measure function to determine the optimal value of parameter K. Polynomial time algorithms are developed together with several optimization heuristics for efficiency improvement. Empirical studies indicate that we can obtain compact summarization in real datasets.
1. Introduction
2. Pattern Profile
3. Pattern Summarization
3.1 Hierarchical Agglomerative Clustering
3.2 K-means Clustering
3.3 Optimization Heuristics
3.3.1 Closed Itemsets vs. Frequent Itemsets
3.3.2 Approximate Profiles
3.4 Quality Evaluation
4. Empirical Study
4.1 Real Datasets
4.2 Synthetic Datasets
5. Related Work
6. Conclusions
References
- [1] F. Afrati, A. Gionis, and H. Mannila, Approximating a collection of frequent sets. In Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’04), pages 12 : 19, 2004.
- [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases, In Proc. of 1993 Int. Conf. on Management of Data (SIGMOD’93), pages 207:216, 1993.
- [3] R. Agrawal and R. Srikant, Mining sequential patterns, In Proc. 0f1995 Int. Conf. on Data Engineering ([CDE’95}, pages 3:14, 1995.
- [4] L. Baker and A. McCallum, Distributional clustering of words for text classification, In Proc. of 1998 ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’98), pages 96:103, 1998.
- [5] R. Bayardo, Efficiently mining long patterns from databases, In Proc. of 1998 Int. Conf. on Management of Data (SIGMOD’98), pages 85:93, 1998.
- [6] T. Calders and B. Goethals, Mining all non-derivable frequent itemsets, In Proc. of 2002 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD’02), pages 74:85, 2002.
- [7] L. Dehaspe, H. Toivonen, and R. King, Finding frequent substructures in chemical compounds, In Proc. of 1998 Int. Conf. on Knowledge Discovery and Data Mining (KDD’98), pages 30:36, 1998.
- [8] M. Deshpande, M. Kuramochi, and G. Karypis, Frequent sub-structure-based approaches for classifying chemical compounds, In Proc. of 2003 Int. Conf. on Data Mining ([CDM’OB}, pages 35:42, 2003.
- [9] I. Dhillon, S. Mallela, and R. Kumar, A divisive information-theoretic feature clustering algorithm for text classification, J. of Machine Learning Research, 3:1265:1287, 2003.
- [10] D. Gunopulos, H. Mannila, R. Khardon, and H. Toivonen, Data mining, hypergraph transversals, and machine learning, In Proc. 1997 ACM SIGACT—SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’97), pages 209 : 216, 1997.
- [11] J. Han, J. Wang, Y. Lu, and P. Tzvetkov, Mining top—k frequent closed patterns without minimum support, In Proc. of 2002 Int. Conf. on Data Mining ([CDM’02), pages 211:218, 2002.
- [12] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. American Statistical Associations, 58:13:30, 1963.
- [13] L. Holder, D. Cook, and S. Djoko, Substructure discovery in the subdue system, In Proc. AAAI94 Workshop on Knowledge Discovery in Databases (KDD94), page 169180, 1994.
- [14] J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha, Mining spatial motifs from protein structure graphs, In Proc. of 8th Amt. Int. Conf. on Research in Computational Molecular Biology (RECOMB’W), pages 308:315, 2004.
- [15] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, Experiments 0f the effectiveness of datafiow— and controlfiow—based test adequacy criteria, In Proc. of 16th Int. Conf. on Software engineering ([CSE’94), pages 191:200, 1994.
- [16] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng, KDD—Cup 2000 organizers7 report: Feeling the onion, SIGKDD Explorations, 2:86:98, 2000.
- [17] T, Mielikainen and H. Mannila, The pattern ordering problem, In Prof. 7th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’03), pages 327:338, 2003.
- [18] E. Omiecinski, Alternative interest measures for mining associations, [EEE Trans. Knowledge and Data Engineering, 15:57:69, 2003.
- [19] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal Discovering frequent closed itemsets for association rules, In Proc. of 7th Int. Conf. on Database Theory ([CDT’99}, pages 398:416, 1999.
- [20] D. Pavlov, H. Mannila, and P. Smyth, Beyond independence: Probabilistic models for query approximation on binary transaction data, [EEE Trans. Knowledge and Data Engineering, 15:1409:1421, 2003.
- [21] J. Pei, G. Dong, W. Zou, and J. Han, On computing condensed frequent pattern bases, In Proc. 2002 Int. Conf. on Data Mining ([CDM’02), pages 378:385, 2002.
- [22] J. Pei, A. Tung, and J. Han, Fault-tolerant frequent pattern mining: Problems and challenges, In Proc. of 2001 ACM Int. Workshop Data Mining and Knowledge Discovery (DMKD’01), pages 7:12, 2001.
- [23] M. Steinbach, P. Tan, and V. Kumar, Support envelopes: a technique for exploring the structure of association patterns, In Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’04), pages 296 : 305, 2004.
- [24] P. Tan, V. Kumar, and J. Srivastava, Selecting the right interestingness measure for association patterns, In Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’02), pages 32:41, 2002.
- [25] K. Wang, C. Xu, and B. Lin, Clustering transactions using large items, In Proc. of 8th Int. Conf. on Information and Knowledge Management (CIKM’99), pages 483 7 490, 1999.
- [26] X. Yan, P. Yu, and J. Han, Graph indexing: A frequent structure-based approach, In Proc. of 2004 ACM Int. Conf. on Management of Data (SIGMOD’04), pages 335:346, 2004.
- [27] C. Yang, U. Fayyad, and P. S. Bradley, Efficient discovery of error-tolerant frequent itemsets in high dimensions, In Proc. of 2001 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’01), pages 194:203, 2001.,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2005 SummarizingItemsetPatterns | Hong Cheng Dong Xin Xifeng Yan Jiawei Han | Summarizing Itemset Patterns: a profile-based approach | http://portal.acm.org/citation.cfm?id=1081907 | 10.1145/1081870.1081907 |