2005 SummarizingItemsetPatterns

(Yan et al., 2005) ⇒ Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin. (2005). “Summarizing Itemset Patterns: a profile-based approach.” In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. doi:10.1145/1081870.1081907

Subject Headings: Frequent Itemset Summarization Task, Frequent Itemset Mining Task, Pattern Summarization Task, Hierarchical Agglomerative Clustering Task, K-Means Clustering Task.

Notes

Cited By

Google Scholar: ~100 Citations.
ACM DL: ~ 79 Citations

Quotes

Abstract

Frequent-pattern mining has been studied extensively on scalable methods for mining various kinds of patterns including itemsets, sequences, and graphs. However, the bottleneck of frequent-pattern mining is not at the efficiency but at the interpretability, due to the huge number of patterns generated by the mining process. In this paper, we examine how to summarize an collection of itemset patterns using only K representatives, a small number of patterns that a user can handle easily. The K representatives should not only cover most of the frequent patterns but also approximate their supports. A generative model is built to extract and profile these representatives, under which the supports of the patterns can be easily recovered without consulting the original dataset. Based on the restoration error, we propose a quality measure function to determine the optimal value of parameter K. Polynomial time algorithms are developed together with several optimization heuristics for efficiency improvement. Empirical studies indicate that we can obtain compact summarization in real datasets.

1. Introduction

2. Pattern Profile

3. Pattern Summarization

3.1 Hierarchical Agglomerative Clustering

3.2 K-means Clustering

3.3 Optimization Heuristics

3.3.1 Closed Itemsets vs. Frequent Itemsets

3.3.2 Approximate Proﬁles

3.4 Quality Evaluation

4. Empirical Study

4.1 Real Datasets

4.2 Synthetic Datasets

5. Related Work

6. Conclusions

References

[1] F. Afrati, A. Gionis, and H. Mannila, Approximating a collection of frequent sets. In Proc. of 2004 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’04), pages 12 : 19, 2004.

[2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases, In Proc. of 1993 Int. Conf. on Management of Data (SIGMOD’93), pages 207:216, 1993.

[3] R. Agrawal and R. Srikant, Mining sequential patterns, In Proc. 0f1995 Int. Conf. on Data Engineering ([CDE’95}, pages 3:14, 1995.

[4] L. Baker and A. McCallum, Distributional clustering of words for text classiﬁcation, In Proc. of 1998 ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’98), pages 96:103, 1998.

[5] R. Bayardo, Efﬁciently mining long patterns from databases, In Proc. of 1998 Int. Conf. on Management of Data (SIGMOD’98), pages 85:93, 1998.

[6] T. Calders and B. Goethals, Mining all non-derivable frequent itemsets, In Proc. of 2002 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD’02), pages 74:85, 2002.

[7] L. Dehaspe, H. Toivonen, and R. King, Finding frequent substructures in chemical compounds, In Proc. of 1998 Int. Conf. on Knowledge Discovery and Data Mining (KDD’98), pages 30:36, 1998.

[8] M. Deshpande, M. Kuramochi, and G. Karypis, Frequent sub-structure-based approaches for classifying chemical compounds, In Proc. of 2003 Int. Conf. on Data Mining ([CDM’OB}, pages 35:42, 2003.

[9] I. Dhillon, S. Mallela, and R. Kumar, A divisive information-theoretic feature clustering algorithm for text classiﬁcation, J. of Machine Learning Research, 3:1265:1287, 2003.

[10] D. Gunopulos, H. Mannila, R. Khardon, and H. Toivonen, Data mining, hypergraph transversals, and machine learning, In Proc. 1997 ACM SIGACT—SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’97), pages 209 : 216, 1997.

[11] J. Han, J. Wang, Y. Lu, and P. Tzvetkov, Mining top—k frequent closed patterns without minimum support, In Proc. of 2002 Int. Conf. on Data Mining ([CDM’02), pages 211:218, 2002.

[12] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. American Statistical Associations, 58:13:30, 1963.

[13] L. Holder, D. Cook, and S. Djoko, Substructure discovery in the subdue system, In Proc. AAAI94 Workshop on Knowledge Discovery in Databases (KDD94), page 169180, 1994.

[14] J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha, Mining spatial motifs from protein structure graphs, In Proc. of 8th Amt. Int. Conf. on Research in Computational Molecular Biology (RECOMB’W), pages 308:315, 2004.

[15] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, Experiments 0f the effectiveness of dataﬁow— and controlﬁow—based test adequacy criteria, In Proc. of 16th Int. Conf. on Software engineering ([CSE’94), pages 191:200, 1994.

[16] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng, KDD—Cup 2000 organizers7 report: Feeling the onion, SIGKDD Explorations, 2:86:98, 2000.

[17] T, Mielikainen and H. Mannila, The pattern ordering problem, In Prof. 7th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’03), pages 327:338, 2003.

[18] E. Omiecinski, Alternative interest measures for mining associations, [EEE Trans. Knowledge and Data Engineering, 15:57:69, 2003.

[19] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal Discovering frequent closed itemsets for association rules, In Proc. of 7th Int. Conf. on Database Theory ([CDT’99}, pages 398:416, 1999.

[20] D. Pavlov, H. Mannila, and P. Smyth, Beyond independence: Probabilistic models for query approximation on binary transaction data, [EEE Trans. Knowledge and Data Engineering, 15:1409:1421, 2003.

[21] J. Pei, G. Dong, W. Zou, and J. Han, On computing condensed frequent pattern bases, In Proc. 2002 Int. Conf. on Data Mining ([CDM’02), pages 378:385, 2002.

[22] J. Pei, A. Tung, and J. Han, Fault-tolerant frequent pattern mining: Problems and challenges, In Proc. of 2001 ACM Int. Workshop Data Mining and Knowledge Discovery (DMKD’01), pages 7:12, 2001.

[23] M. Steinbach, P. Tan, and V. Kumar, Support envelopes: a technique for exploring the structure of association patterns, In Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’04), pages 296 : 305, 2004.

[24] P. Tan, V. Kumar, and J. Srivastava, Selecting the right interestingness measure for association patterns, In Proc. of 2002 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’02), pages 32:41, 2002.

[25] K. Wang, C. Xu, and B. Lin, Clustering transactions using large items, In Proc. of 8th Int. Conf. on Information and Knowledge Management (CIKM’99), pages 483 7 490, 1999.

[26] X. Yan, P. Yu, and J. Han, Graph indexing: A frequent structure-based approach, In Proc. of 2004 ACM Int. Conf. on Management of Data (SIGMOD’04), pages 335:346, 2004.

[27] C. Yang, U. Fayyad, and P. S. Bradley, Efﬁcient discovery of error-tolerant frequent itemsets in high dimensions, In Proc. of 2001 ACM Int. Conf. on Knowledge Discovery in Databases (KDD’01), pages 194:203, 2001.,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2005 SummarizingItemsetPatterns	Hong Cheng Dong Xin Xifeng Yan Jiawei Han			Summarizing Itemset Patterns: a profile-based approach			http://portal.acm.org/citation.cfm?id=1081907	10.1145/1081870.1081907

2005 SummarizingItemsetPatterns

Notes

Cited By

Quotes

Abstract

1. Introduction

2. Pattern Profile

3. Pattern Summarization

3.1 Hierarchical Agglomerative Clustering

3.2 K-means Clustering

3.3 Optimization Heuristics

3.3.1 Closed Itemsets vs. Frequent Itemsets

3.3.2 Approximate Proﬁles

3.4 Quality Evaluation

4. Empirical Study

4.1 Real Datasets

4.2 Synthetic Datasets

5. Related Work

6. Conclusions

References

Navigation menu

Search