2001 DataMiningAtTheIntOfCompSciAndStats

Subject Headings: Data Mining, statistics, pattern recognition, transaction data, correlation.

Notes

This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications.

Is data mining as currently practiced substantially different from conventional applied statistics? Certainly if one looks at the published commercial applications of data mining, such as the case studies presented in [BL00], one sees a heavy reliance on techniques that have their lineage in applied statistics. For example, decision trees are perhaps the single most widely-used modeling technique in commercial predictive data mining applications [Joh99, Koh00]. They are particularly popular because of their ability to both deal with heterogenous data types (they can easily handle both categorical and real-valued variables) and to find relatively low-dimensional parsimonious predictors for high-dimensional problems.

M. J. A. Berry, and G. Linoff. (2000). “Mastering Data Mining: The Art and Science of Customer Relationship Management.” John Wiley and Sons,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2001 DataMiningAtTheIntOfCompSciAndStats	Padhraic Smyth			Data Mining at the Interface of Computer Science and Statistics			http://www.datalab.uci.edu/papers/dmchap.pdf			2001