2015 DataMiningTheTextbook
- (Aggarwal, 2015) ⇒ Charu Aggarwal. (2015). “Data Mining: The Textbook.” Elsevier. ISBN: 978-3-319-14142-8
Subject Headings: Data Mining Textbook; AMSSketch; Active Learning; Co-Clustering; k-Means Clustering; Aggregate Change Points as Outliers; Algorithm Variations and Refinements; Similarity-based Application; Approximate Frequent Patterns; Approximation in Terms of Itemsets; Approximation in Terms of Transactions; Assessing Model Effectiveness; Association Pattern Mining; Association Rule Generation Framework; Associative Classifiers; Autoregressive Model; Autoregressive Moving Average Model; BIRCH; BOAT; Bag-of-Words Kernel; Bagging; Bayes Classifiers; Betweenness Centrality; Bibliographic Notes; Binary and Set Data; Binary and Set-Valued Data; Bioinformatics; Biological and Medical Applications; Bisecting k-Means; Bloom Filter; Boosting; Bootstrap; Bottom-up Agglomerative Method; CLARANS; CLIQUE; CURE; Categorical Data; Categorical and Mixed Attribute Data; Categorical to Numeric Data: Binarization; Categorization by Component Independence; Categorization by Constituent Components; Centroid-based Classification; Data-Driven Classification; Classification of Shapes; Classification with Complex Data Types; Classifier Evaluation; Closed Patterns; Closeness Centrality and Proximity Prestige; Clu Stream Algorithm; Cluster Ensembles; Cluster Validation; Clustering Categorical Data; Clustering Data Streams; Clustering Method; Clustering Shapes; Clustering and Distance-based Method; Clustering for Outlier Detection; Clustering with Complex Data Types; Co-Training; Co-clustering; Co-location Patterns; Collective Classification; Collective Strength; Combatting Spider Traps; Combination Outliers; Combining Different Ensemble Components; Common Mistakes; Community Detection; Comparing Various Linear Model; Comparison with Singular Value Decomposition; Comparison with other Linear Model; Computational Considerations; Computing Similarity between Trajectories; Connections with Random-Walk Method; Constrained Sequential Pattern Mining; Content-based Recommendations; Converting Trajectories to Multidimensional Data; Cosine Coefficient on Columns; Count-Min Sketch; Cross-Validation; Customer Recommendations; Customer Segmentation and Collaborative Filtering; DBSCAN; DENCLUE; Data Classification; Data Cleaning; Data Clustering; Data Partitioned Ensemble; Data Preprocessing; Data Reduction and Transformation; Data Summarization; Data Transformation and Reduction; Data Type Portability; Data-centered Ensemble; Decision Trees; Degree Centrality and Prestige; Demographic and Profile Analysis; Density-based Method; Dependency Oriented Data; Depth-based Method; Design Variations of Nearest Neighbor Classifiers; Dimensionality Reduction; Dimensionality Reduction with Axis Rotation; Dimensionality Reduction with Type Transformation; Discrete Fourier Transform; Discrete Sequence Similarity Measure; Discrete Sequence to Numeric Data; Discrete Sequences and Strings; Discrete Wavelet Transform; Semisupervised Learning; Distance-based Method; Distance-based Model; Distance-based Motifs; Distance-based Outlier Detection; Distributed Privacy; Document Normalization and Similarity Computation; Document Preparation and Similarity Computation; Dynamic Time Warping Distance; Dynamics of Network Formation; Early Termination Trick with Nested Loops; Earth Science Applications; Edge-based Join Growth; Edit Distance; Efficiency Issues: Probabilistic Suffix Trees; Efficient Support Counting; Embedded Model; Ensemble Method; Ensemble Method; Entropy; Enumeration-Tree Algorithms; Enumeration-Tree-based Interpretation of Apriori; Equivalence of Trajectories and Multivariate Time Series; Evaluation: Computing the Fit Probability for Observed Sequence; Example Re-weighting; Exercises; Expected Error Reduction; Expected Model Change; Expected Variance Reduction; Explanation: Determining the Most Likely State Sequence for Observed Sequence; External Validation Criteria; Extreme Value Analysis; Feature Extraction; Feature Extraction and Portability; Feature Selection for Classification; Feature Selection for Clustering; Feature Subset Selection; Filter Model; Financial Fraud and Anomalous Events; Fisher Score; Fisher’s Linear Discriminant; Flajolet-Martin Algorithm for Distinct Element Counting; Formal Definition and Techniques for HMMs; Formal Statement of Bias-Variance Trade-off; Frequency-based Model; Frequent Itemset Mining Algorithm; Frequent Pattern Mining in Data Streams; Frequent Pattern Mining to Graph Pattern Mining; Frequent Patterns to Frequent Sequences; Frequent Substructure Mining in Graphs; Frequent Substructure-based Method; Frequent Substructure-based Transformation and Distance Computation; Frequent Trajectory Paths; From Decision Trees to Regression Trees; General Comments; Generalization to Other Data Types; Generalized Linear Model; Generic Meta-Algorithm; Generic Transformational Approach; Gini Index; Girvan-Newman Algorithm; Graph Classification; Graph Clustering; Graph Edit Distance; Graph Matching Methods for Distance Computation; Graph Regularization Approach; Graph Similarity Measure; Graph-based Algorithm; Graph-based Method; Graph-based Semisupervised Learning; Graphs to Numeric Data; Grid Search for Subspace Outliers; Grid-based Method; Grid-based Rare Subspace Exploration; Group-based Statistics; HITS; Haar Wavelet Transform; Handling Concept Drift; Handling Incorrect and Inconsistent Entries; Handling Missing Entries; Handling Missing Values; Heterogeneity-based Model; Hidden Markov Model; Hierarchical Clustering Algorithm; Hierarchical Method; High-Dimensional Clustering; High-Dimensional Outlier Detection; Histogram- and Grid-based Techniques; Holdout; Homophily; Hopkins Statistic; Human and Visually Supervised Clustering; Hypergraph Partitioning Algorithm; Impact of Behavioral Attribute Normalization; Impact of Complex Data Types on Problem Definitions; Impact of Data Distribution; Impact of Different Lp-norm; Impact of Domain-specific Relevance; Impact of High Dimensionality; Impact of Local Data Distribution; Impact of Locally Irrelevant Feature; Implementation with Arrays but no Pointers; Implementation with Pointers and F P-Tree; Implementation with Pointers but no F P-Tree; Important Observations and Intuitions; Incognito; Independent Cascade Model; Independent Ensemble; Individual Data Points as Outliers; Influence Function Evaluation; Information-Theoretic Model; Instance-based Classifiers; Instance-based Learning; Instance-specific Mahalanobis Distance; Interest Ratio; Internal Validation Criteria; Intrusion Detection Applications; Item-based Similarity with Ratings; Iterative Classification Algorithm; Iterative Label Propagation: The Spectral Interpretation; Jaccard Coefficient and the Min-hash Trick; Katz Measure; Kernel Density Estimation; Kernel Support Vector Machine; Kernel-based Transformations and Computation; Kernighan-Lin Algorithm; Label Propagation with Random Walks; Latent Factor Model; Latent Semantic Analysis; Learn-One-Rule; Leveraging Aggregate Distributions for Data Mining; Leveraging Data Structures for Querying; Leveraging Latent Semantic Analysis; Leveraging Synopsis Structure; Leveraging the Itemset Lattice; Limitations of PLSA; Linear Regression; Linear Threshold Model; Link Prediction; Link Prediction as a Classification Problem; Link Prediction as a Missing Value Estimation Problem; Local Distance Correction Method; Local Outlier Factor (LOF); Logistic Regression; Longest Common Subsequence; Lossy Counting Algorithm; Lp-norm; Market Basket Analysis; Markovian Similarity-based Algorithm: CLUSEQ; Massive-Domain Stream Clustering; Massive-Domain Streaming Classification; Match-based Similarity Computation; Matching and Distance Computation in Graphs; Matrix Factorization; Maximal Patterns; Maximum Common Subgraph Problem; Maximum Common Subgraph-based Distances; Measures of Centrality; Measures of Prestige; Medical Diagnosis; Meta-clustering Algorithm; Micro-clustering Algorithm; Micro-clustering Method; Mining with Contextual Spatial Attributes; Mixed Quantitative and Categorical Data; Mixture of Hidden Markov Model; Model-centered Ensemble; Modeling Abnormal Lower Dimensional Projections; Mondrian Multidimensional k-Anonymity; Multiclass Learning; Multidimensional Data; Multidimensional Scaling; Multilayer Neural Networks; Multilevel Graph Partitioning: METIS; Multimedia Applications; Multinomial Bayes Model; Multiple Threads; Multivariate Extreme Values; Multivariate Forecasting with Hidden Variables; Naive Bayes Classifier; Nearest Neighbor Classifier; Nearest Neighbors with Linear Discriminant Analysis; Neighborhood-based Measure; Neighborhood-based Methods for Collaborative Filtering; Network and Graph Data; Neural Networks; Node-based Join Growth; Noise Removal; Non-dependency Oriented Data; Nonlinear Distributions: ISOMAP; Nonlinear Support Vector Machine; Nonlinear and Polynomial Regression; Nonnegative Matrix Factorization; Normalization; Normalization and Combination; Novelty and First-Story Detection; Numeric to Categorical Data: Discretization; ORCLUS; Online Clustering of Co-evolving Series; Other Applications for Complex Data Types; Other Applications of Kernel Method; Outlier Analysis; Outlier Detection; Outlier Detection in Sequences; Outlier Detection with Categorical Data; Outlier Detection with Complex Data Types; Outlier Ensembles; Outlier Validity; Output Privacy; Output as Class Labels; Output as Numerical Score; PROCLUS; Page Rank; Pairwise Supervision; Parameter Tuning with Internal Measure; Pattern Mining with Complex Data Types; Pattern Querying; Pattern Summarization; Periodic Patterns; Point Outliers; Pointwise Supervision; Position Outliers; Power-Law Degree Distributions; Practical Issues; Predictive Attribute Dependence; Preferential Crawlers; Preprocess-once Query-many Paradigm; Principal Component Analysis; Principal Component Regression; Privacy during Data Collection; Privacy-Preserving Data Publishing; Probabilistic Classifiers; Probabilistic Clustering; Hidden Markov Model; Probabilistic Model-based Algorithm; Probabilistic Model; Pruning Method; Pushing Constraints into Pattern Mining; Putting Associations to Work: Applications; Putting Clustering to Work: Applications; Putting Outliers to Work: Applications; Pyramidal Time Frame; Quality Control and Fault Detection; Quantification Issues; Quantitative Data; Quantitative Multidimensional Data; Query-by-Committee; ROCK; Rain Forest; Random Forests; Random Subspace Sampling; Random Walk-based Measure; Random Walk-based Similarity; Random-Walk Kernels; Rank Centrality and Prestige; Ranking Algorithm; Rare Class Learning; Receiver Operating Characteristic; Recommendations and Collaborative Filtering; Recommender System; Reconstructing Aggregate Distributions; Recursive Suffix-based Pattern Growth Method; Regression Modeling with Numeric Classes; Representative-based Algorithm; Representativeness-based Model; Reservoir Sampling; Reservoir Sampling for Data Streams; Rocchio Classification; Rule Generation from Decision Trees; Rule Pruning; Rule-based Classifiers; Rule-based Method; STREAMAlgorithm; SVMClassifiers for High-dimensional and Sparse Data; Samarati’s Algorithm; Sampling; Sampling Method; Sampling for Static Data; Scalability Issues and the Streaming Scenario; Scalable Classification; Scalable Data Clustering; Scalable Decision Trees; Scalable Support Vector Machine; Scaling and Normalization; Scatter/ Gather Approach; Search Engine Indexing and Query Processing; Selecting Different Ensemble Components; Self-Training; Semisupervised Bayes Classification with E M; Semisupervised Clustering; Semisupervised Learning; Sequence Classification; Sequence Clustering; Sequence-based Method; Sequential Covering Algorithm; Sequential Ensembles; Sequential Pattern Mining; Shape Outliers; Shape to Time-Series Transformation; Shape-based Clustering; Shingling for Near Duplicate Detection; Shortest-Path Kernels; Sim Rank; Similarity Search and Indexing; Similarity between Two Graphs; Similarity between Two Nodes in a Single Graph; Similarity-based Clustering Method; Simultaneous Document and Word Cluster Discovery; Single-Layer Neural Network: The Perceptron; Singular Value Decomposition; Sketches; Social Influence Analysis; Social Network Analysis; Social Networks: Preliminaries and Properties; Solving the Lagrangian Dual; Spatial Co-location Patterns; Spatial Data; Spatial to Multidimensional Transformation with Wavelets; Spatial to Numeric Data; Specialized Classification Methods for Text; Specialized Clustering Methods for Text; Specialized Preprocessing for Web Documents; Spectral Clustering; Spectral Transformation and Embedding of Graphs; Spectrum Kernel; Speeding up Kernighan-Lin; Split Criteria; Stacking; Statistical Coefficient of Correlation; Stopping Criterion and Pruning; Store Product Placement; Streaming Classification; Streaming Outlier Detection; Structural Distance-based Measure; Subsequence-based Clustering; Supervised Event Detection; Supervised Feature Generation with Spectral Embedding; Supervised Micro-cluster Approach; Supervised Similarity Functions; Supervised Spectral Method; Support Vector Machine; Support Vector Machines for Linearly Separable Data; Support Vector Machines with Soft Margin for Nonseparable Data; Symbolic Aggregate Approximation (SAX); Symmetric Confidence Measure; Synopsis Data Structures for Streams; Synopsis Structures for the Massive-Domain Scenario; Synthetic Data Generation: Condensation-based Approach 651; Synthetic Over-sampling: SMOTE; Temporal Similarity Measure; Term Strength; Text Applications; Text Data; Text Similarity Measure; ?-diversity Model; Analytical Phase; Apriori Algorithm; Basic Data Types; The Curse of Dimensionality; Data Mining Process; Data Preprocessing Phase; Frequent Pattern Mining Model; Kernel Trick; Kernel k-Means Algorithm; Ranking Model for Classification; k-Means Algorithm; k-Medians Algorithm; k-Medoids Algorithm; k-anonymity Model; t-closeness Model; Time Series to Discrete Sequence Data; Time Series to Numeric Data; Time-Series Classification; Time-Series Clustering; Time-Series Data; Time-Series Forecasting; Time-Series Motifs; Time-Series Outlier Detection; Time-Series Preparation and Similarity; Time-Series Similarity Measure; Top-down Divisive Method; Topic Modeling; Topic-Sensitive Page Rank; Topological Descriptors; Trade-offs with Different Data Structure; Training a Logistic Regression Classifier; Training: Baum-Welch Algorithm; Trajectory Classification; Trajectory Clustering; Trajectory Clustering as a Sequence Clustering Problem; Trajectory Mining; Trajectory Outlier Detection; Trajectory Pattern Mining; Transductive Support Vector Machine; Transformation to Sequential Pattern Mining; Transformation-based Distance Computation; Tree Projection and Depth Project; Triadic Closure and Clustering Coefficient; Ullman’s Algorithm for Subgraph Isomorphism; Uncertainty Sampling; Univariate Extreme Value Analysis; Unsupervised Mahalanobis Metric; User-based Similarity with Ratings; VFDTFamily; Vertical Counting Method; Visual Clustering; Wavelet-based Rules; Web Crawling and Resource Discovery; Web Log Analytics; Web Log Anomalies; Web Usage Mining; Weighted Degree Kernel; Whole-Series Classification; Why does Ensemble Analysis Work?; Window-based Method; Wrapper Model; X Proj; Direct Clustering with Frequent Subgraph Discovery; X Rules; k-Means; k-Medoids; k-Medoids Clustering; k-Modes Clustering;
Notes
Cited By
Quotes
Abstract
This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues. It goes beyond the traditional focus on data mining problems to introduce advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. Until now, no single book has addressed all these topics in a comprehensive and integrated way. The chapters of this book fall into one of three categories: Fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems. Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor. Appropriate for both introductory and advanced data mining courses, Data Mining: The Textbook balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners (including those with a limited mathematical background). Numerous illustrations, examples, and exercises are included, with an emphasis on semantically interpretable examples.
- Table of Contents
1 An Introduction to Data Mining
1.1 Introduction p.1 1.2 The Data Mining Process p.3 1.2.1 The Data Preprocessing Phase p.5 1.2.2 The Analytical Phase p.6 1.3 The Basic Data Types p.6 1.3.1 Non-dependency Oriented Data p.6 1.3.1.1 Quantitative Multidimensional Data p.7 1.3.1.2 Categorical and Mixed Attribute Data p.7 1.3.1.3 Binary and Set Data p.8 1.3.1.4 Text Data p.8 1.3.2 Dependency Oriented Data p.9 1.3.2.1 Time-Series Data p.9 1.3.2.2 Discrete Sequences and Strings p.10 1.3.2.3 Spatial Data p.11 1.3.2.4 Network and Graph Data p.12 1.4 The Major Building Blocks: A Bird’s Eye View p.13 1.4.1 Association Pattern Mining p.14 1.4.2 Data Clustering p.15 1.4.3 Outlier Detection p.16 1.4.4 Data Classification p.17 1.4.5 Impact of Complex Data Types on Problem Definitions p.18 1.4.5.1 Pattern Mining with Complex Data Types p.18 1.4.5.2 Clustering with Complex Data Types p.19 1.4.5.3 Outlier Detection with Complex Data Types p.19 1.4.5.4 Classification with Complex Data Types p.20 1.5 Scalability Issues and the Streaming Scenario p.20 1.6 A Stroll through some Application Scenarios p.21 1.6.1 Store Product Placement p.21 1.6.2 Customer Recommendations p.21 1.6.3 Medical Diagnosis p.22 1.6.4 Web Log Anomalies p.22 1.7 Summary p.23 1.8 Bibliographic Notes p.23 1.9 Exercises p.24
2 Data Preparation
2.1 Introduction p.25 2.2 Feature Extraction and Portability p.26 2.2.1 Feature Extraction p.26 2.2.2 Data Type Portability p.27 2.2.2.1 Numeric to Categorical Data: Discretization p.28 2.2.2.2 Categorical to Numeric Data: Binarization p.29 2.2.2.3 Text to Numeric Data p.29 2.2.2.4 Time Series to Discrete Sequence Data p.30 2.2.2.5 Time Series to Numeric Data p.30 2.2.2.6 Discrete Sequence to Numeric Data p.30 2.2.2.7 Spatial to Numeric Data p.31 2.2.2.8 Graphs to Numeric Data p.31 2.2.2.9 Any Type to Graphs for Similarity-based Applications p.31 2.3 Data Cleaning p.32 2.3.1 Handling Missing Entries p.33 2.3.2 Handling Incorrect and Inconsistent Entries p.33 2.3.3 Scaling and Normalization p.34 2.4 Data Reduction and Transformation p.35 2.4.1 Sampling p.35 2.4.1.1 Sampling for Static Data p.36 2.4.1.2 Reservoir Sampling for Data Streams p.36 2.4.2 Feature Subset Selection p.38 2.4.3 Dimensionality Reduction with Axis Rotation p.38 2.4.3.1 Principal Component Analysis p.39 2.4.3.2 Singular Value Decomposition p.41 2.4.3.3 Latent Semantic Analysis p.45 2.4.3.4 Applications of PCA and SVD p.45 2.4.4 Dimensionality Reduction with Type Transformation p.46 2.4.4.1 Haar Wavelet Transform p.47 2.4.4.2 Multidimensional Scaling p.52 2.4.4.3 Spectral Transformation and Embedding of Graphs p.54 2.5 Summary p.56 2.6 Bibliographic Notes p.57 2.7 Exercises p.57
3 Similarity and Distances
3.1 Introduction p.59 3.2 Multidimensional Data p.60 3.2.1 Quantitative Data p.60 3.2.1.1 Impact of Domain-specific Relevance p.61 3.2.1.2 Impact of High Dimensionality p.61 3.2.1.3 Impact of Locally Irrelevant Features p.62 3.2.1.4 Impact of Different Lp-norms p.63 3.2.1.5 Match-based Similarity Computation p.64 3.2.1.6 Impact of Data Distribution p.65 3.2.1.7 Nonlinear Distributions: ISOMAP p.66 3.2.1.8 Impact of Local Data Distribution p.67 3.2.1.9 Computational Considerations p.69 3.2.2 Categorical Data p.69 3.2.3 Mixed Quantitative and Categorical Data p.70 3.3 Text Similarity Measures p.71 3.3.1 Binary and Set Data p.72 3.4 Temporal Similarity Measures p.72 3.4.1 Time-Series Similarity Measures p.73 3.4.1.1 Impact of Behavioral Attribute Normalization p.74 3.4.1.2 Lp-norm p.74 3.4.1.3 Dynamic Time Warping Distance p.74 3.4.1.4 Window-based Methods p.77 3.4.2 Discrete Sequence Similarity Measures p.77 3.4.2.1 Edit Distance p.77 3.4.2.2 Longest Common Subsequence p.79 3.5 Graph Similarity Measures p.80 3.5.1 Similarity between Two Nodes in a Single Graph p.80 3.5.1.1 Structural Distance-based Measure p.80 3.5.1.2 Random Walk-based Similarity p.81 3.5.2 Similarity between Two Graphs p.81 3.6 Supervised Similarity Functions p.82 3.7 Summary p.83 3.8 Bibliographic Notes p.84 3.9 Exercises p.85
4 Association Pattern Mining
4.1 Introduction p.87 4.2 The Frequent Pattern Mining Model p.88 4.3 Association Rule Generation Framework p.91 4.4 Frequent Itemset Mining Algorithms p.92 4.4.1 Brute Force Algorithms p.93 4.4.2 The Apriori Algorithm p.94 4.4.2.1 Efficient Support Counting p.95 4.4.3 Enumeration-Tree Algorithms p.96 4.4.3.1 Enumeration-Tree-based Interpretation of Apriori p.99 4.4.3.2 Tree Projection and Depth Project p.99 4.4.3.3 Vertical Counting Methods p.104 4.4.4 Recursive Suffix-based Pattern Growth Methods p.106 4.4.4.1 Implementation with Arrays but no Pointers p.107 4.4.4.2 Implementation with Pointers but no F P-Tree p.108 4.4.4.3 Implementation with Pointers and F P-Tree p.109 4.4.4.4 Trade-offs with Different Data Structures p.112 4.4.4.5 Relationship between F P-growth and Enumeration-Tree Methods p.113 4.5 Alternative Models: Interesting Patterns p.115 4.5.1 Statistical Coefficient of Correlation p.116 4.5.2 ?2 Measure p.116 4.5.3 Interest Ratio p.117 4.5.4 Symmetric Confidence Measures p.117 4.5.5 Cosine Coefficient on Columns p.118 4.5.6 Jaccard Coefficient and the Min-hash Trick p.118 4.5.7 Collective Strength p.119 4.5.8 Relationship to Negative Pattern Mining p.120 4.6 Useful Meta-Algorithms p.120 4.6.1 Sampling Methods p.120 4.6.2 Data Partitioned Ensembles p.121 4.6.3 Generalization to Other Data Types p.121 4.6.3.1 Quantitative Data p.122 4.6.3.2 Categorical Data p.122 4.7 Summary p.122 4.8 Bibliographic Notes p.123 4.9 Exercises p.124
5 Association Pattern Mining: Advanced Concepts
5.1 Introduction p.127 5.2 Pattern Summarization p.128 5.2.1 Maximal Patterns p.128 5.2.2 Closed Patterns p.129 5.2.3 Approximate Frequent Patterns p.131 5.2.3.1 Approximation in Terms of Transactions p.131 5.2.3.2 Approximation in Terms of Itemsets p.132 5.3 Pattern Querying p.133 5.3.1 Preprocess-once Query-many Paradigm p.133 5.3.1.1 Leveraging the Itemset Lattice p.133 5.3.1.2 Leveraging Data Structures for Querying p.134 5.3.2 Pushing Constraints into Pattern Mining p.138 5.4 Putting Associations to Work: Applications p.139 5.4.1 Relationship to Other Data Mining Problems p.139 5.4.1.1 Application to Classification p.139 5.4.1.2 Application to Clustering p.139 5.4.1.3 Applications to Outlier Detection p.139 5.4.2 Market Basket Analysis p.140 5.4.3 Demographic and Profile Analysis p.140 5.4.4 Recommendations and Collaborative Filtering p.140 5.4.5 Web Log Analysis p.140 5.4.6 Bioinformatics p.141 5.4.7 Other Applications for Complex Data Types p.141 5.5 Summary p.141 5.6 Bibliographic Notes p.142 5.7 Exercises p.143
6 Cluster Analysis
6.1 Introduction p.145 6.2 Feature Selection for Clustering p.146 6.2.1 Filter Models p.147 6.2.1.1 Term Strength p.147 6.2.1.2 Predictive Attribute Dependence p.147 6.2.1.3 Entropy p.148 6.2.1.4 Hopkins Statistic p.149 6.2.2 Wrapper Models p.150 6.3 Representative-based Algorithms p.151 6.3.1 The k-Means Algorithm p.154 6.3.2 The Kernel k-Means Algorithm p.155 6.3.3 The k-Medians Algorithm p.155 6.3.4 The k-Medoids Algorithm p.155 6.4 Hierarchical Clustering Algorithms p.158 6.4.1 Bottom-up Agglomerative Methods p.159 6.4.1.1 Group-based Statistics p.160 6.4.2 Top-down Divisive Methods p.163 6.4.2.1 Bisecting k-Means p.164 6.5 Probabilistic Model-based Algorithms p.164 6.5.1 Relationship of E M to k-means and other Representative Methods p.167 6.6 Grid-based and Density-based Algorithms p.169 6.6.1 Grid-based Methods p.170 6.6.2 DBSCAN p.172 6.6.3 DENCLUE p.174 6.7 Graph-based Algorithms p.177 6.7.1 Properties of Graph-based Algorithms p.180 6.8 Nonnegative Matrix Factorization p.181 6.8.1 Comparison with Singular Value Decomposition p.185 6.9 Cluster Validation p.186 6.9.1 Internal Validation Criteria p.186 6.9.1.1 Parameter Tuning with Internal Measures p.188 6.9.2 External Validation Criteria p.189 6.9.3 General Comments p.191 6.10 Summary p.191 6.11 Bibliographic Notes p.192 6.12 Exercises p.192
7 Cluster Analysis: Advanced Concepts
7.1 Introduction p.195 7.2 Clustering Categorical Data p.196 7.2.1 Representative-based Algorithms p.197 7.2.1.1 k-Modes Clustering p.198 7.2.1.2 k-Medoids Clustering p.198 7.2.2 Hierarchical Algorithms p.199 7.2.2.1 ROCK p.199 7.2.3 Probabilistic Algorithms p.200 7.2.4 Graph-based Algorithms p.202 7.3 Scalable Data Clustering p.202 7.3.1 CLARANS p.202 7.3.2 BIRCH p.203 7.3.3 CURE p.205 7.4 High-Dimensional Clustering p.207 7.4.1 CLIQUE p.208 7.4.2 PROCLUS p.209 7.4.3 ORCLUS p.212 7.5 Semisupervised Clustering p.214 7.5.1 Pointwise Supervision p.214 7.5.2 Pairwise Supervision p.215 7.6 Human and Visually Supervised Clustering p.216 7.6.1 Modifications of Existing Clustering Algorithms p.217 7.6.2 Visual Clustering p.217 7.7 Cluster Ensembles p.220 7.7.1 Selecting Different Ensemble Components p.221 7.7.2 Combining Different Ensemble Components p.221 7.7.2.1 Hypergraph Partitioning Algorithm p.221 7.7.2.2 Meta-clustering Algorithm p.222 7.8 Putting Clustering to Work: Applications p.222 7.8.1 Applications to Other Data Mining Problems p.222 7.8.1.1 Data Summarization p.222 7.8.1.2 Outlier Analysis p.222 7.8.1.3 Classification p.223 7.8.1.4 Dimensionality Reduction p.223 7.8.1.5 Similarity Search and Indexing p.223 7.8.2 Customer Segmentation and Collaborative Filtering p.223 7.8.3 Text Applications p.223 7.8.4 Multimedia Applications p.224 7.8.5 Temporal and Sequence Applications p.224 7.8.6 Social Network Analysis p.224 7.9 Summary p.224 7.10 Bibliographic Notes p.224 7.11 Exercises p.225
8 Outlier Analysis
8.1 Introduction p.227 8.2 Extreme Value Analysis p.229 8.2.1 Univariate Extreme Value Analysis p.230 8.2.2 Multivariate Extreme Values p.231 8.2.3 Depth-based Methods p.233 8.3 Probabilistic Models p.234 8.4 Clustering for Outlier Detection p.236 8.5 Distance-based Outlier Detection p.238 8.5.1 Pruning Methods p.239 8.5.1.1 Sampling Methods p.239 8.5.1.2 Early Termination Trick with Nested Loops p.239 8.5.2 Local Distance Correction Methods p.240 8.5.2.1 Local Outlier Factor (LOF) p.242 8.5.2.2 Instance-specific Mahalanobis Distance p.243 8.6 Density-based Methods p.244 8.6.1 Histogram- and Grid-based Techniques p.244 8.6.2 Kernel Density Estimation p.245 8.7 Information-Theoretic Models p.246 8.8 Outlier Validity p.247 8.8.1 Methodological Challenges p.248 8.8.2 Receiver Operating Characteristic p.249 8.8.3 Common Mistakes p.250 8.9 Summary p.250 8.10 Bibliographic Notes p.251 8.11 Exercises p.251
9 Outlier Analysis: Advanced Concepts
9.1 Introduction p.253 9.2 Outlier Detection with Categorical Data p.254 9.2.1 Probabilistic Models p.254 9.2.2 Clustering and Distance-based Methods p.255 9.2.3 Binary and Set-Valued Data p.256 9.3 High-Dimensional Outlier Detection p.256 9.3.1 Grid-based Rare Subspace Exploration p.258 9.3.1.1 Modeling Abnormal Lower Dimensional Projections p.258 9.3.1.2 Grid Search for Subspace Outliers p.259 9.3.2 Random Subspace Sampling p.261 9.4 Outlier Ensembles p.262 9.4.1 Categorization by Component Independence p.263 9.4.1.1 Sequential Ensembles p.263 9.4.1.2 Independent Ensembles p.264 9.4.2 Categorization by Constituent Components p.264 9.4.2.1 Model-centered Ensembles p.265 9.4.2.2 Data-centered Ensembles p.265 9.4.3 Normalization and Combination p.265 9.5 Putting Outliers to Work: Applications p.267 9.5.1 Quality Control and Fault Detection p.267 9.5.2 Financial Fraud and Anomalous Events p.267 9.5.3 Web Log Analytics p.268 9.5.4 Intrusion Detection Applications p.268 9.5.5 Biological and Medical Applications p.268 9.5.6 Earth Science Applications p.268 9.6 Summary p.269 9.7 Bibliographic Notes p.269 9.8 Exercises p.270
10 Data Classification
10.1 Introduction p.271 10.2 Feature Selection for Classification p.273 10.2.1 Filter Models p.274 10.2.1.1 Gini Index p.274 10.2.1.2 Entropy p.275 10.2.1.3 Fisher Score p.275 10.2.1.4 Fisher’s Linear Discriminant p.276 10.2.2 Wrapper Models p.277 10.2.3 Embedded Models p.278 10.3 Decision Trees p.278 10.3.1 Split Criteria p.281 10.3.2 Stopping Criterion and Pruning p.283 10.3.3 Practical Issues p.284 10.4 Rule-based Classifiers p.284 10.4.1 Rule Generation from Decision Trees p.286 10.4.2 Sequential Covering Algorithms p.287 10.4.2.1 Learn-One-Rule p.287 10.4.3 Rule Pruning p.290 10.4.4 Associative Classifiers p.290 10.5 Probabilistic Classifiers p.291 10.5.1 Naive Bayes Classifier p.291 10.5.1.1 The Ranking Model for Classification p.294 10.5.1.2 Discussion of the Naive Assumption p.295 10.5.2 Logistic Regression p.295 10.5.2.1 Training a Logistic Regression Classifier p.297 10.5.2.2 Relationship with Other Linear Models p.298 10.6 Support Vector Machines p.298 10.6.1 Support Vector Machines for Linearly Separable Data p.298 10.6.1.1 Solving the Lagrangian Dual p.303 10.6.2 Support Vector Machines with Soft Margin for Nonseparable Data p.304 10.6.2.1 Comparison with other Linear Models p.306 10.6.3 Nonlinear Support Vector Machines p.306 10.6.4 The Kernel Trick p.307 10.6.4.1 Other Applications of Kernel Methods p.309 10.7 Neural Networks p.311 10.7.1 Single-Layer Neural Network: The Perceptron p.311 10.7.2 Multilayer Neural Networks p.313 10.7.3 Comparing Various Linear Models p.315 10.8 Instance-based Learning p.316 10.8.1 Design Variations of Nearest Neighbor Classifiers p.316 10.8.1.1 Unsupervised Mahalanobis Metric p.317 10.8.1.2 Nearest Neighbors with Linear Discriminant Analysis p.317 10.9 Classifier Evaluation p.319 10.9.1 Methodological Issues p.319 10.9.1.1 Holdout p.320 10.9.1.2 Cross-Validation p.320 10.9.1.3 Bootstrap p.321 10.9.2 Quantification Issues p.322 10.9.2.1 Output as Class Labels p.322 10.9.2.2 Output as Numerical Score p.323 10.10 Summary p.326 10.11 Bibliographic Notes p.326 10.12 Exercises p.327
11 Data Classification: Advanced Concepts
11.1 Introduction p.331 11.2 Multiclass Learning p.332 11.3 Rare Class Learning p.333 11.3.1 Example Re-weighting p.334 11.3.2 Sampling Methods p.335 11.3.2.1 Relationship between Weighting and Sampling p.336 11.3.2.2 Synthetic Over-sampling: SMOTE p.336 11.4 Scalable Classification p.336 11.4.1 Scalable Decision Trees p.337 11.4.1.1 Rain Forest p.337 11.4.1.2 BOAT. p.337 11.4.2 Scalable Support Vector Machines p.337 11.5 Regression Modeling with Numeric Classes p.339 11.5.1 Linear Regression. p.339 11.5.1.1 Relationship with Fisher’s Linear Discriminant p.341 11.5.2 Principal Component Regression p.342 11.5.3 Generalized Linear Models p.343 11.5.4 Nonlinear and Polynomial Regression p.344 11.5.5 From Decision Trees to Regression Trees p.345 11.5.6 Assessing Model Effectiveness p.346 11.6 Semisupervised Learning p.346 11.6.1 Generic Meta-Algorithms p.348 11.6.1.1 Self-Training p.348 11.6.1.2 Co-Training p.348 11.6.2 Specific Variations of Classification Algorithms p.349 11.6.2.1 Semisupervised Bayes Classification with E M p.349 11.6.2.2 Transductive Support Vector Machines p.351 11.6.3 Graph-based Semisupervised Learning p.352 11.6.4 Discussion of Semisupervised Learning p.353 11.7 Active Learning p.353 11.7.1 Heterogeneity-based Models p.355 11.7.1.1 Uncertainty Sampling p.355 11.7.1.2 Query-by-Committee p.356 11.7.1.3 Expected Model Change p.356 11.7.2 Performance-based Models p.357 11.7.2.1 Expected Error Reduction p.357 11.7.2.2 Expected Variance Reduction p.358 11.7.3 Representativeness-based Models p.358 11.8 Ensemble Methods p.358 11.8.1 Why does Ensemble Analysis Work? p.360 11.8.2 Formal Statement of Bias-Variance Trade-off p.362 11.8.3 Specific Instantiations of Ensemble Learning p.364 11.8.3.1 Bagging p.364 11.8.3.2 Random Forests p.365 11.8.3.3 Boosting p.366 11.8.3.4 Bucket of Models p.368 11.8.3.5 Stacking p.368 11.9 Summary p.369 11.10 Bibliographic Notes p.370 11.11 Exercises p.370
12 Mining Data Streams
12.1 Introduction p.373 12.2 Synopsis Data Structures for Streams p.375 12.2.1 Reservoir Sampling p.375 12.2.1.1 Handling Concept Drift p.376 12.2.1.2 Useful Theoretical Bounds for Sampling p.377 12.2.2 Synopsis Structures for the Massive-Domain Scenario p.381 12.2.2.1 Bloom Filter p.382 12.2.2.2 Count-Min Sketch p.386 12.2.2.3 AMSSketch p.389 12.2.2.4 Flajolet-Martin Algorithm for Distinct Element Counting p.391 12.3 Frequent Pattern Mining in Data Streams p.392 12.3.1 Leveraging Synopsis Structures p.392 12.3.1.1 Reservoir Sampling p.393 12.3.1.2 Sketches p.393 12.3.2 Lossy Counting Algorithm. p.393 12.4 Clustering Data Streams p.394 12.4.1 STREAMAlgorithm p.394 12.4.2 Clu Stream Algorithm p.396 12.4.2.1 Micro-cluster Definition p.396 12.4.2.2 Micro-clustering Algorithm p.397 12.4.2.3 Pyramidal Time Frame p.398 12.4.3 Massive-Domain Stream Clustering p.400 12.5 Streaming Outlier Detection p.400 12.5.1 Individual Data Points as Outliers p.401 12.5.2 Aggregate Change Points as Outliers p.402 12.6 Streaming Classification p.404 12.6.1 VFDTFamily p.404 12.6.2 Supervised Micro-cluster Approach p.406 12.6.3 Ensemble Method p.407 12.6.4 Massive-Domain Streaming Classification p.407 12.7 Summary p.408 12.8 Bibliographic Notes p.408 12.9 Exercises p.408
13 Mining Text Data
13.1 Introduction p.411 13.2 Document Preparation and Similarity Computation p.413 13.2.1 Document Normalization and Similarity Computation p.413 13.2.2 Specialized Preprocessing for Web Documents p.415 13.3 Specialized Clustering Methods for Text p.416 13.3.1 Representative-based Algorithms p.416 13.3.1.1 Scatter/ Gather Approach p.416 13.3.2 Probabilistic Algorithms p.418 13.3.3 Simultaneous Document and Word Cluster Discovery p.419 13.3.3.1 Co-clustering p.420 13.4 Topic Modeling p.422 13.4.1 Use in Dimensionality Reduction and Comparison with Latent Semantic Analysis p.425 13.4.2 Use in Clustering and Comparison with Probabilistic Clustering p.427 13.4.3 Limitations of PLSA p.427 13.5 Specialized Classification Methods for Text p.428 13.5.1 Instance-based Classifiers p.428 13.5.1.1 Leveraging Latent Semantic Analysis p.428 13.5.1.2 Centroid-based Classification p.429 13.5.1.3 Rocchio Classification p.429 13.5.2 Bayes Classifiers p.430 13.5.2.1 Multinomial Bayes Model p.430 13.5.3 SVMClassifiers for High-dimensional and Sparse Data p.432 13.6 Novelty and First-Story Detection p.434 13.6.1 Micro-clustering Method p.435 13.7 Summary p.435 13.8 Bibliographic Notes p.436 13.9 Exercises p.436
14 Mining Time-Series Data
14.1 Introduction p.439 14.2 Time-Series Preparation and Similarity. p.441 14.2.1 Handling Missing Values p.441 14.2.2 Noise Removal p.441 14.2.3 Normalization p.443 14.2.4 Data Transformation and Reduction p.444 14.2.4.1 Discrete Wavelet Transform. p.444 14.2.4.2 Discrete Fourier Transform p.444 14.2.4.3 Symbolic Aggregate Approximation (SAX) p.445 14.2.5 Time-Series Similarity Measures p.446 14.3 Time-Series Forecasting p.446 14.3.1 Autoregressive Models p.448 14.3.2 Autoregressive Moving Average Models p.450 14.3.3 Multivariate Forecasting with Hidden Variables p.451 14.4 Time-Series Motifs p.453 14.4.1 Distance-based Motifs p.454 14.4.2 Transformation to Sequential Pattern Mining p.456 14.4.3 Periodic Patterns p.457 14.5 Time-Series Clustering p.458 14.5.1 Online Clustering of Co-evolving Series p.458 14.5.2 Shape-based Clustering p.460 14.5.2.1 k-Means p.461 14.5.2.2 k-Medoids p.462 14.5.2.3 Hierarchical Methods p.462 14.5.2.4 Graph-based Methods p.462 14.6 Time-Series Outlier Detection p.462 14.6.1 Point Outliers p.463 14.6.2 Shape Outliers p.464 14.7 Time-Series Classification p.465 14.7.1 Supervised Event Detection p.466 14.7.2 Whole-Series Classification p.468 14.7.2.1 Wavelet-based Rules p.469 14.7.2.2 Nearest Neighbor Classifier p.469 14.7.2.3 Graph-based Methods p.470 14.8 Summary p.470 14.9 Bibliographic Notes p.470 14.10 Exercises p.471
15 Mining Discrete Sequences
15.1 Introduction p.473 15.2 Sequential Pattern Mining p.474 15.2.1 Frequent Patterns to Frequent Sequences p.477 15.2.2 Constrained Sequential Pattern Mining p.480 15.3 Sequence Clustering p.481 15.3.1 Distance-based Methods p.482 15.3.2 Graph-based Methods p.482 15.3.3 Subsequence-based Clustering p.482 15.3.4 Probabilistic Clustering p.483 15.3.4.1 Markovian Similarity-based Algorithm: CLUSEQ p.483 15.3.4.2 Mixture of Hidden Markov Models p.486 15.4 Outlier Detection in Sequences p.487 15.4.1 Position Outliers p.487 15.4.1.1 Efficiency Issues: Probabilistic Suffix Trees p.490 15.4.2 Combination Outliers p.491 15.4.2.1 Distance-based Models p.492 15.4.2.2 Frequency-based Models p.493 15.5 Hidden Markov Models p.494 15.5.1 Formal Definition and Techniques for HMMs p.496 15.5.2 Evaluation: Computing the Fit Probability for Observed Sequence p.497 15.5.3 Explanation: Determining the Most Likely State Sequence for Observed Sequence p.498 15.5.4 Training: Baum-Welch Algorithm p.499 15.5.5 Applications p.500 15.6 Sequence Classification p.501 15.6.1 Nearest Neighbor Classifier p.501 15.6.2 Graph-based Methods p.501 15.6.3 Rule-based Methods p.502 15.6.4 Kernel Support Vector Machines p.503 15.6.4.1 Bag-of-Words Kernel p.503 15.6.4.2 Spectrum Kernel p.503 15.6.4.3 Weighted Degree Kernel p.504 15.6.5 Probabilistic Methods: Hidden Markov Models p.504 15.7 Summary p.505 15.8 Bibliographic Notes p.505 15.9 Exercises p.506
16 Mining Spatial Data
16.1 Introduction p.509 16.2 Mining with Contextual Spatial Attributes p.510 16.2.1 Shape to Time-Series Transformation p.512 16.2.2 Spatial to Multidimensional Transformation with Wavelets p.515 16.2.3 Spatial Co-location Patterns p.516 16.2.4 Clustering Shapes p.517 16.2.5 Outlier Detection p.518 16.2.5.1 Point Outliers p.518 16.2.5.2 Shape Outliers p.520 16.2.6 Classification of Shapes p.521 16.3 Trajectory Mining p.522 16.3.1 Equivalence of Trajectories and Multivariate Time Series p.522 16.3.2 Converting Trajectories to Multidimensional Data p.523 16.3.3 Trajectory Pattern Mining p.524 16.3.3.1 Frequent Trajectory Paths p.524 16.3.3.2 Co-location Patterns p.526 16.3.4 Trajectory Clustering p.526 16.3.4.1 Computing Similarity between Trajectories p.526 16.3.4.2 Similarity-based Clustering Methods p.527 16.3.4.3 Trajectory Clustering as a Sequence Clustering Problem p.528 16.3.5 Trajectory Outlier Detection p.528 16.3.5.1 Distance-based Methods p.529 16.3.5.2 Sequence-based Methods p.529 16.3.6 Trajectory Classification p.530 16.3.6.1 Distance-based Methods p.530 16.3.6.2 Sequence-based Methods p.530 16.4 Summary p.531 16.5 Bibliographic Notes p.531 16.6 Exercises p.532
17 Mining Graph Data
17.1 Introduction p.533 17.2 Matching and Distance Computation in Graphs p.535 17.2.1 Ullman’s Algorithm for Subgraph Isomorphism p.537 17.2.1.1 Algorithm Variations and Refinements p.539 17.2.2 Maximum Common Subgraph Problem p.540 17.2.3 Graph Matching Methods for Distance Computation p.541 17.2.3.1 Maximum Common Subgraph-based Distances p.541 17.2.3.2 Graph Edit Distance p.542 17.3 Transformation-based Distance Computation p.546 17.3.1 Frequent Substructure-based Transformation and Distance Computation p.546 17.3.2 Topological Descriptors p.547 17.3.3 Kernel-based Transformations and Computation p.549 17.3.3.1 Random-Walk Kernels p.549 17.3.3.2 Shortest-Path Kernels p.550 17.4 Frequent Substructure Mining in Graphs p.550 17.4.1 Node-based Join Growth p.552 17.4.2 Edge-based Join Growth p.554 17.4.3 Frequent Pattern Mining to Graph Pattern Mining p.554 17.5 Graph Clustering p.554 17.5.1 Distance-based Methods p.555 17.5.2 Frequent Substructure-based Methods p.555 17.5.2.1 Generic Transformational Approach p.556 17.5.2.2 X Proj: Direct Clustering with Frequent Subgraph Discovery 556 17.6 Graph Classification p.558 17.6.1 Distance-based Methods p.558 17.6.2 Frequent Substructure-based Methods p.558 17.6.2.1 Generic Transformational Approach p.559 17.6.2.2 X Rules: A Rule-based Approach p.559 17.6.3 Kernel Support Vector Machines p.560 17.7 Summary p.560 17.8 Bibliographic Notes p.561 17.9 Exercises p.562
18 Mining Web Data
18.1 Introduction p.565 18.2 Web Crawling and Resource Discovery p.567 18.2.1 A Basic Crawler Algorithm p.567 18.2.2 Preferential Crawlers p.569 18.2.3 Multiple Threads p.569 18.2.4 Combatting Spider Traps p.569 18.2.5 Shingling for Near Duplicate Detection p.570 18.3 Search Engine Indexing and Query Processing p.570 18.4 Ranking Algorithms p.573 18.4.1 Page Rank p.573 18.4.1.1 Topic-Sensitive Page Rank p.576 18.4.1.2 Sim Rank p.577 18.4.2 HITS p.578 18.5 Recommender Systems p.579 18.5.1 Content-based Recommendations p.581 18.5.2 Neighborhood-based Methods for Collaborative Filtering p.582 18.5.2.1 User-based Similarity with Ratings p.582 18.5.2.2 Item-based Similarity with Ratings p.583 18.5.3 Graph-based Methods p.583 18.5.4 Clustering Methods p.585 18.5.4.1 Adapting k-Means Clustering p.585 18.5.4.2 Adapting Co-Clustering p.585 18.5.5 Latent Factor Models p.586 18.5.5.1 Singular Value Decomposition p.587 18.5.5.2 Matrix Factorization p.587 18.6 Web Usage Mining p.588 18.6.1 Data Preprocessing p.589 18.6.2 Applications p.589 18.7 Summary p.590 18.8 Bibliographic Notes p.591 18.9 Exercises p.591
19 Social Network Analysis
19.1 Introduction p.593 19.2 Social Networks: Preliminaries and Properties p.594 19.2.1 Homophily p.595 19.2.2 Triadic Closure and Clustering Coefficient p.595 19.2.3 Dynamics of Network Formation p.595 19.2.4 Power-Law Degree Distributions p.597 19.2.5 Measures of Centrality and Prestige p.597 19.2.5.1 Degree Centrality and Prestige p.597 19.2.5.2 Closeness Centrality and Proximity Prestige p.598 19.2.5.3 Betweenness Centrality p.600 19.2.5.4 Rank Centrality and Prestige p.600 19.3 Community Detection p.601 19.3.1 Kernighan-Lin Algorithm p.602 19.3.1.1 Speeding up Kernighan-Lin p.604 19.3.2 Girvan-Newman Algorithm p.604 19.3.3 Multilevel Graph Partitioning: METIS p.607 19.3.4 Spectral Clustering p.610 19.3.4.1 Important Observations and Intuitions p.613 19.4 Collective Classification p.614 19.4.1 Iterative Classification Algorithm p.615 19.4.2 Label Propagation with Random Walks p.616 19.4.2.1 Iterative Label Propagation: The Spectral Interpretation p.619 19.4.3 Supervised Spectral Methods p.619 19.4.3.1 Supervised Feature Generation with Spectral Embedding p.620 19.4.3.2 Graph Regularization Approach p.620 19.4.3.3 Connections with Random-Walk Methods p.622 19.5 Link Prediction p.623 19.5.1 Neighborhood-based Measures p.623 19.5.2 Katz Measure p.624 19.5.3 Random Walk-based Measures p.625 19.5.4 Link Prediction as a Classification Problem p.626 19.5.5 Link Prediction as a Missing Value Estimation Problem p.627 19.5.6 Discussion p.627 19.6 Social Influence Analysis p.627 19.6.1 Linear Threshold Model p.629 19.6.2 Independent Cascade Model p.629 19.6.3 Influence Function Evaluation p.630 19.7 Summary p.630 19.8 Bibliographic Notes p.631 19.9 Exercises p.632
20 Privacy-Preserving Data Mining
20.1 Introduction p.635 20.2 Privacy during Data Collection p.636 20.2.1 Reconstructing Aggregate Distributions p.637 20.2.2 Leveraging Aggregate Distributions for Data Mining p.639 20.3 Privacy-Preserving Data Publishing p.639 20.3.1 The k-anonymity Model p.641 20.3.1.1 Samarati’s Algorithm p.645 20.3.1.2 Incognito p.646 20.3.1.3 Mondrian Multidimensional k-Anonymity p.649 20.3.1.4 Synthetic Data Generation: Condensation-based Approach 651 20.3.2 The ?-diversity Model p.653 20.3.3 The t-closeness Model p.655 20.3.4 The Curse of Dimensionality p.658 20.4 Output Privacy p.658 20.5 Distributed Privacy p.659 20.6 Summary p.661 20.7 Bibliographic Notes p.661 20.8 Exercises p.663
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2015 DataMiningTheTextbook | Charu C. Aggarwal | Data Mining - The Textbook | 2015 |