Outlier Detection Algorithm
An outlier detection algorithm is a detection algorithm that can be applied by an outlier detection system (to solve an outlier detection task).
- Context:
- It can range from being a Univariate Outlier Detection Algorithm to being a Multivariate Outlier Detection Algorithm (such as a bivariate outlier detection algorithm).
- It can range from being a Continuous Outlier Detection Algorithm to being a Categorical Outlier Detection Algorithm.
- It can range from being a Local-based Outlier Detection Algorithm to being a Global-based Outlier Detection Algorithm.
- It can range from being a Unordered-Data Outlier Detection Algorithm to being a Sequential-Data Outlier Detection Algorithm (such as a temporal outlier detection algorithm).
- It can be:
- a Distribution-based Outlier Detection Algorithm (parametric algorithm?)
- a Depth-based Outlier Detection Algorithm.
- a Distance-based Outlier Detection Algorithm.
- Example(s):
- a Feature Bagging-based Outlier Detection Algorithm (using feature bagging).
- a Extreme Value Analysis-based Outlier Detection Algorithm, that determines the statistical tails of the underlying distribution of the data. For example, statistical methods like the z-scores on univariate data.
- a Statistical Model-based Outlier Detection Algorithm (using feature bagging), that determines unlikely instances from a probabilistic model of the data. For example, gaussian mixture models optimized using expectation-maximization.
- a Linear Model-based Outlier Detection Algorithm, that model the data into lower dimensions using linear correlations. For example, principle component analysis and data with large residual errors may be outliers.
- a Proximity-based Outlier Detection Algorithm, that isolates data instances from the mass of the data as determined by cluster, density or nearest neighbor analysis. (using feature bagging).
- a Information Theoretic-based Outlier Detection Algorithm, that identify instances that increase the complexity (minimum code length) of the dataset.
- a High-Dimensional Outlier Detection Algorithm, that search subspaces for outliers give the breakdown of distance based measures in higher dimensions.
- a Graph-based Outlier Detection Algorithm, such as an outlying path detection algorithm.
- a Neural-based Outlier Detection Algorithm, such as MSCRED.
- …
- Counter-Example(s):
- See: Fraud Detection Algorithm, Extreme Value Analysis.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/anomaly_detection#Popular_techniques Retrieved:2015-10-5.
- Several anomaly detection techniques have been proposed in literature. Some of the popular techniques are:
- Density-based techniques (k-nearest neighbor, local outlier factor, and many more variations of this concept ).
- Subspace- and correlation-based outlier detection for high-dimensional data.
- One class support vector machines.
- Replicator neural networks.
- Cluster analysis-based outlier detection.
- Deviations from association rules and frequent itemsets.
- Fuzzy logic based outlier detection.
- Ensemble techniques, using feature bagging, score normalization and different sources of diversity.
- Several anomaly detection techniques have been proposed in literature. Some of the popular techniques are:
2013
- (Aggarwal, 2013) ⇒ Charu C. Aggarwal. (2013). “Outlier Analysis." Springer Publishing Company, Incorporated. ISBN:1461463955, 9781461463955 doi:10.1007/978-1-4614-6396-2
- QUOTE: The classical books relevant to outlier analysis are as follows:
- P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 2003.
- V. Barnett and T. Lewis. Outliers in Statistical Data, Wiley, 1994.
- D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
- We note that these books are quite outdated, and the most recent among them is a decade old. Furthermore, this (most recent) book is really focussed on the relationship between regression and outlier analysis, rather than the latter. Outlier analysis is a much broader area, in which regression analysis is only a small part.
- QUOTE: The classical books relevant to outlier analysis are as follows:
- (Hauskrecht et al., 2013) ⇒ Milos Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F Cooper, and Gilles Clermont. (2013). “Outlier Detection for Patient Monitoring and Alerting.” In: Journal of Biomedical Informatics, 46(1). doi:10.1016/j.jbi.2012.08.004
2009
- (Chandola et al., 2009) ⇒ Varun Chandola, Arindam Banerjee, and Vipin Kumar. (2009). “Anomaly Detection: A survey.” In: ACM Computing Surveys, 41(3) doi:10.1145/1541880.1541882
- QUOTE: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection.
2005
- (Angiulli & Pizzuti, 2005) ⇒ F. Angiulli, and C. Pizzuti. (2005). “Outlier Mining in Large High-Dimensional Data Sets.” In: IEEE Transactions on Knowledge and Data Engineering, 17(2). [doi:10.1109/TKDE.2005.31].
- (Lazarevic & Kumar, 2005) ⇒ Aleksandar Lazarevic, and Vipin Kumar. (2005). “Feature Bagging for Outlier Detection.” In: Proceedings of the eleventh ACM SIGKDD international Conference on Knowledge Discovery in Data Mining (KDD-2005). doi:10.1145/1081870.1081891
- (Ben-Gal, 2005) ⇒ Irad E. Ben-Gal. (2005). “Outlier Detection.” In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers. ISBN:0387244352.
- QUOTE: ABSTRACT: Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between [[Univariate Outlier Detection Algorithm}univariate]] vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for data mining is often based on distance measures, clustering and spatial methods.
2004
- (Hodge & Austin, 2004) ⇒ Victoria Hodge, and Jim Austin. (2004). “A Survey of Outlier Detection Methodologies.” In: Artificial Intelligence Review, 22(2). doi:10.1023/B:AIRE.0000045502.10941.a9
2003
- (Rousseeuw & Leroy, 2003) ⇒ Peter J. Rousseeuw, and Annick M. Leroy. (2003). “Robust Regression and Outlier Detection." Wiley-IEEE. ISBN:0471488550
- (Papadimitriou et al., 2003) ⇒ Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. (2003). “LOCI: Fast Outlier Detection Using the Local Correlation Integral.” In: Proceedings of the International Conference on Data Engineering (ICDE 2003). doi:10.1109/ICDE.2003.1260802.
2001
- (Jin et al., 2001) ⇒ Wen Jin, Anthony K. H. Tung, and Jiawei Han. (2001). “Mining Top-n Local Outliers in Large Databases.” In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. doi:10.1145/502512.502554
2000
- (Breunig et al., 2000) ⇒ Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. (2000). “LOF: identifying density-based local outliers.” In: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 2000). doi:10.1145/335191.335388
- QUOTE: Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories.
The first category is distribution-based, where a standard distribution (e.g. Normal, Poisson, etc.) is used to fit the data best. Outliers are defined based on the probability distribution. Over one hundred tests of this category, called discordancy tests, have been developed for different scenarios (see [5]). A key drawback of this category of tests is that most of the distributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results.
The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [20], [16]). In theory, depth-based approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([16], [18], [12]), depth-based approaches become inefficient for large datasets for k ³ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of W(nk/2) for n objects. …
… Definition 1: (Hawkins-Outlier) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.
- QUOTE: Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories.
- (Ramaswamy et al., 2000) ⇒ Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. (2000). “Efficient Algorithms for Mining Outliers from Large Data Sets.” In: SIGMOD Record, 29(20). doi:10.1145/335191.335437.
1998
- (Knorr & Ng, 1998) ⇒ E. Knorr, and Raymond Ng. (1998). “Algorithms for Mining Distance-based Outliers in Large Data Sets.” In: Proceedings of the 24th International Conference on Very Large Databases (VLDB 1998).
1994
- (Barnett et al., 1994) ⇒ Vic Bamnett., and Toby Lewis. (1994). “Outliers in Statistical Data, 3rd edition." Wiley.
1980
- (Hawkins, 1980) ⇒ Douglas M. Hawkins. (1980). “Identification of Outliers." Chapman and Hall,