sklearn.ensemble.IsolationForest

A sklearn.ensemble.IsolationForest is an Outlier Detection Algorithm used in high-dimensional random forests within sklearn.ensemble module.

AKA: IsolationForest.
Context
- Usage:

1) Import Isolation Forest Algorithm from scikit-learn : from sklearn.ensemble import IsolationForest

2) Generate training input sample: X

3) Create Isolation Forest Algorithm object:

clf=IsolationForest([n_estimators=100, max_samples=’auto’, contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=1, random_state=None, verbose=0])

4) Choose method(s):

decision_function(X), average anomaly score of X of the base classifiers.
fit(X[, y, sample_weight]), fits estimator.
get_params([deep]), gets parameters for this estimator.
predict(X), predicts if a particular sample is an outlier or not.
set_params(**params), sets the parameters of this estimator.

Example(s):
Counter-Example(s):
See: Decision Tree, Decision Tree Ensemble Learning System, Regression System, Regularization Task, Ridge Regression Task, Random Forests System, Regression Algorithm.

References

2017a

(Scikit Learn, 2017A) ⇒ http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
- QUOTE: class sklearn.ensemble.IsolationForest(n_estimators=100, max_samples=’auto’, contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=1, random_state=None, verbose=0) Isolation Forest Algorithm.
  Return the anomaly score of each sample using the IsolationForest algorithm
  The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
  Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
  This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
  Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

2017b

(Scikit Learn, 2017B) ⇒ http://scikit-learn.org/stable/modules/outlier_detection.html#isolation-forest
- QUOTE: One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The ensemble.IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
  Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
  This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
  Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

2008

(Liu et al., 2008) ⇒ Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on (pp. 413-422). DOI: 10.1109/ICDM.2008.17
- ABSTRACT: Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and Random Forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.

sklearn.ensemble.IsolationForest

References

2017a

2017b

2008

Navigation menu

Search