Stratified K-Fold Cross-Validation Task

A Stratified K-Fold Cross-Validation Task is a K-Fold Cross-Validation Task in which a class distribution remains closely the same across all the folds.

AKA: Stratified Cross Validation.
Context:
- It can be solved by a Stratified K-Fold Cross-Validation System that Stratified K-Fold Cross-Validation Algorithm.
Example(s):
Counter-Example(s):
See: Cross-Validation Task, Exhaustive Cross-Validation Task, Non-exhaustive Cross-validation Task, Nested Cross-validation Task.

References

2020a

(Wikipedia, 2020) ⇒ https://www.wikiwand.com/en/Cross-validation_(statistics)#/Non-exhaustive_cross-validation Retrieved:2020-2-14.
- In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,^[1] but in general k remains an unfixed parameter.
  For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d₀ and d₁, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d₀ and validate on d₁, followed by training on d₁ and validating on d₀.
  When k = n (the number of observations), k-fold cross-validation is equivalent to leave-one-out cross-validation^[2]. In stratified k-fold cross-validation, the partitions are selected so that the mean response value is approximately equal in all the partitions. In the case of binary classification, this means that each partition contains roughly the same proportions of the two types of class labels. In repeated cross-validation the data is randomly split into k partitions several times. The performance of the model can thereby be averaged over several runs, but this is rarely desirable in practice.

↑ McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.
↑ "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition". web.stanford.edu. Retrieved 2019-04-04.

2020b

(SciKit-Learn, 2020) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html Retrieved:2020-2-14.
- QUOTE: Stratified K-Folds Cross-validator
  Provides train/test indices to split data in train test sets.
  This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
  Read more in the User Guide.

1995

(Kohavi, 1995) ⇒ Ron Kohavi (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection". In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI 95).
- QUOTE: Formally, let $\mathcal{D}_{(i)}$ be the test set that includes instance $x_i =\langle v_i, y_i\rangle $ then the cross-validation estimate of accuracy
  [math]\displaystyle{ acc_{cv} =\dfrac{1}{n} \displaystyle \sum_{\langle v_i,y_i\rangle\in \mathcal{D}} \delta \left(\mathcal{I}(\mathcal{D}\backslash\mathcal{D}_{(i)},v_i),y_i\right) }[/math]
  The cross-validation estimate is a random number that depends on the division into folds. Complete cross-validation is the average of all $\binom{m}{m/k}$ possibilities for choosing $m/k$ instances out of $m$, but it is usually too expensive. Except for leave-one-one (n-fold cross-validation), which is always complete, k-fold cross-validation is estimating complete k-fold cross-validation using a single split of the data into the folds. Repeating cross-validation multiple times using different splits into folds provides a better Monte Carlo estimate to the complele cross-validation at an added cost. In stratified cross-validation the folds are stratified so that they contain approximately the same proportions of labels as the original dataset.

[McLachlan-1] McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Christophe (2004). Analyzing microarray gene expression data. Wiley.

[2] "Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition". web.stanford.edu. Retrieved 2019-04-04.

[1]

[2]

Stratified K-Fold Cross-Validation Task

References

2020a

2020b

2020c

2017

2013

2011

1995

Navigation menu

Search