In-Sample Evaluation Task

An In-Sample Evaluation Task is an prediction system evaluation task that uses training data.

Example(s):
- In-Sample F-test,
- In-Sample Goodness-of-Fit,
Counter-Example(s):
- Holdout Evaluation.
- Out-of-Sample Evaluation.
See: Algorithm Evaluation, Model Evaluation, Machine Learning Algorithm Evaluation, Sampling Task, In-group Bias, Learning Performance, Holdout Data, Machine Learning Algorithm, Bootstrap Sampling, Cross-Validation, Prospective Evaluation.

References

2017a

(Sammut & Webb, 2011) ⇒ (2017) "In-Sample Evaluation". In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA
- QUOTE: In-sample evaluation is an approach to algorithm evaluation whereby the learned model is evaluated on the data from which it was learned. This provides a biased estimate of learning performance, in contrast to holdout evaluation.

2017b

(Mullainathan & Spiess) ⇒ Sendhil Mullainathan and Jann Spiess (2017). "Machine learning: an applied econometric approach". Journal of Economic Perspectives, 31(2), 87-106.
- QUOTE: So how does machine learning manage to do out-of-sample prediction? The first part of the solution is regularization. In the tree case, instead of choosing the “best” overall tree, we could choose the best tree among those of a certain depth. The shallower the tree, the worse the in-sample fit: with many observations in each leaf, no one observation will be fit very well. But this also means there will be less overfit: the idiosyncratic noise of each observation is averaged out. Tree depth is an example of a regularizer. It measures the complexity of a function. As we regularize less, we do a better job at approximating the in-sample variation, but for the same reason, the wedge between in-sample and out-of-sample fit will typically increase. Machine learning algorithms typically have a regularizer associated with them. By choosing the level of regularization appropriately, we can have some benefits of flexible functional forms without having those benefits be overwhelmed by overfit …

2005

(Inoue & Kilian,2005) ⇒ Atsushi Inoue, and Lutz Kilian (2005). "In-sample or out-of-sample tests of predictability: Which one should we use?" (PDF). Econometric Reviews, 23(4), 371-402.
- QUOTE: Predictability tests can be conducted based on the in-sample fit of a model or they can be based on the out-of-sample fit obtained from a sequence of recursive or rolling regressions. In the former case, we use the full sample in fitting the models of interest. Examples of in-sample tests are standard t-tests or F-tests. In the latter case we attempt to mimic the data constraints faced by a real-time forecaster. Examples of out-of-sample tests are tests of equal predictive accuracy and tests of forecast encompassing. If these alternative tests tended to give the same answer, when applied to the same data set, it would not matter much, which one we use. In practice, however, in-sample tests tend to reject the null hypothesis of no predictability more often than out-of-sample tests. It is important to understand why. One possible explanation that is widely accepted among applied researchers is that in-sample tests are biased in favor of detecting spurious predictability. This perception has led to a tendency to discount significant evidence in favor of predictability based on in-sample tests, if this evidence cannot also be supported by out-of-sample tests (...)
  The literature is replete with warnings about unreliable in-sample inference. The two main concerns are that in-sample tests of predictability will tend to be unreliable in the presence of unmodelled structural change and as a result of individual or collective data mining (...) It is important to be clear about what we mean by unreliable inference. In the context of predictive inference, the prevailing concern is that in-sample tests of predictability may spuriously indicate predictability when there is none. In this context, a predictability test would be considered unreliable if it has a tendency to reject the no predictability null hypothesis more often than it should at the chosen significance level. Formally, we define a test to be unreliable if its effective size exceeds its nominal size. It is important to note that the mere inclusion of irrelevant variables, although it inflates in-sample fit, does not affect the reliability of in-sample tests of predictability. By construction, a t-test of predictability is designed to mimic the distribution of the test statistic under the null that the regressor is irrelevant. Similarly, as more and more irrelevant variables are included, the critical values of the F-test will increase to account for this fact. Thus, the possible inclusion of irrelevant variables has no effect on the asymptotic size of predictability tests. This point is important because it means that under standard assumptions there is no reason to expect that in-sample tests offer any less protection against overfitting than do out-of-sample tests.