DFFITS
Jump to navigation
Jump to search
A DFFITS is an algorithm that measures the influential observation of a fitted value in a statistical regression.
- See: Statistical Modeling Algorithm, Statistical Regression, Influential Observation, Cook's Distance, Factorial Design, Student's t-Distribution.
References
2016
- (Wikipedia, 2016) ⇒ https://www.wikiwand.com/en/DFFITS Retrieved 2016-07-24
- DFFITS is a diagnostic meant to show how influential a point is in a statistical regression. It was proposed in 1980. It is defined as the Studentized DFFIT, where the latter is the change in the predicted value for a point, obtained when that point is left out of the regression; Studentization is achieved by dividing by the estimated standard deviation of the fit at that point:
- [math]\displaystyle{ \text{DFFITS} = {\widehat{y_i} - \widehat{y_{i(i)}} \over s_{(i)} \sqrt{h_{ii}}} }[/math]
- where [math]\displaystyle{ \widehat{y_i} }[/math] and [math]\displaystyle{ \widehat{y_{i(i)}} }[/math] are the prediction for point i with and without point i included in the regression, [math]\displaystyle{ s_{(i)} }[/math] is the standard error estimated without the point in question, and [math]\displaystyle{ h_{ii} }[/math] is the leverage for the point.
- DFFITS is very similar to the externally Studentized residual, and is in fact equal to the latter times [math]\displaystyle{ \sqrt{h_{ii}/(1-h_{ii})} }[/math].
- As when the errors are Gaussian the externally Studentized residual is distributed as Student's t (with a number of degrees of freedom equal to the number of residual degrees of freedom minus one), DFFITS for a particular point will be distributed according to this same Student's t distribution multiplied by the leverage factor [math]\displaystyle{ \sqrt{h_{ii}/(1-h_{ii})} }[/math] for that particular point. Thus, for low leverage points, DFFITS is expected to be small, whereas as the leverage goes to 1 the distribution of the DFFITS value widens infinitely.
- For a perfectly balanced experimental design (such as a factorial design or balanced partial factorial design), the leverage for each point is p/n, the number of parameters divided by the number of points. This means that the DFFITS values will be distributed (in the Gaussian case) as [math]\displaystyle{ \sqrt{p \over n-p} \approx \sqrt{p \over n} }[/math] times a t variate. Therefore, the authors suggest investigating those points with DFFITS greater than [math]\displaystyle{ 2\sqrt{p \over n} }[/math].
- Although the raw values resulting from the equations are different, Cook's distance and DFFITS are conceptually identical and there is a closed-form formula to convert one value to the other.
2009
- (Jahufer and Jianbao, 2009) ⇒ Jahufer, A., and Jianbao, C. (2009). Assessing global influential observations in modified ridge regression. Statistics & Probability Letters, 79(4), 513-518. [1]
- Among the most popular single-case influence measures is the difference in fit standardized (DFFITS) (Belsley et al.,1980), which evaluated at the i-th case is given by
- [math]\displaystyle{ \text{DFFITS}(i) = \left(x_i \beta − \beta(i)\right) / SE(x_i\beta) }[/math]
- where [math]\displaystyle{ \beta(i) }[/math] is the least squares estimator of [math]\displaystyle{ \beta }[/math] without the i-th case and [math]\displaystyle{ SE(x_i\beta) }[/math] is an estimator of the standard error (SE) of the fitted values.(DFFITS)is the standardized change in the fitted value of a case when it is deleted. Thus it can be considered a measure of influence on individual fitted values.
- Another useful measure of influence is Cook’s D (Cook and Weisberg, 1982), which evaluated at the ith case is given by
- [math]\displaystyle{ D_i = (\beta − \beta(i))' X'X (\beta − \beta(i))/(ps^2) }[/math]
- [math]\displaystyle{ D_i }[/math] is a measure of the change in all of the fitted values when a case is deleted. Even though [math]\displaystyle{ D_i }[/math] is based on a different theoretical consideration, it is closely related to DFFITS.
- It is important to mention that these measures are useful for detecting single cases with an unduly high influence. These
indexes, however, suffer from the problem of masking — that is, the presence of cases can disguise or mask the potential influence of other cases.
1980
- (Belsley et al., 1980) ⇒ Belsley, D. A., Kuh, E., Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Uiley Series in Probability and Mathematical Statistics.