Influential Observation

From GM-RKB
(Redirected from influential observation)
Jump to navigation Jump to search

An Influential Observation is an observation which removal or omission from the dataset will significantly alter the outcome of the parameter estimation task.



References

2015

AssessmentVarious methods have been proposed for measuring influence. Assume an estimated regression [math]\displaystyle{ \mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e} }[/math], where [math]\displaystyle{ \mathbf{y} }[/math] is an n×1 column vector for the response variable, [math]\displaystyle{ \mathbf{X} }[/math] is the n×k design matrix of explanatory variables (including a constant), [math]\displaystyle{ \mathbf{e} }[/math] is the n×1 residual vector, and [math]\displaystyle{ \mathbf{b} }[/math] is a k×1 vector of estimates of some population parameter [math]\displaystyle{ \mathbf{\beta} \in \mathbb{R}^{k} }[/math]. Also define [math]\displaystyle{ \mathbf{H} \equiv \mathbf{X} \left(\mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsf{T}} }[/math], the projection matrix of [math]\displaystyle{ \mathbf{X} }[/math]. Then we have the following measures of influence:
  1. [math]\displaystyle{ \text{DFBETA}_{i} \equiv \mathbf{b} - \mathbf{b}_{(-i)} = \frac{\left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{x}_{i}^{\mathsf{T}} e_{i}}{1 - h_{i}} }[/math], where [math]\displaystyle{ \mathbf{b}_{(-i)} }[/math] denotes the coefficients estimated with the i-th row [math]\displaystyle{ \mathbf{x}_{i} }[/math] of [math]\displaystyle{ \mathbf{X} }[/math] deleted, [math]\displaystyle{ h_{i} = \mathbf{x}_{i} \left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{x}_{i}^{\mathsf{T}} }[/math] denotes the i-th row of [math]\displaystyle{ \mathbf{H} }[/math]. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each point and each observation (if there are N points and k variables there are N·k DFBETAs).
  2. DFFITS
  3. Cook's D measures the effect of removing a data point on all the parameters combined.