sklearn.linear model.TheilSenRegressor
Jump to navigation
Jump to search
A sklearn.linear model.TheilSenRegressor is an Theil-Sen Regression System within sklearn.linear_model
class.
- Context:
- Usage:
- 1) Import TheilSenRegressor model from scikit-learn :
from sklearn.linear_model import TheilSenRegressor
- 2) Create design matrix
X
and response vectorY
- 3) Create TheilSenRegressor object:
model= TheilSenRegressor([fit_intercept=True, copy_X=True, max_subpopulation=10000.0, n_subsamples=None, ...])
- 4) Choose method(s):
fit(X, y)
, fits linear model.get_params([deep])
, gets parameters for this estimator.predict(X)
, predicts using the linear modelscore(X, y[, sample_weight])
, returns the coefficient of determination R^2 of the prediction.set_params(**params)
, sets the parameters of this estimator.
- 1) Import TheilSenRegressor model from scikit-learn :
- Example(s)
- Counter-Example(s):
- See: Regression System, L1 Norm, L2 Norm Cross-Validation Task, Ridge Regression Task, Bayesian Analysis.
References
2017A
- (scikit-learn.org, 2017) ⇒ http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TheilSenRegressor.html
- QUOTE:
class sklearn.linear_model.TheilSenRegressor(fit_intercept=True, copy_X=True, max_subpopulation=10000.0, n_subsamples=None, max_iter=300, tol=0.001, random_state=None, n_jobs=1, verbose=False)
- QUOTE:
- Theil-Sen Estimator: robust multivariate regression model.
- The algorithm calculates least square solutions on subsets with size n_subsamples of the samples in X. Any value of n_subsamples between the number of features and samples leads to an estimator with a compromise between robustness and efficiency. Since the number of least square solutions is “n_samples choose n_subsamples”, it can be extremely large and can therefore be limited with max_subpopulation. If this limit is reached, the subsets are chosen randomly. In a final step, the spatial median (or L1 median) is calculated of all least square solutions.
2017B
- (scikit-learn.org, 2017) ⇒ http://scikit-learn.org/stable/modules/linear_model.html#theil-sen-regression
- QUOTE: The
TheilSenRegressor
estimator uses a generalization of the median in multiple dimensions. It is thus robust to multivariate outliers. Note however that the robustness of the estimator decreases quickly with the dimensionality of the problem. It looses its robustness properties and becomes no better than an ordinary least squares in high-dimension.
- QUOTE: The
- (...)
TheilSenRegressor
is comparable to the Ordinary Least Squares (OLS) in terms of asymptotic efficiency and as an unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric method which means it makes no assumption about the underlying distribution of the data. Since Theil-Sen is a median-based estimator, it is more robust against corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3%.The implementation of
TheilSenRegressor
in scikit-learn follows a generalization to a multivariate linear regression model [8] using the spatial median which is a generalization of the median to multiple dimensions [9]. In terms of time and space complexity, Theil-Sen scales according to- [math]\displaystyle{ \binom{n_{samples}}{n_{subsamples}} }[/math]
- which makes it infeasible to be applied exhaustively to problems with a large number of samples and features. Therefore, the magnitude of a subpopulation can be chosen to limit the time and space complexity by considering only a random subset of all possible combinations.