Statistical Regression Analysis Task
A Statistical Regression Analysis Task is a model-based supervised estimation task that requires the use of statistical theory (typically in the form of a statistical regression algorithm that finds function fits between an independent variable and dependent variable).
- AKA: Regression Task, Statistical Regression Task.
- Context:
- Input: Regression Task Input, such as:
- [math]\displaystyle{ X_{indep}=\{x_{i1},\,x_{i2},\,\ldots ,\,x_{ip}\,\} }[/math], a continuous dataset of observed values of one or more independent variables, these are called the predictor variables, regressors, exogenous variables, explanatory variables, covariates, input variables.
- [math]\displaystyle{ Y_{dep}=\{y_i\}^p_{i=1}=\{y_1,\,\ldots,\,y_p\} }[/math], a continuous dataset of observed values of the dependent variable. This is called the response variable, regressand, endogenous variable, measured variable, criterion variable, target variable.
- output:
- [math]\displaystyle{ \hat{y}(x_s) }[/math], predicted values/estimated values for input dataset [math]\displaystyle{ x_s }[/math].
- [math]\displaystyle{ \beta_k }[/math], ([math]\displaystyle{ p+1 }[/math])-dimensional parameter vector, also called regression coefficients or effects.
- [math]\displaystyle{ |\hat{y} - y|^2, \sigma,... }[/math], squared error, mean squared error, mean standard deviation, bias other statistical information about the fitting parameters.
- [math]\displaystyle{ \alpha_j }[/math], regularization parameters or complexity parameters vector.
- [math]\displaystyle{ \epsilon_i }[/math], error term solution.
- Task Requirements
- It requires to solve the equation:
[math]\displaystyle{ y_i=f(x_i)+ \epsilon_i }[/math] for [math]\displaystyle{ i=1,...,p }[/math]
where the pair [math]\displaystyle{ \{x_i,y_i\}_{i=1}^p }[/math] correspond to an observed dataset or training data, the [math]\displaystyle{ f(x) }[/math] is a regression function and [math]\displaystyle{ \epsilon_i }[/math] are the regression residuals, also called error term, disturbance term, or noise.
- A regression diagnostic test to determine goodness of fit the regression model and the statistical significance of the estimated parameters
- It requires to solve the equation:
- It can be solved by a Regression Analysis System (that implements a regression analysis algorithm).
- It can range from (often) being a Least-Squares Regression Task to being a Non-Least-Squares Regression Task.
- It can range from being a Single Variable Regression Analysis Task to being a Multivariate Regression Analysis Task.
- It can range between being a Linear Regression Task to being a Nonlinear Regression Task.
- It can range between being a Parametric Regression Task to being a Nonparametric Regression Task.
- It can be used to estimate Empirical Relationships among Data Variables.
- Input: Regression Task Input, such as:
- Example(s):
- A Linear Least-Squares Regression Task.
- A Bayesian Linear Regression Task.
- A Nonparametric Regression Task, such as Kernel Regression Task.
- A Robust Regression Task.
- A Percentage Regression Task.
- A Quantile Regression Task.
- A Logistic Regression Task.
- A Distance Metric Learning Task.
- a Supervised Point Estimation Task (with a dependent variable).
- A Geospatial Regression Analysis.
- …
- Counter-Example(s):
- See: Linking Function, Predictive Modeling, Conditional Expectation, Average Value, Location Parameter, Function (Mathematics), Probability Distribution, Prediction, Forecasting, Machine Learning, Causality.
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Regression_analysis#History Retrieved:2017-8-20.
- The earliest form of regression was the method of least squares, which was published by Legendre in 1805,[1] and by Gauss in 1809.[2] Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but also later the then newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821,[3] including a version of the Gauss–Markov theorem.
The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). For Galton, regression had only this biological meaning, [4] [5] but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925. [6] Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821. In the 1950s and 1960s, economists used electromechanical desk calculators to calculate regressions. Before 1970, it sometimes took up to 24 hours to receive the result from one regression. [7]
Regression methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression, regression involving correlated responses such as time series and growth curves, regression in which the predictor (independent variable) or response variables are curves, images, graphs, or other complex data objects, regression methods accommodating various types of missing data, nonparametric regression, Bayesian methods for regression, regression in which the predictor variables are measured with error, regression with more predictor variables than observations, and causal inference with regression.
- The earliest form of regression was the method of least squares, which was published by Legendre in 1805,[1] and by Gauss in 1809.[2] Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but also later the then newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821,[3] including a version of the Gauss–Markov theorem.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Regression_analysis Retrieved:2015-1-14.
- In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation. Many techniques for carrying out regression analysis have been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional. The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results. [8] [9]
- In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.
2011A
- (Quadrianto & Buntine, 2011) ⇒ Novi Quadrianto and Wray L. Buntine (2011). "Regression" In: (Sammut & Webb, 2011) pg. 1075-1080
- QUOTE: Regression is a fundamental problem in statistics and machine learning. In regression studies, we are typically interested in inferring a real-valued function (called a regression function) whose values correspond to the mean of a dependent (or response or output) variable conditioned on one or more independent (or input) variables. Many different techniques for estimating this regression function have been developed, including parametric, semi-parametric, and nonparametric methods.
1998
- (Johnson & Wichern, 1998) ⇒ Richard A. Johnson, and Dean W. Wichern. (1998). “Applied Multivariate Statistical Analysis, 4th ed." Prentice hall, 1992. ISBN:013834194X
- QUOTE: Regression analysis is the statistical methodology for predicting values of one or more response (dependent) variables from a collection of predictor (independent) variable values. It can also be used for assessing the effects of the predictor variables on the responses. Unfortunately, the name regression culled from the title of the first paper on the subject by F. Galton [13], in no way reflects either the importance or breath of application of this methodology. ... Let [math]\displaystyle{ z_1, z_2, ..., z_r }[/math] be [math]\displaystyle{ r }[/math] predictor variables through to be related to a response variable [math]\displaystyle{ Y }[/math] ...