2012 Chapter12LogisticRegression
- (Shalizi, 2012) ⇒ Cosma Shalizi. (2012). “Chapter 12 - Logistic Regression.” In: Carnegie Melon University, 36-402, Undergraduate Advanced Data Analysis.
Subject Headings: Logistic Regression Algorithm; Iteratively Re-Weighted Least Squares; Supervised Binary Prediction.
Notes
Cited By
Quotes
12.1 Modeling Conditional Probabilities
So far, we either looked at estimating the conditional expectations of continuous variables (as in regression), or at estimating distributions. There are many situations where however we are interested in input-output relationships, as in regression, but the output variable is discrete rather than continuous. In particular there are many situations where we have binary outcomes (it snows in Pittsburgh on a given day, or it doesn’t; this squirrel carries plague, or it doesn’t; this loan will be paid back, or it won’t; this person will get heart disease in the next five years, or they won’t). In addition to the binary outcome, we have some input variables, which may or may not be continuous. How could we model and analyze such data?
We could try to come up with a rule which guesses the binary output from the input variables. This is called classification, and is an important topic in statistics and machine learning. However, simply guessing “yes” or “no” is pretty crude — especially if there is no perfect rule. (Why should there be ?) Something which takes noise into account, and doesn’t just give a binary answer, will often be useful. In short, we want probabilities — which means we need to fit a stochastic model.
What would be nice, in fact, would be to have conditional distribution of the response Y, given the input variables, Pr (Y|X). This would tell us about how precise our predictions are. If our model says that there’s a 51% chance of snow and it doesn’t snow, that’s better than if it had said there was a 99% chance of snow (though even a 99% chance is not a sure thing). We have seen how to estimate conditional probabilities non-parametrically, and could do this using the kernels for discrete variables from lecture 6. While there are a lot of merits to this approach, it does involve coming up with a model for the joint distribution of outputs Y and inputs X, which can be quite time-consuming.
Let’s pick one of the classes and call it “1” and the other “0”. (It doesn’t matter which is which. Then Y becomes an indicator variable, and you can convince yourself that Pr (Y = 1) = E [ Y ]. Similarly, Pr (Y = 1|X = x) = E [ Y|X = x ]. (In a phrase, “conditional probability is the conditional expectation of the indicator”.) This helps us because by this point we know all about estimating conditional expectations. The most straightforward thing for us to do at this point would be to pick out our favorite smoother and estimate the regression function for the indicator variable; this will be an estimate of the conditional probability function.
There are two reasons not to just plunge ahead with that idea. One is that probabilities must be between 0 and 1, but our smoothers will not necessarily respect that, even if all the observed yi they get are either 0 or 1. The other is that we might be better off making more use of the fact that we are trying to estimate probabilities, by more explicitly modeling the probability.
Assume that Pr (Y = 1|X = x) = p (x; ?), for some function p parameterized by ?. parameterized function ?, and further assume that observations are independent of each other. The (conditional) likelihood …
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2012 Chapter12LogisticRegression | Cosma Shalizi | Chapter 12 - Logistic Regression | 2012 |