Logistic Regression
Log likelihood¶
Consider a binary classification problem. Let denote an estimate for the conditional probability that given that . Let denote the true conditional probability. True or false:
Solutions
True. The expected log likelihood of the truth is always at least as good as the expected log likelihood for an estimate. That’s part of why methods like logistic regression estimate coefficients to maximize average log likelihood on training data.
Consistency of logistic regression¶
Consider a binary classification problem. Let denote the true conditional probability that given that . Assume that
for some coefficients . True or false: in typical settings, given enough data, the logistic regression estimate of the log odds will become arbitrarily close to the truth.
Solutions
True. If the log odds are linear, logistic regression will do just fine.
Softmax redundancy¶
Consider the “softmax” representation for multiclass logistic regression with one predictor feature,
True or false: this representation is redundant, i.e., one can find at least two different numerical coefficients for β that lead to the same conditional probabilities.
Solutions
True. It’s not a problem in most cases. But it is important to be aware of. You can’t interpret the coefficients very easily in this case.
Logistic regression and LASSO/ridge¶
True or false: penalties such as LASSO and ridge are rarely used when training logistic regression.
Solutions
False. In fact, sklearn uses a mild ridge penalty by default.
Alzheimer’s and bilingualism¶
You are interested to understand the relationship between bilingualism (being fluent in at least two languages) and Alzheimer’s disease. You obtain a dataset with 50,000 older adults living the United States, of whom 5,000 have Alzheimer’s. For each patient you know whether they are fluent in more than one language and what their taxable yearly income is.
You first fit a logistic regression model to predict whether a patient has Alzheimer’s disease ( if they don’t, if they do) given their bilingualism status ( for only one language, for at least two). The model suggests that bilingual individuals are less likely to get Alzheimer’s disease (note we are making no causal statements here). You then realize that your approach was probably not the right way to go. Instead, given your dataset, you should have fit a model with features (including both bilingualism status and income in the model). Why?
SolutionsTwo things here. First, roughly speaking, income may be a “confound.” We won’t define that rigorously in this class, but the point is that bringing that extra predictor in may lead to a more complete picture. Second, you have plenty of data. If you only had 10 datapoints, it might have been hard to fit a full model using both predictors (variance might be too high). However, with , you can definitely fit a model with two predictors.
You next fit a logistic regression model to predict whether a patient has Alzheimer’s disease ( if they don’t, if they do) given their bilingualism status ( for only one language, for at least two) and their income (, in thousands of dollars). We estimate log odds using a generalized additive model (GAM) with the following form:
Here and are the two component functions of the GAM. We fit , , and to data, using splines for and .
Under this model, which is the correct expression for
SolutionsFirst answer is correct. Note that ) is not not just a combination of ratios and logs: it is an important quantity for interpretation. It indicates the estimated change in the log odds of getting Alzheimer’s, among those with some particular income . If is positive, it means that among those with income , those who are bilingual are more likely to get Alzheimer’s. If it is negative, it means that among those with income , those who are bilingual are less likely to get Alzheimer’s. This interpretation comes with two caveats. First, we are not making any causal statements here. Second, all of our statements are only as valid as our model, which might have been estimated badly.
Data with irrelevant features¶
You are given a dataset with data points concerning input features and one binary response. You suspect that most of the features are completely irrelevant to the response. Of the following, which would be the most typical estimator to use in this case?
Logistic regression with no penalty
Logistic regression with a ridge penalty
Logistic regression with a LASSO penalty
Solutions
The last one. If you think most of the features are irrelevant, LASSO can help you ignore them. Of course if you know which features are irrelevant you can just delete them, leading to lower variance estimates in most cases---but lasso is useful if you aren’t sure which are irrelevant.
Interpreting LASSO¶
You are given a dataset concerning input features and one binary response. You estimate log odds using LASSO-regularized logistic regression (using cross-validation to pick the regularization strength). You find that the estimated coefficients are zero for all but the first two input features (i.e. ). Which of the following would be the most appropriate interpretation of this finding?
Cross-validation probably failed to identify a good choice for the regularization strength.
In future efforts to predict the response, it may be sensible to focus research on features and .
The response appears to be independent of the features .
The response appears to be independent of features .
Solutions
The second one. LASSO doesn’t say anything conclusive about independence (except perhaps in some very special cases). And there’s no reason to suppose cross-validation failed just because you get a bunch of zero coefficients.
Logistic regression coefficients under two different models¶
You are given a dataset with two predictors () and one binary response. You first fit logistic regression using only (call that Model I). Then you fit logistic regression with both predictors (call that Model II). True or false: if the coefficient for was positive in Model I, then it must be positive in Model II.
Solutions
False. When you bring in a new predictor, anything can happen.