Skip to article frontmatterSkip to article content

Logistic Regression

Log likelihood

Consider a binary classification problem. Let p^(yx)\hat p(y|x) denote an estimate for the conditional probability that Y=yY=y given that X=xX=x. Let p(yx)p^*(y|x) denote the true conditional probability. True or false:

E[logp(YX)]E[logp^(YX)].\mathbb{E}[\log p^*(Y|X)] \geq \mathbb{E}[\log \hat p(Y|X)].

Consistency of logistic regression

Consider a binary classification problem. Let p(yx)p^*(y|x) denote the true conditional probability that Y=yY=y given that X=xX=x. Assume that

logp(+x)p^(x)=β0+j=1pβjxj\log \frac{p^*(+|x)}{\hat p^*(-|x)} = \beta_0 + \sum_{j=1}^p \beta_j^* x_j

for some coefficients β\beta^*. True or false: in typical settings, given enough data, the logistic regression estimate of the log odds will become arbitrarily close to the truth.

Softmax redundancy

Consider the “softmax” representation for multiclass logistic regression with one predictor feature,

p^(yx)=exp(βy,0+βy,1x)yexp(βy,0+βy,1x).\hat p(y|x) = \frac{\exp(\beta_{y,0} +\beta_{y,1} x)}{\sum_{y'} \exp(\beta_{y',0} +\beta_{y',1} x)}.

True or false: this representation is redundant, i.e., one can find at least two different numerical coefficients for β that lead to the same conditional probabilities.

Logistic regression and LASSO/ridge

True or false: penalties such as LASSO and ridge are rarely used when training logistic regression.

Alzheimer’s and bilingualism

You are interested to understand the relationship between bilingualism (being fluent in at least two languages) and Alzheimer’s disease. You obtain a dataset with 50,000 older adults living the United States, of whom 5,000 have Alzheimer’s. For each patient you know whether they are fluent in more than one language and what their taxable yearly income is.

  1. You first fit a logistic regression model to predict whether a patient has Alzheimer’s disease (Y=0Y=0 if they don’t, Y=1Y=1 if they do) given their bilingualism status (X1=0X_1=0 for only one language, X1=1X_1=1 for at least two). The model suggests that bilingual individuals are less likely to get Alzheimer’s disease (note we are making no causal statements here). You then realize that your approach was probably not the right way to go. Instead, given your dataset, you should have fit a model with p=2p=2 features (including both bilingualism status and income in the model). Why?

  2. You next fit a logistic regression model to predict whether a patient has Alzheimer’s disease (Y=0Y=0 if they don’t, Y=1Y=1 if they do) given their bilingualism status (X1=0X_1=0 for only one language, X1=1X_1=1 for at least two) and their income (X2X_2, in thousands of dollars). We estimate log odds using a generalized additive model (GAM) with the following form:

    f^(x)=β^0+f^1(x2)+I(x1=1)f^2(x2).\hat f(x) = \hat \beta_0 + \hat f_1(x_2) +\mathbb{I}(x_1=1)\hat f_2(x_2).

    Here f^1\hat f_1 and f^2\hat f_2 are the two component functions of the GAM. We fit β^0\hat \beta_0, f^1\hat f_1, and f^2\hat f_2 to data, using splines for f^1\hat f_1 and f^2\hat f_2.

    Under this model, which is the correct expression for

    Δ(x2)=log(P(Y=1X1=1,X2=x2)P(Y=0X1=1,X2=x2)/P(Y=1X1=0,X2=x2)P(Y=0X1=0,X2=x2))?\Delta(x_2)= \log \left(\frac{\mathbb{P}(Y=1|X_1=1,X_2=x_2)}{\mathbb{P}(Y=0|X_1=1,X_2=x_2)} /\frac{\mathbb{P}(Y=1|X_1=0,X_2=x_2)}{\mathbb{P}(Y=0|X_1=0,X_2=x_2)}\right)?
    1. f^2(x2)\hat f_2(x_2)

    2. f^2(x2)f^1(x2)\hat f_2(x_2)-\hat f_1(x_2)

    3. exp(f^2(x2))\exp(\hat f_2(x_2))

    4. exp(f^2(x2)f^1(x2))\exp(\hat f_2(x_2)-\hat f_1(x_2))

Data with irrelevant features

You are given a dataset with n=400n=400 data points concerning p=1000p=1000 input features and one binary response. You suspect that most of the features are completely irrelevant to the response. Of the following, which would be the most typical estimator to use in this case?

  1. Logistic regression with no penalty

  2. Logistic regression with a ridge penalty

  3. Logistic regression with a LASSO penalty

Interpreting LASSO

You are given a dataset concerning p=100p=100 input features and one binary response. You estimate log odds using LASSO-regularized logistic regression (using cross-validation to pick the regularization strength). You find that the estimated coefficients are zero for all but the first two input features (i.e. β3=β4==β100=0\beta_3=\beta_4=\ldots=\beta_100=0). Which of the following would be the most appropriate interpretation of this finding?

  1. Cross-validation probably failed to identify a good choice for the regularization strength.

  2. In future efforts to predict the response, it may be sensible to focus research on features X1X_1 and X2X_2.

  3. The response appears to be independent of the features X1,X2X_1,X_2.

  4. The response appears to be independent of features X3,X4X100X_3,X_4\ldots X_100.

Logistic regression coefficients under two different models

You are given a dataset with two predictors (X1,X2X_1,X_2) and one binary response. You first fit logistic regression using only X1X_1 (call that Model I). Then you fit logistic regression with both predictors (call that Model II). True or false: if the coefficient for X1X_1 was positive in Model I, then it must be positive in Model II.