Misc
Picking estimators¶
Given a dataset and a collection of 10 candidate estimation procedures, which would be the more typical way to decide which estimation procedure to use: cross-validation or bootstrap?
Solutions
Cross-validation.
Case control adjustment¶
A team is developing a predictive model to estimate the probability that an email is spam, based on predictor features . During the training phase, a model is trained on a dataset where 70% of the emails are labeled as “not spam” and 30% as “spam.” This model produces an estimate for the log odds that an email with features is spam. However, when the model is deployed in a real-world scenario, the distribution of emails changes significantly, with the test data comprising 50% “spam” and 50% “not spam” emails. Assuming that the class-conditional distributions remain consistent between the training and test datasets, which of the following would be a suitable estimate for the log odds that an email is spam in the real-world setting?
None of the above
Solutions
First option is correct. The label-shift adjustment would usually be written as , but .
A choice of two datasets¶
You would like to be able to predict whether a fishing boat will become inoperative in the next six months. You are considering two different datasets about boat longevity. The first dataset contains records for 10000 boats, including 2 boats that become inoperative during the six months for which they were monitored. The second dataset contains records for 1000 boats, including 20 that become inoperative during the six months for which they were being monitored. All else equal, if you had to pick one dataset, which would you prefer to use?
Solutions
The second one. If you have only two examples where the boat sank, it will be almost impossible to build a good predictor (“class imbalance”). There is another question you should be asking yourself here, though: why do the datasets have such different proportions of broken boats? They are clearly not sampling from the same population! Regardless of which dataset you use, it may be important to figure out the overall rate at which the boats you are interested in break, so you can perform some form of label-shift adjustment.
Ridge penalties¶
Let be the number of observations and be the dimension of the predictors. Suppose we estimate the regression coefficients in a linear regression model by minimizing
for a particular value of λ. For parts (a) through (e), indicate which of i. through v. is correct.
a. As we increase λ from 0, the training error will:
i. Increase initially, and then eventually start decreasing in an
inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U
shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.
b. Repeat (a) for test error.
c. Repeat (a) for variance.
d. Repeat (a) for (squared) bias.
e. Repeat (a) for the irreducible error.
Solutions
(a) iii; (b) ii; (c) iv; (d) iii; (e) v