Misc - Some Data Science Questions

Picking estimators¶

Given a dataset and a collection of 10 candidate estimation procedures, which would be the more typical way to decide which estimation procedure to use: cross-validation or bootstrap?

Case control adjustment¶

A team is developing a predictive model to estimate the probability that an email is spam, based on predictor features $x$ . During the training phase, a model is trained on a dataset where 70% of the emails are labeled as “not spam” and 30% as “spam.” This model produces an estimate $\hat f(x)$ for the log odds that an email with features $x$ is spam. However, when the model is deployed in a real-world scenario, the distribution of emails changes significantly, with the test data comprising 50% “spam” and 50% “not spam” emails. Assuming that the class-conditional distributions remain consistent between the training and test datasets, which of the following would be a suitable estimate for the log odds that an email is spam in the real-world setting?

$\hat f(x) - \log(.3/.7)$
$\hat f(x)\frac{.3 \times .5}{.7 \times .5}$
$\hat f(x)+\frac{.3 \times .5}{.7 \times .5}$
None of the above

A choice of two datasets¶

You would like to be able to predict whether a fishing boat will become inoperative in the next six months. You are considering two different datasets about boat longevity. The first dataset contains records for 10000 boats, including 2 boats that become inoperative during the six months for which they were monitored. The second dataset contains records for 1000 boats, including 20 that become inoperative during the six months for which they were being monitored. All else equal, if you had to pick one dataset, which would you prefer to use?

Ridge penalties¶

Let $n$ be the number of observations and $p$ be the dimension of the predictors. Suppose we estimate the regression coefficients in a linear regression model by minimizing

\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \| \beta \|_2^2

(1)

for a particular value of λ. For parts (a) through (e), indicate which of i. through v. is correct.

a. As we increase λ from 0, the training error will:

i.  Increase initially, and then eventually start decreasing in an
    inverted U shape.

ii.  Decrease initially, and then eventually start increasing in a U
    shape.

iii.  Steadily increase.

iv.  Steadily decrease.

v.  Remain constant.

b. Repeat (a) for test error.

c. Repeat (a) for variance.

d. Repeat (a) for (squared) bias.

e. Repeat (a) for the irreducible error.

Featurization, standardization, and the magic of splines

Neural nets