Featurization - Some Data Science Questions

Estimator variance with polynomial features¶

Consider the following distribution on $\left(X,Y\right)$

\begin{aligned} X & \sim\mathrm{Uniform}\left([-10,-8]\cup[8,10]\right)\\ \left[Y|X=x\right] & \sim\mathcal{N}\left(x^{2},1\right). \end{aligned}

(1)

Imagine we gathered a dataset $\mathscr{D}$ sampled from this distribution and used it to produce an estimate $\hat{f}(x;\mathscr{D})\approx\mathbb{E}\left[Y|X=x\right]$ . Assume we used least squares with polynomial features of degree 9. True or false:

\mathrm{var}\left(\hat{f}(0;\mathscr{D})\right)>\mathrm{var}\left(\hat{f}(9;\mathscr{D})\right)

(2)

Estimator variance with polynomial features, II¶

Consider a regression problem with one predictor $X$ and one numerical response $Y$ . Assume the predictor $X$ is distributed as $X \sim \mathcal{N}(0,1)$ . Let $\hat f_A(x)$ denote the least squares estimator for the regression function, based on natural cubic spline features. Let $\hat f_B(x)$ denote the least squares estimator for the regression function, based on some cubic spline features (including features that are not natural cubic splines). Of the following, which is the best explanation for why $f_A$ might be preferred?

$\hat f_A(x)$ will have less variance than $\hat f_B(x)$ , especially when $|x|$ large.
$\hat f_A(x)$ will have less variance than $\hat f_B(x)$ , especially when $|x|$ is small.
$\hat f_A(x)$ will have less bias than $\hat f_B(x)$ , especially when $|x|$ is large.
$\hat f_A(x)$ will have less bias than $\hat f_B(x)$ , especially when $|x|$ small.

Estimator variance with derived features¶

In typical cases, does using additional derived features lead to more variance or less variance?

8th order polynomials vs splines with 8 knots¶

True or false: in the context of linear regression, using polynomial features of degree 8 always leads to the same predictions as using cubic spline features with 8 evenly spaced knots.

Splines vs polynomials¶

True or false: splines are typically much better than polynomials, especially if the predictor is uniformly distributed throughout its range.

Splines vs polynomials, part II¶

True or false: spline features with carefully placed knots can often outperform polynomial features.

Placing knots by eye¶

If you want to place knots by looking at the data and choosing them yourself, while still getting a good assessment of how your method will perform on future test data, which of the following would be preferred?

Use all the data to place knots, to ensure best possible placement. Then make a test/train split, train the spline model on the training data, and assess its performance on the test data.
Make a test/train split. Use the training data to decide where to place the knots, and the same training data to fit the model. Then assess performance on the test data.

Picking knots¶

Among the following choices, which is the most typical way to pick the knots for spline features?

Uniformly spaced inside the range of the observed features.
Uniformly at random inside the range of the observed features
Place a knot at each data point (and use lasso regularization to control curvature)

Roughness penalties with quadratic models¶

Consider the parametric regression model $Y=\beta_{0}+\beta_{1}x+\beta_{2}x^{2}+ \epsilon$ . Suppose we are given a dataset $\mathscr{D}=\{(x_1,y_1),\ldots,(x_n,y_n)\}$ and we estimate the parameters β by solving

\arg \min_{\hat \beta}\sum_{i=1}^n (y_{i}-f(x_i;\hat \beta))^{2}+\int_{-\infty}^{\infty}f''(x;\hat \beta)^{2}dx.

(4)

True or false: our estimated value of $\beta_{2}$ will be zero.

Roughness penalties with quadratic models, II¶

Let $f(x;\beta)=\beta_{0}+\beta_{1}x+\beta_{2}x^{2}$ and consider estimating the coefficients via solving the minimization problem

\arg \min_{\beta}\sum(y_{i}-f(x_i;\beta))^{2}+\int_{-\infty}^{\infty}\lambda f''(x;\beta)^{2}dx.

(6)

Let $\hat{\beta}(\lambda)$ denote the coefficients estimated for any given choice of regularization strength λ. True or false: $\lim_{\lambda\downarrow0}\hat{\beta}_{2}(\lambda)=\hat{\beta_{2}}(0).$

Knots vs training error¶

In cubic regression spline, adding more knots in general will increase the training error. True or false?

Polynomial regression and validation error¶

You are performing least-squares polynomial regression. As the degree of your polynomials increases, the validation error is commonly seen to go down at first but then go up. True or false?

Preprocessing binary features¶

You are given a dataset concerning $p$ binary features (coded as zeros and ones) and one numerical response. True or false: before proceeding with most estimators, it is essential to preprocess your data into a new form with $2p$ features (this is referred to as “one-hot encoding” and the new features are sometimes called “dummy features”).

Preprocessing binary features, II¶

You are given a dataset concerning $p$ categorical features and one numerical response. Assume each categorical feature can take on one of four unique values (e.g., $X_i \in \{1,2,3,4\}$ for each feature $i$ ). Before proceeding with most estimators, it is usually helpful to preprocess your data into a new form using a one-hot encoding. After this encoding process, you will have a new dataset that a wide variety of methods can be directly applied to. How many predictor features will this new dataset have?

Either $5p$ or $4p$ (either is fine; depends on whether you drop one of the categories).
Either $3p$ or $4p$ (either is fine; depends on whether you drop one of the categories).
Either $2p$ or $3p$ (either is fine; depends on whether you drop one of the categories).

Featurization, standardization, and the magic of splines

Standardization

Featurization, standardization, and the magic of splines

Splines