Skip to article frontmatterSkip to article content

Featurization

Estimator variance with polynomial features

Consider the following distribution on (X,Y)\left(X,Y\right)

XUniform([10,8][8,10])[YX=x]N(x2,1).\begin{aligned} X & \sim\mathrm{Uniform}\left([-10,-8]\cup[8,10]\right)\\ \left[Y|X=x\right] & \sim\mathcal{N}\left(x^{2},1\right). \end{aligned}

Imagine we gathered a dataset D\mathscr{D} sampled from this distribution and used it to produce an estimate f^(x;D)E[YX=x]\hat{f}(x;\mathscr{D})\approx\mathbb{E}\left[Y|X=x\right]. Assume we used least squares with polynomial features of degree 9. True or false:

var(f^(0;D))>var(f^(9;D))\mathrm{var}\left(\hat{f}(0;\mathscr{D})\right)>\mathrm{var}\left(\hat{f}(9;\mathscr{D})\right)

Estimator variance with polynomial features, II

Consider a regression problem with one predictor XX and one numerical response YY. Assume the predictor XX is distributed as XN(0,1)X \sim \mathcal{N}(0,1). Let f^A(x)\hat f_A(x) denote the least squares estimator for the regression function, based on natural cubic spline features. Let f^B(x)\hat f_B(x) denote the least squares estimator for the regression function, based on some cubic spline features (including features that are not natural cubic splines). Of the following, which is the best explanation for why fAf_A might be preferred?

  1. f^A(x)\hat f_A(x) will have less variance than f^B(x)\hat f_B(x), especially when x|x| large.

  2. f^A(x)\hat f_A(x) will have less variance than f^B(x)\hat f_B(x), especially when x|x| is small.

  3. f^A(x)\hat f_A(x) will have less bias than f^B(x)\hat f_B(x), especially when x|x| is large.

  4. f^A(x)\hat f_A(x) will have less bias than f^B(x)\hat f_B(x), especially when x|x| small.

Estimator variance with derived features

In typical cases, does using additional derived features lead to more variance or less variance?

8th order polynomials vs splines with 8 knots

True or false: in the context of linear regression, using polynomial features of degree 8 always leads to the same predictions as using cubic spline features with 8 evenly spaced knots.

Splines vs polynomials

True or false: splines are typically much better than polynomials, especially if the predictor is uniformly distributed throughout its range.

Splines vs polynomials, part II

True or false: spline features with carefully placed knots can often outperform polynomial features.

Placing knots by eye

If you want to place knots by looking at the data and choosing them yourself, while still getting a good assessment of how your method will perform on future test data, which of the following would be preferred?

  1. Use all the data to place knots, to ensure best possible placement. Then make a test/train split, train the spline model on the training data, and assess its performance on the test data.

  2. Make a test/train split. Use the training data to decide where to place the knots, and the same training data to fit the model. Then assess performance on the test data.

Picking knots

Among the following choices, which is the most typical way to pick the knots for spline features?

  1. Uniformly spaced inside the range of the observed features.

  2. Uniformly at random inside the range of the observed features

  3. Place a knot at each data point (and use lasso regularization to control curvature)

Roughness penalties with quadratic models

Consider the parametric regression model Y=β0+β1x+β2x2+ϵY=\beta_{0}+\beta_{1}x+\beta_{2}x^{2}+ \epsilon. Suppose we are given a dataset D={(x1,y1),,(xn,yn)}\mathscr{D}=\{(x_1,y_1),\ldots,(x_n,y_n)\} and we estimate the parameters β by solving

argminβ^i=1n(yif(xi;β^))2+f(x;β^)2dx.\arg \min_{\hat \beta}\sum_{i=1}^n (y_{i}-f(x_i;\hat \beta))^{2}+\int_{-\infty}^{\infty}f''(x;\hat \beta)^{2}dx.

True or false: our estimated value of β2\beta_{2} will be zero.

Roughness penalties with quadratic models, II

Let f(x;β)=β0+β1x+β2x2f(x;\beta)=\beta_{0}+\beta_{1}x+\beta_{2}x^{2} and consider estimating the coefficients via solving the minimization problem

argminβ(yif(xi;β))2+λf(x;β)2dx.\arg \min_{\beta}\sum(y_{i}-f(x_i;\beta))^{2}+\int_{-\infty}^{\infty}\lambda f''(x;\beta)^{2}dx.

Let β^(λ)\hat{\beta}(\lambda) denote the coefficients estimated for any given choice of regularization strength λ. True or false: limλ0β^2(λ)=β2^(0).\lim_{\lambda\downarrow0}\hat{\beta}_{2}(\lambda)=\hat{\beta_{2}}(0).

Knots vs training error

In cubic regression spline, adding more knots in general will increase the training error. True or false?

Polynomial regression and validation error

You are performing least-squares polynomial regression. As the degree of your polynomials increases, the validation error is commonly seen to go down at first but then go up. True or false?

Preprocessing binary features

You are given a dataset concerning pp binary features (coded as zeros and ones) and one numerical response. True or false: before proceeding with most estimators, it is essential to preprocess your data into a new form with 2p2p features (this is referred to as “one-hot encoding” and the new features are sometimes called “dummy features”).

Preprocessing binary features, II

You are given a dataset concerning pp categorical features and one numerical response. Assume each categorical feature can take on one of four unique values (e.g., Xi{1,2,3,4}X_i \in \{1,2,3,4\} for each feature ii). Before proceeding with most estimators, it is usually helpful to preprocess your data into a new form using a one-hot encoding. After this encoding process, you will have a new dataset that a wide variety of methods can be directly applied to. How many predictor features will this new dataset have?

  • Either 5p5p or 4p4p (either is fine; depends on whether you drop one of the categories).

  • Either 3p3p or 4p4p (either is fine; depends on whether you drop one of the categories).

  • Either 2p2p or 3p3p (either is fine; depends on whether you drop one of the categories).