Featurization
Estimator variance with polynomial features¶
Consider the following distribution on
Imagine we gathered a dataset sampled from this distribution and used it to produce an estimate . Assume we used least squares with polynomial features of degree 9. True or false:
Solutions
True! We will never see points where is close to zero. The result is that our estimates of will be quite bad.
Estimator variance with polynomial features, II¶
Consider a regression problem with one predictor and one numerical response . Assume the predictor is distributed as . Let denote the least squares estimator for the regression function, based on natural cubic spline features. Let denote the least squares estimator for the regression function, based on some cubic spline features (including features that are not natural cubic splines). Of the following, which is the best explanation for why might be preferred?
will have less variance than , especially when large.
will have less variance than , especially when is small.
will have less bias than , especially when is large.
will have less bias than , especially when small.
Solutions
First answer. Natural cubic splines and unconstrained cubic splines generally give same answers near the “middle” of the data. But for values that are far away from observed data, unconstrained cubic spline features (e.g. B-spline features) lead to very large variance. So we expect
Estimator variance with derived features¶
In typical cases, does using additional derived features lead to more variance or less variance?
Solutions
More.
8th order polynomials vs splines with 8 knots¶
True or false: in the context of linear regression, using polynomial features of degree 8 always leads to the same predictions as using cubic spline features with 8 evenly spaced knots.
Solutions
False. If you use polynomial features of 8th degree you’ll get estimated regression functions with nonzero eighth derivative. Predictions made using cubic spline features have zero eighth derivative (away from the knots). Indeed the cubic spline is made of cubic pieces, and so we get zero for for fourth derivatives, fifth derivatives, etc.
Splines vs polynomials¶
True or false: splines are typically much better than polynomials, especially if the predictor is uniformly distributed throughout its range.
Solutions
False. They’re often pretty comparable in this regime.
Splines vs polynomials, part II¶
True or false: spline features with carefully placed knots can often outperform polynomial features.
Solutions
True.
Placing knots by eye¶
If you want to place knots by looking at the data and choosing them yourself, while still getting a good assessment of how your method will perform on future test data, which of the following would be preferred?
Use all the data to place knots, to ensure best possible placement. Then make a test/train split, train the spline model on the training data, and assess its performance on the test data.
Make a test/train split. Use the training data to decide where to place the knots, and the same training data to fit the model. Then assess performance on the test data.
Solutions
Second option is better. If you use all of the data to pick the knots, you are “cheating” to some extent, which means you may mis-assess how well your estimate will perform on future test data.
Picking knots¶
Among the following choices, which is the most typical way to pick the knots for spline features?
Uniformly spaced inside the range of the observed features.
Uniformly at random inside the range of the observed features
Place a knot at each data point (and use lasso regularization to control curvature)
Solutions
First option is most common. Third option would be plausible except for that bit about lasso regularization to control curvature. It is perhaps conceivable that one could control curvature using some form of lasso, but it is certainly not common practice. Instead, if you want to control curvature it would be typical to use a penalty on the integral of the squared second derivative.
Roughness penalties with quadratic models¶
Consider the parametric regression model . Suppose we are given a dataset and we estimate the parameters β by solving
True or false: our estimated value of will be zero.
Solutions
True. In this case,
Thus if we take we will incur an infinite penalty. That means the solution that minimizes the objective must have .
Roughness penalties with quadratic models, II¶
Let and consider estimating the coefficients via solving the minimization problem
Let denote the coefficients estimated for any given choice of regularization strength λ. True or false:
Solutions
False. For any nonzero λ, we will find that must be exactly zero. Indeed, the penalty will be of the form
So if , the only way to control this penalty is to set . For , by contrast, it will be possible to obtain some nonzero value of .
Knots vs training error¶
In cubic regression spline, adding more knots in general will increase the training error. True or false?
Solutions
False. More knots usually decreases training error (but may increase test error).
Polynomial regression and validation error¶
You are performing least-squares polynomial regression. As the degree of your polynomials increases, the validation error is commonly seen to go down at first but then go up. True or false?
Solutions
True.
Preprocessing binary features¶
You are given a dataset concerning binary features (coded as zeros and ones) and one numerical response. True or false: before proceeding with most estimators, it is essential to preprocess your data into a new form with features (this is referred to as “one-hot encoding” and the new features are sometimes called “dummy features”).
Solutions
False. You don’t need one-hot encoding for binary features. You can just code them as 0s and 1s.
Preprocessing binary features, II¶
You are given a dataset concerning categorical features and one numerical response. Assume each categorical feature can take on one of four unique values (e.g., for each feature ). Before proceeding with most estimators, it is usually helpful to preprocess your data into a new form using a one-hot encoding. After this encoding process, you will have a new dataset that a wide variety of methods can be directly applied to. How many predictor features will this new dataset have?
Either or (either is fine; depends on whether you drop one of the categories).
Either or (either is fine; depends on whether you drop one of the categories).
Either or (either is fine; depends on whether you drop one of the categories).
Solutions
Second solution is correct. You need at least three features for each raw categorical feature (if that feature takes on four unique values). You can also use 4. 4 introduces redundancy, but that doesn’t usually matter for most estimators.