Trees
Sunlight cafe, part I¶
Imagine you run a small cafe, and you’re trying to predict the daily revenue () based on the number of hours of sunlight for the day (). You believe that more sunlight encourages more people to go out, thereby potentially increasing your revenue.
You’d like to approximate and you’re going to do so by fitting a regression stump (i.e. a regression tree with one internal node). Which of the following describes the class of estimators that you could obtain?
a. , with
b. , with
c. for ,
d. , with and
Solutions
c. Though it is a bit of an unconventional way of writing it. We would more often think of it as being defined through (mean in left child) (mean in right child) and , split. Family of possible functions would then be
However, the family of functions you can represent this way is the same as the family you can represent with .
Sunlight cafe, part II¶
Consider the dataset
**Sample** **Hours of Sunlight** **Revenue**
------------ ----------------------- -------------
#1 3 4.5
#2 7 5.0
#3 8 7.0
Fit a regression stump to this dataset to predict revenue from hours of sunlight. What is the estimate of ?
Solutions
It would be
This solution has the smallest possible MSPE among all possible decision stumps (because it groups the two samples together that are closer in response).
Note that other choices such as
would also be just as good. The choice to use 7.5 (midpoint between observed datapoints) is conventional.
Woodchuck behavior, part I¶
Imagine you run a small woodchuck sanctuary, and you’re trying to predict whether a woodchuck will come out () based on the number of hours of sunlight for the day (). You believe that more sunlight encourages woodchucks to come out more.
You’d like to estimate and you’re going to estimate it by fitting a decision tree.
In fitting this tree to a set of training data , we must search among different trees in search of one that best fits the training data. In this search, which of the following would be a typical loss? That is, which of the following would be the most typical way to characterizes how poorly a given estimate fits the data?
Solutions
Last answer, negative log likelihood, is correct.
There are actually other forms of loss that you’ll see (e.g., involving Gini coefficients), but of the listed answers only (d) is commonly used.
Woodchuck behavior, part II¶
Consider the dataset
**Sample** **Hours of Sunlight** **Woodchuck appears**
------------ ----------------------- -----------------------
#1 3 No
#2 7 No
#3 8 Yes
If we fit a classification stump to this dataset to predict revenue from hours of sunlight, what would the final estimate be?
Solutions
We would end up with
Note that this answer is probably not ideal. Given that we only have three training samples, it is probably pretty unwise to declare !
A tree: test and train¶
Say you fit a decision tree to predict a numerical response variable from predictor variables , based on a dataset . You fit the tree with the “max leaves” parameter set to 9. You also have a held-out dataset . Which will typically be true?
a. Among trees with 9 leaves, your estimate will have the smallest mean squared predictive error on the training data (but not the testing data).
b. Among trees with 9 leaves, your estimate will have the smallest mean squared predictive error on the training data and the testing data.
c. Among trees with 9 leaves, your estimate will have the smallest mean squared predictive error on the testing data (but not the training data).
d. None of the above.
Solutions
d, none of the above. Trees are fit greedily to try to minimize mean squared predictive error on training data---but the fitting procedure doesn’t typically find the tree that achieves the minimum error on the training data.
Tree augmentation¶
You are considering a regression problem with 10 binary predictor variables. You have been given a decision tree with 20 leaf nodes and a dataset with 300 samples. You would like to augment the tree to make it a little bit bigger and reduce the mean squared predictive error on the dataset.
To perform the augmentation, you will search among a set of possible trees--each a bit larger than the tree you started with. How many trees will you consider?
Solutions
200 -- for each of the 10 predictors and each of the 20 leaf nodes, you’ll consider a new tree formed by splitting that leaf node using that predictor. 201 would also be an acceptable answer (to include the original tree in the list of trees you consider).
Max depth¶
In fitting a decision tree, the “max depth” hyperparameter sets the maximum possible depth for the tree you fit. Which is true about how this hyperparameter for fitting decision trees affects the quality of the estimator?
More depth typically leads to higher bias and higher variance.
More depth typically leads to higher bias and lower variance.
More depth typically leads to lower bias and higher variance.
More depth typically leads to lower bias and lower variance.
Solutions
Lower bias, higher variance. More depth means more flexibility (and more parameters!).
Min leaf samples¶
The “min leaf sample” is a hyperparameter sometimes used in fitting decision trees. If set, it ensures that we only consider trees with a particular relationship with the training data. Specifically, if the parameter value is set to , it requires that for each leaf there are at least training points inside the region associated with that leaf. Which is true about how this hyperparameter for fitting decision trees affects the quality of the estimator?
Higher typically leads to higher bias and higher variance.
Higher typically leads to higher bias and lower variance.
Higher typically leads to lower bias and higher variance.
Higher typically leads to lower bias and lower variance.
Solutions
Higher bias, lower variance. Higher means there are more samples in each leaf, so we get relatively low variance estimates for what’s going on in each leaf. But it also means that we cannot have too many leaves (indeed, we can have at most ). And fewer leaves means less flexibility.
Trees versus logistic regression¶
Compared with the logistic regression estimator, which best characterizes decision tree classification estimators?
Trees are higher bias and higher variance
Trees are higher bias and lower variance
Trees are lower bias and higher variance
Trees are lower bias and lower variance
None of the above
Solutions
None of the above. Depending on how you set hyperparameters, trees can be very high variance or very low variance. Amount of bias also depends a lot on the true underlying distribution you are trying to estimate. A tree estimator typically has a lot of bias bias if the real relationship between the predictors and the response is linear. On the other hand, trees are capable of being very low bias if you allow enough leaves.
Leaf nodes¶
Consider a decision tree for a problem with numerical predictors. Each leaf node in a decision tree is associated with a subset . Assume the tree has leaf nodes. Which is true?
, but
, but
Solutions
First answer is correct. There is a unique leaf associated with every possible value of .
An internal node¶
Each internal node of a decision tree is associated with a rule. The rule is used to decide whether a given query point should go to the left child or the right child. In the context of numerical predictors, which of the following best describes the set of possible rules that might be associated to an internal node?
LEFT if and RIGHT otherwise (for some )
LEFT if and RIGHT otherwise (for some feature and some )
LEFT if and RIGHT otherwise (for some features and some )
Solutions
Middle answer is correct.
There are some tree-fitting packages with native support for categorical predictors that can look at other kinds of rules (e.g., if they can consider rules like LEFT if and RIGHT otherwise). But for numerical predictors this is pretty much all you will see.