Skip to article frontmatterSkip to article content

Trees

Sunlight cafe, part I

Imagine you run a small cafe, and you’re trying to predict the daily revenue (YY) based on the number of hours of sunlight for the day (XX). You believe that more sunlight encourages more people to go out, thereby potentially increasing your revenue.

You’d like to approximate f(x)=E[YX=x]f^*(x)=\mathbb{E}[Y|X=x] and you’re going to do so by fitting a regression stump (i.e. a regression tree with one internal node). Which of the following describes the class of estimators that you could obtain?

a. f(x;β)=β0+β1xf(x;\beta) = \beta_0 + \beta_1 x, with βR2\beta \in \mathbb{R}^2

b. f(x;β,s)=β0+β1x+I(xs)β2f(x;\beta,s) = \beta_0 + \beta_1 x + \mathbb{I}(x\geq s)\beta_2, with βR3\beta \in \mathbb{R}^3

c. f(x;a,b,s)=a+I(xs)bf(x;a,b,s) = a + \mathbb{I}(x\geq s)b for a,b,sRa,b,s \in \mathbb{R},

d. f(x;β,γ,s)=(β0+β1x)I(x<s)+(γ0+γ1x)I(xs)f(x;\beta,\gamma,s) = (\beta_0 + \beta_1 x)\mathbb{I}(x<s) + (\gamma_0 + \gamma_1 x)\mathbb{I}(x\geq s), with β,γR2\beta,\gamma \in \mathbb{R}^2 and sRs \in \mathbb{R}

Sunlight cafe, part II

Consider the dataset

   **Sample**   **Hours of Sunlight**   **Revenue**
  ------------ ----------------------- -------------
       #1                 3                 4.5
       #2                 7                 5.0
       #3                 8                 7.0

Fit a regression stump to this dataset to predict revenue from hours of sunlight. What is the estimate of E[YX=x]\mathbb{E}[Y|X=x]?

Woodchuck behavior, part I

Imagine you run a small woodchuck sanctuary, and you’re trying to predict whether a woodchuck will come out (Y{no,yes}Y \in \{\mathrm{no},\mathrm{yes}\}) based on the number of hours of sunlight for the day (XX). You believe that more sunlight encourages woodchucks to come out more.

You’d like to estimate p(yx)=P(Y=yesX=x)p(y|x)=\mathbb{P}(Y=\mathrm{yes}|X=x) and you’re going to estimate it by fitting a decision tree.

In fitting this tree to a set of training data D={(X1,Y1),(Xn,Yn)}\mathcal{D}=\{(X_1,Y_1),\ldots(X_n,Y_n)\}, we must search among different trees in search of one that best fits the training data. In this search, which of the following would be a typical loss? That is, which of the following would be the most typical way to characterizes how poorly a given estimate p^(yx)\hat p(y|x) fits the data?

  1. L=ip^(yixi)L = \sum_i \hat p(y_i|x_i)

  2. L=ip^(yixi)L = -\sum_i \hat p(y_i|x_i)

  3. L=ilogp^(yixi)L = \sum_i \log \hat p(y_i|x_i)

  4. L=ilogp^(yixi)L = -\sum_i \log \hat p(y_i|x_i)

Woodchuck behavior, part II

Consider the dataset

   **Sample**   **Hours of Sunlight**   **Woodchuck appears**
  ------------ ----------------------- -----------------------
       #1                 3                      No
       #2                 7                      No
       #3                 8                      Yes

If we fit a classification stump to this dataset to predict revenue from hours of sunlight, what would the final estimate be?

A tree: test and train

Say you fit a decision tree to predict a numerical response variable YY from predictor variables XX, based on a dataset D(tr)\mathcal{D}^{(tr)}. You fit the tree with the “max leaves” parameter set to 9. You also have a held-out dataset D(te)\mathcal{D}^{(te)}. Which will typically be true?

a. Among trees with 9 leaves, your estimate will have the smallest mean squared predictive error on the training data (but not the testing data).

b. Among trees with 9 leaves, your estimate will have the smallest mean squared predictive error on the training data and the testing data.

c. Among trees with 9 leaves, your estimate will have the smallest mean squared predictive error on the testing data (but not the training data).

d. None of the above.

Tree augmentation

You are considering a regression problem with 10 binary predictor variables. You have been given a decision tree with 20 leaf nodes and a dataset D\mathcal{D} with 300 samples. You would like to augment the tree to make it a little bit bigger and reduce the mean squared predictive error on the dataset.

To perform the augmentation, you will search among a set of possible trees--each a bit larger than the tree you started with. How many trees will you consider?

Max depth

In fitting a decision tree, the “max depth” hyperparameter sets the maximum possible depth for the tree you fit. Which is true about how this hyperparameter for fitting decision trees affects the quality of the estimator?

  1. More depth typically leads to higher bias and higher variance.

  2. More depth typically leads to higher bias and lower variance.

  3. More depth typically leads to lower bias and higher variance.

  4. More depth typically leads to lower bias and lower variance.

Min leaf samples

The “min leaf sample” is a hyperparameter sometimes used in fitting decision trees. If set, it ensures that we only consider trees with a particular relationship with the training data. Specifically, if the parameter value is set to n~\tilde n, it requires that for each leaf there are at least n~\tilde n training points inside the region associated with that leaf. Which is true about how this hyperparameter for fitting decision trees affects the quality of the estimator?

  1. Higher n~\tilde n typically leads to higher bias and higher variance.

  2. Higher n~\tilde n typically leads to higher bias and lower variance.

  3. Higher n~\tilde n typically leads to lower bias and higher variance.

  4. Higher n~\tilde n typically leads to lower bias and lower variance.

Trees versus logistic regression

Compared with the logistic regression estimator, which best characterizes decision tree classification estimators?

  1. Trees are higher bias and higher variance

  2. Trees are higher bias and lower variance

  3. Trees are lower bias and higher variance

  4. Trees are lower bias and lower variance

  5. None of the above

Leaf nodes

Consider a decision tree for a problem with pp numerical predictors. Each leaf node ii in a decision tree is associated with a subset SiRpS_i \subset \mathbb{R}^p. Assume the tree has kk leaf nodes. Which is true?

  1. i=1kSi=Rp\bigcup_{i=1}^k S_i = \mathbb{R}^p

  2. i=1kSiRp\bigcup_{i=1}^k S_i \neq \mathbb{R}^p, but i=1kSiRn\bigcup_{i=1}^k S_i \subset \mathbb{R}^n

  3. i=1kSiRp\bigcup_{i=1}^k S_i \neq \mathbb{R}^p, but i=1kSiRn\bigcup_{i=1}^k S_i \supset \mathbb{R}^n

An internal node

Each internal node of a decision tree is associated with a rule. The rule is used to decide whether a given query point xx should go to the left child or the right child. In the context of pp numerical predictors, which of the following best describes the set of possible rules that might be associated to an internal node?

  1. LEFT if β0+i=1pxiβi\beta_0 + \sum_{i=1}^p x_i \beta_i and RIGHT otherwise (for some βRp+1\beta \in \mathbb{R}^{p+1})

  2. LEFT if xi<sx_i<s and RIGHT otherwise (for some feature i{1,,p}i\in \{1,\ldots,p\} and some sRs \in \mathbb{R})

  3. LEFT if xi>xj+sx_i > x_j + s and RIGHT otherwise (for some features i,j{1,,p}i,j\in \{1,\ldots,p\} and some sRs \in \mathbb{R})