Skip to article frontmatterSkip to article content

Bayesian classifiers

Gaussian posteriors vs logistic regression

In a two-class classification problem when the class conditional distributions f(XY=0)f(X|Y=0) and f(XY=1)f(X|Y=1) are modelled as Gaussians with equal covariance matrices, the posterior probabilities always turn out to be the same as those predicted by the logistic regression. True or false?

LDA vs logistic regression

You fit LDA on a classification dataset with one predictor, XX. Assume the response YY takes on two possible values, Y{1,+1}Y \in \{-1,+1\}. Let p^(yx)\hat p(y|x) denote the estimated conditional probability density function. True or false: there exists β0,β1\beta_0,\beta_1 such that

logp^(+x)p^(x)=β0+β1x\log \frac{\hat p(+|x)}{\hat p(-|x)} = \beta_0 + \beta_1 x

QDA vs logistic regression with derived features

You fit QDA on a classification dataset with one predictor, XX. Assume the response YY takes on two possible, values, Y{1,+1}Y \in \{-1,+1\}. Let p^(yx)\hat p(y|x) denote the estimated conditional probability density function. True or false: there exists β0,β1,β2\beta_0,\beta_1,\beta_2 such that

logp^(+x)p^(x)=β0+β1x+β2x2\log \frac{\hat p(+|x)}{\hat p(-|x)} = \beta_0 + \beta_1 x + \beta_2 x^2

QDA vs logistic regression with derived features, II

You fit QDA on a classification dataset with two predictors, X1,X2X_1,X_2. Assume the response YY takes on two possible, values, Y{1,+1}Y \in \{-1,+1\}. Let p^(yx)\hat p(y|x) denote the estimated conditional probability density function. True or false: there must exist β0,β1,β2,β3,β4\beta_0,\beta_1,\beta_2,\beta_3,\beta_4 such that

logp^(+x)p^(x)=β0+β1x1+β2x12+β3x2+β4x22\log \frac{\hat p(+|x)}{\hat p(-|x)} = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_2 + \beta_4 x_2^2

LDA vs logistic regression with derived features

You fit LDA on a classification dataset with one predictor, XX. Assume the response YY takes on two possible, values, Y{1,+1}Y \in \{-1,+1\}. Let p^(yx)\hat p(y|x) denote the estimated conditional probability density function. True or false: there exists β0,β1,β2\beta_0,\beta_1,\beta_2 such that

logp^(+x)p^(x)=β0+β1x+β2x2\log \frac{\hat p(+|x)}{\hat p(-|x)} = \beta_0 + \beta_1 x + \beta_2 x^2

Consistent classification with quadratic decision rule

Consider a binary classification problem in which it is possible to perfectly predict the response from the predictors. In particular, assume there are two predictors and

y={1if x12+x22>100otherwisey = \begin{cases} 1&\mathrm{if}\ x_1^2 + x_2^2 > 10 \\ 0&\mathrm{otherwise} \end{cases}

Which of these approaches can make the misclassification rate arbitrarily close to zero on held-out test data, if we assume that the training dataset can be as large as needed?

  1. LDA

  2. QDA

  3. KNN (with KK chosen by 5-fold cross-validation)

  4. Logistic regression

Consistent classification with quadratic decision rule and derived features

Consider a binary classification problem in which it is possible to perfectly predict the response from the predictors. In particular, assume there are two predictors and

y={1if x12+x22>100otherwisey = \begin{cases} 1&\mathrm{if}\ x_1^2 + x_2^2 > 10 \\ 0&\mathrm{otherwise} \end{cases}

To obtain flexible models, a data scientist augments the feature space with a derived feature x3=x12+x22x_3 = x_1^2 + x_2^2. Using the new data (with three features), which of these approaches can make the misclassification rate arbitrarily close to zero on held-out test data, if we assume that the training dataset can be as large as needed?

  1. LDA

  2. QDA

  3. KNN (with KK chosen by 5-fold cross-validation)

  4. Logistic regression

LDA by hand

Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of XX for companies that issued a dividend was Xˉ=20\bar{X} = 20, while the mean for those that didn’t was Xˉ=10\bar{X} = 10. In addition, the variance of X for each of these two sets of companies was σ2=25\sigma^2 = 25. Finally, 75% of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X=14X = 14 last year.

Assume ϕ(x;μ,σ2)\phi(x;\mu,\sigma^{2}) denotes the Gaussian density for a value xx, with mean μ and variance σ2\sigma^{2}. That is,

ϕ(x;μ,σ2)=12πσ2exp{(xμ)22σ2}.\phi(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{-\frac{(x-\mu)^2}{2\sigma^2}\right\}.

You may write your answer using ϕ (answer does not need to be reduced to a number).

LDA by hand, II

In the fictional region of Plantasia, agronomists are working to distinguish between two species of magical herbs, Gillyweed and Silverleaf, based on the level of a magical essence, Enchantol, found in their stems. They’ve decided to employ Linear Discriminant Analysis (LDA) for this task, leveraging the fact that Enchantol levels follow a normal distribution in both herb species. The concentration of Enchantol is recorded in enchantment units (EU). The researchers have documented the following details:

  1. The prior probability of encountering Gillyweed is 0.65, and Silverleaf is 0.35.

  2. The mean Enchantol concentration in Gillyweed is 15 EU, and in Silverleaf, it is 30 EU.

  3. The variance of Enchantol concentration is consistently 20 EU2 for both species.

What is the LDA estimate of the Bayes optimal decision rule? Assume ϕ(x;μ,σ2)\phi(x;\mu,\sigma^{2}) denotes the Gaussian density for a concentration xx, with mean μ and variance σ2\sigma^{2}.

  1. Classify as Gillyweed if ϕ(x;15,20)×0.65>ϕ(x;30,20)×0.35\phi(x;15,20)\times0.65>\phi(x;30,20)\times0.35

  2. Classify as Gillyweed if log(ϕ(x;15,20)+0.65)>log(ϕ(x;30,20)+0.35)\log(\phi(x;15,20)+0.65)>\log(\phi(x;30,20)+0.35)

  3. Classify as Gillyweed if log(ϕ(x;15,20))×log(0.65)>log(ϕ(x;30,20))×log(0.35)\log(\phi(x;15,20))\times\log(0.65)>\log(\phi(x;30,20))\times\log(0.35)

  4. Classify as Gillyweed if ϕ(x;15,20)×log(0.65)>ϕ(x;30,20)×log(0.35)\phi(x;15,20)\times\log(0.65)>\phi(x;30,20)\times\log(0.35)

Bayes rule

Imagine machine learning researchers have developed a new binary classification method called Octonian Discriminant Analysis (ODA). The method first estimates a class-conditional probability density function fy(x)f_{y}(x) and a prior probability πy\pi_{y} for positive and negative classes, y{0,1}y\in\{0,1\}. Unlike in LDA or QDA, the class-conditional densities are not Gaussian. ODA uses Bayes rule to predict the conditional distribution of YY given XX. Which is true about the ODA estimate, p^\hat{p}?

  1. p^(1x)=π1f1(x)π0f0(x)\hat{p}(1|x)=\frac{\pi_{1}f_{1}(x)}{\pi_{0}f_{0}(x)}

  2. p^(1x)=π1f1(x)\hat{p}(1|x)=\pi_{1}f_{1}(x)

  3. logp^(1x)p^(0x)=logπ1+logf1(x)logπ0logf0(x)\log\frac{\hat{p}(1|x)}{\hat{p}(0|x)}=\log\pi_{1}+\log f_{1}(x)-\log\pi_{0}-\log f_{0}(x)

  4. p^(1x)p^(0x)=π1f1(x)π0f0(x)\frac{\hat{p}(1|x)}{\hat{p}(0|x)}=\pi_{1}f_{1}(x)-\pi_{0}f_{0}(x)

  5. None of the above

LDA with binary features

Consider a dataset with eight binary predictor variables and one binary response. Assume the binary predictor variables are coded using 0s and 1s. Which of the following is most true?

  1. There is reason to hope that LDA will perform well in this context.

  2. We cannot apply LDA in this context, because the covariance of a binary random vector is not uniquely defined.

  3. We probably should not apply LDA in this context, because it is unlikely that the class-conditional covariances will be exactly the same for the two classes.

  4. We probably should not apply LDA in this context, because the class-conditional distributions do not admit probability density functions.

Fisher’s discriminant plot

Fisher’s discriminant plot gives us a way to visualize our data even if we have more than two predictor features. The key idea is to project our feature space down to a lower-dimensional representation that we can actually look at. For example, given a classification problem with K=3K=3 possible values for the response, Fisher’s discriminant plot is designed to identify a two-dimensional representation of our input space such that...

  1. Between-class variability appears small while within-class variability appears large.

  2. Between-class variability appears large while within-class variability appears small.

  3. The class-conditional variance appears isotropic (equal variance in all directions).

  4. The between-class variance appears isotropic (equal variance in all directions).

Naive Bayes assumption

Consider a classification setting in which one seeks to predict a response YY from pp predictor features XX. Naive Bayes is a classification method that builds its predictions using a certain independence assumption. If this assumption holds, which of the following is guaranteed to be true?

  1. E[i=1pXi]=i=1pE[Xi]\mathbb{E}[\prod_{i=1}^p X_i] = \prod_{i=1}^p \mathbb{E}[X_i].

  2. E[i=1pXiY=y]=i=1pE[XiY=y]\mathbb{E}[\prod_{i=1}^p X_i|Y=y] = \prod_{i=1}^p \mathbb{E}[X_i|Y=y] for each yy.

  3. i=1pE[YXi=xi]=E[YX=x]\prod_{i=1}^p \mathbb{E}[Y|X_i=x_i] = \mathbb{E}[Y|X=x] any xx.

  4. i=1pP(Y=yXi=xi)=P(Y=yX=x)\prod_{i=1}^p \mathbb{P}(Y=y|X_i=x_i) = \mathbb{P}(Y=y|X=x) any y,xy,x.

Class conditional for categorical predictor

Consider a binary classification task with two categorical predictors. You would like to apply a naive Bayes estimator. Your first step is to estimate the class-conditional for each predictor and each class. Let

((X1,1,X1,2,Y1),(X2,1,X2,2,Y2),(Xn,1,Xn,2,Yn)((X_{1,1},X_{1,2},Y_1), (X_{2,1},X_{2,2},Y_2), \ldots (X_{n,1},X_{n,2},Y_n)

denote your dataset of nn samples. Assume that Xi,1{1,2,3,n1}X_{i,1} \in \{1,2,3,\ldots n_1\} and Xi,2{1,2,3,n2}X_{i,2} \in \{1,2,3,\ldots n_2\}.

Which would be the most suitable estimator for P(Xi,1=xY=y)\mathbb{P}(X_{i,1}=x|Y=y)?

  1. iI(yi=y and xi,1=x)/iI(yi=y)\sum_i \mathbb{I}(y_i=y\ \mathrm{and}\ x_{i,1}=x) / \sum_i \mathbb{I}(y_i=y)

  2. iI(yi=y)/iI(yi=y,xi,1=x)\sum_i \mathbb{I}(y_i=y) / \sum_i \mathbb{I}(y_i=y, x_{i,1}=x)

  3. iI(yi=y and xi,1=x)/i,yI(yi=y)\sum_i \mathbb{I}(y_i=y\ \mathrm{and}\ x_{i,1}=x) / \sum_{i,y'} \mathbb{I}(y_i=y')

  4. i(I(yi=y and xi,1=x)/yI(yi=y))\sum_i (\mathbb{I}(y_i=y\ \mathrm{and}\ x_{i,1}=x) / \sum_{y'} \mathbb{I}(y_i=y'))