Bayesian classifiers
Gaussian posteriors vs logistic regression¶
In a two-class classification problem when the class conditional distributions and are modelled as Gaussians with equal covariance matrices, the posterior probabilities always turn out to be the same as those predicted by the logistic regression. True or false?
Solutions
False. The problem essentially describes LDA. The final predictions of LDA have the same parametric form as that found by logistic regression---but LDA is not equivalent to logistic regression! The parameters get estimated differently by the two methods. LDA will work better if the class-conditionals actually are Gaussians with common covariance. Otherwise, hard to say.
LDA vs logistic regression¶
You fit LDA on a classification dataset with one predictor, . Assume the response takes on two possible values, . Let denote the estimated conditional probability density function. True or false: there exists such that
Solutions
True.
QDA vs logistic regression with derived features¶
You fit QDA on a classification dataset with one predictor, . Assume the response takes on two possible, values, . Let denote the estimated conditional probability density function. True or false: there exists such that
Solutions
True.
QDA vs logistic regression with derived features, II¶
You fit QDA on a classification dataset with two predictors, . Assume the response takes on two possible, values, . Let denote the estimated conditional probability density function. True or false: there must exist such that
Solutions
LDA vs logistic regression with derived features¶
You fit LDA on a classification dataset with one predictor, . Assume the response takes on two possible, values, . Let denote the estimated conditional probability density function. True or false: there exists such that
Solutions
True. But this is sort of a trick question. You can indeed find . But it will always be the case that .
Consistent classification with quadratic decision rule¶
Consider a binary classification problem in which it is possible to perfectly predict the response from the predictors. In particular, assume there are two predictors and
Which of these approaches can make the misclassification rate arbitrarily close to zero on held-out test data, if we assume that the training dataset can be as large as needed?
LDA
QDA
KNN (with chosen by 5-fold cross-validation)
Logistic regression
Solutions
LDA and logistic regression will fail badly. QDA and KNN would both be fine.
Aside. You wouldn’t actually have to choose by cross-validation due to some special properties of this problem. In this very special case, because the relationship between and is deterministic, you could still get a consistent estimator by setting . However, this fact is beyond the scope of what you will be expected to consider for the midterm.
Consistent classification with quadratic decision rule and derived features¶
Consider a binary classification problem in which it is possible to perfectly predict the response from the predictors. In particular, assume there are two predictors and
To obtain flexible models, a data scientist augments the feature space with a derived feature . Using the new data (with three features), which of these approaches can make the misclassification rate arbitrarily close to zero on held-out test data, if we assume that the training dataset can be as large as needed?
LDA
QDA
KNN (with chosen by 5-fold cross-validation)
Logistic regression
Solutions
All of them could be fine. KNN with derived features is slightly unusual, but it will still work.
LDA by hand¶
Suppose that we wish to predict whether a given stock will issue a
dividend this year (“Yes” or “No”) based on X, last year’s percent
profit. We examine a large number of companies and discover that the
mean value of for companies that issued a dividend was
, while the mean for those that didn’t was .
In addition, the variance of X for each of these two sets of companies
was . Finally, 75% of companies issued dividends.
Assuming that X follows a normal distribution, predict the probability
that a company will issue a dividend this year given that its percentage
profit was last year.
Assume denotes the Gaussian density for a value
, with mean μ and variance . That is,
You may write your answer using ϕ (answer does not need to be reduced to a number).
Solutions
LDA by hand, II¶
In the fictional region of Plantasia, agronomists are working to distinguish between two species of magical herbs, Gillyweed and Silverleaf, based on the level of a magical essence, Enchantol, found in their stems. They’ve decided to employ Linear Discriminant Analysis (LDA) for this task, leveraging the fact that Enchantol levels follow a normal distribution in both herb species. The concentration of Enchantol is recorded in enchantment units (EU). The researchers have documented the following details:
The prior probability of encountering Gillyweed is 0.65, and Silverleaf is 0.35.
The mean Enchantol concentration in Gillyweed is 15 EU, and in Silverleaf, it is 30 EU.
The variance of Enchantol concentration is consistently 20 EU2 for both species.
What is the LDA estimate of the Bayes optimal decision rule? Assume denotes the Gaussian density for a concentration , with mean μ and variance .
Classify as Gillyweed if
Classify as Gillyweed if
Classify as Gillyweed if
Classify as Gillyweed if
Solutions
First option.
Bayes rule¶
Imagine machine learning researchers have developed a new binary classification method called Octonian Discriminant Analysis (ODA). The method first estimates a class-conditional probability density function and a prior probability for positive and negative classes, . Unlike in LDA or QDA, the class-conditional densities are not Gaussian. ODA uses Bayes rule to predict the conditional distribution of given . Which is true about the ODA estimate, ?
None of the above
Solutions
Third option.
Aside. It is in fact possible to find very strange distributions where this sort of formula doesn’t work: there are certain distributions which cannot be expressed as a probability density function and also cannot be expressed as a probability mass mass function (neither discrete nor continuous). In such a case, the class-conditional distributions cannot be expressed by a simple function like or , so you have to do some extra clever stuff. But these sorts of distributions are all out-of-scope for this class.
LDA with binary features¶
Consider a dataset with eight binary predictor variables and one binary response. Assume the binary predictor variables are coded using 0s and 1s. Which of the following is most true?
There is reason to hope that LDA will perform well in this context.
We cannot apply LDA in this context, because the covariance of a binary random vector is not uniquely defined.
We probably should not apply LDA in this context, because it is unlikely that the class-conditional covariances will be exactly the same for the two classes.
We probably should not apply LDA in this context, because the class-conditional distributions do not admit probability density functions.
Solutions
The last one is correct. The covariance is perfectly well defined (because we have coded the predictors using 0s and 1s, so all the features can be interpreted as numbers). The fact that the features are binary does not preclude the possibility that the class-conditional covariances would be the same. The strongest argument here is that LDA models the class-conditionals as Gaussians, with nice smooth probability density functions. But here we have categorical predictors, and these are not going to look Gaussian in the least.
Fisher’s discriminant plot¶
Fisher’s discriminant plot gives us a way to visualize our data even if we have more than two predictor features. The key idea is to project our feature space down to a lower-dimensional representation that we can actually look at. For example, given a classification problem with possible values for the response, Fisher’s discriminant plot is designed to identify a two-dimensional representation of our input space such that...
Between-class variability appears small while within-class variability appears large.
Between-class variability appears large while within-class variability appears small.
The class-conditional variance appears isotropic (equal variance in all directions).
The between-class variance appears isotropic (equal variance in all directions).
Solutions
Second option.
Naive Bayes assumption¶
Consider a classification setting in which one seeks to predict a response from predictor features . Naive Bayes is a classification method that builds its predictions using a certain independence assumption. If this assumption holds, which of the following is guaranteed to be true?
.
for each .
any .
any .
Solutions
The second answer is correct. The assumption is that the s are independent after conditioning on upon . This is also called “conditional independence”: for any and any functions , we have that .
Class conditional for categorical predictor¶
Consider a binary classification task with two categorical predictors. You would like to apply a naive Bayes estimator. Your first step is to estimate the class-conditional for each predictor and each class. Let
denote your dataset of samples. Assume that and .
Which would be the most suitable estimator for ?
Solutions
First option. means this: among cases where , how often will first feature takes on the discrete value ? The first answer gives an estimate of this, by counting the proportion in the given dataset.