Support vector machines
Slacks, part I¶
Consider a classification task with one predictor and a binary response . After fitting an SVM, you find the following decision function:
How would the point be classified? As positive or negative?
Solutions
, so it would be classified as positive.
Slacks, part II¶
There are different conventions for how the decision function is written, and what it says about the margin. But let us say that the decision function is and the margin is equal to . What would be the slack for the following points?
Sample $X$ $Y$
------------ ----- -----
#1 1.5 +1
#2 1.5 -1
#3 2 +1
#4 2 -1
#5 3 +1
#6 3 -1
Solutions
Let’s sketch out what’s going on here with the decision function. The key things to track are (1) what’s the decision boundary, (2) how big is the margin radius, and (3) what is the orientation of the decision boundary (i.e. which side classifies as negative and which side classifies as positive). We can sketch all that out here.
margin radius is 1/2
|------|------|
0.0----0.5----1.0----1.5----2.0----2.5----3.0
| | | | | | |
| decision |
f(x)>0 here | boundary | f(x)<0 here
| is here |
| |
margin margin boundary
boundary for points
for points with y=-
with y=+
For the samples with we compute distance to the margin boundary 1.5, divided by margin radius (and assign zero slack for points on the correctly classified side of the margin boundary).
If y=-, slacks will be...
slack 5 4 3 2 1 0 0
x 0.0----0.5----1.0----1.5----2.0----2.5----3.0
| | | | | | |
decision
f(x)>0 here boundary f(x)<0 here
is here
For the samples with we compute distance to the margin boundary 2.5, divided by margin radius (and assign zero slack for points on the correctly classified side of the margin boundary).
If y=+, slacks will be...
slack 0 0 0 0 1 2 3
x 0.0----0.5----1.0----1.5----2.0----2.5----3.0
| | | | | | |
decision
f(x)>0 here boundary f(x)<0 here
is here
This is one way to get the slacks. The other way is to use the formula. Let denote the signed distance to the decision boundary. Let denote the margin. Then the slack for the point is given by
So, for example, if and then the signed distance is and so the slack will be
In this problem I followed the convention that the margin is defined as the inverse of the square root of the sum of the squares of the non-intercept coefficients of the decision function, i.e., in this case, . When the margin is defined this way, there is a cancellation, and the formula for the slack takes on a slightly simpler form, namely . In my exams I will always adopt this convention. In any case, the formula should always hold, even if you’re not taking one of my exams.
The final answers are
Sample $X$ $Y$ slack
------------ ----- ----- ------- --
#1 1.5 +1 0
#2 1.5 -1 2
#3 2 +1 1
#4 2 -1 1
#5 3 +1 3
#6 3 -1 0
More slacks¶
Now let’s fit another SVM. This time we get
Assume the margin is . What is the slack for the point ?
Solutions
Decision boundary occurs when . Points are classified as positive when . Picture is like this:
margin radius is 5
|------|------|
-15.0 -10.0 -5.0 0.0
| | | | | | |
| decision |
f(x)<0 here | boundary | f(x)>0 here
| is here |
| |
margin margin boundary
boundary for points
for points with y=+
with y=-
The point is correctly classified (it is on the correct side of the decision boundary). But it isn’t on the right side of the margin line. . Slack is given by
Features to kernels¶
We have a dataset and we would like to learn to predict from . To build more expressive prediction models, we will use the following featurization of our data:
Under this featurization, we now have 3 predictive features. We will then fit an SVM to this data. Our final decision function will take the form
for some coefficients . Due to the kernel trick, it will also be possible to express the final decision function in a different form:
for some coefficients . What is the formula for in this case?
Solutions
is the kernel here, and it calculates dot products in the featurized space. Specifically,
Kernel trick with linear SVMs¶
Say you have a dataset about and . The dataset has samples. You’d like to try to learn to predict from using a linear SVM. Which is true?
The kernel trick is not useful in this setting.
The kernel trick could be useful if and .
The kernel trick could be useful if and .
Solutions
Second answer is correct. Kernel trick allows us to perform optimization
in -dimensional space instead of -dimensional space. That is
advantageous when is very large. Same trick can work for linear
regression with ridge penalties (sklearn.kernel_ridge.KernelRidge
) or
logistic regression with ridge penalties (though sklearn doesn’t have
software for it).
Linear SVMs vs polynomial SVMs¶
You have a dataset with one predictor feature and one binary response . You would like to be able to predict the response from the predictor, and you are considering two options.
Option I is to preprocess the data to include the features and then apply a linear SVM. Option II is to use an SVM with a polynomial kernel . Which is true?
True or false: Option I and option II will yield identical results.
Solutions
False. The results will be similar, but slightly different. Although both methods consider the same set of features, the features are scaled differently.
Option I is a SVM with features .
Option II is equivalent to a linear SVM with features . Because SVMs always include an offset term, this is really equivalent to a linear SVM with features .
This problem is pretty hard, wouldn’t be on exam in exactly this form. But an easier version of this problem might be (e.g., one that explicitly guides you through the process of figuring out what the featurization is that corresponds to the kernel ).
Kernels to features¶
Consider a classification with one predictor and one binary response . You are thinking about fitting a kernel SVM with the kernel . You realize that this is equivalent to fitting a linear SVM with a certain featurization. What featurization is it?
Solutions
Positive definite, part I¶
Consider a classification with one predictor and one binary response . You have decided to preprocess the data with a featurization . Which best describes the sign of for an input query ?
for all (but it may be that for some )
for all
for all (but it may be that for some )
for all
Solutions
First answer is best.
Positive definite, part II¶
For any featurization ϕ, one can show that the kernel defined by is positive semi-definite (also called nonnegative definite). What does that mean?
Solutions
It means this. Pick any set of points, . Construct the matrix . Then all eigenvalues of matrix are nonnegative.