Skip to article frontmatterSkip to article content

Support vector machines

Slacks, part I

Consider a classification task with one predictor XRX \in \mathbb{R} and a binary response Y{1,+1}Y \in \{-1,+1\}. After fitting an SVM, you find the following decision function:

f(x)=42x.f(x) = 4 - 2x.

How would the point x=0x=0 be classified? As positive or negative?

Slacks, part II

There are different conventions for how the decision function is written, and what it says about the margin. But let us say that the decision function is f(x)=42xf(x) = 4 - 2x and the margin is equal to 1/21/2. What would be the slack for the following points?

     Sample     $X$   $Y$
  ------------ ----- -----
       #1       1.5   +1
       #2       1.5   -1
       #3        2    +1
       #4        2    -1
       #5        3    +1
       #6        3    -1

More slacks

Now let’s fit another SVM. This time we get

f(x)=2+x/5.f(x) = 2 + x/5.

Assume the margin is m=5m=5. What is the slack for the point X=6,Y=+X=-6,Y=+?

Features to kernels

We have a dataset D={(X1,Y1),,(Xn,Yn)}\mathcal{D}=\{(X_1,Y_1),\ldots,(X_n,Y_n)\} and we would like to learn to predict Yi{1,+1}Y_i \in \{-1,+1\} from XRX \in \mathbb{R}. To build more expressive prediction models, we will use the following featurization of our data:

ϕ(x)=(xcosxsinx).\phi(x) = \begin{pmatrix} x \\ \cos x\\ \sin x \end{pmatrix}.

Under this featurization, we now have 3 predictive features. We will then fit an SVM to this data. Our final decision function will take the form

f(x)=β0+β1x+β2cosx+β3sinxf(x) = \beta_0 + \beta_1 x + \beta_2 \cos x + \beta_3 \sin x

for some coefficients βR4\beta \in \mathbb{R}^4. Due to the kernel trick, it will also be possible to express the final decision function in a different form:

f(x)=γ0+i=1nγiK(x,Xi)f(x) = \gamma_0 + \sum_{i=1}^n \gamma_i K(x,X_i)

for some coefficients γRn+1\gamma \in \mathbb{R}^{n+1}. What is the formula for KK in this case?

Kernel trick with linear SVMs

Say you have a dataset about XRpX \in \mathbb{R}^p and Y{1,+1}Y \in \{-1,+1\}. The dataset has nn samples. You’d like to try to learn to predict YY from XX using a linear SVM. Which is true?

  1. The kernel trick is not useful in this setting.

  2. The kernel trick could be useful if p=1,000,000p=1,000,000 and n=100n=100.

  3. The kernel trick could be useful if n=1,000,000n=1,000,000 and p=100p=100.

Linear SVMs vs polynomial SVMs

You have a dataset with one predictor feature XX and one binary response XX. You would like to be able to predict the response from the predictor, and you are considering two options.

Option I is to preprocess the data to include the features X,X2X,X^2 and then apply a linear SVM. Option II is to use an SVM with a polynomial kernel K(x,y)=(1+xy)2K(x,y)=(1+xy)^2. Which is true?

True or false: Option I and option II will yield identical results.

Kernels to features

Consider a classification with one predictor XX and one binary response YY. You are thinking about fitting a kernel SVM with the kernel K(x,y)=xy+exp(x+y)K(x,y) = xy+\exp(x+y). You realize that this is equivalent to fitting a linear SVM with a certain featurization. What featurization is it?

Positive definite, part I

Consider a classification with one predictor XX and one binary response YY. You have decided to preprocess the data with a featurization ϕ: RRp~\phi:\ \mathbb{R}\rightarrow \mathbb{R}^{\tilde p}. Which best describes the sign of ϕ(x)ϕ(x)\phi(x)^\top \phi(x) for an input query xx?

  1. ϕ(x)ϕ(x)0\phi(x)^\top \phi(x) \geq 0 for all xx (but it may be that ϕ(x)ϕ(x)=0\phi(x)^\top \phi(x) =0 for some xx)

  2. ϕ(x)ϕ(x)>0\phi(x)^\top \phi(x) > 0 for all xx

  3. ϕ(x)ϕ(x)0\phi(x)^\top \phi(x) \leq 0 for all xx (but it may be that ϕ(x)ϕ(x)=0\phi(x)^\top \phi(x) =0 for some xx)

  4. ϕ(x)ϕ(x)<0\phi(x)^\top \phi(x) < 0 for all xx

Positive definite, part II

For any featurization ϕ, one can show that the kernel defined by K(x,y)=ϕ(x)ϕ(y)K(x,y)=\phi(x)^\top \phi(y) is positive semi-definite (also called nonnegative definite). What does that mean?