Skip to article frontmatterSkip to article content

Partitional clustering

Clustering by hand

Consider this dataset.

     Sample     $X$  
  ------------ ----- 
       #1        1   
       #2        2   
       #3        5   
       #4        6  

Run one iteration of K-means with two clusters initialized with centroids μ1=1\mu_1=1 and μ2=6\mu_2=6. Report the two new centroid values.

WSS, BSS, TSS, part I

Consider a dataset D={X1,Xn}\mathcal{D}=\{X_1,\ldots X_n\} that has been clustered into 3 groups. Let Yi{1,2,3}Y_i \in \{1,2,3\} denote the cluster assignments. Recall that the TSS, BSS, and WSS are defined by the following series of formulae.

μˉ=i=1nXi/nμk=i: Yi=kXi/{i: Yi=k}k{1,2,3}WSS=iXiμYi2BSS=k=13{i: Yi=k}μkμˉ2TSS=iXiμ2\begin{aligned} \bar \mu &= \sum_{i=1}^n X_i / n\\ \mu_k &= \sum_{i:\ Y_i=k} X_i / |\{i:\ Y_i=k\}| \qquad k \in \{1,2,3\} \\ \mathrm{WSS} &= \sum_i \Vert X_i - \mu_{Y_i}\Vert^2\\ \mathrm{BSS} &= \sum_{k=1}^3 |\{i:\ Y_i=k\}| \Vert \mu_k - \bar\mu\Vert^2\\ \mathrm{TSS} &= \sum_i \Vert X_i - \mu \Vert^2\\ \end{aligned}

Which of the following gives the relationship between WSS, BSS, and TSS?

  1. TSS = WSS + BSS

  2. WSS = TSS + BSS

  3. BSS = TSS + WSS

WSS, BSS, TSS, part II

At each iteration of K-means clustering, which of the following occurs?

  1. TSS won’t increase (and might decrease)

  2. WSS won’t increase (and might decrease)

  3. BSS won’t increase (and might decrease)

What does Kmeans do?

Which is the most accurate description of what Kmeans clustering algorithm does?

  1. Kmeans attempts to find clusters so that all points in the same cluster are close to each other.

  2. Kmeans finds an optimal clustering, in the sense that it identifies a clustering so that the sum of distances between points within each cluster is as small as possible.

  3. Kmeans attempts to find clusters so that all points in the same cluster are far from each other.

  4. Kmeans finds an optimal clustering, in the sense that it identifies a clustering so that the sum of distances between points within each cluster is as large as possible.

Silhouette

Consider a dataset D={X1,Xn}\mathcal{D}=\{X_1,\ldots X_n\} that has been clustered into 3 groups. Let Yi{1,2,3}Y_i \in \{1,2,3\} denote the cluster assignments. Let Ck={i: Yi=k}C_k = \{i:\ Y_i=k\}. Recall that the Silhouette coefficient sis_i for each point is defined by the following series of formulae.

ai=jCYi\iXiXj/CYi1bi=minyYijCyXiXj/Cysi=(biai)/max(ai,bi)\begin{aligned} a_i &= \sum_{j \in C_{Y_i} \backslash i} \Vert X_i - X_j \Vert / |C_{Y_i}-1| \\ b_i &= \min_{y \neq Y_i} \sum_{j \in C_{y}} \Vert X_i - X_j \Vert / |C_y| \\ s_i &= (b_i-a_i) / \max(a_i,b_i) \end{aligned}

Which of the following statements best describes ai,bi,sia_i,b_i,s_i?

a. ai-a_i is a measure of cohesion, bib_i is a measure of separation, and large values of sis_i indicate the clustering is pretty good.

b. aia_i is a measure of separation, bi-b_i is a measure of cohesion, and large values of sis_i indicate the clustering is pretty good.

c. ai-a_i is a measure of cohesion, bib_i is a measure of separation, and large values of sis_i indicate the clustering is pretty bad.

d. aia_i is a measure of separation, bi-b_i is a measure of cohesion, and large values of sis_i indicate the clustering is pretty bad.