Partitional clustering - Some Data Science Questions

Clustering by hand¶

Consider this dataset.

     Sample     $X$  
  ------------ ----- 
       #1        1   
       #2        2   
       #3        5   
       #4        6

Run one iteration of K-means with two clusters initialized with centroids $\mu_1=1$ and $\mu_2=6$ . Report the two new centroid values.

WSS, BSS, TSS, part I¶

Consider a dataset $\mathcal{D}=\{X_1,\ldots X_n\}$ that has been clustered into 3 groups. Let $Y_i \in \{1,2,3\}$ denote the cluster assignments. Recall that the TSS, BSS, and WSS are defined by the following series of formulae.

\begin{aligned} \bar \mu &= \sum_{i=1}^n X_i / n\\ \mu_k &= \sum_{i:\ Y_i=k} X_i / |\{i:\ Y_i=k\}| \qquad k \in \{1,2,3\} \\ \mathrm{WSS} &= \sum_i \Vert X_i - \mu_{Y_i}\Vert^2\\ \mathrm{BSS} &= \sum_{k=1}^3 |\{i:\ Y_i=k\}| \Vert \mu_k - \bar\mu\Vert^2\\ \mathrm{TSS} &= \sum_i \Vert X_i - \mu \Vert^2\\ \end{aligned}

(1)

Which of the following gives the relationship between WSS, BSS, and TSS?

TSS = WSS + BSS
WSS = TSS + BSS
BSS = TSS + WSS

WSS, BSS, TSS, part II¶

At each iteration of K-means clustering, which of the following occurs?

TSS won’t increase (and might decrease)
WSS won’t increase (and might decrease)
BSS won’t increase (and might decrease)

What does Kmeans do?¶

Which is the most accurate description of what Kmeans clustering algorithm does?

Kmeans attempts to find clusters so that all points in the same cluster are close to each other.
Kmeans finds an optimal clustering, in the sense that it identifies a clustering so that the sum of distances between points within each cluster is as small as possible.
Kmeans attempts to find clusters so that all points in the same cluster are far from each other.
Kmeans finds an optimal clustering, in the sense that it identifies a clustering so that the sum of distances between points within each cluster is as large as possible.

Silhouette¶

Consider a dataset $\mathcal{D}=\{X_1,\ldots X_n\}$ that has been clustered into 3 groups. Let $Y_i \in \{1,2,3\}$ denote the cluster assignments. Let $C_k = \{i:\ Y_i=k\}$ . Recall that the Silhouette coefficient $s_i$ for each point is defined by the following series of formulae.

\begin{aligned} a_i &= \sum_{j \in C_{Y_i} \backslash i} \Vert X_i - X_j \Vert / |C_{Y_i}-1| \\ b_i &= \min_{y \neq Y_i} \sum_{j \in C_{y}} \Vert X_i - X_j \Vert / |C_y| \\ s_i &= (b_i-a_i) / \max(a_i,b_i) \end{aligned}

(3)

Which of the following statements best describes $a_i,b_i,s_i$ ?

a. $-a_i$ is a measure of cohesion, $b_i$ is a measure of separation, and large values of $s_i$ indicate the clustering is pretty good.

b. $a_i$ is a measure of separation, $-b_i$ is a measure of cohesion, and large values of $s_i$ indicate the clustering is pretty good.

c. $-a_i$ is a measure of cohesion, $b_i$ is a measure of separation, and large values of $s_i$ indicate the clustering is pretty bad.

d. $a_i$ is a measure of separation, $-b_i$ is a measure of cohesion, and large values of $s_i$ indicate the clustering is pretty bad.

Unsupervised learning

Hierarchical clustering

Unsupervised learning

Dimensionality reduction