Partitional clustering
Clustering by hand¶
Consider this dataset.
Sample $X$
------------ -----
#1 1
#2 2
#3 5
#4 6
Run one iteration of K-means with two clusters initialized with centroids and . Report the two new centroid values.
Solutions
Points #1 and #2 get assigned to cluster 1. Points #3 and #4 get assigned to cluster 2. So the new means are and .
WSS, BSS, TSS, part I¶
Consider a dataset that has been clustered into 3 groups. Let denote the cluster assignments. Recall that the TSS, BSS, and WSS are defined by the following series of formulae.
Which of the following gives the relationship between WSS, BSS, and TSS?
TSS = WSS + BSS
WSS = TSS + BSS
BSS = TSS + WSS
Solutions
First answer is correct.
WSS, BSS, TSS, part II¶
At each iteration of K-means clustering, which of the following occurs?
TSS won’t increase (and might decrease)
WSS won’t increase (and might decrease)
BSS won’t increase (and might decrease)
Solutions
Second option. BSS won’t decrease (and might increase). TSS will stay the same (doesn’t depend on cluster labels).
What does Kmeans do?¶
Which is the most accurate description of what Kmeans clustering algorithm does?
Kmeans attempts to find clusters so that all points in the same cluster are close to each other.
Kmeans finds an optimal clustering, in the sense that it identifies a clustering so that the sum of distances between points within each cluster is as small as possible.
Kmeans attempts to find clusters so that all points in the same cluster are far from each other.
Kmeans finds an optimal clustering, in the sense that it identifies a clustering so that the sum of distances between points within each cluster is as large as possible.
Solutions
First option is correct. Kmeans is greedy, may not find anything optimal. Moreover, “sum of distances between points within each cluster” is a bit vague. It is true that Kmeans minimizes the WSS, and this can be related to a quantity involving the distances between all pairs of points in each cluster. Specifically,
Silhouette¶
Consider a dataset that has been clustered into 3 groups. Let denote the cluster assignments. Let . Recall that the Silhouette coefficient for each point is defined by the following series of formulae.
Which of the following statements best describes ?
a. is a measure of cohesion, is a measure of separation, and large values of indicate the clustering is pretty good.
b. is a measure of separation, is a measure of cohesion, and large values of indicate the clustering is pretty good.
c. is a measure of cohesion, is a measure of separation, and large values of indicate the clustering is pretty bad.
d. is a measure of separation, is a measure of cohesion, and large values of indicate the clustering is pretty bad.
Solutions
(a) is correct. Higher values of mean less cohesion (so higher values of mean more cohesion). Higher values of mean more separation. We want lots of cohesion and lots of separation, so we want to be large and positive.