Original link: tecdat.cn/?p=9997
Original source: Tuoduan Data Tribe Official Account
Introduction to kmedoids clustering
kmedoids is another clustering algorithm that can be used to find groups in a data set. kmedoids clustering is very similar to kmeans clustering, except for some differences. The optimization function of kmedoids clustering algorithm is slightly different from that of kmeans. In this section, we will study kmedoids clustering.
kmedoids clustering algorithm
There are many different types of algorithms that can perform kmedoids clustering, of which the simplest and most effective algorithm is PAM. In PAM, we perform the following steps to find the cluster center:
 Select k data points from the scatter plot as the starting point of the cluster center.
 Calculate the distance between them and all points in the scatter chart.
 Sort each point into the cluster closest to the center.
 Choose a new point in each cluster to minimize the sum of the distances between all points in the cluster and itself.
 Repeat step 2 until the center stops changing.
It can be seen that, except for steps 1 and 4 , the PAM algorithm is the same as the kmeans clustering algorithm. For most practical purposes, kmedoids clustering gives almost the same results as kmeans clustering. But in some special cases, we have outliers in the data set, so kmedoids clustering is preferred because it is more robust than outliers.
kmedoids clustering code
In this section, we will use the same iris data set used in the previous two sections and compare to see if the results are significantly different from the results obtained last time.
Implement kmedoid clustering
In this exercise, we will use R's prebuilt library to execute kmedoids:

Store the first two columns of the data set in the iris_data variable:
iris_data<iris[,1:2] Copy code 
Install the software package:
install.packages("cluster") Copy code 
Import the software package:
library("cluster") Copy code 
Store the PAM clustering results in the km.res variable:
km <pam (iris_data, 3) copy the code 
Import library:
library("factoextra") Copy code 
Plot the PAM clustering results in the figure:
fviz_cluster (km, data = iris_data, palette = "jco", ggtheme = theme_minimal ()) copying the codeThe output is as follows:
Figure: Results of kmedoids clustering
The results of kmedoids clustering are not much different from the results of kmeans clustering we did in the previous section.
Therefore, we can see that the previous PAM algorithm divides our data set into three clusters, which are similar to the clusters we get through kmeans clustering.
Figure: Results of kmedoids clustering and kmeans clustering
In the previous figure, observe how the centers of kmeans cluster and kmeans cluster are so close, but the center of kmeans cluster directly overlaps the existing points in the data, while the center of kmeans cluster is not.
kmeans clustering and kmedoids clustering
Now that we have studied kmeans and kmedoids clustering, they are almost identical, we will study the difference between them and when to use which type of clustering:

Computational complexity: In these two methods, kmedoids clustering is more computationally complex. When our data set is too large (>10,000 points) and we want to save calculation time, we prefer kmeans clustering to kmedoids clustering.
Whether the data set is large depends entirely on the available computing power.

The existence of outliers: kmeans clustering is more sensitive to outliers than outliers.

Clustering centers: Both kmeans algorithm and kclustering algorithm find cluster centers in different ways.
Use kmedoids clustering for customer segmentation
Use customer data sets to perform kmeans and kmedoids clustering, and then compare the results.
step:
 Select only two columns, namely grocery store and freezer store, to conveniently visualize the cluster in two dimensions.
 Use kmedoids clustering to draw a chart showing the four clusters of the data.
 Use kmeans clustering to draw a fourcluster graph.
 Compare the two graphs to comment on how the results of the two methods are different.
The result will be a kmeans plot of the cluster, as shown below:
Figure: The expected kmeans graph of the cluster
Determine the optimal number of clusters
So far, we have been studying the iris flower data set, in which we know how many kinds of flowers there are, and based on this knowledge, we choose to divide the data set into three clusters. However, in unsupervised learning, our main task is to process data without any information, for example, how many natural clusters or categories are there in the data set. Similarly, clustering can also be a form of exploratory data analysis.
Types of clustering indicators
There is more than one way to determine the optimal number of clusters in unsupervised learning. Here is what we will study in this chapter:
 Contour score
 Elbow method/WSS
 Gap statistics
Contour score
The contour score or average contour score calculation is used to quantify the clustering quality achieved by the clustering algorithm.
The contour score is between 1 and 1. If the contour score of a cluster is low (between 0 and 1), it means that the cluster is spread out or the distance between the points of the cluster is high. If the contour score of the cluster is high (close to 1), it means that the cluster is well defined and the distance between the points of the cluster is low, and the distance between the points of other clusters is high. Therefore, the ideal profile score is close to 1.
Calculate the contour score
We learn how to calculate the contour score of a data set with a fixed number of clusters:

The first two columns iris data set (the length of the spacer and the spacer sheet width) in iris_data variables:

Execute kmeans cluster:

Store the kmeans cluster in the km.res variable:

Store the pairwise distance matrix of all data points in the pair_dis variable:

Calculate the contour score of each point in the data set:

Plot the contour score:
The output is as follows:

Figure: The contour score of each point in each cluster is represented by a single bar
The previous figure shows that the average contour score of the data set is 0.45. It also shows the average contour scores of clusters and point clusters.
We calculated the contour scores of the three clusters. However, to determine how many clusters to have, you must calculate the contour scores of multiple clusters in the data set.
Determine the optimal number of clusters
Calculate the contour score for each value of k to determine the optimal number of clusters:
From the previous figure, select the k value with the highest score; that is 2. According to the contour score, the optimal number of clusters is 2.

Put the first two columns (length and width) of the data set in the iris_data variable:

Import library

Draw a graph of the contour score and the number of clusters (up to 20):
note
In the second parameter, kmeans can be changed to kmedoids or any other type of clustering.
The output is as follows:
Figure: Number of clusters and average contour score
WSS/elbow method
In order to identify clusters in the data set, we try to minimize the distance between points in the cluster, and the sum of squares (WSS) method can measure this distance. The WSS score is the sum of the squared distances of all points in the cluster.
Use WSS to determine the number of clusters
In this exercise, we will see how to use WSS to determine the number of clusters. Perform the following steps.

Put the first two columns of the iris data set (spacer length and spacer width) in the iris_data variable:

Import library

Plot WSS vs. number of clusters
The output is as follows:

Figure: WSS and the number of clusters
In the previous graph, we can choose k = 3 for the elbow of the graph, because the value of WSS starts to decrease more slowly after k = 3. Choosing the elbow of the chart is always a subjective choice, sometimes you may choose k = 4 or k = 2 instead of k = 3, but for this chart, it is obvious that k>5 is not suitable for the value of k, because they are not graphs The elbow is the place where the slope of the graph changes sharply.
Gap statistics
Gap statistics is one of the most effective ways to find the best number of clusters in the data set. It is applicable to any type of clustering method. The Gap statistic is calculated by comparing the WSS values of the clusters generated by the data set we observe with the reference data set without obvious clustering.
So, in short, Gap statistics are used to measure the WSS value of the observed data set and the random data set, and find the deviation of the observed data set from the random data set. In order to find the ideal number of clusters, we choose the value of k, which allows us to obtain the maximum value of the Gap statistic.
Use gap statistics to calculate the ideal number of clusters
In this exercise, we will use Gap statistics to calculate the ideal number of clusters:

Put the first two columns (spacer length and spacer width) of the Iris data set in the iris_data variable

Import factoextra library

Plot the gap statistics and the number of clusters (up to 20):
Figure 1.35: Gap statistics and number of clusters
As shown in the figure above, the maximum value of the Gap statistic is k = 3. Therefore, the ideal number of clusters in the data set is 3.
Find the ideal number of market segments
Use all three methods above to find the optimal number of clusters in the customer data set:
Load the fifth to sixth columns of the wholesale customer data set in the variable.
 The contour score is used to calculate the optimal number of clusters for kmeans clustering.
 The WSS score is used to calculate the optimal number of clusters for kmeans clustering.
 Use Gap statistics to calculate the optimal number of clusters for kmeans clustering.
The result will be three graphs, representing the optimal number of clusters for contour scores, WSS scores and Gap statistics.
Most popular insights
1. R language kShape algorithm stock price time series clustering
2. Comparison of different types of clustering methods in R language
4. Hierarchical clustering of iris iris data set in r language
5. Python Monte Carlo KMeans clustering practice
6. Use R for website review text mining clustering
7. Python for NLP: Multilabel text LSTM neural network using Keras
8. R language analyzes and explores handwritten digit classification data on MNIST data set
9. R language is based on Keras's small data set deep learning image classification