R language uses the optimal number of clusters k-medoids clustering for customer segmentation

R language uses the optimal number of clusters k-medoids clustering for customer segmentation

Original link: tecdat.cn/?p=9997

Original source: Tuoduan Data Tribe Official Account


Introduction to k-medoids clustering

k-medoids is another clustering algorithm that can be used to find groups in a data set. k-medoids clustering is very similar to k-means clustering, except for some differences. The optimization function of k-medoids clustering algorithm is slightly different from that of k-means. In this section, we will study k-medoids clustering.

k-medoids clustering algorithm

There are many different types of algorithms that can perform k-medoids clustering, of which the simplest and most effective algorithm is PAM. In PAM, we perform the following steps to find the cluster center:

  1. Select k data points from the scatter plot as the starting point of the cluster center.
  2. Calculate the distance between them and all points in the scatter chart.
  3. Sort each point into the cluster closest to the center.
  4. Choose a new point in each cluster to minimize the sum of the distances between all points in the cluster and itself.
  5. Repeat  step 2  until the center stops changing.

It can be seen that, except for steps 1  and  4 , the PAM algorithm is the same as the k-means clustering algorithm. For most practical purposes, k-medoids clustering gives almost the same results as k-means clustering. But in some special cases, we have outliers in the data set, so k-medoids clustering is preferred because it is more robust than outliers.

k-medoids clustering code

In this section, we will use the same iris data set used in the previous two sections and compare to see if the results are significantly different from the results obtained last time.

Implement k-medoid clustering

In this exercise, we will use R's pre-built library to execute k-medoids:

  1. Store the first two columns of the data set in the  iris_data  variable:

     

    iris_data<-iris[,1:2] Copy code
  2. Install the software package:

     

    install.packages("cluster") Copy code
  3. Import the software package:

     

    library("cluster") Copy code
  4. Store the PAM clustering results in the  km.res  variable:

     

    km <-pam (iris_data, 3) copy the code
  5. Import library:

     

    library("factoextra") Copy code
  6. Plot the PAM clustering results in the figure:

     

    fviz_cluster (km, data = iris_data, palette = "jco", ggtheme = theme_minimal ()) copying the code

    The output is as follows:

    Figure: Results of k-medoids clustering

The results of k-medoids clustering are not much different from the results of k-means clustering we did in the previous section.

Therefore, we can see that the previous PAM algorithm divides our data set into three clusters, which are similar to the clusters we get through k-means clustering.

 

Figure: Results of k-medoids clustering and k-means clustering

In the previous figure, observe how the centers of k-means cluster and k-means cluster are so close, but the center of k-means cluster directly overlaps the existing points in the data, while the center of k-means cluster is not.

k-means clustering and k-medoids clustering

Now that we have studied k-means and k-medoids clustering, they are almost identical, we will study the difference between them and when to use which type of clustering:

  • Computational complexity: In these two methods, k-medoids clustering is more computationally complex. When our data set is too large (>10,000 points) and we want to save calculation time, we prefer k-means clustering to k-medoids clustering.

    Whether the data set is large depends entirely on the available computing power.

  • The existence of outliers: k-means clustering is more sensitive to outliers than outliers.

  • Clustering centers: Both k-means algorithm and k-clustering algorithm find cluster centers in different ways.

Use k-medoids clustering for customer segmentation

Use customer data sets to perform k-means and k-medoids clustering, and then compare the results.

step:

  1. Select only two columns, namely grocery store and freezer store, to conveniently visualize the cluster in two dimensions.
  2. Use k-medoids clustering to draw a chart showing the four clusters of the data.
  3. Use k-means clustering to draw a four-cluster graph.
  4. Compare the two graphs to comment on how the results of the two methods are different.

The result will be a k-means plot of the cluster, as shown below:

 

Figure: The expected k-means graph of the cluster

Determine the optimal number of clusters

So far, we have been studying the iris flower data set, in which we know how many kinds of flowers there are, and based on this knowledge, we choose to divide the data set into three clusters. However, in unsupervised learning, our main task is to process data without any information, for example, how many natural clusters or categories are there in the data set. Similarly, clustering can also be a form of exploratory data analysis.

Types of clustering indicators

There is more than one way to determine the optimal number of clusters in unsupervised learning. Here is what we will study in this chapter:

  • Contour score
  • Elbow method/WSS
  • Gap statistics

Contour score

The contour score or average contour score calculation is used to quantify the clustering quality achieved by the clustering algorithm.

The contour score is between 1 and -1. If the contour score of a cluster is low (between 0 and -1), it means that the cluster is spread out or the distance between the points of the cluster is high. If the contour score of the cluster is high (close to 1), it means that the cluster is well defined and the distance between the points of the cluster is low, and the distance between the points of other clusters is high. Therefore, the ideal profile score is close to 1.

 

Calculate the contour score

We learn how to calculate the contour score of a data set with a fixed number of clusters:

  1. The first two columns iris data set (the length of the spacer and the spacer sheet width) in  iris_data  variables:

     

  2. Execute k-means cluster:

     

  3. Store the k-means cluster in the  km.res  variable:

     

  4. Store the pairwise distance matrix of all data points in the  pair_dis  variable:

     

  5. Calculate the contour score of each point in the data set:

     

  6. Plot the contour score:

     

    The output is as follows:

  7. Figure: The contour score of each point in each cluster is represented by a single bar

The previous figure shows that the average contour score of the data set is 0.45. It also shows the average contour scores of clusters and point clusters.

We calculated the contour scores of the three clusters. However, to determine how many clusters to have, you must calculate the contour scores of multiple clusters in the data set.

Determine the optimal number of clusters

Calculate the contour score for each value of k to determine the optimal number of clusters:

From the previous figure, select the k value with the highest score; that is 2. According to the contour score, the optimal number of clusters is 2.

  1. Put the first two columns (length and width) of the data set in the  iris_data  variable:

  2. Import library

  3. Draw a graph of the contour score and the number of clusters (up to 20):

    note

    In the second parameter, k-means can be changed to k-medoids or any other type of clustering.

    The output is as follows:

    Figure: Number of clusters and average contour score

WSS/elbow method

In order to identify clusters in the data set, we try to minimize the distance between points in the cluster, and the sum of squares (WSS) method can measure this distance. The WSS score is the sum of the squared distances of all points in the cluster.

Use WSS to determine the number of clusters

In this exercise, we will see how to use WSS to determine the number of clusters. Perform the following steps.

  1. Put the first two columns of the iris data set (spacer length and spacer width) in the  iris_data  variable:

  2. Import library

  3. Plot WSS vs. number of clusters

    The output is as follows:

  4. Figure: WSS and the number of clusters

In the previous graph, we can choose k = 3 for the elbow of the graph, because the value of WSS starts to decrease more slowly after k = 3. Choosing the elbow of the chart is always a subjective choice, sometimes you may choose k = 4 or k = 2 instead of k = 3, but for this chart, it is obvious that k>5 is not suitable for the value of k, because they are not graphs The elbow is the place where the slope of the graph changes sharply.

Gap statistics

Gap statistics is one of the most effective ways to find the best number of clusters in the data set. It is applicable to any type of clustering method. The Gap statistic is calculated by comparing the WSS values of the clusters generated by the data set we observe with the reference data set without obvious clustering.

So, in short, Gap statistics are used to measure the WSS value of the observed data set and the random data set, and find the deviation of the observed data set from the random data set. In order to find the ideal number of clusters, we choose the value of k, which allows us to obtain the maximum value of the Gap statistic.

Use gap statistics to calculate the ideal number of clusters

In this exercise, we will use Gap statistics to calculate the ideal number of clusters:

  1. Put the first two columns (spacer length and spacer width) of the Iris data set in the  iris_data  variable

     

  2. Import  factoextra  library

     

  3. Plot the gap statistics and the number of clusters (up to 20):

     

    Figure 1.35: Gap statistics and number of clusters

As shown in the figure above, the maximum value of the Gap statistic is k = 3. Therefore, the ideal number of clusters in the data set is 3.

Find the ideal number of market segments

Use all three methods above to find the optimal number of clusters in the customer data set:

Load the fifth to sixth columns of the wholesale customer data set in the variable.

  1. The contour score is used to calculate the optimal number of clusters for k-means clustering.
  2. The WSS score is used to calculate the optimal number of clusters for k-means clustering.
  3. Use Gap statistics to calculate the optimal number of clusters for k-means clustering.

The result will be three graphs, representing the optimal number of clusters for contour scores, WSS scores and Gap statistics.


Most popular insights

1. R language k-Shape algorithm stock price time series clustering

2. Comparison of different types of clustering methods in R language

3. R language performs K-medoids clustering modeling and GAM regression on electricity load time series data

4. Hierarchical clustering of iris iris data set in r language

5. Python Monte Carlo K-Means clustering practice

6. Use R for website review text mining clustering

7. Python for NLP: Multi-label text LSTM neural network using Keras

8. R language analyzes and explores handwritten digit classification data on MNIST data set

9. R language is based on Keras's small data set deep learning image classification