Reproduce the classic: "Statistical Learning Methods" Chapter 14 Clustering Methods

Reproduce the classic: "Statistical Learning Methods" Chapter 14 Clustering Methods

Chapter 14 Clustering Methods

This article is a code reproduction of the book "Statistical Learning Methods" by Li Hang. Author: Huang Haiguang

Note: The code can be downloaded in github. I will publish the code one after another on the official account "Machine Learning Beginner", which can be read online in this album.

1. Clustering is a data analysis problem of categorizing a given sample into several "classes" or "clusters" according to the similarity or distance of their attributes. A class is a subset of the sample. Intuitively, similar samples are clustered in the same category, and dissimilar samples are scattered in different categories.

2. Distance or similarity measure plays an important role in clustering.

Commonly used distance measures are Minkowski distance, including Euclidean distance, Manhattan distance, Chebyshev distance, and Mahalanobis distance. Commonly used similarity measures include correlation coefficient and angle cosine. When using distance to measure similarity, the smaller the distance, the more similar the samples; when using the correlation coefficient, the larger the correlation coefficient, the more similar the samples.

3. A class is a subset of samples. For example, there are the following basic definitions: Use to represent a class or cluster, use,; etc. to represent samples in a class, and use to represent the distance between a sample and a sample. If for any, there is

It is called a class or cluster.

The indicators describing the characteristics of the class include center, diameter, scatter matrix, and covariance matrix.

4. The distance between classes used in the clustering process is also called the distance between connecting classes and classes, including the shortest distance, the longest distance, the center distance, and the average distance.

5. Hierarchical clustering assumes that there is a hierarchical structure between categories. There are two methods for clustering samples into hierarchical clusters: aggregation or bottom-up, split or top-down.

Aggregation clustering starts to classify each sample into a class; then the two closest classes are merged to create a new class, and this operation is repeated until the stop condition is met; a hierarchical class is obtained. Split clustering begins to classify all samples into one class; then classify the farthest samples in the existing classes into two new classes, and repeat this operation until the stopping condition is met; a hierarchical class is obtained.

Aggregate clustering needs to pre-determine the following three elements:

(1) Distance or similarity; (2) Combination rules; (3) Stop conditions.

According to different combinations of these concepts, different clustering methods can be obtained.

6. Mean clustering is a commonly used clustering algorithm, which has the following characteristics. Clustering method based on division; the number of categories k is specified in advance; the distance or similarity between samples is represented by the square of Euclidean distance, the category is represented by the center or the mean of the samples; the distance between the sample and the center of the class to which it belongs The sum is the objective function of optimization; the obtained category is flat and non-hierarchical; the algorithm is an iterative algorithm, which cannot guarantee the global optimum.

Means clustering algorithm, first select the center of k classes, divide the samples into the classes closest to the center, and get a clustering result; then calculate the mean value of the samples of each class as the new center of the class; repeat the above steps Until convergence.

Hierarchical clustering

  1. Aggregation (bottom-up): The aggregation method starts to split each sample into a class, then merge the two closest classes to create a new class, repeat the operation until the stop condition is met, and get a hierarchical class.
  2. Splitting (top-down): The splitting method starts to divide all samples into one class, and then divides the farthest samples in the existing classes into two new classes, repeats this operation until the stop condition is met, and a hierarchical one is obtained. category.

k-means clustering

K-means clustering is a center-based clustering method. Through iteration, samples are divided into k classes, so that each sample is closest to the center or mean of the class to which it belongs, and k flat, non-hierarchical classes are obtained. The division of space.

import  math import  random import  numpy as np sklearn from  Import  Datasets, Cluster Import  matplotlib.pyplot AS plt copy the code
iris = datasets.load_iris () to copy the code
= IRIS gt [ 'target' ]; gt copy the code
array([ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2, 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2, 2 , 2 , 2 , 2 , 2 ]) Copy the code

Class 3

IRIS [ 'Data' ] [:,: 2 ] .shape duplicated code
( 150 , 2 ) copy the code
IRIS = Data [ 'Data' ] [:,: 2 ] Copy Code
x = data[:, 0 ] Data = Y [:, . 1 ] Copy the code
plt.scatter(x, y, color = 'green' ) plt.xlim( 48 ) plt.ylim( 15 ) plt.show() Copy code

# Define the node of the number of clusters class ClusterNode:     def __init__(self, vec, left=None, right=None, distance= -1 , id=None, count= 1 ):          "" "         :param vec: save two data clusters to form a new center         :param left: left node         :param right: right node         :param distance: the distance between two nodes         :param id: used to mark which nodes are calculated         :param count: the number of leaf nodes of this node         " ""         self.vec = vec         self.left = left         self.right = right         self.distance = distance         self.id = id         self.count = count Copy code
def euler_distance(point1: np.ndarray, point2: list) -> float:      "" "     Calculate the Euler distance between two points, support multi-dimensional     " ""     distance =  0.0     for  a, b in zip(point1, point2):         Math.pow + = Distance (A - B,  2 )      return  Math.sqrt (Distance) copying the code
# Hierarchical clustering (aggregation method) class Hierarchical:     def __init__(self, k):         self.k = k         self.labels = None              def fit(self, x):         nodes = [ClusterNode(vec=v, id=i)  for  i, v in enumerate(x)]         distances = {}         point_num, feature_num = x.shape         self.labels = [ -1 ] * point_num         currentclustid =  -1         while( len (nodes))> self.k:             min_dist = math.inf             nodes_len =  len (nodes)             closest_part = None             for  i in  range (nodes_len-  1 ):                  for  j in  range (i + 1 , nodes_len):                     d_key = (nodes[i].id, nodes[j].id)                     if  d_key not in distances:                         distances[d_key] = euler_distance(nodes[i].vec, nodes[j].vec)                     d = distances[d_key]                     if  d <min_dist:                         min_dist = d                         closest_part = (i, j)                                      part1, part2 = closest_part             node1, node2 = nodes[part1], nodes[part2]             new_vec = [(node1.vec[i] * node1.count + node2.vec[i] * node2.count)/(node1.count + node2.count)                         for  i in  range (feature_num)]             new_node = ClusterNode(vec=new_vec,                                    left=node1,                                    right=node2,                                    distance=min_dist,                                    id=currentclustid,                                    count=node1.count + node2.count)             currentclustid -=  1             del nodes[part2], nodes[part1]             nodes. append (new_node)                      self.nodes = nodes         self.calc_label()              def calc_label(self):         "" "         Retrieve the results of clustering         " ""         for  i, node in enumerate(self.nodes):             # Classify all leaf nodes of the node             self.leaf_traversal(node, i)     def leaf_traversal(self, node: ClusterNode, label):         "" "         Recursively traverse the leaf nodes         " ""         if  node.left == None and node.right == None:             self.labels[node.id] = label         if  node.left:             self.leaf_traversal(node.left, label)         if  node.right:             self.leaf_traversal(node.right, label)              # https: //zhuanlan.zhihu.com/p/32438294Copy code
my = Hierarchical( 3 ) my.fit(data) labels = np.array(my.labels) print (labels) Copy code
[ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 2 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 1 0 0 1 2 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 . 1 0 0 . 1 0 0 0 . 1 . 1 . 1 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ] duplicated code
# visualize result cat1 = data[np.where(labels== 0 )] cat2 = data[np.where(labels== 1 )] cat3 = data[np.where(labels== 2 )] plt.scatter(cat1[:, 0 ], cat1[:, 1 ], color = 'green' ) plt.scatter(cat2[:, 0 ], cat2[:, 1 ], color = 'red' ) plt.scatter(cat3[:, 0 ], cat3[:, 1 ], color = 'blue' ) plt.title( 'Hierarchical clustering with k=3' ) plt.xlim( 48 ) plt.ylim( 15 ) plt.show() Copy code

sk = cluster.AgglomerativeClustering( 3 ) sk.fit(data) labels_ = sk.labels_ print (labels_) Copy code
[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 2 0 2 0 1 0 1 1 0 2 0 2 0 2 2 2 2 0 0 2 0 0 0 0 0 0 2 2 2 2 0 2 0 0 2 2 2 2 0 2 1 2 2 2 0 1 2 0 2 0 0 0 0 1 0 0 0 0 0 0 2 2 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 ] duplicated code
# visualize result of sklearn cat1_ = data[np.where(labels_== 0 )] cat2_ = data[np.where(labels_== 1 )] cat3_ = data[np.where(labels_== 2 )] plt.scatter(cat1_[:, 0 ], cat1_[:, 1 ], color = 'green' ) plt.scatter(cat2_[:, 0 ], cat2_[:, 1 ], color = 'red' ) plt.scatter(cat3_[:, 0 ], cat3_[:, 1 ], color = 'blue' ) plt.title( 'Hierarchical clustering with k=3' ) plt.xlim( 48 ) plt.ylim( 15 ) plt.show() Copy code

# kmeans class MyKmeans:     def __init__(self, k, n = 20 ):         self.k = k         self.n = n              def fit(self, x, centers=None):         # The first step, randomly select K points, or specify         if  centers is None:             idx = np.random.randint(low = 0 , high = len (x), size=self.k)             centers = x[idx]         # print (centers)                  inters =  0         while inters <self.n:             # print (inters)             # print (centers)             points_set = {key: []  for  key in  range (self.k)}             # The second step, traverse all points P, put P into the set of the nearest cluster centers             for  p in x:                 nearest_index = np.argmin(np.sum((centers-p) **  2 , axis = 1 ) **  0.5 )                 points_set[nearest_index]. append (p)             # The third step is to traverse each point set and calculate the new cluster center             for  i_k in  range (self.k):                 centers[i_k] = sum(points_set[i_k])/len (points_set[i_k])                              inters +=  1                               return  points_set, centers          Copy code
m = MyKmeans( 3 ) points_set, centers = m.fit(data) Copy code
centersCopy code
array([[ 5.006 , 3.428 ], [ 6.81276596 , 3.07446809 ], [ 5.77358491 , 2.69245283 ]]) Copy the code
# visualize result cat1 = np.asarray(points_set[ 0 ]) cat2 = np.asarray(points_set[ 1 ]) cat3 = np.asarray(points_set[ 2 ]) for  ix, p in enumerate(centers):     plt.scatter(p[ 0 ], p[ 1 ], color = 'C{}' .format(ix), marker = '^' , edgecolor = 'black' , s = 256 )          plt.scatter(cat1_[:, 0 ], cat1_[:, 1 ], color = 'green' ) plt.scatter(cat2_[:, 0 ], cat2_[:, 1 ], color = 'red' ) plt.scatter(cat3_[:, 0 ], cat3_[:, 1 ], color = 'blue' ) plt.title( 'Hierarchical clustering with k=3' ) plt.xlim( 48 ) plt.ylim( 15 ) plt.show() Copy code

# using sklearn from sklearn.cluster  import  KMeans kmeans = KMeans(n_clusters = 3 , max_iter = 100 ).fit(data) gt_labels__ = kmeans.labels_ centers__ = kmeans.cluster_centers_ Copy code
gt_labels__Copy code
array([ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 2 , 0 , 2 , 0 , 2, 0 , 2 , 2 , 2 , 2 , 2 , 2 , 0 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0 , 0 , 0 , 0 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0, 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0 , 2 , 0 , 0 , 0 , 0 , 2 , 0 , 0 , 0 , 0 , 0 , 0 , 2 , 2 , 0, 0 , 0 , 0 , 2 , 0 , 2 , 0 , 2 , 0 , 0 , 2 , 2 , 0 , 0 , 0 , 0 , 0 , 2 , 2 , 0 , 0 , 0 , 2 , 0 , 0 , 0 , 2 , 0 , 0, 0 , 2 , 0 , 0 , 2 ]) Copy the code
centers__Copy code
array([[ 6.81276596 , 3.07446809 ], [ 5.006 , 3.428 ], [ 5.77358491 , 2.69245283 ]]) Copy the code
# visualize result cat1 = data[gt_labels__ ==  0 ] cat2 = data[gt_labels__ ==  1 ] cat3 = data[gt_labels__ ==  2 ] for  ix, p in enumerate(centers__):     plt.scatter(p[ 0 ], p[ 1 ], color = 'C{}' .format(ix), marker = '^' , edgecolor = 'black' , s = 256 )          plt.scatter(cat1_[:, 0 ], cat1_[:, 1 ], color = 'green' ) plt.scatter(cat2_[:, 0 ], cat2_[:, 1 ], color = 'red' ) plt.scatter(cat3_[:, 0 ], cat3_[:, 1 ], color = 'blue' ) plt.title( 'kmeans using sklearn with k=3' ) plt.xlim( 48 ) plt.ylim( 15 ) plt.show() Copy code

Find K value

from sklearn.cluster  import  KMeans loss = [] for  i in  range ( 110 ):     kmeans = KMeans(n_clusters=i, max_iter = 100 ).fit(data)     loss. append (kmeans.inertia_/  len (data)/  3 ) plt.title( 'K with loss' ) plt.plot( range ( 110 ), loss) plt.show() Copy code

Example 14.2
= X-[[ 02 ], [ 00 ], [ . 10 ], [ . 50 ], [ . 52 ]] duplicated code
np.asarray(X) Copy code
array([[ 0 , 2 ], [ 0 , 0 ], [ 1 , 0 ], [ 5 , 0 ], [ 5 , 2 ]]) Copy code
m = MyKmeans( 2100 ) points_set, centers = m.fit(np.asarray(X)) Copy code
points_setCopy code
{ 0 : [array([ 0 , 2 ]), array([ 0 , 0 ]), array([ 1 , 0 ])], 1 : [array([ 5 , 0 ]), array([ 5 , 2 ])]} Copy code
centersCopy code
array([[ 0 , 0 ], [ 5 , 1 ]]) Copy code
= KMeans the kmeans (n_clusters = 2 , max_iter = 100 ) .fit (np.asarray (X-)) copying the code
kmeans.labels_copy code
Array ([ 0 , 0 , 0 , . 1 , . 1 ]) Copy the code
kmeans.cluster_centers_copy code
array([[ 0.33333333 , 0.66666667 ], [ 5. , 1. ]]) Copy code

Code source for this chapter: github.com/hktxt/Learn...

download link

github.com/fengdu78/li...

Reference materials:

[1] "Statistical Learning Method": baike.baidu.com/item/Statistical Learning Method...

[2] Huang Haiguang : github.com/fengdu78

[3] github: github.com/fengdu78/li...

[4] wzyonggege: github.com/wzyonggege/

[5] WenDesi: github.com/WenDesi/lih...

[6] Hot and hot: blog.csdn.net/tudaodiaozh...

[7]  hktxt: github.com/hktxt/Learn