Chapter 14 Clustering Methods

Note: The code can be downloaded in github. I will publish the code one after another on the official account "Machine Learning Beginner", which can be read online in this album.

1. Clustering is a data analysis problem of categorizing a given sample into several "classes" or "clusters" according to the similarity or distance of their attributes. A class is a subset of the sample. Intuitively, similar samples are clustered in the same category, and dissimilar samples are scattered in different categories.

2. Distance or similarity measure plays an important role in clustering.

Commonly used distance measures are Minkowski distance, including Euclidean distance, Manhattan distance, Chebyshev distance, and Mahalanobis distance. Commonly used similarity measures include correlation coefficient and angle cosine. When using distance to measure similarity, the smaller the distance, the more similar the samples; when using the correlation coefficient, the larger the correlation coefficient, the more similar the samples.

3. A class is a subset of samples. For example, there are the following basic definitions: Use to represent a class or cluster, use,; etc. to represent samples in a class, and use to represent the distance between a sample and a sample. If for any, there is

It is called a class or cluster.

The indicators describing the characteristics of the class include center, diameter, scatter matrix, and covariance matrix.

4. The distance between classes used in the clustering process is also called the distance between connecting classes and classes, including the shortest distance, the longest distance, the center distance, and the average distance.

5. Hierarchical clustering assumes that there is a hierarchical structure between categories. There are two methods for clustering samples into hierarchical clusters: aggregation or bottom-up, split or top-down.

Aggregation clustering starts to classify each sample into a class; then the two closest classes are merged to create a new class, and this operation is repeated until the stop condition is met; a hierarchical class is obtained. Split clustering begins to classify all samples into one class; then classify the farthest samples in the existing classes into two new classes, and repeat this operation until the stopping condition is met; a hierarchical class is obtained.

Aggregate clustering needs to pre-determine the following three elements:

(1) Distance or similarity; (2) Combination rules; (3) Stop conditions.

According to different combinations of these concepts, different clustering methods can be obtained.

6. Mean clustering is a commonly used clustering algorithm, which has the following characteristics. Clustering method based on division; the number of categories k is specified in advance; the distance or similarity between samples is represented by the square of Euclidean distance, the category is represented by the center or the mean of the samples; the distance between the sample and the center of the class to which it belongs The sum is the objective function of optimization; the obtained category is flat and non-hierarchical; the algorithm is an iterative algorithm, which cannot guarantee the global optimum.

Means clustering algorithm, first select the center of k classes, divide the samples into the classes closest to the center, and get a clustering result; then calculate the mean value of the samples of each class as the new center of the class; repeat the above steps Until convergence.

Hierarchical clustering

1. Aggregation (bottom-up): The aggregation method starts to split each sample into a class, then merge the two closest classes to create a new class, repeat the operation until the stop condition is met, and get a hierarchical class.
2. Splitting (top-down): The splitting method starts to divide all samples into one class, and then divides the farthest samples in the existing classes into two new classes, repeats this operation until the stop condition is met, and a hierarchical one is obtained. category.

k-means clustering

K-means clustering is a center-based clustering method. Through iteration, samples are divided into k classes, so that each sample is closest to the center or mean of the class to which it belongs, and k flat, non-hierarchical classes are obtained. The division of space.

```import  math
import  random
import  numpy as np
sklearn from  Import  Datasets, Cluster
Import  matplotlib.pyplot AS plt
copy the code```
```iris = datasets.load_iris ()
to copy the code```
```= IRIS gt [ 'target' ]; gt
copy the code```
```array([ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
0 , 0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ,
1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1, 1 ,
1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ,
2 , 2 , 2 , 2 , 2 , 2, 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ,
2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2, 2 , 2 , 2 , 2 , 2 ])
Copy the code```

Class 3

```IRIS [ 'Data' ] [:,: 2 ] .shape
duplicated code```
```( 150 , 2 )
copy the code```
```IRIS = Data [ 'Data' ] [:,: 2 ]
Copy Code```
```x = data[:, 0 ]
Data = Y [:, . 1 ]
Copy the code```
```plt.scatter(x, y, color = 'green' )
plt.xlim( 4 ,  8 )
plt.ylim( 1 ,  5 )
plt.show()
Copy code```

```# Define the node of the number of clusters

class ClusterNode:
def __init__(self, vec, left=None, right=None, distance= -1 , id=None, count= 1 ):
"" "
:param vec: save two data clusters to form a new center
:param left: left node
:param right: right node
:param distance: the distance between two nodes
:param id: used to mark which nodes are calculated
:param count: the number of leaf nodes of this node
" ""
self.vec = vec
self.left = left
self.right = right
self.distance = distance
self.id = id
self.count = count
Copy code```
```def euler_distance(point1: np.ndarray, point2: list) -> float:
"" "
Calculate the Euler distance between two points, support multi-dimensional
" ""
distance =  0.0
for  a, b in zip(point1, point2):
Math.pow + = Distance (A - B,  2 )
return  Math.sqrt (Distance)
copying the code```
```# Hierarchical clustering (aggregation method)

class Hierarchical:
def __init__(self, k):
self.k = k
self.labels = None

def fit(self, x):
nodes = [ClusterNode(vec=v, id=i)  for  i, v in enumerate(x)]
distances = {}
point_num, feature_num = x.shape
self.labels = [ -1 ] * point_num
currentclustid =  -1
while( len (nodes))> self.k:
min_dist = math.inf
nodes_len =  len (nodes)
closest_part = None
for  i in  range (nodes_len-  1 ):
for  j in  range (i + 1 , nodes_len):
d_key = (nodes[i].id, nodes[j].id)
if  d_key not in distances:
distances[d_key] = euler_distance(nodes[i].vec, nodes[j].vec)
d = distances[d_key]
if  d <min_dist:
min_dist = d
closest_part = (i, j)

part1, part2 = closest_part
node1, node2 = nodes[part1], nodes[part2]
new_vec = [(node1.vec[i] * node1.count + node2.vec[i] * node2.count)/(node1.count + node2.count)
for  i in  range (feature_num)]
new_node = ClusterNode(vec=new_vec,
left=node1,
right=node2,
distance=min_dist,
id=currentclustid,
count=node1.count + node2.count)
currentclustid -=  1
del nodes[part2], nodes[part1]
nodes. append (new_node)

self.nodes = nodes
self.calc_label()

def calc_label(self):
"" "
Retrieve the results of clustering
" ""
for  i, node in enumerate(self.nodes):
# Classify all leaf nodes of the node
self.leaf_traversal(node, i)

def leaf_traversal(self, node: ClusterNode, label):
"" "
Recursively traverse the leaf nodes
" ""
if  node.left == None and node.right == None:
self.labels[node.id] = label
if  node.left:
self.leaf_traversal(node.left, label)
if  node.right:
self.leaf_traversal(node.right, label)

# https: //zhuanlan.zhihu.com/p/32438294Copy
code```
```my = Hierarchical( 3 )
my.fit(data)
labels = np.array(my.labels)
print (labels)
Copy code```
```[ 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2 2  2  0  0  0  0  0  0  0  2  0  2 2  0 0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  0  0  0  0  2  0  0  0  1  0  0  1  2  1  0  1  0
0  0  0  0  0  0  1  1  0  0 0  . 1  0  0  . 1  0  0  0  . 1  . 1  . 1  0  0  0  . 1  0  0  0  0  0  0  0  0  0  0  0  0
0  0 ]
duplicated code```
```# visualize result

cat1 = data[np.where(labels== 0 )]
cat2 = data[np.where(labels== 1 )]
cat3 = data[np.where(labels== 2 )]

plt.scatter(cat1[:, 0 ], cat1[:, 1 ], color = 'green' )
plt.scatter(cat2[:, 0 ], cat2[:, 1 ], color = 'red' )
plt.scatter(cat3[:, 0 ], cat3[:, 1 ], color = 'blue' )
plt.title( 'Hierarchical clustering with k=3' )
plt.xlim( 4 ,  8 )
plt.ylim( 1 ,  5 )
plt.show()
Copy code```

```sk = cluster.AgglomerativeClustering( 3 )
sk.fit(data)
labels_ = sk.labels_
print (labels_)
Copy code```
```[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1  1  1 1 1  0  0  0  2  0  2  0  1  0  1 1  0  2  0  2  0  2  2  2  2  0  0  2  0
0  0  0  0  0  2  2  2  2  0  2  0  0  2  2  2  2  0  2  1  2  2  2  0  1  2  0  2  0  0  0  0  1  0  0  0  0
0  0  2  2  0  0  0  0  2  0 2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  0  0  0  2  0
0  0 ]
duplicated code```
```# visualize result of sklearn

cat1_ = data[np.where(labels_== 0 )]
cat2_ = data[np.where(labels_== 1 )]
cat3_ = data[np.where(labels_== 2 )]

plt.scatter(cat1_[:, 0 ], cat1_[:, 1 ], color = 'green' )
plt.scatter(cat2_[:, 0 ], cat2_[:, 1 ], color = 'red' )
plt.scatter(cat3_[:, 0 ], cat3_[:, 1 ], color = 'blue' )
plt.title( 'Hierarchical clustering with k=3' )
plt.xlim( 4 ,  8 )
plt.ylim( 1 ,  5 )
plt.show()
Copy code```

```# kmeans

class MyKmeans:
def __init__(self, k, n = 20 ):
self.k = k
self.n = n

def fit(self, x, centers=None):
# The first step, randomly select K points, or specify
if  centers is None:
idx = np.random.randint(low = 0 , high = len (x), size=self.k)
centers = x[idx]
# print (centers)

inters =  0
while inters <self.n:
# print (inters)
# print (centers)
points_set = {key: []  for  key in  range (self.k)}

# The second step, traverse all points P, put P into the set of the nearest cluster centers
for  p in x:
nearest_index = np.argmin(np.sum((centers-p) **  2 , axis = 1 ) **  0.5 )
points_set[nearest_index]. append (p)

# The third step is to traverse each point set and calculate the new cluster center
for  i_k in  range (self.k):
centers[i_k] = sum(points_set[i_k])/len (points_set[i_k])

inters +=  1

return  points_set, centers

Copy code```
```m = MyKmeans( 3 )
points_set, centers = m.fit(data)
Copy code```
```centersCopy
code```
```array([[ 5.006      , 3.428      ],
[ 6.81276596 , 3.07446809 ],
[ 5.77358491 , 2.69245283 ]])
Copy the code```
```# visualize result

cat1 = np.asarray(points_set[ 0 ])
cat2 = np.asarray(points_set[ 1 ])
cat3 = np.asarray(points_set[ 2 ])

for  ix, p in enumerate(centers):
plt.scatter(p[ 0 ], p[ 1 ], color = 'C{}' .format(ix), marker = '^' , edgecolor = 'black' , s = 256 )

plt.scatter(cat1_[:, 0 ], cat1_[:, 1 ], color = 'green' )
plt.scatter(cat2_[:, 0 ], cat2_[:, 1 ], color = 'red' )
plt.scatter(cat3_[:, 0 ], cat3_[:, 1 ], color = 'blue' )
plt.title( 'Hierarchical clustering with k=3' )
plt.xlim( 4 ,  8 )
plt.ylim( 1 ,  5 )
plt.show()
Copy code```

```# using sklearn
from sklearn.cluster  import  KMeans
kmeans = KMeans(n_clusters = 3 , max_iter = 100 ).fit(data)
gt_labels__ = kmeans.labels_
centers__ = kmeans.cluster_centers_
Copy code```
```gt_labels__Copy
code```
```array([ 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ,
1 , 1 , 1 , 1 , 1 , 1 , 1, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ,
1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 2 , 0 , 2 , 0 , 2, 0 , 2 , 2 , 2 , 2 , 2 , 2 , 0 ,
2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0 , 0 , 0 , 0 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0, 2 ,
2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 0 , 2 , 0 , 0 , 0 , 0 , 2 , 0 , 0 , 0 ,
0 , 0 , 0 , 2 , 2 , 0, 0 , 0 , 0 , 2 , 0 , 2 , 0 , 2 , 0 , 0 , 2 , 2 , 0 , 0 , 0 , 0 ,
0 , 2 , 2 , 0 , 0 , 0 , 2 , 0 , 0 , 0 , 2 , 0 , 0, 0 , 2 , 0 , 0 , 2 ])
Copy the code```
```centers__Copy
code```
```array([[ 6.81276596 , 3.07446809 ],
[ 5.006      , 3.428      ],
[ 5.77358491 , 2.69245283 ]])
Copy the code```
```# visualize result

cat1 = data[gt_labels__ ==  0 ]
cat2 = data[gt_labels__ ==  1 ]
cat3 = data[gt_labels__ ==  2 ]

for  ix, p in enumerate(centers__):
plt.scatter(p[ 0 ], p[ 1 ], color = 'C{}' .format(ix), marker = '^' , edgecolor = 'black' , s = 256 )

plt.scatter(cat1_[:, 0 ], cat1_[:, 1 ], color = 'green' )
plt.scatter(cat2_[:, 0 ], cat2_[:, 1 ], color = 'red' )
plt.scatter(cat3_[:, 0 ], cat3_[:, 1 ], color = 'blue' )
plt.title( 'kmeans using sklearn with k=3' )
plt.xlim( 4 ,  8 )
plt.ylim( 1 ,  5 )
plt.show()
Copy code```

Find K value

```from sklearn.cluster  import  KMeans

loss = []

for  i in  range ( 1 ,  10 ):
kmeans = KMeans(n_clusters=i, max_iter = 100 ).fit(data)
loss. append (kmeans.inertia_/  len (data)/  3 )

plt.title( 'K with loss' )
plt.plot( range ( 1 ,  10 ), loss)
plt.show()
Copy code```

Example 14.2
```= X-[[ 0 ,  2 ], [ 0 ,  0 ], [ . 1 ,  0 ], [ . 5 ,  0 ], [ . 5 ,  2 ]]
duplicated code```
```np.asarray(X)
Copy code```
```array([[ 0 , 2 ],
[ 0 , 0 ],
[ 1 , 0 ],
[ 5 , 0 ],
[ 5 , 2 ]])
Copy code```
```m = MyKmeans( 2 ,  100 )
points_set, centers = m.fit(np.asarray(X))
Copy code```
```points_setCopy
code```
```{ 0 : [array([ 0 , 2 ]), array([ 0 , 0 ]), array([ 1 , 0 ])],
1 : [array([ 5 , 0 ]), array([ 5 , 2 ])]}
Copy code```
```centersCopy
code```
```array([[ 0 , 0 ],
[ 5 , 1 ]])
Copy code```
```= KMeans the kmeans (n_clusters = 2 , max_iter = 100 ) .fit (np.asarray (X-))
copying the code```
```kmeans.labels_copy
code```
```Array ([ 0 , 0 , 0 , . 1 , . 1 ])
Copy the code```
```kmeans.cluster_centers_copy
code```
```array([[ 0.33333333 , 0.66666667 ],
[ 5.         , 1.         ]])
Copy code```

Code source for this chapter: github.com/hktxt/Learn...

github.com/fengdu78/li...

Reference materials:

[1] "Statistical Learning Method": baike.baidu.com/item/Statistical Learning Method...

[2] Huang Haiguang : github.com/fengdu78

[3] github: github.com/fengdu78/li...

[4] wzyonggege: github.com/wzyonggege/

[5] WenDesi: github.com/WenDesi/lih...

[6] Hot and hot: blog.csdn.net/tudaodiaozh...

[7]  hktxt: github.com/hktxt/Learn