kHMC algorithm

kHMC algorithm

Written in front: The algorithm is a top-k mining algorithm designed based on the utility-list structure of HUI-Miner. Many of the threshold self-increment strategies adopted are worthy of our study and research. The purpose of writing this note is also to introduce these strategies And personal thoughts

An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies


External utility of items

Example database


For most of the content, please refer to the HUI-Miner algorithm notes, here are only some additional supplements

Estimated Utility Co-occurrence Structure (EUCS)

The paper defines EUCS as a tuple ({(a; b; TWU(ab))\{(a;/b;/TWU(ab)) ,a,bAI a,b/in I^* ,TWU(ab)AR+}TWU(ab)/in R^+\}), whereaa andbb are all items (not repeated by default)R+R^+ Explain that the algorithm does not consider the situation of negative utility. In order to facilitate understanding, the author drew several matrix diagrams to illustrate (Ps. actually uses hashmap to store this information, not a matrix structure). The specific construction process is as follows:

Utility list construction

  • Itemset utility value: In the utility list that has been constructed, the formula for finding the utility value of an itemset is ulAUL(X)ul.iutil\sum_{\rm ul/in UL(X)}ul.iutil , whereUL(X)UL(X) refers to the utility list of item set X (refer toHUI-Miner forhow to construct it)

Transitive extension upper-bound utility

Given any set of items XX and its expandable itemsxx , the new boundary value calculation formula isteu(X,y)teu(X, y) = ulAUL(X)ul.iutil\sum_{\rm ul/in UL(X)}ul.iutil + ulAUL(y)ul.iutil\sum_{\rm ul/in UL(y)}ul.iutil +u(y)u(y)


When two items ii andjj has a relationshipg(i) g(j)g(i)/subseteq g(j) , we call the termjj contains (cover) itemsii , you can deduce the relevant itemsii 's coverageC(i)={g(i) g(x) xAI}C(i) =/{g(i)/subseteq g(x)/land x/in I\} , whereg(x)g(x) means to include itemsxxThe serial number (TID)of all transaction items of , namelyg(x)={tid x Ttid}g(x) =/{tid/mid x/subseteq T_{\rm tid}\}

Further expansion, there is also a coverage relationship between itemsets, when itemsetsX=x1x2 xnX = x_1x_2/ldots x_{\rm n},xiAC(x1)(2 i n)x_{\rm i}/in C(x_1) (2/le i/le n) established, it is calledXX iscoverage itemset

Is this actually a closure concept? Use a small number of items to replace the extended item set that has the same nature as it later. Next, we will continue to introduce other useful properties based on the concept of coverage . The proof process has been introduced in detail in the original text, and you can derive it manually.

  • Nature 1 : Order itemii andjj hasjAC(i)j/in C(i) relation, thenu(ij) u(i)+ g(i) p(j)>0u(ij)/ge u(i) +/mid g(i)/mid/times p(j)> 0 , where the itemsetijij is obtained by combining two items
  • Nature 2 : Order itemii andjj hasjAC(i)j/in C(i) relation, thenTWU(i)=TWU(ij)TWU(i) = TWU(ij)
  • Nature 3 : Let the item setX=x1x2 xnX = x_1x_2/ldots x_{\rm n},then u(X)= 2 i nu(x1xi) (n 1) u(x1)u(X) =/sum_{2/le i/le n}u(x_1x_{\rm i})-(n-1)/times u(x_1)
  • Property 4 : The properties of 3 , we can get a new way to calculate the utility value: as itemxx ,xix_i And itemsets YY hasxiAC(x)x_i/in C(x) ,Y C(x)Y/subset C(x) andxi Yx_i/not\in Y , availableu(xYxi)u(xYx_i) =u(xY)+u(xxi) u(x)u(xY) + u(xx_i)-u(x) , at the same time we knowu(xYxi)>u(xY)u(xYx_i)> u(xY) (refer toproperties of apushable Syndrome)
  • Nature 5 : According to the nature of 4 , it is easy to think ofminUtil u(xY)<u(xYxi)minUtil/le u(xY) <u(xYx_i) Is always established (can be explained by closures?), the normative expression in the paper is: set itemsjj , the item setXX andYY hasY C(j)Y/subseteq C(j) ,X Y= X/cap Y =/varnothing andj Xj/not\in X and other relations are always established, thenminUtil u(jX)<u(jXY)minUtil/le u(jX) <u(jXY) must hold


The estimated utility co-occurrence pruning strategy (EUCP)

After the completion of the EUCS construction according to the definition content , delete thoseTWU(xy)<minUtilTWU(xy) <minUtilAfter sorting the binary terms of , you can get aEUCSTstructure diagram, as shown in the following figure:

It can be clearly seen that the number of binary itemsets is much less than the original ones. Later, on the basis of this binary itemsets, higher-order high-efficiency itemsets will be mined. The core of this pruning strategy is to use the traditional TWUTWU property, but most of the previous algorithms do the first-order item set pruning, and this algorithm can successfully use this property on the second-order item set, which has great reference value for other algorithms

The efficient utility-list pruning strategy

According to the two definition formulas of itemset utility value calculation and itemset remaining utility value calculation, a new pruning strategy can be obtained, namely ulAUL(X)ul.iutil+ X iU(i)<minUtil\sum_{\rm ul/in UL(X)}ul.iutil +/sum_{X/succ i}U(i) <minUtil , the itemsetXXNeither nor its extended set can be a high-efficiency item set. The reason is that in a sorted transaction item (transaction), it is ranked in the item setXXElements after will becomeXX the extension items of , if the utility value of all the extension items added up is smaller than the threshold value, then the utility value of a single combination will naturally be smaller than the threshold value, so it can be cut directly (the proof processis discussed in detailinHUI-Miner)

The early abandoning additional strategy (EA)

  • LA property: Given two itemsets XX andYY when TiADU(X,Ti)+RU(X,Ti) TjAD,X Tj Y TjU(X,Tj)+RU(X,Tj)<minUtil\sum_{\rm/forall T_i/in D)U(X, T_{\rm i}) + RU(X, T_{\rm i})-\sum_{\rm/forall T_j/in D, X/subseteq T_{\rm j}/land Y/not\subseteq T_{\rm j)}U(X, T_{\rm j)) + RU(X, T_{\rm j)) <minUtil X X Y Y X Y HUIs\forall X/subseteq X^\prime/land Y/subseteq Y^\prime/Rightarrow X^\prime Y^\prime/not\in HUIs

EA LA property utility list

Transitive extension pruning strategy (TEP)

Transitive extension upper-bound utility

  • teu(X,y) u(Xy)teu(X, y)/ge u(Xy)
  • teu(X,y) ul UL(Xy)(ul.iutil+ul.rutil)teu(X, y)/ge/sum_{\rm ul/in UL(Xy)}(ul.iutil + ul.rutil)

teu(X,y)<minUtilteu(X, y) < minUtil XyXy utility list The efficient utility-list pruning strategy

Real item utilities threshold raising strategy (RIU)

utility utility hashmap

The co-occurrence threshold raising strategy (CUD)

EUCST 2-itemsets CUDM

The coverage threshold raising strategy (COV)

COV List

construct utility list


iConstruct algorithm

The above utility list construction algorithm is actually very inefficient. Each time you need to traverse (although it is a dichotomy) two lists, find itemsets with the same prefix and then join them together. This process is time-consuming and memory-intensive. kHMC designed a new construction method, using the EA strategy to avoid constructing those inefficient itemsets

kHMC algorithm

The pseudo code of the entire algorithm is as follows:


The kHMC algorithm uses a lot of threshold raising strategies, and only revolves around the utility-list structure. The first time I encountered this paper, I only scanned it roughly, and I didn't have a detailed understanding of the methods and features used. The author has also done a lot of comparative experiments to test the superiority of the kHMC algorithm from various aspects, and also mentioned at the end that the concept of coverage can actually be used in utility mining in other fields. I personally think that this algorithm still has a relatively large reference value and is worth studying. The only downside is that the relevant code could not be found on spmf ...