Written in front: The algorithm is a top-k mining algorithm designed based on the utility-list structure of HUI-Miner. Many of the threshold self-increment strategies adopted are worthy of our study and research. The purpose of writing this note is also to introduce these strategies And personal thoughts

# An efficient algorithm for mining the top-k high utility itemsets, using novel threshold raising and pruning strategies

## Sample

**External utility of items**

**Example database**

## definition

For most of the content, please refer to the HUI-Miner algorithm notes, here are only some additional supplements

### Estimated Utility Co-occurrence Structure (EUCS)

The paper defines **EUCS** as a tuple ($\{(a;/b;/TWU(ab))$ ,$a,b/in I^*$ ,$TWU(ab)/in R^+\}$), where$a$ and$b$ are all items (not repeated by default)$R^+$ Explain that the algorithm does not consider the situation of negative utility. In order to facilitate understanding, the author drew several matrix diagrams to illustrate (Ps. actually uses hashmap to store this information, not a matrix structure). The specific construction process is as follows:

### Utility list construction

- Itemset utility value: In the utility list that has been constructed, the formula for finding the utility value of an itemset is $\sum_{\rm ul/in UL(X)}ul.iutil$ , where$UL(X)$ refers to the utility list of item set X (refer toHUI-Miner forhow to construct it)

### Transitive extension upper-bound utility

Given any set of items $X$ and its expandable items$x$ , the new boundary value calculation formula is$teu(X, y)$ =$\sum_{\rm ul/in UL(X)}ul.iutil$ +$\sum_{\rm ul/in UL(y)}ul.iutil$ +$u(y)$

### Coverage

When two items $i$ and$j$ has a relationship$g(i)/subseteq g(j)$ , we call the term$j$ contains (**cover**) items$i$ , you can deduce the relevant items$i$ 's coverage$C(i) =/{g(i)/subseteq g(x)/land x/in I\}$ , where$g(x)$ means to include items$x$The serial number (**TID**)of all transaction items of$x$ , namely$g(x) =/{tid/mid x/subseteq T_{\rm tid}\}$

Further expansion, there is also a **coverage** relationship between itemsets, when itemsets$X = x_1x_2/ldots x_{\rm n}$,$x_{\rm i}/in C(x_1) (2/le i/le n)$ established, it is called$X$ is**coverage itemset**

Is this actually a closure concept? Use a small number of items to replace the extended item set that has the same nature as it later. Next, we will continue to introduce other useful properties based on the concept of **coverage** . The proof process has been introduced in detail in the original text, and you can derive it manually.

**Nature 1**: Order item$i$ and$j$ has$j/in C(i)$ relation, then$u(ij)/ge u(i) +/mid g(i)/mid/times p(j)> 0$ , where the itemset$ij$ is obtained by combining two items**Nature 2**: Order item$i$ and$j$ has$j/in C(i)$ relation, then$TWU(i) = TWU(ij)$**Nature 3**: Let the item set$X = x_1x_2/ldots x_{\rm n}$,then $u(X) =/sum_{2/le i/le n}u(x_1x_{\rm i})-(n-1)/times u(x_1)$**Property 4**: The**properties of 3**, we can get a new way to calculate the utility value: as item$x$ ,$x_i$ And itemsets $Y$ has$x_i/in C(x)$ ,$Y/subset C(x)$ and$x_i/not\in Y$ , available$u(xYx_i)$ =$u(xY) + u(xx_i)-u(x)$ , at the same time we know$u(xYx_i)> u(xY)$ (refer to**properties of a**pushable Syndrome)**Nature 5**: According to**the nature of 4**, it is easy to think of$minUtil/le u(xY) <u(xYx_i)$ Is always established (can be explained by closures?), the normative expression in the paper is: set items$j$ , the item set$X$ and$Y$ has$Y/subseteq C(j)$ ,$X/cap Y =/varnothing$ and$j/not\in X$ and other relations are always established, then$minUtil/le u(jX) <u(jXY)$ must hold

## Strategy

### The estimated utility co-occurrence pruning strategy (EUCP)

After the completion of the **EUCS** construction according to the definition content , delete those$TWU(xy) <minUtil$After sorting the binary terms of$m$$i$$n$$U$$t$$i$$l$ , you can get a**EUCST**structure diagram, as shown in the following figure:

It can be clearly seen that the number of binary itemsets is much less than the original ones. Later, on the basis of this binary itemsets, higher-order high-efficiency itemsets will be mined. The core of this pruning strategy is to use the traditional $TWU$ property, but most of the previous algorithms do the first-order item set pruning, and this algorithm can successfully use this property on the second-order item set, which has great reference value for other algorithms

### The efficient utility-list pruning strategy

According to the two definition formulas of itemset utility value calculation and itemset remaining utility value calculation, a new pruning strategy can be obtained, namely $\sum_{\rm ul/in UL(X)}ul.iutil +/sum_{X/succ i}U(i) <minUtil$ , the itemset$X$Neither$X$ nor its extended set can be a high-efficiency item set. The reason is that in a sorted transaction item (transaction), it is ranked in the item set$X$Elements after$X$ will become$X$ the extension items of$X$ , if the utility value of all the extension items added up is smaller than the threshold value, then the utility value of a single combination will naturally be smaller than the threshold value, so it can be cut directly (the proof processis discussed in detailinHUI-Miner)

### The early abandoning additional strategy (EA)

- LA property: Given two itemsets $X$ and$Y$ when$\sum_{\rm/forall T_i/in D)U(X, T_{\rm i}) + RU(X, T_{\rm i})-\sum_{\rm/forall T_j/in D, X/subseteq T_{\rm j}/land Y/not\subseteq T_{\rm j)}U(X, T_{\rm j)) + RU(X, T_{\rm j)) <minUtil$ $\forall X/subseteq X^\prime/land Y/subseteq Y^\prime/Rightarrow X^\prime Y^\prime/not\in HUIs$

**EA** LA property utility list

### Transitive extension pruning strategy (TEP)

**Transitive extension upper-bound utility**

- $teu(X, y)/ge u(Xy)$
- $teu(X, y)/ge/sum_{\rm ul/in UL(Xy)}(ul.iutil + ul.rutil)$

$teu(X, y) < minUtil$ $Xy$ utility list **The efficient utility-list pruning strategy**

### Real item utilities threshold raising strategy (RIU)

utility utility hashmap

### The co-occurrence threshold raising strategy (CUD)

EUCST 2-itemsets CUDM

### The coverage threshold raising strategy (COV)

COV List

### construct utility list

### iConstruct algorithm

The above utility list construction algorithm is actually very inefficient. Each time you need to traverse (although it is a dichotomy) two lists, find itemsets with the same prefix and then join them together. This process is time-consuming and memory-intensive. kHMC designed a new construction method, using the EA strategy to avoid constructing those inefficient itemsets

### kHMC algorithm

The pseudo code of the entire algorithm is as follows:

# summary

**The kHMC** algorithm uses a lot of threshold raising strategies, and only revolves around the utility-list structure. The first time I encountered this paper, I only scanned it roughly, and I didn't have a detailed understanding of the methods and features used. The author has also done a lot of comparative experiments to test the superiority of the kHMC algorithm from various aspects, and also mentioned at the end that the concept of coverage can actually be used in utility mining in other fields. I personally think that this algorithm still has a relatively large reference value and is worth studying. The only downside is that the relevant code could not be found on spmf ...