Association Mining in a nutshell

(Last updated: 17. April. 2020)

This blog post is a supplement for Data Mining instruction at Business Process Intelligence, RWTH-Aachen.

Concept

Association rule aims at discovering interesting relations between variables (mostly sets of variables) in large databases. A typical example is that customers who purchase beer are likely also to buy diapers. Importantly, we need to distinguish frequent itemset and association rules. In essence, frequent itemset is a joint probability, e.g., $P(beer,diaper)$, while association rule is a conditional probability, e.g., $P(diaper|beer)$. Thus, we can say that association rule more likely reflects the relation aspect.

In fact, frequent itemsets are part of the calculation of association rules. Frequent itemsets are informally defined as itemsets having high support. $support(X)= \frac{N_{X }}{N}$, where $N$ is the number of instances and $N_X$ is the number of instances covering $X$. (You may understand it as the joint probability of elements in $X$).

Association rules are informally defined as relations between two sets having high confidence. $confidence(X \Rightarrow Y)= \frac{N_{X \cup Y}}{NX}=\frac{support{X \cup Y}}{support_X}$, where $NX$ is the number of instances covering $X$ and $N{X \cup Y}$ is the number of instances covering $X$ and $Y$. (You may understand it as conditional probability of two sets $X, Y$).

An association rule is evaluated as “good” if it has higher support, confidence closer to 1, and lift higher than 1.

(To deal with lift)

Association Rule Exercise

Given the example below, let’s evaluate the association rule, $Tea \Rightarrow Coffee$.

IMAGE

It has support of $0.15$, confidence of $0.75$, and lift of $0.83$. Since support is low and lift is less than $1$, we can say that this rule is not desired.

Related