# 연관 규칙(Association Rules) 학습
# 1. 데이터 준비
groceries <- read.csv(file = 'mlwr/groceries.csv')
> str(groceries)
'data.frame': 15295 obs. of 4 variables:
$ citrus.fruit : Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 1 121 102 83 113 ...
$ semi.finished.bread: Factor w/ 162 levels "","abrasive cleaner",..: 161 1 161 160 14 1 1 153 1 1 ...
$ margarine : Factor w/ 164 levels "","abrasive cleaner",..: 34 1 39 35 163 1 1 120 1 1 ...
$ ready.soups : Factor w/ 162 levels "","abrasive cleaner",..: 1 1 87 83 116 1 1 12 1 1 ...
> head(groceries)
citrus.fruit semi.finished.bread margarine ready.soups
1 tropical fruit yogurt coffee
2 whole milk
3 pip fruit yogurt cream cheese meat spreads
4 other vegetables whole milk condensed milk long life bakery product
5 whole milk butter yogurt rice
6 abrasive cleaner
Observation 수가 실제 9835개인데, 데이터프레임으로 만드니 15295개가 됨
Variable 수가 4개가 넘어가면 다음 행에 입력이 된다.
# csv 파일의 각 행에는 영수증의 구매 아이템들이 있음!
# 영수증마다 구매 아이템의 갯수가 다르기 때문에,
# 컬럼의 갯수가 일정하지 않음.
# -> 해결방법: sparse matrix(희소 행렬)을 사용
# arules 패키지: association rules(연관 규칙) 패키지
> summary(groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda yogurt (Other)
2513 1903 1809 1715 1372 34055
element (itemset/transaction) length distribution : 영수증이 포함한 아이템 갯수. ex) 1개 구매한 영수증 ~ 32개 구매한 영수증
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26 27 28 29 32
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14 14 9 11 4 6 1 1 1 1 3 1
영수증에 포함된 아이템 갯수의 Summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
1 abrasive cleaner
2 artif. sweetener
3 baby cosmetics
inspect(groceries[1:5]) # 1~9835 영수증 아이템들을 보여줌
[1] {citrus fruit,margarine,ready soups,semi-finished bread}
[2] {coffee,tropical fruit,yogurt}
[3] {whole milk}
[4] {cream cheese,meat spreads,pip fruit,yogurt}
[5] {condensed milk,long life bakery product,other vegetables,whole milk}
> # 영수증에 등장하는 아이템들의 빈도(frequency) 비율
> itemFrequency(groceries[,1:5])
abrasive cleaner artif. sweetener baby cosmetics baby food bags
0.0035587189 0.0032536858 0.0006100661 0.0001016777 0.0004067107
# 아이템들의 빈도 분포
itemFrequencyPlot(groceries, support = 0.1 )
# support 영수증에 아이템이 나타나는 횟수 : 0.1 은 최소 10% 이상 등장하는 아이템만 포함
itemFrequencyPlot(groceries, topN = 20)
# TopN : 그래프에 표시할 빈도 순서로 표시
# 희소 행렬(Sparse Matrix)를 그래프로 표시
image(groceries[1:100 ])
Column : 거래 아이템 번호, Rows : 거래 번호(id)
# 데이터의 이상치나 어떤 경향을 파악할 수 있다.
# 3. 모델 학습 - 자율(비지도) 학습의 한 종류 a priori 알고리즘
grocery_rules <- apriori(data = groceries)
Parameter specification : # minlen, maxlen : 아이템 집합 안에 있는 아이템의 최소/최대 개수
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules FALSE
Algorithmic control :
filter tree heap memopt load sort verbose
Absolute minimum support count : 983
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
> summary(grocery_rules) # 규칙이 없다 !
set of 0 rules
# apriori 함수의 임계값 파라미터 기본값들이
# support = 0.1(10%), confidence = 0.8(80%)로 되어 있는데,
# 파라미터 요구수준이 너무 높아서 만족하는 규칙이 없다.
# 파라미터 이름에는 confidence, minval, smax , arem, aval, originalSupport , maxtime, support, minlen, maxlen, target, ext 를 사용
하루에 10개 팔리는 품목의 support : 10 * 30 days / 9835 transactions = 0.03
> grocery_rules2 <- apriori(data = groceries,
+ parameter = list(support = 0.03,
+ confidence = 0.25,
+ minlen = 2))
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.25 0.1 1 none FALSE TRUE 5 0.03 2 10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
Absolute minimum support count: 295
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [44 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [15 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
> summary(grocery_rules2)
set of 15 rules
rule length distribution (lhs + rhs) : sizes # lhs + rhs = 2, 총 15개의 규칙을 생성
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 2 2 2 2 2
summary of quality measures:
support confidence lift count
Min. :0.03010 Min. :0.2929 Min. :1.205 Min. :296.0
1st Qu.:0.03274 1st Qu.:0.3185 1st Qu.:1.488 1st Qu.:322.0
Median :0.04230 Median :0.3737 Median :1.572 Median :416.0
Mean :0.04475 Mean :0.3704 Mean :1.598 Mean :440.1
3rd Qu.:0.05247 3rd Qu.:0.4024 3rd Qu.:1.758 3rd Qu.:516.0
Max. :0.07483 Max. :0.4496 Max. :2.247 Max. :736.0
mining info:
data ntransactions support confidence
groceries 9835 0.03 0.25
> inspect(grocery_rules2)
# lift(x->y) = confidence(x->y) / support(y)