R29_패턴 찾기 : 연관 규칙(Association Rules)을 이용한 장바구니 분석

R/R 머신러닝

R29_패턴 찾기 : 연관 규칙(Association Rules)을 이용한 장바구니 분석

Codezoy 2019. 11. 13. 11:57

# 연관 규칙(Association Rules) 학습

# 1. 데이터 준비

groceries <- read.csv(file = 'mlwr/groceries.csv')

groceries.csv

> str(groceries)

'data.frame': 15295 obs. of 4 variables:

$ citrus.fruit : Factor w/ 167 levels "abrasive cleaner",..: 156 165 109 102 165 1 121 102 83 113 ...

$ semi.finished.bread: Factor w/ 162 levels "","abrasive cleaner",..: 161 1 161 160 14 1 1 153 1 1 ...

$ margarine : Factor w/ 164 levels "","abrasive cleaner",..: 34 1 39 35 163 1 1 120 1 1 ...

$ ready.soups : Factor w/ 162 levels "","abrasive cleaner",..: 1 1 87 83 116 1 1 12 1 1 ...

> head(groceries)

citrus.fruit semi.finished.bread margarine ready.soups

1 tropical fruit yogurt coffee

2 whole milk

3 pip fruit yogurt cream cheese meat spreads

4 other vegetables whole milk condensed milk long life bakery product

5 whole milk butter yogurt rice

6 abrasive cleaner

Observation 수가 실제 9835개인데, 데이터프레임으로 만드니 15295개가 됨

Variable 수가 4개가 넘어가면 다음 행에 입력이 된다.

# csv 파일의 각 행에는 영수증의 구매 아이템들이 있음!

# 영수증마다 구매 아이템의 갯수가 다르기 때문에,

# 컬럼의 갯수가 일정하지 않음.

# -> 해결방법: sparse matrix(희소 행렬)을 사용

# arules 패키지: association rules(연관 규칙) 패키지

install.packages('arules')

library(arules)

> summary(groceries)

transactions as itemMatrix in sparse format with

9835 rows (elements/itemsets/transactions) and

169 columns (items) and a density of 0.02609146

most frequent items:

whole milk other vegetables rolls/buns soda yogurt (Other)

2513 1903 1809 1715 1372 34055

element (itemset/transaction) length distribution : 영수증이 포함한 아이템 갯수. ex) 1개 구매한 영수증 ~ 32개 구매한 영수증

sizes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26 27 28 29 32

2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14 14 9 11 4 6 1 1 1 1 3 1

영수증에 포함된 아이템 갯수의 Summary

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 2.000 3.000 4.409 6.000 32.000

includes extended item information - examples:

labels

1 abrasive cleaner

2 artif. sweetener

3 baby cosmetics

inspect(groceries[1:5]) # 1~9835 영수증 아이템들을 보여줌

items

[1] {citrus fruit,margarine,ready soups,semi-finished bread}

[2] {coffee,tropical fruit,yogurt}

[3] {whole milk}

[4] {cream cheese,meat spreads,pip fruit,yogurt}

[5] {condensed milk,long life bakery product,other vegetables,whole milk}

> # 영수증에 등장하는 아이템들의 빈도(frequency) 비율

> itemFrequency(groceries[,1:5])

abrasive cleaner artif. sweetener baby cosmetics baby food bags

0.0035587189 0.0032536858 0.0006100661 0.0001016777 0.0004067107

# 아이템들의 빈도 분포

itemFrequencyPlot(groceries, support = 0.1 )

# support 영수증에 아이템이 나타나는 횟수 : 0.1 은 최소 10% 이상 등장하는 아이템만 포함

itemFrequencyPlot(groceries, topN = 20)

# TopN : 그래프에 표시할 빈도 순서로 표시

# 희소 행렬(Sparse Matrix)를 그래프로 표시

image(groceries[1:100 ])

Column : 거래 아이템 번호, Rows : 거래 번호(id)

# 데이터의 이상치나 어떤 경향을 파악할 수 있다.

# 3. 모델 학습 - 자율(비지도) 학습의 한 종류 a priori 알고리즘

grocery_rules <- apriori(data = groceries)

Apriori

Parameter specification : # minlen, maxlen : 아이템 집합 안에 있는 아이템의 최소/최대 개수

confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext

0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules FALSE

Algorithmic control :

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count : 983

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].

sorting and recoding items ... [8 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 done [0.00s].

writing ... [0 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

> summary(grocery_rules) # 규칙이 없다 !

set of 0 rules

# apriori 함수의 임계값 파라미터 기본값들이

# support = 0.1(10%), confidence = 0.8(80%)로 되어 있는데,

# 파라미터 요구수준이 너무 높아서 만족하는 규칙이 없다.

# 파라미터 이름에는 confidence, minval, smax , arem, aval, originalSupport , maxtime, support, minlen, maxlen, target, ext 를 사용

하루에 10개 팔리는 품목의 support : 10 * 30 days / 9835 transactions = 0.03

> grocery_rules2 <- apriori(data = groceries,

+ parameter = list(support = 0.03,

+ confidence = 0.25,

+ minlen = 2))

Apriori

Parameter specification:

confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext

0.25 0.1 1 none FALSE TRUE 5 0.03 2 10 rules FALSE

Algorithmic control:

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 295

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].

sorting and recoding items ... [44 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 3 done [0.00s].

writing ... [15 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

> summary(grocery_rules2)

set of 15 rules

rule length distribution (lhs + rhs) : sizes # lhs + rhs = 2, 총 15개의 규칙을 생성

Min. 1st Qu. Median Mean 3rd Qu. Max.

2 2 2 2 2 2

summary of quality measures:

support confidence lift count

Min. :0.03010 Min. :0.2929 Min. :1.205 Min. :296.0

1st Qu.:0.03274 1st Qu.:0.3185 1st Qu.:1.488 1st Qu.:322.0

Median :0.04230 Median :0.3737 Median :1.572 Median :416.0

Mean :0.04475 Mean :0.3704 Mean :1.598 Mean :440.1

3rd Qu.:0.05247 3rd Qu.:0.4024 3rd Qu.:1.758 3rd Qu.:516.0

Max. :0.07483 Max. :0.4496 Max. :2.247 Max. :736.0

mining info:

data ntransactions support confidence

groceries 9835 0.03 0.25

> inspect(grocery_rules2)

# lift(x->y) = confidence(x->y) / support(y)

저작자표시 (새창열림)