Support Vector Machine: SVM
○ 공간을 나눠서 양쪽에 매유 균질적인 분할을 생성하는 초평면(hyperplane)이라고 하는 경계를 생성하는 것
○ 공간 상의 두 클래스를 가장 멀리 분리하는 초평면
○ SVM의 목표는 MMH를 찾는 것
○ 각 클래스에서 MMH에 가장 가까운 점들
○ 서포트 벡터를 찾으면 MMH를 정의할 수 있다.
○ 데이터에 새로운 차원을 추가해 데이터를 분리하는 방법
○ 커널 트릭을 사용하면 비선형 관계가 선형적인 관계로 나타날 수 있다.
# SVM(Support Vector Machine)을 이용한 분류
# 1. 데이터 준비
letters<- read.csv('mlwr/letterdata.csv')
# 2. 데이터 확인, 전처리
str(letters)
table(letters$letter)
# 학습 데이터(80%) / 테스트 데이터(20%) 세트 !!!
letters_train <- letters[1:16000,]
letters_test <- letters[16001:20000,]
table(letters_train$letter)
table(letters_test$letter)
# 3. 모델 생성 - SVM
# kernlab 패키지 :
install.packages('kernlab')
library(kernlab)
detach("package:kernlab") # library 메모리에서 지우기
search()
# SVM 알고리즘 모델을 생성
letter_classifier <- ksvm(letter ~.,
data = letters_train,
kernel = 'vanilladot')
kernel 에 다음 값을 사용할 수 있다.
# 4. 모델 평가
letters_predict <- predict(letter_classifier, letters_test)
head(letters_predict)
table(letters_predict, letters_test$letter) #table(row,column)
letters_predict A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 144 0 0 0 0 0 0 0 0 1 0 0 1 2 2 0 5 0 1 1 1 0 1 0 0 1
B 0 121 0 5 2 0 1 2 0 0 1 0 1 0 0 2 2 3 5 0 0 2 0 1 0 0
C 0 0 120 0 4 0 10 2 2 0 1 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0
D 2 2 0 156 0 1 3 10 4 3 4 3 0 5 5 3 1 4 0 0 0 0 0 3 3 1
E 0 0 5 0 127 3 1 1 0 0 3 4 0 0 0 0 2 0 10 0 0 0 0 2 0 3
F 0 0 0 0 0 138 2 2 6 0 0 0 0 0 0 16 0 0 3 0 0 1 0 1 2 0
G 1 1 2 1 9 2 123 2 0 0 1 2 1 0 1 2 8 2 4 3 0 0 0 1 0 0
H 0 0 0 1 0 1 0 102 0 2 3 2 3 4 20 0 2 3 0 3 0 2 0 0 1 0
I 0 1 0 0 0 1 0 0 141 8 0 0 0 0 0 1 0 0 3 0 0 0 0 5 1 1
J 0 1 0 0 0 1 0 2 5 128 0 0 0 0 1 1 3 0 2 0 0 0 0 1 0 6
K 1 1 9 0 0 0 2 5 0 0 118 0 0 2 0 1 0 7 0 1 3 0 0 5 0 0
L 0 0 0 0 2 0 1 1 0 0 0 133 0 0 0 0 1 0 5 0 0 0 0 0 0 1
M 0 0 1 1 0 0 1 1 0 0 0 0 135 4 0 0 0 0 0 0 3 0 8 0 0 0
N 0 0 0 0 0 1 0 1 0 0 0 0 0 145 0 0 0 3 0 0 1 0 2 0 0 0
O 1 0 2 1 0 0 1 2 0 1 0 0 0 1 99 3 3 0 0 0 3 0 0 0 0 0
P 0 0 0 1 0 2 1 0 0 0 0 0 0 0 2 130 0 0 0 0 0 0 0 0 1 0
Q 0 0 0 0 0 0 8 2 0 0 0 3 0 0 3 1 124 0 5 0 0 0 0 0 2 0
R 0 7 0 0 1 0 3 8 0 0 13 0 0 1 1 1 0 138 0 1 0 1 0 0 0 0
S 1 1 0 0 1 0 3 0 1 1 0 1 0 0 0 0 14 0 101 3 0 0 0 2 0 10
T 0 0 0 0 3 2 0 0 0 0 1 0 0 0 0 0 0 0 3 133 1 0 0 0 2 2
U 1 0 3 1 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 152 0 0 1 1 0
V 0 0 0 0 0 1 3 4 0 0 0 0 1 2 1 0 3 1 0 0 0 126 1 0 4 0
W 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 4 4 127 0 0 0
X 0 1 0 0 2 0 0 1 3 0 1 6 0 0 1 0 0 0 1 0 0 0 0 137 1 1
Y 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0 7 0 0 0 3 0 0 0 0 127 0
Z 2 0 0 0 1 0 0 0 3 4 0 0 0 0 0 0 0 0 18 3 0 0 0 0 0 132
>correct <- ifelse(letters_predict == letters_test$letter,1,0)
>correct
[1] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1
[61] 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
[121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0
[181] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1
[241] 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0
[301] 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1
[361] 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1
[421] 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
[481] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1
[541] 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1
[601] 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0
[661] 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
[721] 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1
[781] 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0
[841] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1
[901] 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
[961] 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1
[ reached getOption("max.print") -- omitted 3000 entries ]
> correct_count<-sum(correct)
> correct_count # SVM 모델이 문자들을 제대로 구분한 갯수
[1] 3357
> correct_ratio <- correct_count / 4000
> correct_ratio
[1] 0.83925
# 모델 수정 -> 재평가 -> 성능 개선, kernel을 변화
classifier2 <- ksvm(letter ~.,
data = letters_train,
kernel = 'rbfdot')
# rbfdot : 정규분포 형태의 함수(rbfdot)를 kernel로 사용하겠다는 뜻.
predict2 <- predict(classifier2, letters_test)
> head(predict2,n=10)
[1] U N V I N H E Y G E
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> head(letters_test$letter,n=10)
[1] U N V I N H E Y G E
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> table(predict2, letters_test$letter) # 행 , 열
predict2 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0
B 0 128 0 3 0 1 0 2 0 0 0 1 2 1 0 2 1 3 3 0 0 4 1 1 0 0
C 0 0 132 0 3 0 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 1 1 0 161 0 0 2 8 2 3 1 0 0 1 1 3 1 3 0 2 0 0 0 2 3 0
E 0 0 3 0 137 2 0 0 0 1 0 4 0 0 0 1 0 0 2 1 0 0 0 0 0 2
F 0 0 0 0 0 148 0 0 3 0 0 0 0 0 0 11 0 0 1 0 0 1 0 0 0 0
G 0 0 2 0 8 0 154 2 0 0 0 2 2 0 2 1 0 0 0 2 0 0 0 0 0 0
H 0 1 0 1 0 0 2 125 0 1 2 1 1 3 0 1 1 0 0 2 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 151 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
J 0 0 0 0 0 0 0 0 3 136 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
K 0 0 1 0 0 0 0 5 0 0 132 0 0 1 0 0 0 3 0 0 0 0 0 2 0 0
L 0 0 0 0 0 0 1 0 0 0 0 141 0 0 0 0 0 0 1 0 0 0 0 0 0 0
M 0 0 0 0 0 0 1 1 0 0 0 0 138 1 0 0 0 0 0 0 1 0 2 0 0 0
N 0 0 0 0 0 2 0 0 0 0 0 0 0 150 0 0 0 2 0 0 0 0 1 0 0 0
O 0 0 2 0 0 0 0 0 0 1 0 0 0 5 129 2 4 0 0 0 1 0 0 0 0 0
P 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 141 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3 3 158 0 0 0 0 0 0 0 0 0
R 0 3 1 1 0 0 2 5 0 0 9 1 0 3 2 1 0 150 0 1 0 0 0 0 0 0
S 0 2 0 0 0 0 0 0 1 2 0 2 0 0 0 0 0 0 152 0 0 0 0 0 0 2
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 140 0 0 0 0 1 0
U 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 161 0 0 0 1 0
V 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 2 131 0 0 1 0
W 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 3 0 135 0 0 0
X 0 1 0 0 1 0 0 0 0 0 2 4 0 0 0 0 0 0 1 1 0 0 0 153 1 1
Y 4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 138 0
Z 0 0 0 0 3 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 150
> correct2 <- ifelse(predict2 == letters_test$letter,1,0)
> correct2_count<- sum(correct2)
> correct2_ratio<- correct2_count/4000
> correct2_ratio
[1] 0.9305