혼동행렬을 사용한 성능 측정
● TP(True Positive, 참 긍정)
● TN(True Negative, 참 부정)
● FP(False Positive, 거짓 긍정)
● FN(False Negative, 거짓 부정)
● 정확도(accuracy), 오류율(error rate)
# Naive Bayes 알고리즘에서 스팸 메세지 분류 결과/ 확률 정리
sms_results <- read.csv(file = 'mlwr/sms_results.csv')
library(dplyr)
sms_results %>%
filter(prob_spam> 0.4 & prob_spam < 0.6 )
actual_type predict_type prob_spam prob_ham
1 spam ham 0.47536 0.52464
2 ham spam 0.56188 0.43812
3 ham spam 0.57917 0.42083
> table(sms_results$actual_type, sms_results$predict_type)
ham spam
ham 1203 4
spam 31 152
> CrossTable(sms_results$actual_type, sms_results$predict_type)
※ 같은 색끼리 합이 '1'
정확도 = 0.865 + 0.109 = 0.974
오류율 = 0.022 + 0.003 = 0.025
※ 요소가 3개 이상일 경우 다음과 같은 조화평균 식을 이용한다.
# kappa 통계량 계산
# Pr(a): 실제 일치(actual agreement) 비율
# TN + TP
pr_a <- 0.865+0.109
# Pr(e): 예상 일치(expected agreement) 비율
# 독립 사건이라는 가정에서
# P(실제 스팸|스팸 예측) + P(실제 햄|햄 예측)
pr_e = (0.132)*(0.112)+(0.868)*(0.888)
kappa <- (pr_a-pr_e)/(1-pr_e)
# caret 패키지 : Classification And REgression Training
install.packages('caret')
> confusionMatrix(data = sms_results$predict_type, reference = sms_results$actual_type,
+ positive = 'spam')
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 1203 31
spam 4 152
Accuracy : 0.9748
95% CI : (0.9652, 0.9824)
No Information Rate : 0.8683
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8825
Mcnemar's Test P-Value : 1.109e-05
Sensitivity : 0.8306
Specificity : 0.9967
Pos Pred Value : 0.9744
Neg Pred Value : 0.9749
Prevalence : 0.1317
Detection Rate : 0.1094
Detection Prevalence : 0.1122
Balanced Accuracy : 0.9136
'Positive' Class : spam
CrossTable(sms_results$actual_type, sms_results$predict_type,
positive = 'spam')
값과 테이블 비교하기
Sensitivity : 0.831
Speficity : 0.997
Pos Pred Value = Positive Predictive Value -> 정밀도(Precision) : 0.974
Neg Pred Value = Negative Predictive Value -> 0.975
# 민감도
> sensitivity(data = sms_results$predict_type,
+ reference = sms_results$actual_type,
+ positive = 'spam')
[1] 0.8306011
# 특이도
> specificity(data = sms_results$predict_type,
+ reference = sms_results$actual_type,
+ negative = 'ham')
[1] 0.996686
# 정밀도
> precision(data = sms_results$predict_type,
+ reference = sms_results$actual_type,
+ relevant = 'spam')
[1] 0.974359
# F - 척도= (2 * precision * recall) / (precision + recall)
> F_meas(data = sms_results$predict_type,
+ reference = sms_results$actual_type,
+ relev='spam')
[1] 0.8967552
f <- (2 * 0.974359 * 0.8306011) / (0.974359 + 0.8306011) # 0.896755
# ROC(Receiver Operation Characteristic) 곡선
install.packages('pROC')
library(pROC)
sms_roc <- roc(response = sms_results$actual_type,
predictor = sms_results$prob_spam)
plot(sms_roc, col = 'blue', lwd = 3)
sms_knn <- read.csv(file= 'mlwr/sms_results_knn.csv')
head(sms_knn)
sms_knn_roc <- roc(response = sms_results$actual_type,
predictor = sms_knn$p_spam)
plot(sms_knn_roc, col = 'red', lwd = 3, add = T)