Classification is a form of predictive analysis, like Regression. Only, its is different that Regression because it is used to predict categorical variables. Regression is used to predict numerical variables.
In this post, I will be using an iterative process (K Nearest Neighbors, or K-NN) to determine the most optimal score for the Area Under the Curve--the higher the score, the more predictive the Classification model. More on this later.
Agenda:
1.
I will synthesize my learning on Classification
by using the k-NN algorithm.
2.
We need to partition the data into 2 sets; A
training set (used for learning); and a test set (used to evaluate the model).
3.
I will discuss the results via Confusion Matrix,
and explain it components.
I will be using the data set provided in class
lectures at DePaul University Kellstadt Graduate School of Business. It details of credit card information, has 10,000 observations and
only 3 variables. I will use a common process that will predict the default on
credit card debt.
Let’s utilize our appropriate packages, Tidyverse contains
our visualization sub-package ggplot2. As usual, the text that is highlighted in BLUE will be the code that you run in R Studio:
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts
------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
Let’s set our working directory. This is unique to the
user, but here is the function that we will use.
setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")
Now, we will import the default.csv data set.
credit <- read.csv("default.csv", header = TRUE)
I named the data frame “credit” just to be rebellious. You can be a rebel too! Name the object whatever you desire.
Let’s use some basic exploratory functions to help us
learn the type of data we are working with.
dim(credit)
## [1] 10000
3
str(credit)
## 'data.frame':
10000 obs. of 3 variables:
## $ default:
chr "No" "No"
"No" "No" ...
## $ balance:
num 730 817 1074 529 786 ...
## $ income :
num 44362 12106 31767 35704 38463 ...
colnames(credit)
## [1] "default" "balance"
"income"
There are 10,000 observations (or rows), and 3
variables (or columns).
Of the three variables only “default” is of character
value. We will change this variable into a factor later.
We have three columns called "default," "balance," and "income."
You can use the head () and tail () functions if you wish.
I think these functions are repetitive so I won’t use them.
Next, it would be prudent to convert the default variable
into a Factor type.
table(credit$default)
##
## No Yes
## 9667 333
credit$default<- as.factor(credit$default)
Next, we will rearrange the class categories, with Yes
being first.
credit$default <- relevel(credit$default, "Yes")
table(credit$default)
##
## Yes No
## 333 9667
Yes now comes before No as noted in the output.
Let’s get some descriptive statistics for our predictor
variables. Now, there are only two of them so this part is relatively painless.
summary(credit$balance)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 0.0 481.7
823.6 835.4 1166.3
2654.3
sd(credit$balance)
## [1] 483.715
The summary () function tells us the min, max, median,
mean, and 1st and 3rd quartiles of the data of interest. The sd () function
tells us the standard deviation of the data of interest. We are looking at the
balance column currently.
summary(credit$income)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 772 21340
34553 33517 43808 73554
sd(credit$income)
## [1] 13336.64
We can see summary statistics and the standard
deviation for the income now.
We can even use a distribution table to summarize
categorical variables like default.
table(credit$default)
##
## Yes No
## 333 9667
Ok, so that was the easy part. Next, we need to
partition our data into training and test sets. Recall that the training set
allows for our model to learn the best way to fit the data. And the test set
allows us to evaluate our model. To Train, to test!
We are first going to set seed so that we can reproduce
the results and randomize our sets.
set.seed(1234)
Let’s split our data using an 80:20 ratio. Meaning 80%
of our data will be in the training set, and the remaining 20% will reside in
our testing set. The process I will use is a common one:
smp_size <- floor(0.80 * nrow(credit))
train_ind <- sample(seq_len(nrow(credit)), size
=
smp_size)
train <-
credit[train_ind, ]
test <- credit[-train_ind, ]
Notice in the Environment 8000 observations are in the
train set (80% of the data) and 2000 observations are in the test set (20% of
the data).
Let’s use ggplot2 to visualize the data. This. Will. Be.
Super. Awesome! For good measure, let’s label our x-axis (balance), and y-axis
(income).
ggplot(train, mapping = aes(x = balance, y = income, color
=
default)) +
geom_point() +
ggtitle("Balance and Income by default") +
xlab("balance") +
ylab("income")
Isn’t that a pretty graph. Lovely!
Let’s do box plots of income and balance while we’re on a
role! Here are both of them below:
ggplot(train, aes(x = default, y = balance, color
=
default)) +
geom_boxplot() +
ggtitle("Balance by default") +
xlab("default") +
ylab("balance")
We can observe that those that tend to default have higher balances than those who have less balance.
ggplot(train, aes(x = default, y = income, color
=
default)) +
geom_boxplot() +
ggtitle("Income by default") +
xlab("default") +
ylab("income")
Further, we can determine that those who default have less income. This makes sense, as it seems that those with lower income might have a harder time paying off a balance, which would result in a default.
Now, we can run a k-NN algorithm. So what do we need?
1.
We will need data from the training and testing
sets.
2.
We should probably have an independent variable
(x) that will be impacting response variable (y).
3.
We will also need a dependent variable (y) that
will be impacted by the predictor variable (x).
4.
Finally, we need a package that contains our knn
() function.
Let’s set k = 1.
This will be an iterative process,
and we will choose different values for k in the following lines:
library(class)
## Warning: package 'class' was built under R version
4.1.2
knn_1<-knn(train[,2:3], test[,2:3], train$default, k = 1)
table(knn_1)
## knn_1
## Yes No
## 56 1944
Ok, so there are 56 representing Yes (those who
defaulted), and 1944 representing No (those who did NOT default).
Let’s evaluate our model using a Confusion Matrix.
table(test$default, knn_1)
## knn_1
##
Yes No
## Yes 18
47
## No 38 1897
This is how we will interpret it:
True Positives = 18
True Negatives = 1897
False Positives = 38
False Negatives = 47
These numbers are very important in determining
sensitivity and specificity. Let’s calculate each of these metrics one at a time:
Sensitivity = TP / (TP+FN) = 18 / (18 + 47) = .27692 (True
Positive Rate)
Specificity = TN / (TN+FP) = 1897 / (1897 + 38) = .98036
(True Negative Rate)
Sensitivity and specificity will both be used to plot a visualization. We want both values to be HIGH, as in, close to 1.
library(ROSE)
## Warning: package 'ROSE' was built under R version
4.1.2
## Loaded ROSE 0.0-4
roc.curve(test$default, knn_1)
## Area under the curve (AUC): 0.629
As you can see True Positive Rate (sensitivity) is
presented on the y-axis. And, True Negative Rate (specificity) is positioned
along the x-axis. The greater the area under the curve (AUC) the better the
performance of the model. So far, we have a score of 0.629.
Let’s pick another k and see if we can get higher results.
How about when k is equal to 3?
knn_3 <- knn(train[,2:3], test[,2:3], train$default, k = 3)
table(test$default, knn_3)
## knn_3
##
Yes No
## Yes 11 54
## No 16 1919
We can see that:
True positive: 11
True negative: 1919
False positive: 16
False negative: 54
So lets calculate sensitivity, specificity.
Sensitivity = TP / (TP+FN) = 11 / (11 + 54) = .1692308
(True Positive Rate)
Specificity = TN / (TN+FP) = 1919 / (1919 + 16) = .9917313
(True Negative Rate)
With k equaling 3 our sensitivity dropped, but our
specificity rose.
roc.curve(test$default, knn_3)
## Area under the curve (AUC): 0.580
Area under the curve (AUC) is .58 which is less than
when k equaled 1. So far, when k equals 1, we have filled more area under the
curve.
Let’s try this when k equals to 6 to see if our AUC
improves.
knn_6<-knn(train[,2:3], test[,2:3], train$default, k = 6)
table(test$default, knn_6)
## knn_6
##
Yes No
## Yes 6
59
## No 9 1926
We can see that:
True positive: 7
True negative: 1925
False positive: 10
False negative: 58
So lets calculate sensitivity, specificity when k is equal
to 6.
Sensitivity = TP / (TP+FN) = 7 / (7 + 58) = .1076923 (True
Positive Rate)
Specificity = TN / (TN+FP) = 1919 / (1919 + 16) = .994832
(True Negative Rate)
The same results occurred: Sensitivity decreased, and
specificity increased.
Let’s see our AUC:
roc.curve(test$default, knn_6)
## Area under the curve (AUC): 0.544
Area under the curve (AUC) keeps decreasing when you
increase the value of k. When k is equal to 6, the AUC is equal to .551.
Conclusion: I am going to keep k value equal to 1, to keep
my AUC score the highest.
But let’s say that I didn’t see the writing on the wall.
Let’s try this one more time. This time I am going to set my k value to the
square root of the number of instances in my training set which is 89. Why not?
Supposedly, this is just one of the valid ways of choosing
a k value. Will it help improve our AUC?
knn_89<-knn(train[,2:3], test[,2:3], train$default, k = 89)
table(test$default, knn_89)
## knn_89
##
Yes No
## Yes 0
65
## No 0 1935
roc.curve(test$default, knn_89)
## Area under the curve (AUC): 0.500
The area under the curve (AUC) is exactly .500. I smell
trouble!! I am going to call it quits, actually. I am going to keep my k value
equal to 1 for this example.
THANK YOU!
As you can see, Classification analysis is wildly different than Regression and Clustering. In my opinion, it is more difficult to interpret the outputs. But, it is a vital method in machine learning and data analytics if you are trying to make a binary decision (Yes or No).
Thank you so much for visiting my blog. I hope this post didn't confuse you or discourage you from your goal of transitioning from the Sex Work Industry. Just keep practicing, and use the data sets that I have posted on my Google Drive.
Don't forget to enjoy the journey and process of learning something new! Thanks again.