Tuesday, August 9, 2022

Classification Algorithm Using K-NN (K-Nearest Neighbors)

Classification is a form of predictive analysis, like Regression. Only, its is different that Regression because it is used to predict categorical variables. Regression is used to predict numerical variables.


In this post, I will be using an iterative process (K Nearest Neighbors, or K-NN) to determine the most optimal score for the Area Under the Curve--the higher the score, the more predictive the Classification model. More on this later. 


Agenda:


1.         I will synthesize my learning on Classification by using the k-NN algorithm.


2.         We need to partition the data into 2 sets; A training set (used for learning); and a test set (used to evaluate the model).


3.         I will discuss the results via Confusion Matrix, and explain it components.


I will be using the data set provided in class lectures at DePaul University Kellstadt Graduate School of Business. It details of credit card information, has 10,000 observations and only 3 variables. I will use a common process that will predict the default on credit card debt.


Let’s utilize our appropriate packages, Tidyverse contains our visualization sub-package ggplot2. As usual, the text that is highlighted in BLUE will be the code that you run in R Studio:


library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()


Let’s set our working directory. This is unique to the user, but here is the function that we will use.


setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")



Now, we will import the default.csv data set.


credit <- read.csv("default.csv", header = TRUE)


I named the data frame “credit” just to be rebellious. You can be a rebel too! Name the object whatever you desire.


Let’s use some basic exploratory functions to help us learn the type of data we are working with.


dim(credit)

## [1] 10000     3


str(credit)

## 'data.frame':    10000 obs. of  3 variables:
##  $ default: chr  "No" "No" "No" "No" ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...


colnames(credit)

## [1] "default" "balance" "income"


There are 10,000 observations (or rows), and 3 variables (or columns).


Of the three variables only “default” is of character value. We will change this variable into a factor later.


We have three columns called "default," "balance," and "income."


You can use the head () and tail () functions if you wish. I think these functions are repetitive so I won’t use them.


Next, it would be prudent to convert the default variable into a Factor type.


table(credit$default)

##
##   No  Yes
## 9667  333


credit$default<- as.factor(credit$default)


Next, we will rearrange the class categories, with Yes being first.


credit$default <- relevel(credit$default, "Yes")
table(credit$default)

##
##  Yes   No
##  333 9667


Yes now comes before No as noted in the output.


Let’s get some descriptive statistics for our predictor variables. Now, there are only two of them so this part is relatively painless.


summary(credit$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     0.0   481.7   823.6   835.4  1166.3  2654.3

sd(credit$balance)

## [1] 483.715


The summary () function tells us the min, max, median, mean, and 1st and 3rd quartiles of the data of interest. The sd () function tells us the standard deviation of the data of interest. We are looking at the balance column currently.


summary(credit$income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     772   21340   34553   33517   43808   73554

sd(credit$income)

## [1] 13336.64


We can see summary statistics and the standard deviation for the income now.


We can even use a distribution table to summarize categorical variables like default.


table(credit$default)

##
##  Yes   No
##  333 9667


Ok, so that was the easy part. Next, we need to partition our data into training and test sets. Recall that the training set allows for our model to learn the best way to fit the data. And the test set allows us to evaluate our model. To Train, to test!


We are first going to set seed so that we can reproduce the results and randomize our sets.


set.seed(1234)


Let’s split our data using an 80:20 ratio. Meaning 80% of our data will be in the training set, and the remaining 20% will reside in our testing set. The process I will use is a common one:


smp_size <- floor(0.80 * nrow(credit))
train_ind <- sample(seq_len(nrow(credit)), size = smp_size)

train <- credit[train_ind, ]
test <- credit[-train_ind, ]


Notice in the Environment 8000 observations are in the train set (80% of the data) and 2000 observations are in the test set (20% of the data).


Let’s use ggplot2 to visualize the data. This. Will. Be. Super. Awesome! For good measure, let’s label our x-axis (balance), and y-axis (income).


ggplot(train, mapping = aes(x = balance, y = income, color = default)) +
  geom_point() + 
  ggtitle("Balance and Income by default") +
  xlab("balance") +
  ylab("income")


Classification Plot




Isn’t that a pretty graph. Lovely!


Let’s do box plots of income and balance while we’re on a role! Here are both of them below:


ggplot(train, aes(x = default, y = balance, color = default)) +
  geom_boxplot() + 
  ggtitle("Balance by default") +
  xlab("default") +
  ylab("balance")


Box Plot 1

We can observe that those that tend to default have higher balances than those who have less balance.


ggplot(train, aes(x = default, y = income, color = default)) +
  geom_boxplot() +
  ggtitle("Income by default") +
  xlab("default") +
  ylab("income")


Box Plot 2


Further, we can determine that those who default have less income. This makes sense, as it seems that those with lower income might have a harder time paying off a balance, which would result in a default.


Now, we can run a k-NN algorithm. So what do we need?


1.         We will need data from the training and testing sets.


2.         We should probably have an independent variable (x) that will be impacting response variable (y).


3.         We will also need a dependent variable (y) that will be impacted by the predictor variable (x).


4.         Finally, we need a package that contains our knn () function.


Let’s set k = 1. 


This will be an iterative process, and we will choose different values for k in the following lines:


library(class)

## Warning: package 'class' was built under R version 4.1.2

knn_1<-knn(train[,2:3], test[,2:3], train$default, k = 1)
table(knn_1)

## knn_1
##  Yes   No
##   56 1944


Ok, so there are 56 representing Yes (those who defaulted), and 1944 representing No (those who did NOT default).


Let’s evaluate our model using a Confusion Matrix.


table(test$default, knn_1)

##      knn_1
##        Yes   No
##   Yes   18   47
##   No    38 1897


This is how we will interpret it:


True Positives = 18

True Negatives = 1897

False Positives = 38

False Negatives = 47


These numbers are very important in determining sensitivity and specificity. Let’s calculate each of these metrics one at a time:


Sensitivity = TP / (TP+FN) = 18 / (18 + 47) = .27692 (True Positive Rate)


Specificity = TN / (TN+FP) = 1897 / (1897 + 38) = .98036 (True Negative Rate)


Sensitivity and specificity will both be used to plot a visualization. We want both values to be HIGH, as in, close to 1.


library(ROSE)

## Warning: package 'ROSE' was built under R version 4.1.2

## Loaded ROSE 0.0-4

roc.curve(test$default, knn_1)


ROC Curve 1

## Area under the curve (AUC): 0.629


As you can see True Positive Rate (sensitivity) is presented on the y-axis. And, True Negative Rate (specificity) is positioned along the x-axis. The greater the area under the curve (AUC) the better the performance of the model. So far, we have a score of 0.629.


Let’s pick another k and see if we can get higher results. How about when k is equal to 3?


knn_3 <- knn(train[,2:3], test[,2:3], train$default, k = 3)
table(test$default, knn_3)

##      knn_3
##        Yes   No
##   Yes   11   54
##   No    16 1919


We can see that:


True positive: 11

True negative: 1919

False positive: 16

False negative: 54


So lets calculate sensitivity, specificity.


Sensitivity = TP / (TP+FN) = 11 / (11 + 54) = .1692308 (True Positive Rate)

Specificity = TN / (TN+FP) = 1919 / (1919 + 16) = .9917313 (True Negative Rate)


With k equaling 3 our sensitivity dropped, but our specificity rose.


roc.curve(test$default, knn_3)


ROC Curve 2


## Area under the curve (AUC): 0.580


Area under the curve (AUC) is .58 which is less than when k equaled 1. So far, when k equals 1, we have filled more area under the curve.


Let’s try this when k equals to 6 to see if our AUC improves.


knn_6<-knn(train[,2:3], test[,2:3], train$default, k = 6)

table(test$default, knn_6)

##      knn_6
##        Yes   No
##   Yes    6   59
##   No     9 1926


We can see that:


True positive: 7

True negative: 1925

False positive: 10

False negative: 58


So lets calculate sensitivity, specificity when k is equal to 6.


Sensitivity = TP / (TP+FN) = 7 / (7 + 58) = .1076923 (True Positive Rate)

Specificity = TN / (TN+FP) = 1919 / (1919 + 16) = .994832 (True Negative Rate)


The same results occurred: Sensitivity decreased, and specificity increased.


Let’s see our AUC:


roc.curve(test$default, knn_6)


ROC Curve 3

## Area under the curve (AUC): 0.544


Area under the curve (AUC) keeps decreasing when you increase the value of k. When k is equal to 6, the AUC is equal to .551.


Conclusion: I am going to keep k value equal to 1, to keep my AUC score the highest.


But let’s say that I didn’t see the writing on the wall. Let’s try this one more time. This time I am going to set my k value to the square root of the number of instances in my training set which is 89. Why not?


Supposedly, this is just one of the valid ways of choosing a k value. Will it help improve our AUC?


knn_89<-knn(train[,2:3], test[,2:3], train$default, k = 89)

table(test$default, knn_89)

##      knn_89
##        Yes   No
##   Yes    0   65
##   No     0 1935

roc.curve(test$default, knn_89)


ROC Curve 4

## Area under the curve (AUC): 0.500


The area under the curve (AUC) is exactly .500. I smell trouble!! I am going to call it quits, actually. I am going to keep my k value equal to 1 for this example.



THANK YOU!


As you can see, Classification analysis is wildly different than Regression and Clustering. In my opinion, it is more difficult to interpret the outputs. But, it is a vital method in machine learning and data analytics if you are trying to make a binary decision (Yes or No). 


Thank you so much for visiting my blog. I hope this post didn't confuse you or discourage you from your goal of transitioning from the Sex Work Industry. Just keep practicing, and use the data sets that I have posted on my Google Drive. 


Don't forget to enjoy the journey and process of learning something new! Thanks again. 


Share:
Location: Chicago, IL, USA