Tuesday, August 9, 2022

Classification Analysis: What Is It? And How Is It Different Than Regression?

For those who are interested in Business Analytics and Data Science, there are three main analyses that are considered industry standard. 


One is regression analysis, which is used to predict a response variable that is of numerical nature. 


The second one is clustering, which is a type of “unsupervised” machine learning model. 


And the third—subsequently the topic of this post—is classification, which is a “supervised” machine learning model that is used to predict two groupings of observations based on categorical response variables.

 

Thus far in my blog, I have discussed Regression and Clustering algorithms. So, I thought it would be appropriate to begin a discussion on Classification. 


So, when would an analyst decide to use classification? The common use case is when the variable that you are trying to predict is not numerical. In this case, it is “categorical,” as in whether or not a customer decides to make a purchase (yes, or no), or even when a banking customer defaults on a loan or payment (default, or no default).


Categorical, in essence, means that the variable in question is not a number. Age, location, and gender can be considered categorical, as an example.

 

There are several algorithms that we can use, but for my purpose, I am interested in using the k-NN algorithm.

 

Here is a graphical representation of the k-NN (k-Nearest Neighbors):


K-Nearest Neighbors


The k-NN algorithm is powerful and relatively simple to interpret and visualize. These are its main advantages: It’s simple and effective in determining two groupings of observations; since it is a supervised model, it makes no assumptions about the data in question; finally, it has a fast-training process.  

 

The process for determining the outcome of a k-NN algorithm is as follows:


1. Clean the data


2. Determine appropriate labels for training sets


3. Pick an appropriate k-value


4. Test the data with predictor variables of interest


5. Manually select k-observations that have the closest proximity to new observations in the model


6. Use a “majority vote” to choose subset of response variable (yes or no, cancerous or not cancerous, default on a loan or no default)

 

(Fear not! This process will become more tangible in my next post, when I run through the code of the algorithm on a data set.)

 

So how do you choose an appropriate k-value? This is the most challenging part of the algorithm as it requires trial and error. It is an iterative process. 


The basic rule is that the more complex and ambiguous the data is, you should use a smaller k-value. However, you don’t want a k-value that is too small like 1, and you don’t want one that is too large, otherwise you risk overfitting the data. 


Overfitting is NOT ideal if you are trying to predict future outcomes, as the model won’t perform well with newer data. From a mathematical standpoint, you can set the k-value equal to the square root of the number of observations in your training set. That will work as well. And some use of diagnostics will enable you to pick an appropriate k-value later on in the process.

 

You are certain to run a “confusion matrix” which will inform you of how well your model performed and predicted True Positive (TP) values, and True Negatives (TN). Type I and Type II errors are also considered in this matrix, and other metrics are considered like “sensitivity” and “specificity.” 


(Again, this will become clearer when I post the code for Classification.)

 

You will need an ROC Curve to visualize how well your model performed. Here is what that looks like:



ROC Curve



This curve can be visualized with the True Positive Rate (or Sensitivity metric) on the y-axis, as a function of False Positive rate (or 1- Specificity) on the x-axis. 


The AUC, or Area Under the Curve is important in our final assessment of the model. The larger this area, the greater the predictive power of the model.

 


THANK YOU!


This post is meant as a 30,000-foot overview of the Classification algorithm used to determine two groups of the data in a categorical response variable. This information is abstract in written form but will be clearer in my next post where I will run the code in R Studio. 


Thank you so much for reading.


Share:
Location: Chicago, IL, USA