For those who are interested in Business Analytics and Data Science, there are three main analyses that are considered industry standard.
One is regression analysis, which is used to predict a response variable that is of numerical nature.
The second one is clustering, which is a type of “unsupervised” machine learning model.
And the third—subsequently the topic of this post—is classification, which is a “supervised” machine learning model that is used to predict two groupings of observations based on categorical response variables.
So, when would an analyst decide to use classification? The common use case is when the variable that you are trying to predict is not numerical. In this case, it is “categorical,” as in whether or not a customer decides to make a purchase (yes, or no), or even when a banking customer defaults on a loan or payment (default, or no default).
Categorical,
in essence, means that the variable in question is not a number. Age, location,
and gender can be considered categorical, as an example.
There are several algorithms that we can use, but for my
purpose, I am interested in using the k-NN algorithm.
Here is a graphical representation of the k-NN
(k-Nearest Neighbors):
The k-NN algorithm is powerful and relatively simple
to interpret and visualize. These are its main advantages: It’s simple and
effective in determining two groupings of observations; since it is a
supervised model, it makes no assumptions about the data in question; finally,
it has a fast-training process.
The process for determining the outcome of a k-NN
algorithm is as follows:
1. Clean the data
2. Determine appropriate labels for training sets
3. Pick an appropriate k-value
4. Test the data with predictor variables of interest
5. Manually select k-observations that have the
closest proximity to new observations in the model
6. Use a “majority vote” to choose subset of response
variable (yes or no, cancerous or not cancerous, default on a loan or no
default)
(Fear not! This process will become more tangible in
my next post, when I run through the code of the algorithm on a data set.)
So how do you choose an appropriate k-value? This is the most challenging part of the algorithm as it requires trial and error. It is an iterative process.
The basic rule is that the more complex and ambiguous the data is, you should use a smaller k-value. However, you don’t want a k-value that is too small like 1, and you don’t want one that is too large, otherwise you risk overfitting the data.
Overfitting is NOT ideal if you are
trying to predict future outcomes, as the model won’t perform well with newer
data. From a mathematical standpoint, you can set the k-value equal to the
square root of the number of observations in your training set. That will work
as well. And some use of diagnostics will enable you to pick an appropriate
k-value later on in the process.
You are certain to run a “confusion matrix” which will inform you of how well your model performed and predicted True Positive (TP) values, and True Negatives (TN). Type I and Type II errors are also considered in this matrix, and other metrics are considered like “sensitivity” and “specificity.”
(Again, this will become clearer when I post the code for
Classification.)
You will need an ROC Curve to visualize how well your
model performed. Here is what that looks like:
This curve can be visualized with the True Positive Rate (or Sensitivity metric) on the y-axis, as a function of False Positive rate (or 1- Specificity) on the x-axis.
The AUC, or Area Under the Curve is
important in our final assessment of the model. The larger this area, the
greater the predictive power of the model.
This post is meant as a 30,000-foot overview of the Classification algorithm used to determine two groups of the data in a categorical response variable. This information is abstract in written form but will be clearer in my next post where I will run the code in R Studio.
Thank
you so much for reading.