Thank you for checking out my post on Clustering earlier last week. I wanted to expanded on the method with an exercise on clustering universities in the states of Alabama, California, and Florida. I hope you have fun with this one. I know I did!
Let’s begin by loading packages, setting our directory and
loading our data set. We will be using the “colleges” data set created by the
US Department of Education, and cleaned by Fred Nwanganga and Mike Chapple.
As always, if you wish to follow along, you can download the data set on the Google Drive by clicking on the "Data Sets" tab on the homepage. And like all of my other posts, the code that you can replicate is highlighted in BLUE.
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts ------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")
college <- read_csv("college.csv", col_types = "nccfffffnnnnnnnnn")
Let’s take the time to do some fundamental exploratory
data analysis:
dim(college)
## [1] 1270
17
str(college)
## spec_tbl_df [1,270 x 17] (S3:
spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:1270] 102669 101648
100830 101879 100858 ...
## $ name : chr [1:1270] "Alaska
Pacific University" "Marion Military Institute" "Auburn
University at Montgomery" "University of North Alabama" ...
## $ city : chr [1:1270]
"Anchorage" "Marion" "Montgomery"
"Florence" ...
## $ state : Factor w/ 51 levels
"AK","AL","AR",..: 1 2 2 2 2 2 2 2 2 2 ...
## $
region : Factor w/ 4 levels
"West","South",..: 1 2 2 2 2 2 2 2 2 2 ...
## $
highest_degree : Factor w/ 4 levels
"Graduate","Associate",..: 1 2 1 1 1 1 1 1 1 1 ...
## $
control : Factor w/ 2 levels
"Private","Public": 1 2 2 2 2 2 2 1 2 2 ...
## $
gender : Factor w/ 3 levels
"CoEd","Women",..: 1 1 1 1 1 1 1 1 1 1 ...
## $
admission_rate : num [1:1270] 0.421
0.614 0.802 0.679 0.835 ...
## $
sat_avg : num [1:1270] 1054
1055 1009 1029 1215 ...
## $
undergrads : num [1:1270] 275 433
4304 5485 20514 ...
## $
tuition : num [1:1270] 19610
8778 9080 7412 10200 ...
## $
faculty_salary_avg: num [1:1270] 5804 5916 7255 7424 9487 ...
## $
loan_default_rate : num [1:1270] 0.077 0.136 0.106 0.111 0.045 0.062 0.096
0.007 0.103 0.063 ...
## $
median_debt : num [1:1270] 23250
11500 21335 21500 21831 ...
## $ lon : num [1:1270] -149.9 -87.3
-86.3 -87.7 -85.5 ...
## $ lat : num [1:1270] 61.2 32.6 32.4
34.8 32.6 ...
## - attr(*,
"spec")=
## .. cols(
## .. id = col_number(),
## .. name = col_character(),
## .. city = col_character(),
## .. state = col_factor(levels = NULL, ordered =
FALSE, include_na = FALSE),
## .. region = col_factor(levels = NULL, ordered =
FALSE, include_na = FALSE),
## .. highest_degree = col_factor(levels = NULL,
ordered = FALSE, include_na = FALSE),
## .. control = col_factor(levels = NULL, ordered
= FALSE, include_na = FALSE),
## .. gender = col_factor(levels = NULL, ordered =
FALSE, include_na = FALSE),
## .. admission_rate = col_number(),
## .. sat_avg = col_number(),
## .. undergrads = col_number(),
## .. tuition = col_number(),
## .. faculty_salary_avg = col_number(),
## .. loan_default_rate = col_number(),
## .. median_debt = col_number(),
## .. lon = col_number(),
## .. lat = col_number()
## .. )
## - attr(*,
"problems")=<externalptr>
colnames(college)
## [1]
"id"
"name"
"city"
## [4]
"state"
"region"
"highest_degree"
## [7]
"control"
"gender" "admission_rate"
## [10] "sat_avg" "undergrads" "tuition"
## [13] "faculty_salary_avg"
"loan_default_rate"
"median_debt"
## [16] "lon" "lat"
There are a total of 1270 rows, and 17 columns. We can
confirm the data types with this function. There are 17 variables.
This time, we will do our clustering analysis on the state
of Alabama. So let’s create a new data set with ONLY schools from this state.
alabama_schools <- college %>%
filter(state == "AL") %>%
column_to_rownames(var = "name")
After we pass the state code “MD” in the filter()
function, we use the column_to_rowname() function which allows us to see the
name of the school for our observations. Let’s view the new data set:
View(alabama_schools)
We now have 24 observations, or schools, and 16 variables in our new data set with only schools from Maryland. Now, let’s use the select() function to specify which features we are interested in.
For this case, we are interested in admission_rate and sat_avg. Sorry for making that decision for you!!
alabama_schools %>%
select(admission_rate, sat_avg) %>%
summary()
##
admission_rate sat_avg
## Min. :0.4414
Min. : 811
## 1st
Qu.:0.5309 1st Qu.: 969
## Median
:0.5927 Median :1035
## Mean :0.6523
Mean :1033
## 3rd
Qu.:0.8064 3rd Qu.:1109
## Max. :1.0000
Max. :1219
Before moving on, I would like to observe that we are using
the pipe (%>%) operator which allows us to compose cleaner code. Let’s move
on:
As expected, we can notice that the range of values
between both variables have vastly different ranges. As a reminder, we MUST
normalize our variables BEFORE building our model.
alabama_schools_scaled <-
alabama_schools %>%
select(admission_rate, sat_avg) %>%
scale()
We create a new object, using our alabama_schools data
set, that utilizes the scale() function to “normalize” our data to z-scores.
This creates an optimal scenario to enable us to use a clustering model.
Let’s see the new values for each variable by piping the
summary() function into our new data object.
alabama_schools_scaled %>%
summary()
##
admission_rate sat_avg
## Min. :-1.3758
Min. :-1.92418
## 1st
Qu.:-0.7922 1st Qu.:-0.55343
## Median
:-0.3884 Median : 0.01916
## Mean : 0.0000
Mean : 0.00000
## 3rd Qu.:
1.0051 3rd Qu.: 0.66332
## Max. : 2.2685
Max. : 1.61547
Our variables have been standardized to z-scores and
our data is officially normalized. Remember, by normalizing our data we don’t
need to worry about extreme differences between our variables. Let’s load the stats
package so we can run the algorithm.
Let’s set the seed so we can have reproducible results.
library(stats)
set.seed(1234)
Notice that we have set our seed to randomize the data.
Simply remember the number that you use in the argument so that you can
reproduce the results every time you run this algorithm.
Reminder:
The kmeans() functions takes three arguments. Our scaled
data set, which recall is alabama_schools_scaled, is the first argument.
Centers will determine how many cluster centers we will have. This is
essentially our k-value. We are going to set our centers (k-value) to 3,
because we want three clusters. The nstart will be set to 25, and this will be
the total number of configuration attempts. Let’s run it!
k_3 <- kmeans(alabama_schools_scaled, centers = 3, nstart = 25)
Great! Let’s start to explore this cluster. Let’s see
how many observations are in each of the three clusters.
k_3$size
## [1] 8 6 10
This output tells us that one cluster has 8
observations, another has 6 observations, and the third cluster has the remaining
10 observations.
We can also get a sense of the values of the three cluster
centers by using the centers attribute.
k_3$centers
##
admission_rate sat_avg
## 1
-0.6824578 0.5743977
## 2
-0.8282133 -1.2431427
## 3
1.0428942 0.2863675
Let’s break this down per cluster:
The first cluster center has a value of roughly -.68 for
admission_rate and .57 for sat_avg.
The second cluster center has a value of roughly -.82 for
admission_rate and -1.24 for sat_avg.
The third cluster center has a value of roughly 1.04 for
admission_rate and approximately .28 for sat_avg.
THESE VALUES ARE NORMALIZED VALUES OF THE ORIGINAL
DATASET!!!
With this in mind, we have lost some ability to interpret
this part of the data. This is fine as our objective is to group these schools
as accurately as possible.
What good is clustering if we can’t have a visualization
of our model. Let’s load another package called factoextra.
library(factoextra)
## Warning: package 'factoextra' was built under R
version 4.1.2
## Welcome! Want to learn more? See two
factoextra-related books at https://goo.gl/ve3WBa
Next, will pass three arguments into a very powerful
function called the fviz_cluster() function. The first argument will be an
object where our model is stored, k_3. The second component in the function
will be the data set, which is the scaled data set, alabama_schools_scaled. The
last element of the function will contain a specification that will help
organize our labels.
fviz_cluster(k_3, data = alabama_schools_scaled, repel = TRUE)
## Warning: ggrepel: 1 unlabeled data points (too
many overlaps). Consider
## increasing max.overlaps
There is a lot to unpack here, but for brevity, I will
describe the clusters as simply as possible.
Cluster 1 is labeled in red and as you can notice it
contains schools that have the highest SAT average and lowest admission rates.
I would argue that the 8 schools in this cluster are the most elite schools in
Alabama as observed by the two metrics that we have used.
Cluster 2 is labeled in green. Universities in this cluster have the lowest SAT average, and are low in admission rates.
Cluster 3 is labeled in blue. Schools in this cluster have high admission rate and have high sat averages; this cluster is similar to cluster 1 with regards to high SAT scores.
I suppose all school clusters will have similar arrangements on two dimensional space if they are measured by same features, sat_avg and admission_rate.
Let’s see the how
the three clusters looks for the state of California, and Florida. Just to test
out my hypothesis.
california_schools <- college %>%
filter(state == "CA") %>%
column_to_rownames(var = "name")
california_schools %>%
select(admission_rate, sat_avg) %>%
summary()
##
admission_rate sat_avg
## Min. :0.0509
Min. : 871.0
## 1st
Qu.:0.4051 1st Qu.: 993.5
## Median
:0.5800 Median :1086.0
## Mean :0.5477
Mean :1113.7
## 3rd
Qu.:0.7425 3rd Qu.:1222.0
## Max. :0.8750
Max. :1545.0
california_schools_scaled <-
california_schools %>%
select(admission_rate, sat_avg) %>%
scale()
set.seed(1234)
k_3 <- kmeans(california_schools_scaled, centers = 3, nstart = 25)
fviz_cluster(k_3, data = california_schools_scaled, repel = TRUE)
## Warning: ggrepel: 58 unlabeled data points (too
many overlaps). Consider
## increasing max.overlaps
Now for Florida:
florida_schools <- college %>%
filter(state == "FL") %>%
column_to_rownames(var = "name")
florida_schools %>%
select(admission_rate, sat_avg) %>%
summary()
##
admission_rate sat_avg
## Min. :0.2286
Min. : 803
## 1st
Qu.:0.4481 1st Qu.: 948
## Median
:0.5188 Median :1065
## Mean :0.5436
Mean :1057
## 3rd
Qu.:0.6157 3rd Qu.:1153
## Max. :1.0000
Max. :1330
florida_schools_scaled <-
florida_schools %>%
select(admission_rate, sat_avg) %>%
scale()
set.seed(1234)
k_3 <- kmeans(florida_schools_scaled, centers = 3, nstart = 25)
fviz_cluster(k_3, data = florida_schools_scaled, repel = TRUE)
## Warning: ggrepel: 19 unlabeled data points (too
many overlaps). Consider
## increasing max.overlaps
An observation that I would like to make is that the clusters are labeled different from graphic to graphic. For example, Alabama’s elite schools are labeled red, California’s elite schools are labeled color green, and Florida’s top schools are labeled blue.
Yet, regardless of how the cluster is labeled by color, we
can deduce that elite level schools are positioned in the upper left portion of
the quadrant.
THANK YOU!
Thank you so much reading this different take on Clustering. This machine learning method is used widely in a Marketing context, namely when managers are looking to categorize, or "segment," groups of costumers.
I am proud of all of you! I know that the steps that you are taking towards a safer, healthier life, are tedious, and at times cumbersome.
But you know what?
Your dedication to ridding yourselves of a toxic profession will pay off. You will live happier lives, filled with new found love from family and friends, and even acquaintances who are drawn to your hard work and professional ethic.
Thank you for allowing me to be part of your journey to becoming data analysts, and data entrepreneurs.