Wednesday, August 10, 2022

Clustering Analysis of Universities: Schools in Alabama, California, and Florida According to SAT Averages and Admission Rates

Thank you for checking out my post on Clustering earlier last week. I wanted to expanded on the method with an exercise on clustering universities in the states of Alabama, California, and Florida. I hope you have fun with this one. I know I did!


Let’s begin by loading packages, setting our directory and loading our data set. We will be using the “colleges” data set created by the US Department of Education, and cleaned by Fred Nwanganga and Mike Chapple.


As always, if you wish to follow along, you can download the data set on the Google Drive by clicking on the "Data Sets" tab on the homepage. And like all of my other posts, the code that you can replicate is highlighted in BLUE.


library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")

college <- read_csv("college.csv", col_types = "nccfffffnnnnnnnnn")


Let’s take the time to do some fundamental exploratory data analysis:


dim(college)

## [1] 1270   17

str(college)

## spec_tbl_df [1,270 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                : num [1:1270] 102669 101648 100830 101879 100858 ...
##  $ name              : chr [1:1270] "Alaska Pacific University" "Marion Military Institute" "Auburn University at Montgomery" "University of North Alabama" ...
##  $ city              : chr [1:1270] "Anchorage" "Marion" "Montgomery" "Florence" ...
##  $ state             : Factor w/ 51 levels "AK","AL","AR",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ region            : Factor w/ 4 levels "West","South",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ highest_degree    : Factor w/ 4 levels "Graduate","Associate",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ control           : Factor w/ 2 levels "Private","Public": 1 2 2 2 2 2 2 1 2 2 ...
##  $ gender            : Factor w/ 3 levels "CoEd","Women",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_rate    : num [1:1270] 0.421 0.614 0.802 0.679 0.835 ...
##  $ sat_avg           : num [1:1270] 1054 1055 1009 1029 1215 ...
##  $ undergrads        : num [1:1270] 275 433 4304 5485 20514 ...
##  $ tuition           : num [1:1270] 19610 8778 9080 7412 10200 ...
##  $ faculty_salary_avg: num [1:1270] 5804 5916 7255 7424 9487 ...
##  $ loan_default_rate : num [1:1270] 0.077 0.136 0.106 0.111 0.045 0.062 0.096 0.007 0.103 0.063 ...
##  $ median_debt       : num [1:1270] 23250 11500 21335 21500 21831 ...
##  $ lon               : num [1:1270] -149.9 -87.3 -86.3 -87.7 -85.5 ...
##  $ lat               : num [1:1270] 61.2 32.6 32.4 34.8 32.6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_number(),
##   ..   name = col_character(),
##   ..   city = col_character(),
##   ..   state = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   region = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   highest_degree = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   control = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   admission_rate = col_number(),
##   ..   sat_avg = col_number(),
##   ..   undergrads = col_number(),
##   ..   tuition = col_number(),
##   ..   faculty_salary_avg = col_number(),
##   ..   loan_default_rate = col_number(),
##   ..   median_debt = col_number(),
##   ..   lon = col_number(),
##   ..   lat = col_number()
##   .. )
##  - attr(*, "problems")=<externalptr>


colnames(college)

##  [1] "id"                 "name"               "city"             
##  [4] "state"              "region"             "highest_degree"   
##  [7] "control"            "gender"             "admission_rate"   
## [10] "sat_avg"            "undergrads"         "tuition"          
## [13] "faculty_salary_avg" "loan_default_rate"  "median_debt"      
## [16] "lon"                "lat"


There are a total of 1270 rows, and 17 columns. We can confirm the data types with this function. There are 17 variables.


This time, we will do our clustering analysis on the state of Alabama. So let’s create a new data set with ONLY schools from this state.


alabama_schools <- college %>%
  filter(state == "AL") %>%
  column_to_rownames(var = "name")


After we pass the state code “MD” in the filter() function, we use the column_to_rowname() function which allows us to see the name of the school for our observations. Let’s view the new data set:


View(alabama_schools)


We now have 24 observations, or schools, and 16 variables in our new data set with only schools from Maryland. Now, let’s use the select() function to specify which features we are interested in. 


For this case, we are interested in admission_rate and sat_avg. Sorry for making that decision for you!! 


alabama_schools %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg   
##  Min.   :0.4414   Min.   : 811 
##  1st Qu.:0.5309   1st Qu.: 969 
##  Median :0.5927   Median :1035 
##  Mean   :0.6523   Mean   :1033 
##  3rd Qu.:0.8064   3rd Qu.:1109 
##  Max.   :1.0000   Max.   :1219


Before moving on, I would like to observe that we are using the pipe (%>%) operator which allows us to compose cleaner code. Let’s move on:


As expected, we can notice that the range of values between both variables have vastly different ranges. As a reminder, we MUST normalize our variables BEFORE building our model.


alabama_schools_scaled <- alabama_schools %>%
  select(admission_rate, sat_avg) %>%
  scale()


We create a new object, using our alabama_schools data set, that utilizes the scale() function to “normalize” our data to z-scores. This creates an optimal scenario to enable us to use a clustering model.


Let’s see the new values for each variable by piping the summary() function into our new data object.


alabama_schools_scaled %>%
  summary()

##  admission_rate       sat_avg       
##  Min.   :-1.3758   Min.   :-1.92418 
##  1st Qu.:-0.7922   1st Qu.:-0.55343 
##  Median :-0.3884   Median : 0.01916 
##  Mean   : 0.0000   Mean   : 0.00000 
##  3rd Qu.: 1.0051   3rd Qu.: 0.66332 
##  Max.   : 2.2685   Max.   : 1.61547


Our variables have been standardized to z-scores and our data is officially normalized. Remember, by normalizing our data we don’t need to worry about extreme differences between our variables. Let’s load the stats package so we can run the algorithm.


Let’s set the seed so we can have reproducible results.


library(stats)

set.seed(1234)


Notice that we have set our seed to randomize the data. Simply remember the number that you use in the argument so that you can reproduce the results every time you run this algorithm.


Reminder:


The kmeans() functions takes three arguments. Our scaled data set, which recall is alabama_schools_scaled, is the first argument. Centers will determine how many cluster centers we will have. This is essentially our k-value. We are going to set our centers (k-value) to 3, because we want three clusters. The nstart will be set to 25, and this will be the total number of configuration attempts. Let’s run it!


k_3 <- kmeans(alabama_schools_scaled, centers = 3, nstart = 25)


Great! Let’s start to explore this cluster. Let’s see how many observations are in each of the three clusters.


k_3$size

## [1]  8  6 10


This output tells us that one cluster has 8 observations, another has 6 observations, and the third cluster has the remaining 10 observations.


We can also get a sense of the values of the three cluster centers by using the centers attribute.


k_3$centers

##   admission_rate    sat_avg
## 1     -0.6824578  0.5743977
## 2     -0.8282133 -1.2431427
## 3      1.0428942  0.2863675


Let’s break this down per cluster:


The first cluster center has a value of roughly -.68 for admission_rate and .57 for sat_avg.


The second cluster center has a value of roughly -.82 for admission_rate and -1.24 for sat_avg.


The third cluster center has a value of roughly 1.04 for admission_rate and approximately .28 for sat_avg.


THESE VALUES ARE NORMALIZED VALUES OF THE ORIGINAL DATASET!!!


With this in mind, we have lost some ability to interpret this part of the data. This is fine as our objective is to group these schools as accurately as possible.


What good is clustering if we can’t have a visualization of our model. Let’s load another package called factoextra.


library(factoextra)

## Warning: package 'factoextra' was built under R version 4.1.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa


Next, will pass three arguments into a very powerful function called the fviz_cluster() function. The first argument will be an object where our model is stored, k_3. The second component in the function will be the data set, which is the scaled data set, alabama_schools_scaled. The last element of the function will contain a specification that will help organize our labels.


fviz_cluster(k_3, data = alabama_schools_scaled, repel = TRUE)

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


Alabama Schools Cluster


There is a lot to unpack here, but for brevity, I will describe the clusters as simply as possible.


Cluster 1 is labeled in red and as you can notice it contains schools that have the highest SAT average and lowest admission rates. I would argue that the 8 schools in this cluster are the most elite schools in Alabama as observed by the two metrics that we have used.


Cluster 2 is labeled in green. Universities in this cluster have the lowest SAT average, and are low in admission rates.


Cluster 3 is labeled in blue. Schools in this cluster have high admission rate and have high sat averages; this cluster is similar to cluster 1 with regards to high SAT scores.


I suppose all school clusters will have similar arrangements on two dimensional space if they are measured by same features, sat_avg and admission_rate. 


Let’s see the how the three clusters looks for the state of California, and Florida. Just to test out my hypothesis.


california_schools <- college %>%
  filter(state == "CA") %>%
  column_to_rownames(var = "name")



california_schools %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg     
##  Min.   :0.0509   Min.   : 871.0 
##  1st Qu.:0.4051   1st Qu.: 993.5 
##  Median :0.5800   Median :1086.0 
##  Mean   :0.5477   Mean   :1113.7 
##  3rd Qu.:0.7425   3rd Qu.:1222.0 
##  Max.   :0.8750   Max.   :1545.0

california_schools_scaled <- california_schools %>%
  select(admission_rate, sat_avg) %>%
  scale()



set.seed(1234)


k_3 <- kmeans(california_schools_scaled, centers = 3, nstart = 25)


fviz_cluster(k_3, data = california_schools_scaled, repel = TRUE)

## Warning: ggrepel: 58 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


California Schools Cluster

Now for Florida:


florida_schools <- college %>%
  filter(state == "FL") %>%
  column_to_rownames(var = "name")


florida_schools %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg   
##  Min.   :0.2286   Min.   : 803 
##  1st Qu.:0.4481   1st Qu.: 948 
##  Median :0.5188   Median :1065 
##  Mean   :0.5436   Mean   :1057 
##  3rd Qu.:0.6157   3rd Qu.:1153 
##  Max.   :1.0000   Max.   :1330

florida_schools_scaled <- florida_schools %>%
  select(admission_rate, sat_avg) %>%
  scale()



set.seed(1234)


k_3 <- kmeans(florida_schools_scaled, centers = 3, nstart = 25)


fviz_cluster(k_3, data = florida_schools_scaled, repel = TRUE)

## Warning: ggrepel: 19 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


Florida Schools Cluster


An observation that I would like to make is that the clusters are labeled different from graphic to graphic. For example, Alabama’s elite schools are labeled red, California’s elite schools are labeled color green, and Florida’s top schools are labeled blue.


Yet, regardless of how the cluster is labeled by color, we can deduce that elite level schools are positioned in the upper left portion of the quadrant.



THANK YOU!


Thank you so much reading this different take on Clustering. This machine learning method is used widely in a Marketing context, namely when managers are looking to categorize, or "segment," groups of costumers.


I am proud of all of you! I know that the steps that you are taking towards a safer, healthier life, are tedious, and at times cumbersome. 


But you know what? 


Your dedication to ridding yourselves of a toxic profession will pay off. You will live happier lives, filled with new found love from family and friends, and even acquaintances who are drawn to your hard work and professional ethic. 


Thank you for allowing me to be part of your journey to becoming data analysts, and data entrepreneurs


Share:
Location: Chicago, IL, USA