Wednesday, August 3, 2022

Covid Analysis for 2020: A Basic Clustering Method to Inform Leaders... Porn Stars Can Inform Leaders Too!

 *** Introduction:

 

Hello ladies, thank you so much for coming to this page, and for continuing to be so supportive. Today's topic is about an UNSUPERVISED MACHINE LEARNING method called CLUSTERING


This method detects patterns in the data that any normal person could not see on their own. Well, maybe Wonder Woman can, but she's not normal!


This method is useful for Marketing departments in organizations as they can use Clustering to see if there are any segments, or groups, inside their data that they can target in upcoming sales campaigns. 


Clustering is also helpful for other use cases, like in the Healthcare Industry, for instance. Today, we are going to go over COVID-19 data from two years ago, and group states in categories of Mild, Moderate Priority and High Priority in terms of vaccine supplies.


ALERT: I don't mean for this post to be political at all. Covid has been a touchy subject for everyone on the Planet, and especially for the United States. The subject of covid, and it's dangers, has been politicized by both political parties, conservatives and progressives. 


The reason why I chose Covid for this Clustering analysis is because the data is easy to obtain. Further, you will be able to see how each state, or observation, is grouped due to number of cases and deaths, in a low to high manner. 


I end this post with a sample letter to then President Donald J. Trump, urging him to allocate Covid resources effectively as not all states are suffering equally from this deadly disease. ANYONE, especially Sex Workers, can and should have the inspiration to speak up when they see a problem! In fact, it is our civic duty to do just that! Having a voice should not be for only the privileged folk!


Again, this post is not mean to be political. If you love former President Trump. It is irrelevant! If you hate former President Trump. It is irrelevant! What is important is that this post will give you another tool for tackling Machine Learning and Data Analytics.


With that, I hope you enjoy this one. I had so much fun putting it together. I hope it helps all of you on your journey!
 

*** Idea in Brief:

 

I will be analyzing two cleaned data sets that are related to COVID deaths and cases, and population of each US state. I will take the joined-data set on a journey through the Modified Analytics Process (MAP), which is a process that was introduced to me by one of my professors at Kellstadt Graduate School of business. The 5 stages will be used sequentially, and at times, out of order. It can be summarized as follows:

 

1.         Plan- “Define Goals”

2.         Collect- “Lock, Load and Clean”

3.         Explore- “Learn Your Data”

4.         Exploit- “Use Statistical Approaches for Problem Solving”

5.         Report- “Persuade Management”

 

So then, I will begin this project with the PLAN phase of MAP.

 

*** PLAN: “Define Goals”

 

This part of the analysis is qualitative and might be iterated upon as the process goes on. As I learn more about the data set, its constraints, and how those constraints are related to my goals, I might need to modify my objective(s).

 

Initially, I would like to see if this data is compatible with techniques that were learned in the class, specifically: clustering. This is my main objective. Along with trying to fit this algorithm over the data, I will be focusing on using as many of the functions as I can. Visualizations and other EDA tools (Exploratory Data Analysis) will be considered as a vital component of the overarching analysis.

 

This is the extent to the PLAN phase that I wish to contribute currently. Allow me to move on to the next phase.

 

*** COLLECT: “Lock, Load, and Clean”

 

I will now load the necessary libraries into our environment. Additionally, I will set my directory, and create an object to store the data sets of interest; I will call the objects, “covid” and “population.”


If you wish to follow along, I will post the data sets on the Google Drive that you can access on the home page menu. Click on "Data Sets" and you will be brought to the page. L


Lastly, the code that you need to type is highlighted in blue. Please remember to press CTRL and ENTER after typing the coding highlighted in blue. 


library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")

covid <- read.csv("covid.csv", header = TRUE)
population <- read.csv("population.csv", header = TRUE)

 

In the Plan Phase, I mentioned that I would be doing a “join” on the covid data and the population data. Normally, I feel that one would do this near the end of the analysis when they have discovered their data set more thoroughly. However, I already have an idea of what the data will look like graphically due to prior exposure to these sets, so I am going to create an object that will combine certain features of both sets right now. I will call my object “scale,” and it will be referenced later in my analysis.

 

I want to treat the “scale” data frame differently that the data frames labeled “covid” and “population.” My intention is to create more columns that can better represent the information between the 50 states that is now present in the original data.

 

scale <- covid %>% inner_join(population, by = “state”)

 

Since my data has been cleaned prior to loading, I can move on to the next phase of MAP where I will learn more about my data to determine if I can run any, if not all, of the statistical models listed in the plan phase.

 

*** EXPLORE: “Learn Your Data”

 

I will use some basic exploration functions to do a preliminary analysis of the data set so as to best determine which direction I take. I will see if it is more convenient to explore my two initial data sets simultaneously for the sake of brevity in this report.

 

dim(covid)

## [1] 55  4

str(covid)

## 'data.frame':    55 obs. of  4 variables:
##  $ ï..date: chr  "10/20/20" "10/20/20" "10/20/20" "10/20/20" ...
##  $ state  : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ cases  : int  174528 12349 232939 100441 887093 87966 64455 23325 16445 760381 ...
##  $ deaths : int  2805 63 5837 1728 17067 2212 4559 668 642 16104 ...

colnames(covid)

## [1] "ï..date" "state"   "cases"   "deaths"

head(covid)

##    ï..date      state  cases deaths
## 1 10/20/20    Alabama 174528   2805
## 2 10/20/20     Alaska  12349     63
## 3 10/20/20    Arizona 232939   5837
## 4 10/20/20   Arkansas 100441   1728
## 5 10/20/20 California 887093  17067
## 6 10/20/20   Colorado  87966   2212

 

We can see that the covid data frame is 55 observations (rows) multiplied by 4 variables (columns). The four columns are comprised of date, and state, which are both character types (non-numeric), and cases and deaths, which are both integer types (numeric). 


Consequently, these columns are also the names of the variables: date, state, cases, and deaths. The head() function shows us the first 6 observations which are represented by the states of Alabama, Alaska, Arizona, Arkansas, California, and Colorado. Let’s do the same exploration of the population data frame.

 

dim(population)

## [1] 52  2

str(population)

## 'data.frame':    52 obs. of  2 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ population: int  4903185 731545 7278717 3017804 39512223 5758736 3565287 973764 705749 21477737 ...

colnames(population)

## [1] "state"      "population"

head(population)

##        state population
## 1    Alabama    4903185
## 2     Alaska     731545
## 3    Arizona    7278717
## 4   Arkansas    3017804
## 5 California   39512223
## 6   Colorado    5758736

 

The population data frame is composed of 52 observations (rows) and 2 variables (columns). This data frame is much smaller. It only has two columns, state and population, which are of character and integer data types respectively. The head() function tells us the population for Alabama, Alaska, Arizona, Arkansas, California, and Colorado.

 

Now that I have some idea of what the data means, I am going to circle back to my objectives in the plan phase


First, I would like to see if I can run a clustering algorithm on the data frame entitled “covid.” Then, as I already have an idea of the output, I will use the data frame “scale” to get a more accurate representation of how certain variables relate to the states.

 

I will do a K-means cluster approach, and I will set k equal to 4 (to represent the number of Clusters I wish to see) when I get the opportunity. I am going to set my seed so that I can reproduce my models with 100% equivalency when needed.

 

set.seed(1234)

 

*** EPLOIT: “Use Statistical Approaches for Problem Solving”

 

Now that my seed is set, I will build the model using the kmeans() function. I will use the “covid” data frame and I will index that argument with a vector with all rows contained in the two columns, cases and deaths. The K-means value will be set to 4 clusters, and this model will be iterated upon 20 times to find the best centroids for the clusters.

 

km_covid <- kmeans(covid[, c("cases","deaths")], 4, nstart = 20)

 

I can now interpret the output of the model that has been stored in the object called “km_covid.”

 

km_covid

## K-means clustering with 4 clusters of sizes 17, 3, 32, 3
##
## Cluster means:
##       cases     deaths
## 1 181126.29  5309.8235
## 2 391839.00 16684.6667
## 3  47952.19   932.8438
## 4 842297.67 16938.3333
##
## Clustering vector:
##  [1] 1 3 1 3 4 3 3 3 3 4 2 3 3 3 2 1 3 3 3 1 3 1 1 1 1 3 1 3 3 3 3 1 3 2 1 3 3 1
## [39] 3 3 1 3 3 1 3 1 4 3 3 3 1 3 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 19901981416 15607035773 45933578173 10096166187
##  (between_SS / total_SS =  95.5 %)
##
## Available components:
##
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

 

Cluster means for cases and deaths are listed for 4 total clusters.

 

km_covid$cluster

##  [1] 1 3 1 3 4 3 3 3 3 4 2 3 3 3 2 1 3 3 3 1 3 1 1 1 1 3 1 3 3 3 3 1 3 2 1 3 3 1
## [39] 3 3 1 3 3 1 3 1 4 3 3 3 1 3 3 1 3

 

This output gives me each of the observations that have been assigned to each cluster. For instance, I can refer to our first observation in the “covid” data frame, Alabama, and it will be grouped in cluster 1. Our second observation in the “covid” data frame is Alaska, and it will be assigned to cluster 3. One can continue this method of referring the observation to the data frame to the cluster number for all the states.

 

table(km_covid$cluster)

##
##  1  2  3  4
## 17  3 32  3

 

This table shows us the observations or localities assigned to each cluster, and this assignment is due to similarities between the observations with regards to cases and deaths. Cluster 1 will have 17 localities. Cluster 2 will have 3 localities. Cluster 3 will have 32 localities, and Cluster 4 will have 3 localities.

 

km_covid$centers

##       cases     deaths
## 1 181126.29  5309.8235
## 2 391839.00 16684.6667
## 3  47952.19   932.8438
## 4 842297.67 16938.3333

 

This output will show us the values for the centroid observations within our cluster model. I will now visualize the data, but first, I need to use the as.factor() function to make our model easier to graph.

 

km_covid$cluster <- as.factor(km_covid$cluster)

 

I will use ggplot to graph our cluster, and give it a title, as well.

 

ggplot(covid, aes(cases,deaths, color = km_covid$cluster)) +
  geom_point() +
  geom_text(aes(label=state)) +
  ggtitle("Clusters of US State based on number of covid cases and deaths") +
  labs(colour = "Clusters")


Cluster of Covid Cases and Deaths



This is the graphical output of our model, and I feel that something is off. I think that New York should have its own cluster, ideally, since it is not at all similar with other states in that cluster, Illinois and Georgia. Since our centroids are random in my 4-cluster model, there is no way that I can control where I these centroids exist. 


I am going to try to update our model with 5 clusters to see if I can have a more sensible cluster model. Further, I will have 30 iterations instead of 20. If this fails, I will choose a k-means of 3. My fingers are crossed!

 

km_covid2 <- kmeans(covid[, c("cases","deaths")], 5, nstart = 30)
km_covid2$cluster <- as.factor(km_covid2$cluster)

 

Let’s see the graphical representation of the new model called km_cluster2.

 

ggplot(covid, aes(cases,deaths, color = km_covid2$cluster)) +

  geom_point() +
  geom_text(aes(label=state)) +
  ggtitle("Clusters of US State based on number of covid cases and deaths") +
  labs(colour = "Clusters")


Second Cluster Model of Covid Cases VS. Deaths




In trying to pick a cluster number that would assign New York in its own cluster, I made the cluster map more complex. This is not ideal! I will now try to assign a k-means of 3 in our model to see if I can rectify my error. Below is what my newer model will look like:

 

km_covid3 <- kmeans(covid[, c("cases","deaths")], 3, nstart = 30)
km_covid3$cluster <- as.factor(km_covid3$cluster)

 

Let’s visualize the newer model:

 

ggplot(covid, aes(cases,deaths, color = km_covid3$cluster)) +
  geom_point() +
  geom_text(aes(label=state)) +
  ggtitle("Clusters of US State based on number of covid cases and deaths") +
  labs(colour = "Clusters")


Cluster Model 3 of Covid Cases VS. Deaths



 

I suppose this is the best I can do for this data frame. Ideally, I would prefer to have New York in its own cluster as I don’t think it is similar to other observations in its cluster which is Cluster 3, in this case. It is clearly an outlier! But there is an issue that I haven’t address yet. Here is the issue:


If we look at the states in the green cluster (cluster 2), we notice, based on prior knowledge, that these states have enormous populations. Florida, Texas, and California are three of the most populated states in the country, so how can we compare all observations appropriately knowing that this model is somewhat biased?

 

I can address this issue by scaling the data. I have already created a data frame, “scale,” that begins to address the conflict. What if I could create new variables of interest that would scale the data? 


I might create two columns that address cases as divided by 100k population, and deaths as divided by 100k population. This would make more sense, and we might even begin to see different clusters that are based on proportion rather than raw deaths and cases. So, let’s begin!

 

final<- scale %>%
  mutate(cases100k=(cases/population)*100000, deaths100k=(deaths/population)*100000)

 

What does the top of this data frame look like?

 

head(final)

##    ï..date      state  cases deaths population cases100k deaths100k
## 1 10/20/20    Alabama 174528   2805    4903185  3559.482   57.20771
## 2 10/20/20     Alaska  12349     63     731545  1688.071    8.61191
## 3 10/20/20    Arizona 232939   5837    7278717  3200.276   80.19270
## 4 10/20/20   Arkansas 100441   1728    3017804  3328.281   57.26018
## 5 10/20/20 California 887093  17067   39512223  2245.110   43.19423
## 6 10/20/20   Colorado  87966   2212    5758736  1527.523   38.41121

 

Now I can see two more columns added to the “scale” data frame called “cases100k” and “deaths100k.” Let me set the seed once more.

 

set.seed(1234)

 

I can create my final model called “km_final,” and I will set k-means equal to 3, and I will have 30 iterations of the model before settling on our final model. I will index our “final” data frame to include all rows and two columns “cases100k,” and “deaths100k.” Let me run it:


km_final <- kmeans(final[, c("cases100k","deaths100k")], 3, nstart = 30)

 

I will use the as.factor() function to make the necessary adjustments before I visualize the new model.

 

km_final$cluster <- as.factor(km_final$cluster)

 

… and the visualization, with labels changed, will look like such:

 

ggfinal <- ggplot(final, aes(cases100k,deaths100k, color = km_final$cluster)) +
  geom_point() +
  geom_text(aes(label=state)) +
  ggtitle("Clusters of US State based on number of covid cases and deaths per 100k") +
  labs(colour = "Clusters")
ggfinal


Cluster Model 4 Covid Cases VS. Deaths


You know what?! Fine! I will submit! I surrender! Setting the model to 3 clusters looks good, although I still feel that it is not perfect. I will leave this the way it is. Let me interpret this final model in my own words.

 

The first observation that I would like to take notice of is that this new model has shifted some of the states in terms of proportionality rather than total deaths and cases alone. In our first model, Florida, Texas, and California, were positioned all the wayyyyyyyyy on the right side of the plot. But this is only because those three states are the largest by population. In our new scaled model, we have a more accurate representation of covid infection rates by cases and death PER 100,000 citizens.

 

This is a more meaning metric as state governments can have a better idea of how to implement safety guidelines to reduce covid cases and death cases PER its population. The severity of Covid-19 is widespread, BUT is DIFFERENT from state to state. In our new model, Florida is still at a critical level, but so is North Dakota, which is a much smaller state with regards to its population.

 

Notice how Pennsylvania, a heavily populated state, has a different status than Florida, another heavily populated state. These two states give a "bigger-picture-overview" that Covid safety measures should be SPECIFIC to the state’s population. Meaning, Pennsylvania might, and should, have a different strategy of case and death reduction initiatives than Florida.

 

*** REPORT: “Persuade Management”

 

In conclusion, the final cluster model will do a better job of informing the local, state, and federal governments of where, why, and how they should divert resources from one state to another. 


One state might require more ICU assistance, PPE aid, and financial resources to combat cases and deaths, and this is due to the proportion of the population to covid cases and deaths.

 

(Let’s say hypothetically, I wanted to persuade then President Trump to act on covid relief measures, and detail the situation in plain English. I might write to him the following letter)

 

Greetings Mr. President Trump,


I am writing to briefly, and humbly, to act on the Covid situation that is greatly impacting our nation. As you know, there is a shortage of resources that we have, as covid is a new disease. Our supply chains, as you have mentioned in a previous press conference, are stretched thin. Here is a graphical representation of the states that are in the following three clusters. These clusters can be noted as:

 

Cluster 1 (RED): Less Critical 

Cluster 2 (GREEN): Moderately Critical 

Cluster 3 (BLUE): Highly Critical

 

ggfinal <- ggplot(final, aes(cases100k,deaths100k, color = km_final$cluster)) +
  geom_point() +
  geom_text(aes(label=state)) +
  ggtitle("Clusters of US State based on number of covid cases and deaths per 100k") +
  labs(colour = "Clusters")
ggfinal



Final Cluster Model of Covid Cases VS. Deaths


States that are in the BLUE cluster should be given the highest priority access to covid relief, as these are the states that have the most cases and deaths for every 100,000 citizens, on average. Access to such care is vital for these states. 


Conversely, states that are in the RED cluster should have the lowest priority in your allocation of covid resources. 


And finally, once you have appropriately allocated the highest priority to the BLUE states, you can allocate a moderate amount of resources to the states in the GREEN cluster, as they are your second highest priority.

 

 

What might this allocation ratio look like? I will leave this to your discretion. However, a basic ratio might be 60:30:10, in which:

 

The Highly Critical cluster (BLUE) might receive 60% of covid relief. 

The Moderately Critical cluster (GREEN) might receive 30% of covid relief. 

The Less Critical cluster (RED) might receive 10% of covid relief.

 

Again, I will leave this to the discretion of you and your team, as you are the group that will be held accountable for this decision.

 

Thank you so much for your attention, and for hearing out a concerned citizen. You and your team have reacted as well as possible considering we have been unprepared to handle this black swan event.


I hope you, your family, your staff, and the rest of the world, can remain healthy during this pandemic.

 

Best regards,

Concerned Citizen

 

*** CONCLUSION: “Be Persuasive In Your Report”

 

The goal of the report is NOT to be complicated. The visualizations and the analysis should be clear enough for a layperson to understand. If the implications are clear, what are the most IMMEDIATE actions that a leader should take with your report? 


Try to tell a story with the data if you can, so that leaders can be emotionally tied to the report. This will enable such leaders to be more involved in the decision-making process AND more effective.


THANK YOU!!


This post was more complicated than my previous ones on Regression analysis. Clustering resides in a different branch of Machine Learning called Unsupervised Machine Learning, so that is why the code is different and the interpretation of the output is different as well. The algorithm detects patterns that you can't on its own... it just needs more data to do this!


I am so proud of you for reaching this far into the blog. I hope that your journey is going well, and that you are having fun learning about this subject. It is always so much fun for me to write these posts, so I have y'all to thank for that. Thank you!



Share:
Location: Chicago, IL, USA