*** Introduction:
*** Idea in Brief:
I will be analyzing two cleaned data sets that are related
to COVID deaths and cases, and population of each US state. I will take the
joined-data set on a journey through the Modified Analytics Process (MAP),
which is a process that was introduced to me by one of my professors at Kellstadt Graduate School of business. The 5 stages will be used
sequentially, and at times, out of order. It can be summarized as follows:
1.
Plan- “Define Goals”
2.
Collect- “Lock, Load and Clean”
3.
Explore- “Learn Your Data”
4.
Exploit- “Use Statistical Approaches for Problem
Solving”
5.
Report- “Persuade Management”
So then, I will begin this project with the PLAN phase
of MAP.
*** PLAN: “Define Goals”
This part of the analysis is qualitative and might be
iterated upon as the process goes on. As I learn more about the data set, its
constraints, and how those constraints are related to my goals, I might need to
modify my objective(s).
Initially, I would like to see if this data is compatible
with techniques that were learned in the class, specifically: clustering. This
is my main objective. Along with trying to fit this algorithm over the data, I
will be focusing on using as many of the functions as I can. Visualizations and
other EDA tools (Exploratory Data Analysis) will be considered as a vital
component of the overarching analysis.
This is the extent to the PLAN phase that I wish to
contribute currently. Allow me to move on to the next phase.
*** COLLECT: “Lock, Load, and Clean”
I will now load the necessary libraries into our
environment. Additionally, I will set my directory, and create an object to
store the data sets of interest; I will call the objects, “covid” and
“population.”
Lastly, the code that you need to type is highlighted in blue. Please remember to press CTRL and ENTER after typing the coding highlighted in blue.
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts ------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")
covid <- read.csv("covid.csv", header = TRUE)
population <- read.csv("population.csv", header = TRUE)
In the Plan Phase, I mentioned that I would be doing a
“join” on the covid data and the population data. Normally, I feel that one
would do this near the end of the analysis when they have discovered their data
set more thoroughly. However, I already have an idea of what the data will look
like graphically due to prior exposure to these sets, so I am going to create
an object that will combine certain features of both sets right now. I will
call my object “scale,” and it will be referenced later in my analysis.
I want to treat the “scale” data frame differently that
the data frames labeled “covid” and “population.” My intention is to create
more columns that can better represent the information between the 50 states
that is now present in the original data.
scale <- covid %>% inner_join(population, by = “state”)
Since my data has been cleaned prior to loading, I can
move on to the next phase of MAP where I will learn more about my data to
determine if I can run any, if not all, of the statistical models listed in the
plan phase.
*** EXPLORE: “Learn Your Data”
I will use some basic exploration functions to do a
preliminary analysis of the data set so as to best determine which direction I
take. I will see if it is more convenient to explore my two initial data sets
simultaneously for the sake of brevity in this report.
dim(covid)
## [1] 55 4
str(covid)
## 'data.frame':
55 obs. of 4 variables:
## $ ï..date:
chr "10/20/20"
"10/20/20" "10/20/20" "10/20/20" ...
## $ state : chr
"Alabama" "Alaska" "Arizona"
"Arkansas" ...
## $ cases : int
174528 12349 232939 100441 887093 87966 64455 23325 16445 760381 ...
## $ deaths :
int 2805 63 5837 1728 17067 2212 4559
668 642 16104 ...
colnames(covid)
## [1] "ï..date" "state" "cases" "deaths"
head(covid)
##
ï..date state cases deaths
## 1 10/20/20
Alabama 174528 2805
## 2 10/20/20
Alaska 12349 63
## 3 10/20/20
Arizona 232939 5837
## 4 10/20/20
Arkansas 100441 1728
## 5 10/20/20 California 887093 17067
## 6 10/20/20
Colorado 87966 2212
We can see that the covid data frame is 55 observations (rows) multiplied by 4 variables (columns). The four columns are comprised of date, and state, which are both character types (non-numeric), and cases and deaths, which are both integer types (numeric).
Consequently, these columns are
also the names of the variables: date, state, cases, and deaths. The head()
function shows us the first 6 observations which are represented by the states
of Alabama, Alaska, Arizona, Arkansas, California, and Colorado. Let’s do the
same exploration of the population data frame.
dim(population)
## [1] 52 2
str(population)
## 'data.frame':
52 obs. of 2 variables:
## $ state : chr
"Alabama" "Alaska" "Arizona"
"Arkansas" ...
## $
population: int 4903185 731545 7278717
3017804 39512223 5758736 3565287 973764 705749 21477737 ...
colnames(population)
## [1] "state" "population"
head(population)
## state
population
## 1
Alabama 4903185
## 2
Alaska 731545
## 3
Arizona 7278717
## 4
Arkansas 3017804
## 5 California
39512223
## 6
Colorado 5758736
The population data frame is composed of 52
observations (rows) and 2 variables (columns). This data frame is much smaller.
It only has two columns, state and population, which are of character and integer
data types respectively. The head() function tells us the population for
Alabama, Alaska, Arizona, Arkansas, California, and Colorado.
Now that I have some idea of what the data means, I am going to circle back to my objectives in the plan phase.
First, I would like to
see if I can run a clustering algorithm on the data frame entitled “covid.”
Then, as I already have an idea of the output, I will use the data frame
“scale” to get a more accurate representation of how certain variables relate
to the states.
I will do a K-means cluster approach, and I will set k
equal to 4 (to represent the number of Clusters I wish to see) when I get the opportunity. I am going to set my seed so that I can
reproduce my models with 100% equivalency when needed.
set.seed(1234)
*** EPLOIT: “Use Statistical Approaches for Problem
Solving”
Now that my seed is set, I will build the model using the
kmeans() function. I will use the “covid” data frame and I will index that
argument with a vector with all rows contained in the two columns, cases and
deaths. The K-means value will be set to 4 clusters, and this model will be
iterated upon 20 times to find the best centroids for the clusters.
km_covid <- kmeans(covid[, c("cases","deaths")], 4, nstart = 20)
I can now interpret the output of the model that has
been stored in the object called “km_covid.”
km_covid
## K-means clustering with 4 clusters of sizes 17, 3,
32, 3
##
## Cluster means:
##
cases deaths
## 1 181126.29
5309.8235
## 2 391839.00 16684.6667
## 3
47952.19 932.8438
## 4 842297.67 16938.3333
##
## Clustering vector:
## [1] 1 3 1 3
4 3 3 3 3 4 2 3 3 3 2 1 3 3 3 1 3 1 1 1 1 3 1 3 3 3 3 1 3 2 1 3 3 1
## [39] 3 3 1 3 3 1 3 1 4 3 3 3 1 3 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 19901981416 15607035773 45933578173 10096166187
## (between_SS
/ total_SS = 95.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Cluster means for cases and deaths are listed for 4
total clusters.
km_covid$cluster
## [1] 1 3 1 3
4 3 3 3 3 4 2 3 3 3 2 1 3 3 3 1 3 1 1 1 1 3 1 3 3 3 3 1 3 2 1 3 3 1
## [39] 3 3 1 3 3 1 3 1 4 3 3 3 1 3 3 1 3
This output gives me each of the observations that have
been assigned to each cluster. For instance, I can refer to our first observation
in the “covid” data frame, Alabama, and it will be grouped in cluster 1. Our
second observation in the “covid” data frame is Alaska, and it will be assigned
to cluster 3. One can continue this method of referring the observation to the
data frame to the cluster number for all the states.
table(km_covid$cluster)
##
## 1 2
3 4
## 17 3 32 3
This table shows us the observations or localities
assigned to each cluster, and this assignment is due to similarities between
the observations with regards to cases and deaths. Cluster 1 will have 17
localities. Cluster 2 will have 3 localities. Cluster 3 will have 32
localities, and Cluster 4 will have 3 localities.
km_covid$centers
##
cases deaths
## 1 181126.29
5309.8235
## 2 391839.00 16684.6667
## 3
47952.19 932.8438
## 4 842297.67 16938.3333
This output will show us the values for the centroid
observations within our cluster model. I will now visualize the data, but
first, I need to use the as.factor() function to make our model easier to
graph.
km_covid$cluster <- as.factor(km_covid$cluster)
I will use ggplot to graph our cluster, and give it a
title, as well.
ggplot(covid, aes(cases,deaths, color = km_covid$cluster)) +
geom_point() +
geom_text(aes(label=state)) +
ggtitle("Clusters of US State based on number of covid cases and
deaths") +
labs(colour = "Clusters")
This is the graphical output of our model, and I feel that something is off. I think that New York should have its own cluster, ideally, since it is not at all similar with other states in that cluster, Illinois and Georgia. Since our centroids are random in my 4-cluster model, there is no way that I can control where I these centroids exist.
I am going to try to update
our model with 5 clusters to see if I can have a more sensible cluster model.
Further, I will have 30 iterations instead of 20. If this fails, I will choose
a k-means of 3. My fingers are crossed!
km_covid2 <- kmeans(covid[, c("cases","deaths")], 5, nstart
= 30)
km_covid2$cluster <- as.factor(km_covid2$cluster)
Let’s see the graphical representation of the new model
called km_cluster2.
ggplot(covid, aes(cases,deaths, color = km_covid2$cluster)) +
geom_point() +
geom_text(aes(label=state)) +
ggtitle("Clusters of US State based on number of covid cases and
deaths") +
labs(colour = "Clusters")
In trying to pick a cluster number that would assign New
York in its own cluster, I made the cluster map more complex. This is not
ideal! I will now try to assign a k-means of 3 in our model to see if I can
rectify my error. Below is what my newer model will look like:
km_covid3 <- kmeans(covid[, c("cases","deaths")], 3, nstart
= 30)
km_covid3$cluster <- as.factor(km_covid3$cluster)
Let’s visualize the newer model:
ggplot(covid, aes(cases,deaths, color = km_covid3$cluster)) +
geom_point() +
geom_text(aes(label=state)) +
ggtitle("Clusters of US State based on number of covid cases and
deaths") +
labs(colour = "Clusters")
I suppose this is the best I can do for this data frame.
Ideally, I would prefer to have New York in its own cluster as I don’t think it
is similar to other observations in its cluster which is Cluster 3, in this
case. It is clearly an outlier! But there is an issue that I haven’t address
yet. Here is the issue:
If we look at the states in the green cluster (cluster 2),
we notice, based on prior knowledge, that these states have enormous
populations. Florida, Texas, and California are three of the most populated
states in the country, so how can we compare all observations appropriately
knowing that this model is somewhat biased?
I can address this issue by scaling the data. I have already created a data frame, “scale,” that begins to address the conflict. What if I could create new variables of interest that would scale the data?
I
might create two columns that address cases as divided by 100k population, and
deaths as divided by 100k population. This would make more sense, and we might
even begin to see different clusters that are based on proportion rather than
raw deaths and cases. So, let’s begin!
final<- scale %>%
mutate(cases100k=(cases/population)*100000, deaths100k=(deaths/population)*100000)
What does the top of this data frame look like?
head(final)
##
ï..date state cases deaths population cases100k deaths100k
## 1 10/20/20
Alabama 174528 2805 4903185
3559.482 57.20771
## 2 10/20/20
Alaska 12349 63
731545 1688.071 8.61191
## 3 10/20/20
Arizona 232939 5837 7278717
3200.276 80.19270
## 4 10/20/20
Arkansas 100441 1728 3017804
3328.281 57.26018
## 5 10/20/20 California 887093 17067
39512223 2245.110 43.19423
## 6 10/20/20
Colorado 87966 2212
5758736 1527.523 38.41121
Now I can see two more columns added to the “scale”
data frame called “cases100k” and “deaths100k.” Let me set the seed once more.
set.seed(1234)
I can create my final model called “km_final,” and I
will set k-means equal to 3, and I will have 30 iterations of the model before
settling on our final model. I will index our “final” data frame to include all
rows and two columns “cases100k,” and “deaths100k.” Let me run it:
km_final <- kmeans(final[, c("cases100k","deaths100k")], 3, nstart
= 30)
I will use the as.factor() function to make the
necessary adjustments before I visualize the new model.
km_final$cluster <- as.factor(km_final$cluster)
… and the visualization, with labels changed, will look
like such:
ggfinal <- ggplot(final, aes(cases100k,deaths100k, color = km_final$cluster)) +
geom_point() +
geom_text(aes(label=state)) +
ggtitle("Clusters of US State based on number of covid cases and
deaths per 100k") +
labs(colour = "Clusters")
ggfinal
You know what?! Fine! I will submit! I surrender! Setting the model to 3 clusters looks good, although I still feel that it is not perfect. I will leave this the way it is. Let me interpret this final model in my own words.
The first observation that I would like to take notice of
is that this new model has shifted some of the states in terms of
proportionality rather than total deaths and cases alone. In our first model,
Florida, Texas, and California, were positioned all the wayyyyyyyyy on the
right side of the plot. But this is only because those three states are the
largest by population. In our new scaled model, we have a more accurate
representation of covid infection rates by cases and death PER 100,000
citizens.
This is a more meaning metric as state governments can
have a better idea of how to implement safety guidelines to reduce covid cases
and death cases PER its population. The severity of Covid-19 is widespread, BUT
is DIFFERENT from state to state. In our new model, Florida is still at a
critical level, but so is North Dakota, which is a much smaller state with
regards to its population.
Notice how Pennsylvania, a heavily populated state, has a
different status than Florida, another heavily populated state. These two
states give a "bigger-picture-overview" that Covid safety measures should be
SPECIFIC to the state’s population. Meaning, Pennsylvania might, and should,
have a different strategy of case and death reduction initiatives than Florida.
*** REPORT: “Persuade Management”
In conclusion, the final cluster model will do a better job of informing the local, state, and federal governments of where, why, and how they should divert resources from one state to another.
One state might
require more ICU assistance, PPE aid, and financial resources to combat cases
and deaths, and this is due to the proportion of the population to covid cases
and deaths.
(Let’s say hypothetically, I wanted to persuade then
President Trump to act on covid relief measures, and detail the situation in
plain English. I might write to him the following letter)
Greetings Mr. President Trump,
I am writing to briefly, and humbly, to act on the Covid
situation that is greatly impacting our nation. As you know, there is a
shortage of resources that we have, as covid is a new disease. Our supply
chains, as you have mentioned in a previous press conference, are stretched
thin. Here is a graphical representation of the states that are in the
following three clusters. These clusters can be noted as:
Cluster 1 (RED): Less Critical
Cluster 2 (GREEN): Moderately Critical
Cluster 3 (BLUE): Highly Critical
ggfinal <- ggplot(final, aes(cases100k,deaths100k, color = km_final$cluster)) +
geom_point() +
geom_text(aes(label=state)) +
ggtitle("Clusters of US State based on number of covid cases and
deaths per 100k") +
labs(colour = "Clusters")
ggfinal
States that are in the BLUE cluster should be given the highest priority access to covid relief, as these are the states that have the most cases and deaths for every 100,000 citizens, on average. Access to such care is vital for these states.
Conversely, states that are in the RED cluster should have the lowest priority in your allocation of covid resources.
And finally, once you have appropriately allocated the highest priority to the BLUE states, you can allocate a moderate amount of resources to the states in the GREEN cluster, as they are your second highest priority.
What might this allocation ratio look like? I will leave
this to your discretion. However, a basic ratio might be 60:30:10, in which:
The Highly Critical cluster (BLUE) might receive 60% of covid relief.
The Moderately Critical cluster (GREEN) might receive 30% of covid relief.
The Less Critical cluster (RED) might receive 10% of covid relief.
Again, I will leave this to the discretion of you and your
team, as you are the group that will be held accountable for this decision.
Thank you so much for your attention, and for hearing out a concerned citizen. You and your team have reacted as well as possible considering we have been unprepared to handle this black swan event.
I hope you, your family, your staff, and the rest of the
world, can remain healthy during this pandemic.
Best regards,
Concerned Citizen
*** CONCLUSION: “Be Persuasive In Your Report”
The goal of the report is NOT to be complicated. The visualizations and the analysis should be clear enough for a layperson to understand. If the implications are clear, what are the most IMMEDIATE actions that a leader should take with your report?
Try to tell a story with the data
if you can, so that leaders can be emotionally tied to the report. This will
enable such leaders to be more involved in the decision-making process AND more effective.
THANK YOU!!
This post was more complicated than my previous ones on Regression analysis. Clustering resides in a different branch of Machine Learning called Unsupervised Machine Learning, so that is why the code is different and the interpretation of the output is different as well. The algorithm detects patterns that you can't on its own... it just needs more data to do this!
I am so proud of you for reaching this far into the blog. I hope that your journey is going well, and that you are having fun learning about this subject. It is always so much fun for me to write these posts, so I have y'all to thank for that. Thank you!