INTRODUCTION
Hello Porn Stars and all who are employed in the Sex Work Industry. Thank you as always for coming to this page. You have come such a long way in your learning and you should be proud of yourselves. Great work!
I posted, recently, about clustering analysis, which is an UNSUPERVISED MACHINE LEARNING method.
I wanted this post to give you an idea of how I might complete a report to a supervisor or manager. This includes how I would structure the report, and how I would deliver the results of the report.
I will use an analytical framework called MAP (or, Modified Analytical Process) to determine the allocation of COVID-19 preventative measures across the state of Florida.
The process will look like this:
1.
Plan- “Define Goals”
2.
Collect- “Lock, Load and Clean”
3.
Explore- “Learn Your Data”
4.
Exploit- “Use Statistical Approaches for Problem
Solving”
5.
Report- “Persuade Management”
So, let’s begin the analysis:
***
PLAN: “DEFINE GOALS”
I would like to do a cluster analysis of the state of Florida involving all the counties in the state. Using the most current data from the New York Times, it is my goal to determine which counties are at the most critical stages during recent pandemic news, and thus, I wish to offer guidance on which counties should be prioritized for state-wide COVID-19 prevention measures.
This goal might change depending on what I discover.
(One important caveat is that this data was collected just months after the COVID-19 outbreak. Since then, we have had access to vaccines and other worthy medical solutions for the better part of a year.)
I will use the K-Means algorithm for my Clustering
analysis, and I intend to separate counties into 3 groups called: Lowest
Priority (or Not Critical), Moderate Priority (Somewhat Critical), and High
Priority (Extremely Critical).
***
COLLECT: “LOCK, LOAD, and CLEAN”
I will load tidyverse into my environment, set my working directory, and store my data into an object called “covid.” Please observe the following bit of code to execute, highlighted in blue:
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3
v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts
------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/Documents/R")
covid <- read.csv("us-counties-recent.csv", header = TRUE)
***
EXPLORE: “LEARN YOUR DATA”
I will perform some preliminary exploratory techniques to
get a superficial sense of the data.
dim(covid)
## [1] 97576
6
str(covid)
## 'data.frame':
97576 obs. of 6 variables:
## $ date : chr
"2022-01-27" "2022-01-27" "2022-01-27"
"2022-01-27" ...
## $ county:
chr "Autauga"
"Baldwin" "Barbour" "Bibb" ...
## $ state :
chr "Alabama"
"Alabama" "Alabama" "Alabama" ...
## $ fips : int
1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ cases :
int 13251 50313 5054 5795 13427 2149
4519 28519 7882 4529 ...
## $ deaths:
int 163 608 83 95 204 46 106 544 147 67
...
colnames(covid)
## [1] "date" "county" "state" "fips" "cases" "deaths"
The data set that I have stored into a data frame
called “covid” is 97,576 observations (rows), multiplied by 6 variables
(columns). These variables are divided into character types (date, county, and
state), and integer types (fips, cases, and deaths).
Understand that character types are not considered numerical. Conversely, integer types are considered as numbers.
Next, I am going to create an object that focuses on the
state of Florida, and its counties, specifically. I will refer to this object
as “florida_counties.”
florida_counties <- covid %>%
filter(state == "Florida")
In order to further clean the data, I will remove two
columns that are irrelevant for this analysis, date and fips.
florida_counties <-
florida_counties %>%
select(-date, -fips)
Finally, I will remove all duplicated information,
keeping only unique rows names, which happen to be the individual counties and
their respective statistics.
florida_counties <-
florida_counties[!duplicated(florida_counties$county), ]
From prior exercises in clustering analysis, I know that I need to scale the data accordingly. Not all counties are equal in population.
For instance, some counties like Miami-Dade have enormous populations. Other counties similar to Baker county are much smaller.
So, I
would like to create 3-scaled clusters that take both cases and deaths into a
scaled form. The code expressing these changes might look like this:
florida_counties_scaled <-
florida_counties %>%
select(cases, deaths) %>%
scale()
florida_counties_scaled %>%
summary()
##
cases deaths
## Min. :-0.48881
Min. :-0.62228
## 1st
Qu.:-0.45243 1st Qu.:-0.56647
## Median
:-0.35627 Median :-0.29518
## Mean : 0.00000
Mean : 0.00000
## 3rd
Qu.:-0.02328 3rd Qu.: 0.09143
## Max. : 6.35542
Max. : 5.79201
Now let’s load the stats library, set our seed, and
perform the kmeans() function.
library(stats)
set.seed(1234)
***
EXPLOIT: “USE STATISTICAL APPROACHES FOR PROBLEM SOLVING”
Let’s build our model by creating a new object called
“k_3” as I will be using 3 clusters in this model. And the model will run
itself 30 times before determining a final optimal model.
k_3 <- kmeans(florida_counties_scaled, centers = 3, nstart = 30)
Let’s explore our model:
k_3$size
## [1] 1
59 8
This output tells us that in the 3 clusters that I have
specified: there is 1 observation in cluster 1, 59 observations in cluster 2,
and 8 observations in cluster 3. Keep in mind that when I use the word observation
I am generally referring to the counties. Meaning, each unique country is a single
observation.
k_3$centers
##
cases deaths
## 1
6.3554163 5.7920092
## 2 -0.2862105 -0.3002287
## 3
1.3163750 1.4901853
The “centers” attribute tells us the location of our 3 centroids. They are the centers of each cluster and they were determined randomly by the algorithm.
We can begin to visualize our model with the
factoextra package. This would be ideal as clusters are meant to be visualized.
library(factoextra)
## Warning: package 'factoextra' was built under R
version 4.1.2
## Welcome! Want to learn more? See two
factoextra-related books at https://goo.gl/ve3WBa
We can ignore the warning!
fviz_cluster(k_3, data = florida_counties_scaled, repel = TRUE)
## Warning: ggrepel: 58 unlabeled data points (too
many overlaps). Consider
## increasing max.overlaps
Let’s see if we can put the county names on a ggplot,
along with some labels:
k_3$cluster <- as.factor(k_3$cluster)
ggplot(florida_counties, aes(cases,deaths, color = k_3$cluster)) +
geom_point() +
geom_text(aes(label=county)) +
ggtitle("Clusters of Florida Counties based on number of covid
cases and de aths") +
labs(colour = "Clusters")+
xlab("Cases per 10,000")
***
REPORT: “PERSUADE MANAGEMENT”
I would like to conclude my research at this point. I have
enough information to determine which clusters of counties in the state of
Florida are at the most risk of current COVID-19 circumstances.
As you can see from the plot above there are 3 clusters of
counties, and I will label them accordingly:
RED CLUSTER:
Extremely Critical, has high number of cases per 10,000 citizens, and high number of deaths relative to other clusters.
BLUE CLUSTER:
Somewhat Critical, has moderate number of cases per 10,000 citizens, and moderate number of deaths compared to red cluster.
GREEN CLUSTER:
Not Critical, has a lower number of cases per 10,000 citizens, and low number of deaths compared to both red and blue clusters.
There is only one county that is in the RED CLUSTER: Miami-Dade County.
There are 8 counties in the BLUE
CLUSTER: Including Broward, Palm Beach, Orange, among others.
And the remaining 59 counties in the GREEN CLUSTER have
a status of “Not Critical.”
**From the information that I have gathered, it is clear that Miami-Dade County is in need of the most resources, whether in the form of ICU assistance, PPE, testing kits, and vaccine allocation.
This
conclusion is not, by any means, meant to be a political statement, as it
mainly deals with numbers, cold, unemotional numbers. If this county is not
given thoughtful consideration, COVID-19 cases and deaths will reduce at a much
slower rate, causing an unnecessary amount to turmoil in the region.**
MANAGERIAL IMPLICATIONS
I wish that this contents of this report could guide
leaders in the Miami-Dade county to be more proactive about pursuing aid from
the state and federal government. If there is ONE TAKEWAY from this analysis,
it is that Miami-Dade’s status with regards to COVID-19 is EXTREMELY CRITICAL.
Excess resources should be divided equally into the next critical cluster, counties that are labeled in blue. And finally, mild attention is needed for counties listed in the green cluster.
Thank you so much for reading my analysis. I hope that we,
the citizens of the world, can use data in a way that can impact change within
our communities. We can begin to have a conversation about the future of data,
its usefulness, and effectiveness when it is interpreted ethically. I hope this
project can be of use to leaders in the Miami-Dade medical community.
THANK YOU
Thank you so much for reading this post. I hope it gives you a simple and concise breakdown for how to conduct a basic report. Depending on the topic, and guidelines given to you by your manager, the report can be much lengthier.
I am so proud of you for taking these crucial and necessary steps for changing the trajectory of your lives by becoming efficient at data analysis, and machine learning topics. Your hard work will pay off in multiples!