Monday, August 8, 2022

Unsupervised Machine Learning For Porn Stars: COVID-19 Cluster Analysis for Miami-Dade County, Critical Status

on August 08, 2022 in Blog, Clustering

INTRODUCTION

Hello Porn Stars and all who are employed in the Sex Work Industry. Thank you as always for coming to this page. You have come such a long way in your learning and you should be proud of yourselves. Great work!

I posted, recently, about clustering analysis, which is an UNSUPERVISED MACHINE LEARNING method.

I wanted this post to give you an idea of how I might complete a report to a supervisor or manager. This includes how I would structure the report, and how I would deliver the results of the report.

I will use an analytical framework called MAP (or, Modified Analytical Process) to determine the allocation of COVID-19 preventative measures across the state of Florida.

The process will look like this:

1. Plan- “Define Goals”

2. Collect- “Lock, Load and Clean”

3. Explore- “Learn Your Data”

4. Exploit- “Use Statistical Approaches for Problem Solving”

5. Report- “Persuade Management”

So, let’s begin the analysis:

*** PLAN: “DEFINE GOALS”

I would like to do a cluster analysis of the state of Florida involving all the counties in the state. Using the most current data from the New York Times, it is my goal to determine which counties are at the most critical stages during recent pandemic news, and thus, I wish to offer guidance on which counties should be prioritized for state-wide COVID-19 prevention measures.

This goal might change depending on what I discover.

(One important caveat is that this data was collected just months after the COVID-19 outbreak. Since then, we have had access to vaccines and other worthy medical solutions for the better part of a year.)

I will use the K-Means algorithm for my Clustering analysis, and I intend to separate counties into 3 groups called: Lowest Priority (or Not Critical), Moderate Priority (Somewhat Critical), and High Priority (Extremely Critical).

*** COLLECT: “LOCK, LOAD, and CLEAN”

I will load tidyverse into my environment, set my working directory, and store my data into an object called “covid.” Please observe the following bit of code to execute, highlighted in blue:

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble 3.1.4     v dplyr   1.0.7
## v tidyr 1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

setwd("C:/Users/firstnamelastinitial/Documents/R")

covid <- read.csv("us-counties-recent.csv", header = TRUE)

*** EXPLORE: “LEARN YOUR DATA”

I will perform some preliminary exploratory techniques to get a superficial sense of the data.

dim(covid)

## [1] 97576 6

str(covid)

## 'data.frame': 97576 obs. of 6 variables:
## $ date : chr "2022-01-27" "2022-01-27" "2022-01-27" "2022-01-27" ...
## $ county: chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ fips : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ cases : int 13251 50313 5054 5795 13427 2149 4519 28519 7882 4529 ...
## $ deaths: int 163 608 83 95 204 46 106 544 147 67 ...

colnames(covid)

## [1] "date" "county" "state" "fips" "cases" "deaths"

The data set that I have stored into a data frame called “covid” is 97,576 observations (rows), multiplied by 6 variables (columns). These variables are divided into character types (date, county, and state), and integer types (fips, cases, and deaths).

Understand that character types are not considered numerical. Conversely, integer types are considered as numbers.

Next, I am going to create an object that focuses on the state of Florida, and its counties, specifically. I will refer to this object as “florida_counties.”

florida_counties <- covid %>%
filter(state == "Florida")

In order to further clean the data, I will remove two columns that are irrelevant for this analysis, date and fips.

florida_counties <- florida_counties %>%
select(-date, -fips)

Finally, I will remove all duplicated information, keeping only unique rows names, which happen to be the individual counties and their respective statistics.

florida_counties <- florida_counties[!duplicated(florida_counties$county), ]

From prior exercises in clustering analysis, I know that I need to scale the data accordingly. Not all counties are equal in population.

For instance, some counties like Miami-Dade have enormous populations. Other counties similar to Baker county are much smaller.

So, I would like to create 3-scaled clusters that take both cases and deaths into a scaled form. The code expressing these changes might look like this:

florida_counties_scaled <- florida_counties %>%
select(cases, deaths) %>%
scale()

florida_counties_scaled %>%
summary()

##      cases              deaths
## Min.   :-0.48881   Min.   :-0.62228
## 1st Qu.:-0.45243   1st Qu.:-0.56647
## Median :-0.35627   Median :-0.29518
## Mean   : 0.00000   Mean   : 0.00000
## 3rd Qu.:-0.02328   3rd Qu.: 0.09143
## Max.   : 6.35542   Max.   : 5.79201

Now let’s load the stats library, set our seed, and perform the kmeans() function.

library(stats)
set.seed(1234)

*** EXPLOIT: “USE STATISTICAL APPROACHES FOR PROBLEM SOLVING”

Let’s build our model by creating a new object called “k_3” as I will be using 3 clusters in this model. And the model will run itself 30 times before determining a final optimal model.

k_3 <- kmeans(florida_counties_scaled, centers = 3, nstart = 30)

Let’s explore our model:

k_3$size

## [1] 1 59 8

This output tells us that in the 3 clusters that I have specified: there is 1 observation in cluster 1, 59 observations in cluster 2, and 8 observations in cluster 3. Keep in mind that when I use the word observation I am generally referring to the counties. Meaning, each unique country is a single observation.

k_3$centers

## cases deaths
## 1 6.3554163 5.7920092
## 2 -0.2862105 -0.3002287
## 3 1.3163750 1.4901853

The “centers” attribute tells us the location of our 3 centroids. They are the centers of each cluster and they were determined randomly by the algorithm.

We can begin to visualize our model with the factoextra package. This would be ideal as clusters are meant to be visualized.

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.1.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

We can ignore the warning!

fviz_cluster(k_3, data = florida_counties_scaled, repel = TRUE)

## Warning: ggrepel: 58 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Cluster Plot of Florida Counties

Let’s see if we can put the county names on a ggplot, along with some labels:

k_3$cluster <- as.factor(k_3$cluster)

ggplot(florida_counties, aes(cases,deaths, color = k_3$cluster)) +
geom_point() +
geom_text(aes(label=county)) +
ggtitle("Clusters of Florida Counties based on number of covid cases and de aths") +
labs(colour = "Clusters")+
xlab("Cases per 10,000")

Cluster Plot 2 on Florida Counties

*** REPORT: “PERSUADE MANAGEMENT”

I would like to conclude my research at this point. I have enough information to determine which clusters of counties in the state of Florida are at the most risk of current COVID-19 circumstances.

As you can see from the plot above there are 3 clusters of counties, and I will label them accordingly:

RED CLUSTER: Extremely Critical, has high number of cases per 10,000 citizens, and high number of deaths relative to other clusters.

BLUE CLUSTER: Somewhat Critical, has moderate number of cases per 10,000 citizens, and moderate number of deaths compared to red cluster.

GREEN CLUSTER: Not Critical, has a lower number of cases per 10,000 citizens, and low number of deaths compared to both red and blue clusters.

There is only one county that is in the RED CLUSTER: Miami-Dade County.

There are 8 counties in the BLUE CLUSTER: Including Broward, Palm Beach, Orange, among others.

And the remaining 59 counties in the GREEN CLUSTER have a status of “Not Critical.”

**From the information that I have gathered, it is clear that Miami-Dade County is in need of the most resources, whether in the form of ICU assistance, PPE, testing kits, and vaccine allocation.

This conclusion is not, by any means, meant to be a political statement, as it mainly deals with numbers, cold, unemotional numbers. If this county is not given thoughtful consideration, COVID-19 cases and deaths will reduce at a much slower rate, causing an unnecessary amount to turmoil in the region.**

MANAGERIAL IMPLICATIONS

I wish that this contents of this report could guide leaders in the Miami-Dade county to be more proactive about pursuing aid from the state and federal government. If there is ONE TAKEWAY from this analysis, it is that Miami-Dade’s status with regards to COVID-19 is EXTREMELY CRITICAL.

Excess resources should be divided equally into the next critical cluster, counties that are labeled in blue. And finally, mild attention is needed for counties listed in the green cluster.

Thank you so much for reading my analysis. I hope that we, the citizens of the world, can use data in a way that can impact change within our communities. We can begin to have a conversation about the future of data, its usefulness, and effectiveness when it is interpreted ethically. I hope this project can be of use to leaders in the Miami-Dade medical community.

THANK YOU

Thank you so much for reading this post. I hope it gives you a simple and concise breakdown for how to conduct a basic report. Depending on the topic, and guidelines given to you by your manager, the report can be much lengthier.

I am so proud of you for taking these crucial and necessary steps for changing the trajectory of your lives by becoming efficient at data analysis, and machine learning topics. Your hard work will pay off in multiples!

Share:

Location: Chicago, IL, USA