Sunday, August 21, 2022

Computer Skills for Retiring Porn Stars: A Basic Overview of Tools That You NEED To Thrive As An Entrepreneur

Today, I was on Alura Jenson’s (Elizabeth Marie Spraggins) Instagram, and she had a very comical—yet reasonable—rant about how much she hates computers.

 

Although it was quite humorous, her insight highlights a key issue that many sex workers might be having, a lack of computer skills. This is completely understandable as the need to even use a computer was never part of her job as an Adult Entertainer.

 

There is nothing for her, or you, to be ashamed of.

 

Hello ladies! And once again, thank you for visiting my website for your educational needs. I am going use this post to address an issue that you might be having as  retired Porn Stars or Sex Workers. I am going to talk about computer skills, specifically, that you should focus on in order to be successful in your entrepreneurial ventures.

 

 

Luckily, we are living in the third decade of the twenty first century, and there are many tools for learning about computers, software, hardware, and other skills that you might need to make the smooth transition between sex work and entrepreneurship.

 


This post is meant to be a very basic overview on some key terms and concepts that you should search on Google, so as to learn these key skills that you need to survive in a world of entrepreneurship.

 

Let’s start with the first key skill that will be valuable to you for the rest of your life.

 

TOUCH TYPING

 

In order to gain maximum effect of using a computer, it would help if your typing skills were efficient. You should be able to type on the keyboard WITHOUT looking down at your fingers. Further, you should be able to type fluidly at a speed of roughly 60 words per minute.  That is roughly one complete word per second.

 

At 60 words per minute, you should be able to type a 4-page document in less than 20 minutes—assuming that you are fully organized with your thoughts.

 

Typing expeditiously is a life skill that will serve you through a majority of tasks that you will need to thrive as an entrepreneur—like composing emails, writing business or marketing plans, and creating reports and other forms of content.

 

And since the keyboard is the most direct interface with the computer, that makes it that much more important to learn how to type.

 

You can find touch typing websites that allow you to use their base-service for free. I like using this website. It’s called the Typing Club.

 

 

They claim that after 2 or 3 weeks that you will be able to type quickly and without looking at your hands. You will be able to achieve this because you will be learning how to type using the HOME ROW KEYS, which facilitates muscle memory.


Practice your touch typing for at least 45 minutes each day for a month. The pay off is worth so much more than the effort!

 

Let’s discuss the next skill that you should learn.

 

BE FAMILIAR WITH CHROME OR OTHER WEB BROWSERS

 

A web browser is simply a piece of software that allows you to “browse” content on the Internet. I don’t want to assume, but you may already be familiar with Google Chrome.

 

This browser is used by about 90% of the world’s population. Ya! It’s popular. This is because the search browser uses a sophisticated algorithm to “rank” and sort websites in a way that allows for the user to find relevant websites with ease.

 

However, Google Chrome, as well as other browsers, allow you to search for MORE than just other websites. You can use the browser to find specific locations on maps. You can find images, reviews on services, and even other browsers—if you don’t trust google.

 

Think of a web browser as your passport to the digital world!

 

Some of the more popular web browsers are:

 

- Google Chrome

 

- Bing

 

- Microsoft Edge

 

- Yahoo

 

- DuckDuckGo

 

- Amazon (for shopping and e-commerce)

 


There are some tasks that you should be able to do when using or operating a web browser. You should:

 

- Be able to “bookmark” or “add favorites”

 

- Clear browsing history

 

- Disable/Enable pop-ups and disable/enable cookies

 

- Use filter options to refine and narrow down your search results (Google has millions of websites to view, so this is useful to know)

 

 

Let’s discuss Microsoft Office.

 


MICROSOFT OFFICE: WORD PROCESSING, POWERPOINT, AND SPREADSHEETS

 


Microsoft Word will be your main point of contact for when you are typing your reports, email drafts, and pretty much anything that requires text. It is an impressive piece of software that has been around for years.

 


Although, it has a dizzying array of functions and tools, there are some basic operations that you should focus on in order to create a basic “Word” document. Some useful tasks include:

 


- Create, edit, and save a “document”

 

- Add tables and graphs (Microsoft Excel can do this too. More on this soon.)

 

- Adjust corner margins, and line spacing (for example, single spaced, or double spaced)

 

- Checking the character count and word count of the document

 

- Insert headers at the top and footers at the bottom of the document

 

- Copying the document

 

- And emailing the document when and if needed

 


Microsoft Excel is smart spreadsheet that will allow you to organize your data into tables. Like, Microsoft Word, it has a vast range of functionality. It would go beyond the scope of this blog post to get into all of them. But, you can find many tutorials on YouTube, which is a web browser (search engine) that is dedicated to searching for video content.

 


You might do a YouTube search like this, “Microsoft Excel Tutorial,” or “Microsoft Excel Basics.” Like Google, YouTube ranks video content by relevance, so searching for a video near the top of the page is usually adequate.

 


You can do this same strategy for learning how to use Microsoft Word. YouTube is a great tool for learning, PERIOD!

 


Microsoft PowerPoint is the last tool you should focus on in the Microsoft Office Suite. It is a tool that allows you to create presentations so that you can show to your clients. PowerPoint can create professional looking presentations that will make you look like a total pro!

 


Like the previous software—Word and Excel—there is a lot of functionality, way too much to go over in this post. But there are so many content creators on YouTube who have tutorials for you to learn PowerPoint. It is well worth the hour you spend watching a well-articulated video if you can learn a new skill.

 

Let’s move on to File Management and Organization.

 

FILE MANAGEMENT

 


As you begin and complete more projects for your clients, you will need to have an organized computer in which you can find your project files easily. Therefore, you should be able to:

 


- Create, rename, and delete “folders” to store your documents in

 

- Add multiple “folders” to the main folder (For example, you might have a main folder that says “Clients” and then you might have folders within that folder that say, “High Profile Clients” and “Lower Profile Clients”)

 

- Extend the hard drive capabilities of your computer by using external USB flash drives or wireless hard drives

 

- Use cloud storage services, like Drop Box, and Google Drive.

 


These are just the basics. You might try a free computer basics course on YouTube. The one I am referring to has nearly 40,000 views and is nearly 12 hours long. It is called:

 

“Basic Computer Skills for the Workplace in 2021 – 12 Hoursof Free Tech Training”

 

 

Let’s briefly discuss Operating Systems.

 


WINDOWS OPERATING SYSTEM AND MAC

 

There are two very popular operating systems, and they are supplied by Apple (OS) and Microsoft (Windows). There is another operating system called Linux that is gaining popularity because of the ecosystem of free and open-source software that accompanies it.


However, you should strive to be comfortable with Windows 11 and Mac OS, as they are the two most commonly used.

 

I will say this again, and again, and again, and again! Use YouTube to learn how to obtain a high level of comfort for these systems, and ANYTHING that you don’t know. If there is something that you don’t know, for any subject or concept, the chances are very likely that someone has created educational content for that topic.

 

Let’s talk about Email Formalities.

 

EMAIL FORMALITIES

 

As an entrepreneur, you will be using email A LOT! So, you should be able to use it effectively. Here are some quick tips that you can consider when creating an email to a client:

 

- Your email should contain a Greetings or Salutations (like Dear Sir So and So, or Greetings Miss So and So)

 

- Include your name and a brief, concise introduction of who you are and why you are emailing them

 

- Emails should be short and to-the-point if you emailing to a stranger, but can be longer when emailing someone who is familiar with you

 

- Use complete and well-thought-out sentences. This is NOT a text message, so abbreviations are not allowed

 

- Be respectful ALWAYS

 

- Be able to attach documents to your email, like Word, Excel, and Power Point

 

- Use a signing off phrase like “best regards,” or “sincerely.” Then use the next line to put your name, and appropriate contact information like your mobile, or email

 

- Google Gmail and Microsoft Outlook are the two most popular email platforms

 

 

THANK YOU

 

I hope that this post can help you become more adept at using computers, or at least give you the fodder for finding resources that will help you with that. There was no talk about machine learning and data analytics in this post, because I wanted to provide you all with a more practical post that you can act on and implement immediately in your business.

 

Thank you so much for using my website as your source for learning to become the best versions of yourselves. Transformation and cutting out toxicity from your life is so difficult to do, and you have my respect and admiration! You are all HEROES!

Share:

Wednesday, August 10, 2022

Essay: My Love For Linear Regression, LASSO, And Business Education

Two years ago, during the Summer of 2020, I was enrolled in a master’s program at Georgetown University in management. It is a generalist management degree where students learn about the broader aspects of management. 


Topics like Finance, Accounting, Economics, and Supply Chain Management were covered, as well as Consulting, and Business Communication. I didn’t end up finishing the program; I withdrew due to COVID health concerns.

 

Yet, despite what by all accounts would have been a negative experience, there was one course that resonated with me. This course was Business Analytics. The subject of Business Analytics was a discipline that I have read about in Harvard Business Review, but I never had a concrete perspective on what exactly the topic was. 


The course itself, on a fundamental level, is a statistics course combined with R programming language. The more I learned in the course, the broader my perspective grew. The course is so much MORE than another math class; the skills learned in this course, can aid business managers in making high-impact decisions that are useful in organizations.

 

Now, this class was tough. Homework assignments were 3, no, 4 hours long. Students needed to assimilate the statistical material learned in class and apply them to R programming language. 


In fact, coding was the main component of the class. My instructor, Professor Jose, had one goal for the entire class. His goal was to facilitate the learning environment so that students would be able to run a linear regression model from scratch. This is a goal that is worthy of anyone learning analytics. And I am optimistic about learning even more advance methods for making predictions given any data set.

 

Since Georgetown, I have taken two sequential semesters of statistics in my home town, Miami (at FIU), to broaden my skillset, and also to acquire familiarity with the mathematical concepts that I had difficulty understanding while I was at Georgetown.


I am so much better at statistics, having done this. As a side note, I recommend continuing education at the college level during the evening, in conjunction with work during the day. You can use what you learn in the classroom immediately the next day in your work.

 

At this point, I am looking into internships, training and development programs at management consulting firms, and even research positions, that will help me acquire more analytical skills. 


I am now enrolled in a Business Analytics program at DePaul University. For the next year or so, I will be learning how to use data to make predictions about certain business questions and problems. I couldn’t be more excited.

 

Now that you are acquainted with my situation, I will get to the point of this essay—Multivariate Linear Regression.

 

Since the summer of 2021, I have been studying and learning R programming, and my experience couldn’t be more satisfying. There is something empowering about typing in a line of code and NOT seeing an error message. Those of you who have been frustrated with programming and coding will know what I am referring to.

 

At DePaul, I am learning how to code linear regression models, and there are two fundamental ways of model building that have caught my interest. 


The first way to create a linear model is to build it from the bottom up. That is, you pick a response variable, the variable that you wish to predict, and a predictor variables, the variables that impact the response. 


The method for standard linear modeling is straightforward, you begin with a response, and build the model with predictors that have a high correlation with that response. You then build out your model, adding a single predictor at a time, until you find predictors that explain the most uncertainty accounted for in an Adjusted R-Squared measurement.

 

As amazing as this method is for building multivariate linear models, there is another method that I have fallen in love with since learning of it. This method is known as a LASSO regression. This method is the opposite of the standard approach to linear regression model building in that the user can build their models from the TOP down. 


Programmers can use an algorithm that exposes ALL of the variables in a data set, which scale down their statistical correlations if they are insignificant. Analysts can then “feature select” which variables offer the highest statistical significance and use them to build an effective regression model. This regularization technique is especially useful if there are dozens, if not hundreds, of predictor variables in a dataset.

 

The LASSO regression algorithm is more in-depth, and though I have a basic understanding of it, I plan to make use of my time to run one from scratch. The more variables, the more satisfying, the more glory! 


I wish to use this brief essay to document my initial interests in regression analysis so that I can begin to acquire more advanced skills, and eventually, add value to an organization or business entity.

Share:

Brief Reviews On Data Visualization and Multicollinearity: A Humble Take On The Current Landscape

Hello. For this post, I thought it might be prudent to cite two articles and critique them. The first article is concerned with Boxplots, and the second has to do with multicollinearity. 


The goal of this exercise is to read an article regarding any matter of data analytics, digest the article, and offer a reaction to the authors. 


In this exercise, there is a need to think critically about the topic, and how to utilize the information to become a better data analyst


I hope this helps! 



 “Why Is a Boxplot Better Than a Histogram?” 


Click here for source.


The fundamental concept behind this article is to plot your data BEFORE you do any sort of analysis of the data. But how should you plot the data? What type of visualization should one use at the initial stage of their analysis? 


The author of this article argues that boxplots should be the preferred standard to visualization data, and that histograms—although most commonly used—present relationships in the data that can be misinterpreted. The author proposes that boxplots can be a remedy to misinterpretations that histograms may account for. 


This is the main goal, or thesis of this article.

 

Specifically, boxplots have several advantages when it comes to visualizing data. Along with being able to plot multiple sets of data at once, boxplots can show outliers, imply the distributions of the sets—including whether the data is normal or skewed. 


Unlike histograms, boxplots are adept at showing the median, first and third quartiles, and the minimum and maximum approximation of a total range. To be clear, histograms do show this information as well, but there is much more guess work involved as the values can only be approximated upon visual inspection. 


Finally, boxplots are easier to read and interpret compared to histograms.

 

After reading this article, it is clear that the author prefers to use boxplots over histograms because of its clear advantages. Noting the rationale for boxplots, I am going to begin using boxplots in my analysis, but before I yield to the author’s suggestion, I would like to make an argument that favors the ease of use for histograms. 


In the early stages of analysis, I feel that one should be concerned about both the quality and the quantity of information is appropriate for that specific stage. Let me explain.


Let’s assume that a business manager or supervisor gives their subordinate the task of fitting a regression line through a set of observations. Let’s also assume that this task uses data sets that are primarily numerical, and that there are very few categorical variables of interest. 


We might also assume that this task might be industry-specific; data sets that encompass customer profiles in banking might be different than data sets that encapsulate patient medical history. 


With this in mind, I argue that depending on the STAGE (Exploratory Data Analysis, for instance) of the analysis and the TYPE of data sets that are use, there might be a preference for the ease and simplicity of a histogram. 


If I already know, based on prior professional experience, that a regression analysis is the goal, then I might not need to know the total range of a data set (first and third quartile, median, minimum and maximum) until further into my analysis. 


Consequently, it might be preferable to run a “proc means” (in SAS) or a “summary ()” function (in R) to determine a more accurate account of those descriptive statistics. If my goal is to determine correlation between my response and explanatory variables of interest, then a quick check with a histogram might be a simpler approach, as I don’t require the additional information of a boxplot, yet. 


Boxplots are not useful visualizations for describing a correlation between two variables, and that is why I would use a histogram IN THE EARLY stages of an analysis. 


In all fairness, the author of the article DOES make the point—almost too briefly—that boxplots can also be used alongside of a histogram to provide richer details about data in question. 


This might be a practical approach, and it might require more effort to code into the statistical software, but it could be worth the effort if the output provides a more meaningful account of what is happening in the data. 


For my own approach to analysis, it would make sense for me to explore boxplots more closely to see if I can reach an agreement with the author. At the moment, I am most familiar with regression analysis, so histograms serve my initial goal of determining correlation between my response and explanatory variables. 


Yet, there are other algorithms and methods that are used in data analysis and machine learning, like classification and clustering methods, which require a different mindset; the problems that these additional methods solve is different than regression.

 

 

“Multicollinearity in Regression Analysis: Problems, Detections and Solutions,” by Jim Frost. 


Click here for source. 


What is multicollinearity, and why is it an issue in regression analysis? Statistics professional and thought leader, Jim Frost, says that multicollinearity occurs when explanatory variables in a regression model are highly correlated. This can cause concern as explanatory variables are supposed to be independent of one another. 


In this article, Frost’s main point is to show analysts how to detect multicollinearity between explanatory variables, and finally how to resolve issues of multicollinearity.

 

One can make the claim that one of the most important goals of a regression analysis is to “isolate the relationship between” the response variable and the explanatory variables. And since this goal is so vital to this type of analysis, it is logical to conclude that multicollinearity would present an issue in regression; if an explanatory variable is acting similar to the response variable, then there is an issue that needs to be corrected. 


If an analyst decides that a Multiple Linear Regression is the best approach to making a prediction based on the data, they need to be confident that every 1 unit increase in variable Beta 1—given that all other variables are held constant—will result in a change in the response variable. 


But this can’t happen if independent variables display multicollinearity with other independent variables. At that point, the regression model is almost completely useless!

 

Frost suggests that there are two key issues with multicollinearity. The first issue is that beta coefficients can have a “high sensitivity to change.” Meaning that x-variables can be extreme—high or low—and such severe values can lead to an inaccurate prediction model. 


Second, since multicollinear values can have extreme beta coefficients, there could be a miscalculation in the statistical significance (p-values) of the variables in the final model. And for this reason, an analyst can NOT trust these levels of significance. 


So then, how does an analyst address issues of multicollinearity? Frost argues that there are methods to fix these issues, but boldly suggests that issues of multicollinearity depend heavily on the “primary goal of the regression analysis.” 


Frost proposes that moderate multicollinearity might not need to be resolved, as the severity of a moderate position might not impact the regression model. Although, this approach—or neglect—should only be assessed on a case-by-case basis. Also, he posits that variables that have a multicollinear relationship might not be included in the final model, and therefore, do NOT need to be addressed early on.  


Yet, for analysts that are conservative and meticulous, there is such a test method called Variance Inflation Factors, or VIF. In the modeling process, VIF can be used to determine the strength of correlation between independent variables, and with scores above a certain threshold, these values can be removed from the prediction model. 


In my own analysis, I have been taught to use VIF metrics to determine which explanatory variables to remove during the modeling process. 


Also, I use the Adjusted R-Square score compared to the R-Square score, as this metric provides a more meaningful measure of predictability—penalizing models that rely on a large quantity of variables at the expense of statistical significance.


With the combination of VIF metrics, to remove variables, and relying on an Adjusted R-Square score, I feel confident that I can create a model that is, at the very least, a conservative estimation of future response values. 


Since I am not a highly skilled analyst (yet), I will be meticulous and remove variables that the VIF method highlights. The author Jim Frost is not the only analytics thought leader to exude a “it depends on the data/situation” mentality. 


For him, he is more astute at analytics, therefore, he is more skilled at determining which variables to remove, and which to leave in the model. I am not at his level of preeminence, so for now, I will remove any variable with the smallest hint of multicollinearity.  

Share:

Clustering Analysis of Universities: Schools in Alabama, California, and Florida According to SAT Averages and Admission Rates

Thank you for checking out my post on Clustering earlier last week. I wanted to expanded on the method with an exercise on clustering universities in the states of Alabama, California, and Florida. I hope you have fun with this one. I know I did!


Let’s begin by loading packages, setting our directory and loading our data set. We will be using the “colleges” data set created by the US Department of Education, and cleaned by Fred Nwanganga and Mike Chapple.


As always, if you wish to follow along, you can download the data set on the Google Drive by clicking on the "Data Sets" tab on the homepage. And like all of my other posts, the code that you can replicate is highlighted in BLUE.


library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/R")

college <- read_csv("college.csv", col_types = "nccfffffnnnnnnnnn")


Let’s take the time to do some fundamental exploratory data analysis:


dim(college)

## [1] 1270   17

str(college)

## spec_tbl_df [1,270 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                : num [1:1270] 102669 101648 100830 101879 100858 ...
##  $ name              : chr [1:1270] "Alaska Pacific University" "Marion Military Institute" "Auburn University at Montgomery" "University of North Alabama" ...
##  $ city              : chr [1:1270] "Anchorage" "Marion" "Montgomery" "Florence" ...
##  $ state             : Factor w/ 51 levels "AK","AL","AR",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ region            : Factor w/ 4 levels "West","South",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ highest_degree    : Factor w/ 4 levels "Graduate","Associate",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ control           : Factor w/ 2 levels "Private","Public": 1 2 2 2 2 2 2 1 2 2 ...
##  $ gender            : Factor w/ 3 levels "CoEd","Women",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_rate    : num [1:1270] 0.421 0.614 0.802 0.679 0.835 ...
##  $ sat_avg           : num [1:1270] 1054 1055 1009 1029 1215 ...
##  $ undergrads        : num [1:1270] 275 433 4304 5485 20514 ...
##  $ tuition           : num [1:1270] 19610 8778 9080 7412 10200 ...
##  $ faculty_salary_avg: num [1:1270] 5804 5916 7255 7424 9487 ...
##  $ loan_default_rate : num [1:1270] 0.077 0.136 0.106 0.111 0.045 0.062 0.096 0.007 0.103 0.063 ...
##  $ median_debt       : num [1:1270] 23250 11500 21335 21500 21831 ...
##  $ lon               : num [1:1270] -149.9 -87.3 -86.3 -87.7 -85.5 ...
##  $ lat               : num [1:1270] 61.2 32.6 32.4 34.8 32.6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_number(),
##   ..   name = col_character(),
##   ..   city = col_character(),
##   ..   state = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   region = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   highest_degree = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   control = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   admission_rate = col_number(),
##   ..   sat_avg = col_number(),
##   ..   undergrads = col_number(),
##   ..   tuition = col_number(),
##   ..   faculty_salary_avg = col_number(),
##   ..   loan_default_rate = col_number(),
##   ..   median_debt = col_number(),
##   ..   lon = col_number(),
##   ..   lat = col_number()
##   .. )
##  - attr(*, "problems")=<externalptr>


colnames(college)

##  [1] "id"                 "name"               "city"             
##  [4] "state"              "region"             "highest_degree"   
##  [7] "control"            "gender"             "admission_rate"   
## [10] "sat_avg"            "undergrads"         "tuition"          
## [13] "faculty_salary_avg" "loan_default_rate"  "median_debt"      
## [16] "lon"                "lat"


There are a total of 1270 rows, and 17 columns. We can confirm the data types with this function. There are 17 variables.


This time, we will do our clustering analysis on the state of Alabama. So let’s create a new data set with ONLY schools from this state.


alabama_schools <- college %>%
  filter(state == "AL") %>%
  column_to_rownames(var = "name")


After we pass the state code “MD” in the filter() function, we use the column_to_rowname() function which allows us to see the name of the school for our observations. Let’s view the new data set:


View(alabama_schools)


We now have 24 observations, or schools, and 16 variables in our new data set with only schools from Maryland. Now, let’s use the select() function to specify which features we are interested in. 


For this case, we are interested in admission_rate and sat_avg. Sorry for making that decision for you!! 


alabama_schools %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg   
##  Min.   :0.4414   Min.   : 811 
##  1st Qu.:0.5309   1st Qu.: 969 
##  Median :0.5927   Median :1035 
##  Mean   :0.6523   Mean   :1033 
##  3rd Qu.:0.8064   3rd Qu.:1109 
##  Max.   :1.0000   Max.   :1219


Before moving on, I would like to observe that we are using the pipe (%>%) operator which allows us to compose cleaner code. Let’s move on:


As expected, we can notice that the range of values between both variables have vastly different ranges. As a reminder, we MUST normalize our variables BEFORE building our model.


alabama_schools_scaled <- alabama_schools %>%
  select(admission_rate, sat_avg) %>%
  scale()


We create a new object, using our alabama_schools data set, that utilizes the scale() function to “normalize” our data to z-scores. This creates an optimal scenario to enable us to use a clustering model.


Let’s see the new values for each variable by piping the summary() function into our new data object.


alabama_schools_scaled %>%
  summary()

##  admission_rate       sat_avg       
##  Min.   :-1.3758   Min.   :-1.92418 
##  1st Qu.:-0.7922   1st Qu.:-0.55343 
##  Median :-0.3884   Median : 0.01916 
##  Mean   : 0.0000   Mean   : 0.00000 
##  3rd Qu.: 1.0051   3rd Qu.: 0.66332 
##  Max.   : 2.2685   Max.   : 1.61547


Our variables have been standardized to z-scores and our data is officially normalized. Remember, by normalizing our data we don’t need to worry about extreme differences between our variables. Let’s load the stats package so we can run the algorithm.


Let’s set the seed so we can have reproducible results.


library(stats)

set.seed(1234)


Notice that we have set our seed to randomize the data. Simply remember the number that you use in the argument so that you can reproduce the results every time you run this algorithm.


Reminder:


The kmeans() functions takes three arguments. Our scaled data set, which recall is alabama_schools_scaled, is the first argument. Centers will determine how many cluster centers we will have. This is essentially our k-value. We are going to set our centers (k-value) to 3, because we want three clusters. The nstart will be set to 25, and this will be the total number of configuration attempts. Let’s run it!


k_3 <- kmeans(alabama_schools_scaled, centers = 3, nstart = 25)


Great! Let’s start to explore this cluster. Let’s see how many observations are in each of the three clusters.


k_3$size

## [1]  8  6 10


This output tells us that one cluster has 8 observations, another has 6 observations, and the third cluster has the remaining 10 observations.


We can also get a sense of the values of the three cluster centers by using the centers attribute.


k_3$centers

##   admission_rate    sat_avg
## 1     -0.6824578  0.5743977
## 2     -0.8282133 -1.2431427
## 3      1.0428942  0.2863675


Let’s break this down per cluster:


The first cluster center has a value of roughly -.68 for admission_rate and .57 for sat_avg.


The second cluster center has a value of roughly -.82 for admission_rate and -1.24 for sat_avg.


The third cluster center has a value of roughly 1.04 for admission_rate and approximately .28 for sat_avg.


THESE VALUES ARE NORMALIZED VALUES OF THE ORIGINAL DATASET!!!


With this in mind, we have lost some ability to interpret this part of the data. This is fine as our objective is to group these schools as accurately as possible.


What good is clustering if we can’t have a visualization of our model. Let’s load another package called factoextra.


library(factoextra)

## Warning: package 'factoextra' was built under R version 4.1.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa


Next, will pass three arguments into a very powerful function called the fviz_cluster() function. The first argument will be an object where our model is stored, k_3. The second component in the function will be the data set, which is the scaled data set, alabama_schools_scaled. The last element of the function will contain a specification that will help organize our labels.


fviz_cluster(k_3, data = alabama_schools_scaled, repel = TRUE)

## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


Alabama Schools Cluster


There is a lot to unpack here, but for brevity, I will describe the clusters as simply as possible.


Cluster 1 is labeled in red and as you can notice it contains schools that have the highest SAT average and lowest admission rates. I would argue that the 8 schools in this cluster are the most elite schools in Alabama as observed by the two metrics that we have used.


Cluster 2 is labeled in green. Universities in this cluster have the lowest SAT average, and are low in admission rates.


Cluster 3 is labeled in blue. Schools in this cluster have high admission rate and have high sat averages; this cluster is similar to cluster 1 with regards to high SAT scores.


I suppose all school clusters will have similar arrangements on two dimensional space if they are measured by same features, sat_avg and admission_rate. 


Let’s see the how the three clusters looks for the state of California, and Florida. Just to test out my hypothesis.


california_schools <- college %>%
  filter(state == "CA") %>%
  column_to_rownames(var = "name")



california_schools %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg     
##  Min.   :0.0509   Min.   : 871.0 
##  1st Qu.:0.4051   1st Qu.: 993.5 
##  Median :0.5800   Median :1086.0 
##  Mean   :0.5477   Mean   :1113.7 
##  3rd Qu.:0.7425   3rd Qu.:1222.0 
##  Max.   :0.8750   Max.   :1545.0

california_schools_scaled <- california_schools %>%
  select(admission_rate, sat_avg) %>%
  scale()



set.seed(1234)


k_3 <- kmeans(california_schools_scaled, centers = 3, nstart = 25)


fviz_cluster(k_3, data = california_schools_scaled, repel = TRUE)

## Warning: ggrepel: 58 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


California Schools Cluster

Now for Florida:


florida_schools <- college %>%
  filter(state == "FL") %>%
  column_to_rownames(var = "name")


florida_schools %>%
  select(admission_rate, sat_avg) %>%
  summary()

##  admission_rate      sat_avg   
##  Min.   :0.2286   Min.   : 803 
##  1st Qu.:0.4481   1st Qu.: 948 
##  Median :0.5188   Median :1065 
##  Mean   :0.5436   Mean   :1057 
##  3rd Qu.:0.6157   3rd Qu.:1153 
##  Max.   :1.0000   Max.   :1330

florida_schools_scaled <- florida_schools %>%
  select(admission_rate, sat_avg) %>%
  scale()



set.seed(1234)


k_3 <- kmeans(florida_schools_scaled, centers = 3, nstart = 25)


fviz_cluster(k_3, data = florida_schools_scaled, repel = TRUE)

## Warning: ggrepel: 19 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


Florida Schools Cluster


An observation that I would like to make is that the clusters are labeled different from graphic to graphic. For example, Alabama’s elite schools are labeled red, California’s elite schools are labeled color green, and Florida’s top schools are labeled blue.


Yet, regardless of how the cluster is labeled by color, we can deduce that elite level schools are positioned in the upper left portion of the quadrant.



THANK YOU!


Thank you so much reading this different take on Clustering. This machine learning method is used widely in a Marketing context, namely when managers are looking to categorize, or "segment," groups of costumers.


I am proud of all of you! I know that the steps that you are taking towards a safer, healthier life, are tedious, and at times cumbersome. 


But you know what? 


Your dedication to ridding yourselves of a toxic profession will pay off. You will live happier lives, filled with new found love from family and friends, and even acquaintances who are drawn to your hard work and professional ethic. 


Thank you for allowing me to be part of your journey to becoming data analysts, and data entrepreneurs


Share: