Hello ladies, thank you again so much for visiting my blog. I hope that you are having fun learning the basics of Machine Learning and Regression. I am pleasantly happy, yet not at all surprised that you have made it this far in my blog. You are all so brave and gritty!
Let’s begin today’s post with a concept that can help you in your Regression modeling. I call it PROTOTYPING.
Recently, I came across a data set with nearly 7 million
rows. What in the world?!?! I didn’t even know that a data set could be so
large. Now, this may actually seem like a small data set compared to some of
the sets used by industry professionals.
But what tools do they use to get the job done? Perhaps
AWS, or Microsoft Azure with their unlimited cloud computing solutions? Insert
name of any enterprise level cloud computing solution right here and make it
snappy!
My point: What if we didn’t have access to all the
solutions that the “pros” use? What if we had different goals in mind, and
different use cases for the data set in question?
This post will be geared for those who don’t have access
to subscription-based cloud computing, and who are not employed by a top-tier
organization with access to superior level hardware.
I am targeting those who are mainly students and digital
entrepreneurs—learning the basics of analytics—and those who also are looking
for tools to help build their projects portfolio. This one is for you!
STEP 1: Load library, set working directory, and load
data set.
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts
------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/Documents/R")
Now, I am going to load–quite painfully–the data set
that I obtained on the University of Chicago website for Pre-Doctoral Business
PhD candidates. This enormous monstrosity is the set that the admissions
committee asks Pre-Doctoral candidates to analyze as a one of two application
tasks.
Here is the link to that site. Download their data
set to follow along… if you are adventurous:
wage <- read.csv("cps_wages_LFP.csv", header = TRUE)
*** STEP 2: Examine data to get a sense of how much
you will partition it.
glimpse(wage)
## Rows: 6,883,923
## Columns: 23
## $ year
<int> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,~
## $ statefip
<chr> "alabama", "alabama",
"alabama", "alabama", "alabama", "ala~
## $ month
<chr> "march", "march", "march",
"march", "march", "march", "marc~
## $ wtsupp
<dbl> 1443.85, 1592.53, 1229.96, 1472.76, 1503.37, 1328.25, 1847.~
## $ age
<chr> "15", "4", "43",
"34", "15", "19", "18", "7",
"51", "26", "~
## $ sex
<chr> "male", "female", "male",
"female", "male", "male", "female~
## $ race
<chr> "white", "white", "white",
"white", "white", "white", "whit~
## $ hispan
<chr> "not hispanic", "not hispanic",
"not hispanic", "not hispan~
## $ educ
<chr> "grade 8", "", "grade 4",
"grade 8", "grade 8", "12th grade~
## $ empstat
<chr> "nilf, school", "", "at work",
"at work", "nilf, school", "~
## $ occ
<int> NA, NA, 473, 283, NA, 753, 623, NA, 703, NA, 663, NA, NA, N~
## $ wkswork1
<int> NA, NA, 52, 52, NA, 39, 13, NA, 36, NA, 52, NA, NA, NA, 52,~
## $ wkswork2
<chr> "", "", "50-52 weeks",
"50-52 weeks", "", "27-39 weeks", "1~
## $ uhrsworkly <int> NA, NA, 58, 57, NA, 50,
55, NA, 20, NA, 40, NA, NA, NA, 40,~
## $ inctot
<int> 0, NA, 11200, 0, 3043, 4020, 600, NA, 1625, 0, 8200, NA, NA~
## $ incwage
<int> 0, NA, 10400, 0, 0, 3300, 600, NA, 1170, 0, 8200, NA, NA, 0~
## $ age_group
<chr> "age < 25", "age < 25", "25
<= age < 45", "25 <= age < 45",~
## $ white
<int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ skilled
<int> 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, 1, 0, NA, NA,~
## $ hours
<int> NA, NA, 3016, 2964, NA, 1950, 715, NA, 720, NA, 2080, NA, N~
## $ wage
<dbl> NA, NA, 3.448276, 0.000000, NA, NA, NA, NA, NA, NA, 3.94230~
## $ lfp
<chr> "Not in labor force", "", "In labor
force", "In labor force~
## $ empstatid
<chr> "Not In Labor Force", "",
"Employed", "Employed", "Not In L~
As we can see, the set has 6,883,923 rows multiplied by
23 columns. That, is, crazy! I have very limited resources at my disposal,
mainly a limitation of computing hardware and time to compute the data.
Luckily, there is a solution!!! Instead of trying to
explore this data set, and creating sophisticated modeling with it, maybe we
can shrink the data set to a more management chunk, or “sample.”
This makes perfect sense to me. I can still use a random
sample of this data set to create my predictive models. I can STILL create a
prototype based on a mere leaf of this data tree. So, I will set the seed, so
that I can randomize my new data set.
*** STEP 3: Set seed, and load “stats” library.
set.seed(1234)
library(stats)
*** STEP 4: Create Sample of the data set.
Now I will create an object that will store the
arguments that I pass through the sample() function. Then, I will use this
object to index new rows in a new data frame called wage_mini. Noticed that in
the sample_set object, I am retaining 0.5% of the entire data set. Even this
smaller number will yield plenty of observations for us to prototype predictive
models. The code will appear as such:
sample_set <- sample(nrow(wage), nrow(wage) * 0.005, replace = FALSE)
wage_mini <-
wage[sample_set, ]
*** STEP 5: Examine new data sample to see if the
size is acceptable for prototyping predictive models.
Fantastic! Let’s have a look at the new data set we
have created.
glimpse(wage_mini)
## Rows: 34,419
## Columns: 23
## $ year
<int> 2004, 2010, 1987, 1987, 1984, 2005, 2011, 2015, 2010, 2014,~
## $ statefip
<chr> "hawaii", "washington",
"arkansas", "colorado", "virginia",~
## $ month
<chr> "march", "march", "march",
"march", "march", "march", "marc~
## $ wtsupp
<dbl> 351.50, 1618.92, 1134.60, 1866.00, 1600.22, 1368.44, 1484.1~
## $ age <chr> "10", "28",
"2", "46", "29", "34", "41",
"71", "9", "34", "~
## $ sex
<chr> "female", "male", "male",
"female", "male", "male", "female~
## $ race
<chr> "asian only", "white",
"black/negro", "white", "white", "wh~
## $ hispan
<chr> "not hispanic", "not hispanic",
"not hispanic", "not hispan~
## $ educ
<chr> "", "high school diploma or equivalent",
"", "4 years of co~
## $ empstat
<chr> "", "at work", "", "at
work", "at work", "nilf, unable to w~
## $ occ
<int> NA, 6350, NA, 156, 19, NA, 350, NA, NA, NA, 230, 552, 1000,~
## $ wkswork1
<int> NA, 52, NA, 40, 52, NA, 52, NA, NA, NA, 26, 52, 52, 52, NA,~
## $ wkswork2
<chr> "", "50-52 weeks", "",
"40-47 weeks", "50-52 weeks", "", "5~
## $ uhrsworkly <int> NA, 42, NA, 16, 65, NA,
24, NA, NA, NA, 20, 40, 40, 38, NA,~
## $ inctot
<int> NA, 97200, NA, 8038, 54200, 13032, 30002, 11132, NA, 16981,~
## $ incwage
<int> NA, 97000, NA, 7800, 55000, 0, 30000, 0, NA, 0, 600, 13242,~
## $ age_group
<chr> "age < 25", "25 <= age < 45",
"age < 25", "45 <= age < 65",~
## $ white
<int> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,~
## $ skilled
<int> NA, 0, NA, 1, 0, 1, 1, 0, NA, 0, 1, 0, 1, 0, 0, NA, 1, NA, ~
## $ hours
<int> NA, 2184, NA, 640, 3380, NA, 1248, NA, NA, NA, 520, 2080, 2~
## $ wage
<dbl> NA, 44.413918, NA, NA, 16.272190, NA, 24.038462, NA, NA, NA~
## $ lfp
<chr> "", "In labor force", "",
"In labor force", "In labor force~
## $ empstatid
<chr> "", "Employed", "", "Employed",
"Employed", "Not In Labor F~
Notice that the new data frame has only 34,419 rows,
significantly smaller than nearly 7 million rows. If 34,419 rows are too few
for you to prototype a predictive model, repeat the above process, but instead
of using “0.005” in the sample_set object, use “0.01.”
Notice that there is still roughly 2.12 Gigabytes being used in your environment.
The next steps would be to save the wage_mini data
frame into a csv file; clear out your environment and console; restart r
studio; and finally, load the wage_mini data frame into your refreshed workspace.
*** STEP 6: Create csv file with new data frame, clear workspace,
and reload R Studio to begin exploratory analysis.
Let me save this new data frame into a csv file. The name
should be sensible and easy to remember:
write.csv(wage_mini,file="wage_mini.csv")
Find the new csv file in your directory, load it in a
clean R file and begin your exploratory analysis.
Thank you so much for reading this far into my post. I
hope it was helpful to you. For those of you that use R and R Studio on a
regular basis, please consider donating to their organization so that they can
continue to update this amazing tool.
I am not an employee of their organization. I just
believe that if the product is adding meaning to my life—which it is—then I
should support for the cause! So, I do. The link to donate is here if you feel compelled to do so.
I hope this blog is helping all of you ladies with your transition into a safer, healthier life. I would like to add a laptop roundup/review so that you can get a better sense of what computers you should purchase for doing these projects. There are many new laptops that have been released as of Summer of 2022, and new chips, like the 12th Generation i5 and i7 from Intel, look to be impressive. So stay tuned for that.
Thank you ladies very much for your time and dedication to this blog. Be well!