Wednesday, August 3, 2022

How Porn Stars Can Prototype on Large Sets: Randomly Splitting Your Data Into A Smaller Set to Make Machine Learning Easier

Hello ladies, thank you again so much for visiting my blog. I hope that you are having fun learning the basics of Machine Learning and Regression. I am pleasantly happy, yet not at all surprised that you have made it this far in my blog. You are all so brave and gritty! 


Let’s begin today’s post with a concept that can help you in your Regression modeling. I call it PROTOTYPING.

 

Recently, I came across a data set with nearly 7 million rows. What in the world?!?! I didn’t even know that a data set could be so large. Now, this may actually seem like a small data set compared to some of the sets used by industry professionals.

 

But what tools do they use to get the job done? Perhaps AWS, or Microsoft Azure with their unlimited cloud computing solutions? Insert name of any enterprise level cloud computing solution right here and make it snappy!

 

My point: What if we didn’t have access to all the solutions that the “pros” use? What if we had different goals in mind, and different use cases for the data set in question?

 

This post will be geared for those who don’t have access to subscription-based cloud computing, and who are not employed by a top-tier organization with access to superior level hardware.

 

I am targeting those who are mainly students and digital entrepreneurs—learning the basics of analytics—and those who also are looking for tools to help build their projects portfolio. This one is for you!

 

STEP 1: Load library, set working directory, and load data set.

 

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/firstnamelastinitial/Documents/R")

 

Now, I am going to load–quite painfully–the data set that I obtained on the University of Chicago website for Pre-Doctoral Business PhD candidates. This enormous monstrosity is the set that the admissions committee asks Pre-Doctoral candidates to analyze as a one of two application tasks.

 

Here is the link to that site. Download their data set to follow along… if you are adventurous:

 

wage <- read.csv("cps_wages_LFP.csv", header = TRUE)

 

*** STEP 2: Examine data to get a sense of how much you will partition it.

 

glimpse(wage)

## Rows: 6,883,923
## Columns: 23
## $ year       <int> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,~
## $ statefip   <chr> "alabama", "alabama", "alabama", "alabama", "alabama", "ala~
## $ month      <chr> "march", "march", "march", "march", "march", "march", "marc~
## $ wtsupp     <dbl> 1443.85, 1592.53, 1229.96, 1472.76, 1503.37, 1328.25, 1847.~
## $ age        <chr> "15", "4", "43", "34", "15", "19", "18", "7", "51", "26", "~
## $ sex        <chr> "male", "female", "male", "female", "male", "male", "female~
## $ race       <chr> "white", "white", "white", "white", "white", "white", "whit~
## $ hispan     <chr> "not hispanic", "not hispanic", "not hispanic", "not hispan~
## $ educ       <chr> "grade 8", "", "grade 4", "grade 8", "grade 8", "12th grade~
## $ empstat    <chr> "nilf, school", "", "at work", "at work", "nilf, school", "~
## $ occ        <int> NA, NA, 473, 283, NA, 753, 623, NA, 703, NA, 663, NA, NA, N~
## $ wkswork1   <int> NA, NA, 52, 52, NA, 39, 13, NA, 36, NA, 52, NA, NA, NA, 52,~
## $ wkswork2   <chr> "", "", "50-52 weeks", "50-52 weeks", "", "27-39 weeks", "1~
## $ uhrsworkly <int> NA, NA, 58, 57, NA, 50, 55, NA, 20, NA, 40, NA, NA, NA, 40,~
## $ inctot     <int> 0, NA, 11200, 0, 3043, 4020, 600, NA, 1625, 0, 8200, NA, NA~
## $ incwage    <int> 0, NA, 10400, 0, 0, 3300, 600, NA, 1170, 0, 8200, NA, NA, 0~
## $ age_group  <chr> "age < 25", "age < 25", "25 <= age < 45", "25 <= age < 45",~
## $ white      <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ skilled    <int> 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, 1, 0, NA, NA,~
## $ hours      <int> NA, NA, 3016, 2964, NA, 1950, 715, NA, 720, NA, 2080, NA, N~
## $ wage       <dbl> NA, NA, 3.448276, 0.000000, NA, NA, NA, NA, NA, NA, 3.94230~
## $ lfp        <chr> "Not in labor force", "", "In labor force", "In labor force~
## $ empstatid  <chr> "Not In Labor Force", "", "Employed", "Employed", "Not In L~

 

As we can see, the set has 6,883,923 rows multiplied by 23 columns. That, is, crazy! I have very limited resources at my disposal, mainly a limitation of computing hardware and time to compute the data.

 

Luckily, there is a solution!!! Instead of trying to explore this data set, and creating sophisticated modeling with it, maybe we can shrink the data set to a more management chunk, or “sample.”

 

This makes perfect sense to me. I can still use a random sample of this data set to create my predictive models. I can STILL create a prototype based on a mere leaf of this data tree. So, I will set the seed, so that I can randomize my new data set.

 

*** STEP 3: Set seed, and load “stats” library.

 

set.seed(1234)

library(stats)

 

*** STEP 4: Create Sample of the data set.

 

Now I will create an object that will store the arguments that I pass through the sample() function. Then, I will use this object to index new rows in a new data frame called wage_mini. Noticed that in the sample_set object, I am retaining 0.5% of the entire data set. Even this smaller number will yield plenty of observations for us to prototype predictive models. The code will appear as such:

 

sample_set <- sample(nrow(wage), nrow(wage) * 0.005, replace = FALSE)
wage_mini <- wage[sample_set, ]

 

*** STEP 5: Examine new data sample to see if the size is acceptable for prototyping predictive models.

 

Fantastic! Let’s have a look at the new data set we have created.

 

glimpse(wage_mini)

## Rows: 34,419
## Columns: 23
## $ year       <int> 2004, 2010, 1987, 1987, 1984, 2005, 2011, 2015, 2010, 2014,~
## $ statefip   <chr> "hawaii", "washington", "arkansas", "colorado", "virginia",~
## $ month      <chr> "march", "march", "march", "march", "march", "march", "marc~
## $ wtsupp     <dbl> 351.50, 1618.92, 1134.60, 1866.00, 1600.22, 1368.44, 1484.1~
## $ age        <chr> "10", "28", "2", "46", "29", "34", "41", "71", "9", "34", "~
## $ sex        <chr> "female", "male", "male", "female", "male", "male", "female~
## $ race       <chr> "asian only", "white", "black/negro", "white", "white", "wh~
## $ hispan     <chr> "not hispanic", "not hispanic", "not hispanic", "not hispan~
## $ educ       <chr> "", "high school diploma or equivalent", "", "4 years of co~
## $ empstat    <chr> "", "at work", "", "at work", "at work", "nilf, unable to w~
## $ occ        <int> NA, 6350, NA, 156, 19, NA, 350, NA, NA, NA, 230, 552, 1000,~
## $ wkswork1   <int> NA, 52, NA, 40, 52, NA, 52, NA, NA, NA, 26, 52, 52, 52, NA,~
## $ wkswork2   <chr> "", "50-52 weeks", "", "40-47 weeks", "50-52 weeks", "", "5~
## $ uhrsworkly <int> NA, 42, NA, 16, 65, NA, 24, NA, NA, NA, 20, 40, 40, 38, NA,~
## $ inctot     <int> NA, 97200, NA, 8038, 54200, 13032, 30002, 11132, NA, 16981,~
## $ incwage    <int> NA, 97000, NA, 7800, 55000, 0, 30000, 0, NA, 0, 600, 13242,~
## $ age_group  <chr> "age < 25", "25 <= age < 45", "age < 25", "45 <= age < 65",~
## $ white      <int> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,~
## $ skilled    <int> NA, 0, NA, 1, 0, 1, 1, 0, NA, 0, 1, 0, 1, 0, 0, NA, 1, NA, ~
## $ hours      <int> NA, 2184, NA, 640, 3380, NA, 1248, NA, NA, NA, 520, 2080, 2~
## $ wage       <dbl> NA, 44.413918, NA, NA, 16.272190, NA, 24.038462, NA, NA, NA~
## $ lfp        <chr> "", "In labor force", "", "In labor force", "In labor force~
## $ empstatid  <chr> "", "Employed", "", "Employed", "Employed", "Not In Labor F~

 

Notice that the new data frame has only 34,419 rows, significantly smaller than nearly 7 million rows. If 34,419 rows are too few for you to prototype a predictive model, repeat the above process, but instead of using “0.005” in the sample_set object, use “0.01.”

 

Notice that there is still roughly 2.12 Gigabytes being used in your environment. 


The next steps would be to save the wage_mini data frame into a csv file; clear out your environment and console; restart r studio; and finally, load the wage_mini data frame into your refreshed workspace.

 

*** STEP 6: Create csv file with new data frame, clear workspace, and reload R Studio to begin exploratory analysis.

 

Let me save this new data frame into a csv file. The name should be sensible and easy to remember:

 

write.csv(wage_mini,file="wage_mini.csv")

 

Find the new csv file in your directory, load it in a clean R file and begin your exploratory analysis.

 

Thank you so much for reading this far into my post. I hope it was helpful to you. For those of you that use R and R Studio on a regular basis, please consider donating to their organization so that they can continue to update this amazing tool.

 

I am not an employee of their organization. I just believe that if the product is adding meaning to my life—which it is—then I should support for the cause! So, I do. The link to donate is here if you feel compelled to do so.


I hope this blog is helping all of you ladies with your transition into a safer, healthier life. I would like to add a laptop roundup/review so that you can get a better sense of what computers you should purchase for doing these projects. There are many new laptops that have been released as of Summer of 2022, and new chips, like the 12th Generation i5 and i7 from Intel, look to be impressive. So stay tuned for that.


Thank you ladies very much for your time and dedication to this blog. Be well! 

 

Share:
Location: Chicago, IL, USA