Wednesday, August 3, 2022

6 Steps That Porn Stars Can Use to Run A Linear Regression: The Basics (No Training or Testing Sets, No Model Validation, No Testing of Assumptions or Diagnostics)

on August 03, 2022 in Blog, Machine Learning, Prediction Equation, Simple Linear Regression, Supervised Machine Learning

INTRODUCTION:

This is a brief tutorial on Regression analysis, which is a supervised machine learning method. I have adapted this post from a book that I will cite at the conclusion, and I made this post accessible for all those in the Sex Work Industry.

You don’t need to be a mathematician to understand how Regression works, and I will not dive into the mathematics either. My goal is to persuade you how to interpret the output that R Studio will give you throughout the programming process. Don’t be intimidated, there is nothing to fear.

This post is about how to utilize a numeric prediction based on recorded data. Regression is a method or process that will predict a response (one that is numeric by nature) by determining the strength of the relationship between other numeric variables of interest.

Regression is the “bread and butter” of all techniques used in big data analytics. Analysts are expected to know how to “run” a Regression. Knowing Regression is NOT an option, it is mandatory.

You should all be proud of yourselves for making it this far in my blog! This is not easy material no matter how you teach--or learn--it. The fact that you have made it this far suggests that you want to live a healthier life, and I applaud all of you for having this courageous spirit.

Let’s begin.

STEP 1: Load necessary packages, set your working directory, and create an object for your data set.

We will load the tidyverse package for now. If you have not installed the tidyverse package, make sure to run this command:

install.packages(“tidyverse”), and then you can load tidyverse with the library() function.

It is best to set a working directory with the setwd() function. Essentially, you will find the folder where you keep your data sets. This directory can be set to any file where you store your CSV files. Setting the working directory will ensure that R Studio can function properly.

Finally, let’s load the necessary data set into R Studio Environment. Please open the R Studio application now.

I will be using a data set called “bikes” and that is the name I will use to store the data set as well. At the end of this post, I will tell you how you can find this data set so you can follow along in your spare time.

To follow along with the data set, please go to the home page, and click on “Data Sets” near the top of my page. Click the link on that page to be taken to a Google Drive. Then click and download the bikes.csv file to your computer.

The code that you should type is highlighted in Blue, and it is Bolded. The rest of the output has a yellow background, it is what you will see in the lower left window, once you type the code. Remember that the code should be typed in the upper left window of R Studio. Once you are finished typing the code, highlight it, and press CTRL and ENTER.

One last hint: Your working directory will likely have your name, or the name that you gave your computer when you first installed it. It will look different for everyone, but it will lead you to the file that you created for your R Studio files. Please store your csv files in this folder so that R Studio can efficiently find it.

Here is a link to the blog post that I wrote that has more detailed steps on how to create your folder.

This is what all of Step 1 should look like with the code and the output of the code.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble 3.1.4    v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

setwd("C:/Users/firstnamelastinitial/Documents/R")
bikes <- read.csv("bikes.csv")

STEP 2: Discovery Basic Elements of the Data

View(bikes)
glimpse(bikes)

## Rows: 731
## Columns: 10
## $ date        <chr> "2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "2~
## $ season      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ holiday     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
## $ weekday     <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4~
## $ weather     <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2~
## $ temperature <dbl> 46.71653, 48.35024, 34.21239, 34.52000, 36.80056, 34.88784~
## $ realfeel    <dbl> 46.39865, 45.22419, 25.70131, 28.40009, 30.43728, 30.90523~
## $ humidity    <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261~
## $ windspeed   <dbl> 6.679665, 10.347140, 10.337565, 6.673420, 7.780994, 3.7287~
## $ rentals     <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 12~

colnames(bikes)

## [1] "date" "season" "holiday" "weekday" "weather"
## [6] "temperature" "realfeel" "humidity" "windspeed" "rentals"

head(bikes)

##         date season holiday weekday weather temperature realfeel humidity
## 1 2011-01-01      1       0       6       2    46.71653 46.39865 0.805833
## 2 2011-01-02      1       0       0       2    48.35024 45.22419 0.696087
## 3 2011-01-03      1       0       1       1    34.21239 25.70131 0.437273
## 4 2011-01-04      1       0       2       1    34.52000 28.40009 0.590435
## 5 2011-01-05      1       0       3       1    36.80056 30.43728 0.436957
## 6 2011-01-06      1       0       4       1    34.88784 30.90523 0.518261
##   windspeed rentals
## 1 6.679665     985
## 2 10.347140     801
## 3 10.337565    1349
## 4 6.673420    1562
## 5 7.780994    1600
## 6 3.728766    1606

tail(bikes)

##           date season holiday weekday weather temperature realfeel humidity
## 726 2012-12-26      1       0       3       3    38.18597 29.37556 0.823333
## 727 2012-12-27      1       0       4       2    39.10253 30.12507 0.652917
## 728 2012-12-28      1       0       5       2    39.03197 33.49946 0.590000
## 729 2012-12-29      1       0       6       2    39.03197 31.99712 0.752917
## 730 2012-12-30      1       0       0       1    39.24347 30.72596 0.483333
## 731 2012-12-31      1      0       1       2    35.85947 29.75026 0.577500
##     windspeed rentals
## 726 13.178398     441
## 727 14.576687    2114
## 728 6.472546    3095
## 729 5.178295    1341
## 730 14.602540    1796
## 731 6.446527    2729

Let me explain this input and output.

The View() function creates a separate window, showing us a table of the data set. Keep in mind that R is CASE SENSITIVE, and the View is capitalized.

The glimpse() function tells us that there are 731 rows, or observations in the data set. Similarly, there are 10 columns, and they are can be examined further with the colnames() function, which tells us the names of the columns. Column names will be the variables that we examine and use to predict our outcome. More on this, later.

The head() function gives us the top 6 rows of the data set, while the tail() function tells us the bottom 6 rows. When we have a data set with hundreds of thousands... no... MILLIONS of observations, these two functions are ideal to use for our exploratory analysis of the data.

STEP 3: Further Discovery

As this is a tutorial on Regression Analysis, we KNOW that we will be eventually running a Regression algorithm to make predictions based on variables of interest. But what if we didn’t know what business problem we were trying to solve? What if we didn’t know that this data set would be ideal to run a Regression? Let’s assume now that we don’t know that this is a Regression problem.

There are some questions that we might ask to determine whether a Regression analysis is appropriate for this scenario:

A. Assuming that we wanted to predict the number of bike “rentals,” is there a correlation between this variable and any of the remaining variables in our data set?

B. Let’s say that we determine that there is a relationship between “rentals” and the variables of interest. How strong can we say is the relationship? Put another way, what is the relationship’s strength?

C. Let’s say that there are varying levels of strength between the “rentals” variable and the other variables of interest. Can we determine that the relationship is “linear.” If so, we can be confident in assuming that this type of business problem, given the data set, can be interpreted with a Regression Analysis.

D. Further, once we determine that the relationship is linear, we need to be sure that we can adequately quantify the impact of any variable on the variable we are trying to predict, in this case, “rentals.”

E. Only once these conditions have been met, we can run a Regression algorithm to see how well we can predict the response variable, “rentals.” More on response and predictor variables later.

STEP 4: Correlation Plot with corrplot()

Now we have a bigger picture of the tools we need to continue with our Regression analysis. Again, I have adapted the tutorial to be completely accessible. I am not expecting that you are a hardcore mathematical genius, with statistical skills that would match Alan Turing. That would be unreasonable, right? Let's continue

So we will use a corrplot() function to determine the strength of the relationships between variables. Just as a reminder, “Correlation” is a statistical method that is used to understand a quantified relationship between two variables. Keep in mind that the two variables have to be numeric. They MUST be numbers.

There is a simple way to determine the strength of correlation, and that is by using the cor() function. In this function you will use two arguments that will be measured for correlation. Using the “bikes” data set, such an input will look like this.

cor(bikes$humidity, bikes$rentals)

## [1] -0.1006586

Before I discuss what the output is, allow me to describe the cor() and the inputs that I have used.

The cor() will take all of the numeric values in the "humidity" column of the bikes data set, and calculate a correlation with the "rentals" column of the bikes data set. Now, what does the "-0.1006586" mean?

Indeed. WHAT DOES THIS NUMBER MEAN? First, we can note that the correlation is negative, depicted by the negative sign. Next, we can determine that it is weak. So, how can we determine that the number is weak? Use this guideline:

The range of this “correlation coefficient” is from -1 to 1.

A 0 would represent no correlation.

From .1 to .3 = Weak Positive Correlation

From .4 to .6 = Moderate Positive Correlation

From .7 higher = Strong Positive Correlation

From -.1 to -.3 = Weak Negative Correlation

From -.4 to -.6 = Moderate Negative Correlation

From -.7 lower = Strong Negative Correlation

You might see a problem with the above input. This input works great if you are working with a data set with few variables. But what if you wanted to examine the correlation coefficient of multiple variables? The following algorithm graphs all numeric variables. Let’s check this out:

STEP 4 (Part B): Correlation Plot

First, Create a new object--otherwise known as a variable--called "bikenumeric." We will use the pipe operator, %>%, to pipe the “bikes” object that we created earlier into our new object. We will use the select() function to remove the “date” column, because it is not numerical data.

bikenumeric <- bikes %>%
select(-date)

Next, we will create another object--this time called "bike_correlations"-- and store our “bikenumeric” object in a cor() function. That will look like this:

bike_correlations <- cor(bikenumeric)

Now, we can finally use the corrplot() function. Be sure to load the "corrplot" library first:

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.2

## corrplot 0.92 loaded

corrplot(bike_correlations)

And AHA! We get a graph that tells us the strength of each variable in relation to each other. Let’s do one thing before we interpret the output:

We will add a line of code that scores the correlation with the use of decimals to make our interpretation more intuitive.

corrplot.mixed(bike_correlations)

Correlation Plot With Numbers

Now we can see the strength of the variables with regardless to one another via decimal scores. Check this out! We can determine that all decimals that are BLUE are positive, as noted by the scale on the right side of the plot. All decimals that are RED are negative, by default.

We can also determine that the strongest relationship is between “temperature” and “realfeel.” If you trace a line starting with “temperature” downward until you intersect with “realfeel,” you will come across a number in blue, .99. This number and color, indicates a Strong Positive Correlation between the two variables.

However, we are more interested in the correlations between “rentals” and any other variable that has a registered correlation. We might determine that our variables of interest are “realfeel,” “temperature,” “weather,” and “windspeed.” These are the 4 variables that we will use for determining our regression model if we decide to create a model with more than one predictor.

Let's stick to using one predictor for this post, however.

So, now that we have a better idea of the variables that we can use in our Regression Analysis, let’s build our first regression model.

STEP 5: Simple Linear Regression

The goal of Regression is to observe how one variable impacts another. Here is some useful terminology:

The Response variable is the variable that we are trying to predict. It is the DEPENDENT variable that will be impacted by the Predictor variable, the INDEPENDENT variable.

The Response variable will live on the y-axis. And the Predictor variable will live on the x-axis.

Without getting too much into the mathematics of regression, we will be creating a model that can draw a straight line across as many of the data points as possible. Think of it this way.

If you wish to discover more about the mathematics behind Regression analysis, search the following key terms into the Google search engine:

"OLS (Ordinary Least Squares) method"

"Errors of Residuals"

So, let’s create our Regression model that creates a line that will pass by as many of the data points as possible. Then in the next step we will interpret the output that R Studio has given us.

First, we create a new object called, “lm_mod,” and then use the lm() function to calculate the regression.

Just as a reminder, the lm() function uses two arguments, the name of the data set, and an interface between our response variable and predictor variable. We want to predict “rentals” as a function of “temperature.” The variable on the left side of the squiggly line, ~, is what we are trying to predict. Here is what the entire function looks like:

lm_mod <- lm(data = bikes, rentals ~ temperature)

There we go! Our Regression has been created.

STEP 6: Visualization and Interpretation

I will plot what the Regression model looks like between our Response Variable, "rentals," and a single Predictor Variable, "temperature," and then I will conclude the tutorial with a tedious explanation of the output so that we can interpret our model. Here is what the regression looks like with a graph, that was coded by the ggplot() function:

ggplot(data = bikes, aes(x = temperature, y = rentals)) +
geom_point(color='red') +
geom_smooth(method = "lm", se = FALSE)+
labs(title = "Regession line of Rental and Temperature",
y = "Rentals",
x = "Temperature")

## `geom_smooth()` using formula 'y ~ x'

Regression Line of Bikes Data Set

The blue line that is drawn through our data points is our Regression model. Now let’s interpret the model’s meaning by getting the output from the summary() function:

summary(lm_mod)

##
## Call:
## lm(formula = rentals ~ temperature, data = bikes)
##
## Residuals:
##     Min      1Q Median      3Q     Max
## -4615.3 -1134.9 -104.4 1044.3 3737.8
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -166.877    221.816 -0.752    0.452
## temperature   78.495      3.607 21.759   <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929
## F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16

Whooooaaaaa! That looks crazy, doesn’t it? Understanding the sections of this model will take time, but it is necessary in order to understand our model. We don’t need to know the mathematical wizardry that went into these numbers. As I mentioned before, we are only interested in what these numbers mean.

RESIDUALS:

Quite simply, this section tells us the 5-number summary of the residuals, or the observed value minus the predicted value. Remember that the Regression line has data points below and above it. And the line can explain the addition or subtraction of the distance between the data points (observed values) and the line (predicted values). We can determine that:

The Minimum residual value is -4615.3. The Maximum residual value is 3737.8. The Median value is -104.4. The 1st Quartile is -1134.9. The 3rd Quartile is 1044.3.

How can we phrase this using the two variables in question, rentals and temperature? We can say that for every 1 unit of temperature (1 degree Fahrenheit), that our model OVERPREDICTED the number of rentals by 4,615 bikes (as noted by the Minimum Residual Value).

Conversely, we can determine that for every 1 unit of temperature our model UNDERPREDICTED the number of rentals by 3,737 bikes (as noted by the Maximum Residual Value).

COEFFICIENTS:

This chart will tell us some key factors of the model. First, under the “Estimate” column, we see numerical values for the intercept, where the regression line touches the y-intercept, and temperature. These coefficients give us the equation for our Simple Linear Regression Model.

The Intercept has a value of -166.877. Temperature has a value of 78.495.

The equation for our model will then appear as such:

Predicted rentals = -166.9 + 78.5*(temperature)

The * symbol denotes multiplication, as our prediction for "temperature" on any given day of interest will be substituted and multiplied by 78.5. So, if we wanted to predict how many bike rentals we might have when a day is 82 degrees, we would:

Multiply 82 by 78.5 and subtract 166.9 from this value.

So, (82 * 78.5) - 166.9 = 6,270.1

We would have a predicted value of 6,270 bikes for that day. Recall that our y-intercept is negative (-166.9). So, we would subtract the input value for temperature multiplied by it's associated coefficient by 166.9 to give us the predicted rentals for that day.

The column that highlights the “Pr(>|t|)” is also important to us.

The closer that this number is to zero, the better the predictive power of the model. If you look at the intersection of temperature, our predictor variable, and the “Pr(>|t|)” column, you will see three *, which tells us the significance codes below. It is good for our model to have three of them (*), but having one, which correlates to a p-value of .01 is acceptable for most cases.

Just as a rule of thumb, any variable that has a p-value of less than .05 is considered to be statistically significant, and therefore, would be a useful feature for a regression model.

MULTIPLE and ADJUSTED R-SQUARED:

We have the following values that describe our multiple and adjusted r-squared, respectively, 0.3937 and 0.3929.

The closer this number is to 1, the better our model explains the data. In other words, we want this number to be as close to 1 as possible, as such a score would explain a model that is efficient in explaining the ambiguity in the data set.

Adjusted R-Squared is a conservative measure but is close to Multiple R-Squared. Always use the Adjusted R-Squared measurement in your analysis as it accounts for more error.

But, we can interpret the Multiple R-Squared Score in this case because we are only using one predictor variable; we won't be penalized for by the output for this. This is how we would interpret the score:

"The Regression Model, with Temperature as our INDEPENDENT variable explains roughly 39.37% of the variation in our DEPENDENT variable, Rentals. This is given that our Confidence Level is roughly 95%."

Your interpretation doesn't need to be "word for word" like mine, but it does need to illustrate how much of your predictor "explains" or "accounts for" whatever percentage that your R-Squared score says. Remember that to get a percentage of these values, we need to multiple them by 100, or simply move the decimal point two places to the right.

F-STATISTIC and P-VALUE:

The F-Statistic should be a larger value in order to determine whether there is a relationship between the response and predictor variables. Our score is 473.5 on 1, and this is considered a good score as it is above 1.

We have discussed the p-value briefly, “Pr(>|t|).” The closer this number is to zero the better the fit of our model. In our case, the number is so close to zero, that we can be confident that our model fits the data well.

THANK YOU:

Thank you so much for checking out just one of the machine learning methods that we are going to learn on this website. It is my goal to create easily digestible content for Porn Stars and anyone else in the Sex Industry, so that you are able to learn necessary skills in the new age that we are living in, that is of Machine Learning and Data Analytics. These new skills are not just valuable to you, but for everyone and anyone looking to learn tools that will be required in a newer modern work force.

CITATIONS:

Please purchase this book for a more detailed explanation of the process, code, and output of the code. This is not an affiliate marketing link of any kind, and I am not benefitting from your purchase of this book. Only the authors are!

Both these authors dedicated so much of their time to creating such a wonderful resource, let's reward them by purchasing their book. They are rock-star-status professors who teach at Mendoza College of Business at the University of Notre Dame.

“Practical Machine Learning in R,” by Professors Fred Nwanganga and Mike Chapple

Share:

Location: Chicago, IL, USA