INTRODUCTION:
This is a brief tutorial on Regression analysis, which is
a supervised machine learning method. I have adapted this post from a book that
I will cite at the conclusion, and I made this post accessible for all those in
the Sex Work Industry.
You don’t need to be a mathematician to understand how
Regression works, and I will not dive into the mathematics either. My goal is
to persuade you how to interpret the output that R Studio will give you
throughout the programming process. Don’t be intimidated, there is nothing to
fear.
This post is about how to utilize a numeric prediction
based on recorded data. Regression is a method or process that will predict a
response (one that is numeric by nature) by determining the strength of the
relationship between other numeric variables of interest.
Regression is the “bread and butter” of all techniques
used in big data analytics. Analysts are expected to know how to “run” a
Regression. Knowing Regression is NOT an option, it is mandatory.
Let’s begin.
STEP 1: Load necessary packages, set your
working directory, and create an object for your data set.
We will load the tidyverse package for now. If you have
not installed the tidyverse package, make sure to run this command:
install.packages(“tidyverse”), and then you can load
tidyverse with the library() function.
It is best to set a working directory with the setwd()
function. Essentially, you will find the folder where you keep your data sets.
This directory can be set to any file where you store your CSV files. Setting
the working directory will ensure that R Studio can function properly.
Finally, let’s load the necessary data set into R Studio
Environment. Please open the R Studio application now.
I will be using a data set called “bikes” and that is the
name I will use to store the data set as well. At the end of this post, I will
tell you how you can find this data set so you can follow along in your spare
time.
To follow along with the data set, please go to the home
page, and click on “Data Sets” near the top of my page. Click the link on that page to be taken to a Google
Drive. Then click and download the bikes.csv file to your computer.
The code that you should type is highlighted in Blue, and it is Bolded. The rest of the output has a yellow background, it is what you will see in the lower left window, once you type the code. Remember that the code should be typed in the upper left window of R Studio. Once you are finished typing the code, highlight it, and press CTRL and ENTER.
One last hint: Your working directory will likely have your name, or the name that you gave your computer when you first installed it. It will look different for everyone, but it will lead you to the file that you created for your R Studio files. Please store your csv files in this folder so that R Studio can efficiently find it.
Here is a link to the blog post that I wrote that
has more detailed steps on how to create your folder.
This is what all of Step 1 should look like with the code and the output of the code.
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr
1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts ------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/Documents/R")
bikes <- read.csv("bikes.csv")
STEP 2: Discovery Basic Elements of the Data
View(bikes)
glimpse(bikes)
## Rows: 731
## Columns: 10
## $ date
<chr> "2011-01-01", "2011-01-02",
"2011-01-03", "2011-01-04", "2~
## $ season
<int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
## $ weekday
<int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4~
## $ weather
<int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2~
## $ temperature <dbl> 46.71653, 48.35024,
34.21239, 34.52000, 36.80056, 34.88784~
## $ realfeel
<dbl> 46.39865, 45.22419, 25.70131, 28.40009, 30.43728, 30.90523~
## $ humidity
<dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261~
## $ windspeed
<dbl> 6.679665, 10.347140, 10.337565, 6.673420, 7.780994, 3.7287~
## $ rentals
<int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 12~
colnames(bikes)
## [1]
"date"
"season"
"holiday"
"weekday"
"weather"
## [6]
"temperature" "realfeel"
"humidity"
"windspeed"
"rentals"
head(bikes)
## date
season holiday weekday weather temperature realfeel humidity
## 1 2011-01-01
1 0 6
2 46.71653 46.39865 0.805833
## 2 2011-01-02
1 0 0 2
48.35024 45.22419 0.696087
## 3 2011-01-03
1 0 1
1 34.21239 25.70131 0.437273
## 4 2011-01-04
1 0 2
1 34.52000 28.40009 0.590435
## 5 2011-01-05
1 0 3
1 36.80056 30.43728 0.436957
## 6 2011-01-06
1 0 4
1 34.88784 30.90523 0.518261
## windspeed
rentals
## 1
6.679665 985
## 2 10.347140
801
## 3 10.337565
1349
## 4
6.673420 1562
## 5
7.780994 1600
## 6
3.728766 1606
tail(bikes)
##
date season holiday weekday weather temperature realfeel humidity
## 726 2012-12-26
1 0 3
3 38.18597 29.37556 0.823333
## 727 2012-12-27
1 0 4
2 39.10253 30.12507 0.652917
## 728 2012-12-28
1 0 5
2 39.03197 33.49946 0.590000
## 729 2012-12-29
1 0 6
2 39.03197 31.99712 0.752917
## 730 2012-12-30
1 0 0
1 39.24347 30.72596 0.483333
## 731 2012-12-31
1 0
1 2 35.85947 29.75026 0.577500
## windspeed
rentals
## 726 13.178398
441
## 727 14.576687
2114
## 728
6.472546 3095
## 729
5.178295 1341
## 730 14.602540
1796
## 731
6.446527 2729
Let me explain this input and output.
The View() function creates a separate window, showing
us a table of the data set. Keep in mind that R is CASE SENSITIVE, and the View
is capitalized.
The glimpse() function tells us that there are 731 rows,
or observations in the data set. Similarly, there are 10 columns, and they are
can be examined further with the colnames() function, which tells us the names
of the columns. Column names will be the variables that we examine and use to
predict our outcome. More on this, later.
The head() function gives us the top 6 rows of the data
set, while the tail() function tells us the bottom 6 rows. When we have a data set with hundreds of thousands... no... MILLIONS of observations, these two functions are ideal to use for our exploratory analysis of the data.
STEP 3: Further Discovery
As this is a tutorial on Regression Analysis, we KNOW that
we will be eventually running a Regression algorithm to make predictions based
on variables of interest. But what if we didn’t know what business problem we
were trying to solve? What if we didn’t know that this data set would be ideal
to run a Regression? Let’s assume now that we don’t know that this is a
Regression problem.
There are some questions that we might ask to determine
whether a Regression analysis is appropriate for this scenario:
A. Assuming that we wanted to predict the number of bike
“rentals,” is there a correlation between this variable and any of the
remaining variables in our data set?
B. Let’s say that we determine that there is a
relationship between “rentals” and the variables of interest. How strong can we
say is the relationship? Put another way, what is the relationship’s strength?
C. Let’s say that there are varying levels of strength
between the “rentals” variable and the other variables of interest. Can we
determine that the relationship is “linear.” If so, we can be confident in
assuming that this type of business problem, given the data set, can be
interpreted with a Regression Analysis.
D. Further, once we determine that the relationship is
linear, we need to be sure that we can adequately quantify the impact of any
variable on the variable we are trying to predict, in this case, “rentals.”
E. Only once these conditions have been met, we can run a
Regression algorithm to see how well we can predict the response variable,
“rentals.” More on response and predictor variables later.
STEP 4: Correlation Plot with corrplot()
Now we have a bigger picture of the tools we need to continue
with our Regression analysis. Again, I have adapted the tutorial to be completely accessible. I
am not expecting that you are a hardcore mathematical genius, with statistical
skills that would match Alan Turing. That would be unreasonable, right? Let's continue
So we will use a corrplot() function
to determine the strength of the relationships between variables. Just as a
reminder, “Correlation” is a statistical method that is used to understand a
quantified relationship between two variables. Keep in mind that the two
variables have to be numeric. They MUST be numbers.
There is a simple way to determine the strength of
correlation, and that is by using the cor() function. In this function you will
use two arguments that will be measured for correlation. Using the “bikes”
data set, such an input will look like this.
cor(bikes$humidity, bikes$rentals)
## [1] -0.1006586
Indeed. WHAT DOES THIS NUMBER MEAN? First, we can note that the
correlation is negative, depicted by the negative sign. Next, we can determine that it is weak. So, how can we determine that the
number is weak? Use this guideline:
The range of this “correlation coefficient” is from -1 to
1.
A 0 would represent no correlation.
From .1 to .3 = Weak Positive Correlation
From .4 to .6 = Moderate Positive Correlation
From .7 higher = Strong Positive Correlation
From -.1 to -.3 = Weak Negative Correlation
From -.4 to -.6 = Moderate Negative Correlation
From -.7 lower = Strong Negative Correlation
You might see a problem with the above input. This input
works great if you are working with a data set with few variables. But what if you
wanted to examine the correlation coefficient of multiple variables? The
following algorithm graphs all numeric variables. Let’s check this out:
First, Create a new object--otherwise known as a variable--called "bikenumeric." We will use the pipe operator,
%>%, to pipe the “bikes” object that we created earlier into our new object.
We will use the select() function to remove the “date” column, because it is
not numerical data.
bikenumeric <- bikes %>%
select(-date)
Next, we will create another object--this time called "bike_correlations"-- and store our
“bikenumeric” object in a cor() function. That will look like this:
bike_correlations <- cor(bikenumeric)
Now, we can finally use the corrplot() function. Be
sure to load the "corrplot" library first:
library(corrplot)
## Warning: package 'corrplot' was built under R
version 4.1.2
## corrplot 0.92 loaded
corrplot(bike_correlations)
And AHA! We get a graph that tells us the strength of each
variable in relation to each other. Let’s do one thing before we interpret the
output:
We will add a line of code that scores the correlation
with the use of decimals to make our interpretation more intuitive.
corrplot.mixed(bike_correlations)
Now we can see the strength of the variables with
regardless to one another via decimal scores. Check this out! We can determine
that all decimals that are BLUE are positive, as noted by the scale on the
right side of the plot. All decimals that are RED are negative, by default.
We can also determine that the strongest relationship
is between “temperature” and “realfeel.” If you trace a line starting with
“temperature” downward until you intersect with “realfeel,” you will come
across a number in blue, .99. This number and color, indicates a Strong Positive
Correlation between the two variables.
However, we are more interested in the correlations between “rentals” and any other variable that has a registered correlation. We might determine that our variables of interest are “realfeel,” “temperature,” “weather,” and “windspeed.” These are the 4 variables that we will use for determining our regression model if we decide to create a model with more than one predictor.
Let's stick to using one predictor for this post, however.
So, now that we have a better idea of the variables that
we can use in our Regression Analysis, let’s build our first regression model.
STEP 5: Simple Linear Regression
The goal of Regression is to observe how one variable impacts
another. Here is some useful terminology:
The Response variable is the variable that we are trying
to predict. It is the DEPENDENT variable that will be impacted by the Predictor
variable, the INDEPENDENT variable.
The Response variable will live on the y-axis. And the
Predictor variable will live on the x-axis.
Without getting too much into the mathematics of regression, we will be creating a model that can draw a straight line across as many of the data points as possible. Think of it this way.
If you wish to
discover more about the mathematics behind Regression analysis, search the
following key terms into the Google search engine:
"OLS (Ordinary Least Squares) method"
"Errors of Residuals"
So, let’s create our Regression model that creates a line
that will pass by as many of the data points as possible. Then in the next step
we will interpret the output that R Studio has given us.
First, we create a new object called, “lm_mod,” and then
use the lm() function to calculate the regression.
Just as a reminder, the lm() function uses two
arguments, the name of the data set, and an interface between our response variable and
predictor variable. We want to predict “rentals” as a function of
“temperature.” The variable on the left side of the squiggly line, ~, is what
we are trying to predict. Here is what the entire function looks like:
lm_mod <- lm(data = bikes,
rentals ~ temperature)
There we go! Our Regression has been created.
STEP 6: Visualization and Interpretation
I will plot what the Regression model looks like between our Response Variable, "rentals," and a single Predictor Variable, "temperature," and then
I will conclude the tutorial with a tedious explanation of the output so that
we can interpret our model. Here is what the regression looks like with a graph, that was coded by the ggplot() function:
ggplot(data = bikes, aes(x = temperature, y
=
rentals)) +
geom_point(color='red') +
geom_smooth(method = "lm", se = FALSE)+
labs(title = "Regession
line of Rental and Temperature",
y = "Rentals",
x = "Temperature")
## `geom_smooth()` using formula 'y ~ x'
The blue line that is drawn through our data points is our
Regression model. Now let’s interpret the model’s meaning by getting the output
from the summary() function:
summary(lm_mod)
##
## Call:
## lm(formula = rentals ~ temperature, data = bikes)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -4615.3 -1134.9
-104.4 1044.3 3737.8
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) -166.877 221.816
-0.752 0.452
## temperature
78.495 3.607 21.759
<2e-16 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1509 on 729 degrees of
freedom
## Multiple R-squared:
0.3937, Adjusted R-squared:
0.3929
## F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16
Whooooaaaaa! That looks crazy, doesn’t it?
Understanding the sections of this model will take time, but it is necessary in
order to understand our model. We don’t need to know the mathematical wizardry that
went into these numbers. As I mentioned before, we are only interested in what
these numbers mean.
RESIDUALS:
Quite simply, this section tells us the 5-number summary
of the residuals, or the observed value minus the predicted value. Remember
that the Regression line has data points below and above it. And the line can
explain the addition or subtraction of the distance between the data points
(observed values) and the line (predicted values). We can determine that:
The Minimum residual value is -4615.3. The Maximum
residual value is 3737.8. The Median value is -104.4. The 1st Quartile is
-1134.9. The 3rd Quartile is 1044.3.
How can we phrase this using the two variables in question, rentals and temperature? We can say that for every 1 unit of temperature (1 degree Fahrenheit), that our model OVERPREDICTED the number of rentals by 4,615 bikes (as noted by the Minimum Residual Value).
Conversely, we can determine that for every 1 unit of temperature our model
UNDERPREDICTED the number of rentals by 3,737 bikes (as noted by the Maximum Residual Value).
COEFFICIENTS:
This chart will tell us some key factors of the model.
First, under the “Estimate” column, we see numerical values for the intercept,
where the regression line touches the y-intercept, and temperature. These
coefficients give us the equation for our Simple Linear Regression Model.
The Intercept has a value of -166.877. Temperature has a
value of 78.495.
The equation for our model will then appear as such:
Predicted rentals = -166.9 + 78.5*(temperature)
The * symbol denotes multiplication, as our prediction for "temperature" on any given day of interest will be substituted and multiplied by 78.5. So, if we wanted to predict how many bike rentals we might have when a day is 82 degrees, we would:
Multiply 82 by 78.5 and subtract 166.9 from this value.
So, (82 * 78.5) - 166.9 = 6,270.1
We would have a predicted value of 6,270 bikes for that day. Recall that our y-intercept is negative (-166.9). So, we would subtract the input value for temperature multiplied by it's associated coefficient by 166.9 to give us the predicted rentals for that day.
The column that highlights the “Pr(>|t|)” is also important to us.
The closer that this number is to zero, the better the
predictive power of the model. If you look at the intersection of temperature,
our predictor variable, and the “Pr(>|t|)” column, you will see three *,
which tells us the significance codes below. It is good for our model to have
three of them (*), but having one, which correlates to a p-value of .01 is
acceptable for most cases.
Just as a rule of thumb, any variable that has a p-value
of less than .05 is considered to be statistically significant, and therefore,
would be a useful feature for a regression model.
MULTIPLE and ADJUSTED R-SQUARED:
We have the following values that describe our multiple
and adjusted r-squared, respectively, 0.3937 and 0.3929.
The closer this number is to 1, the better our model
explains the data. In other words, we want this number to be as close to 1 as
possible, as such a score would explain a model that is efficient in explaining the ambiguity in the data set.
Adjusted R-Squared is a conservative measure but is close
to Multiple R-Squared. Always use the Adjusted R-Squared measurement in your
analysis as it accounts for more error.
But, we can interpret the Multiple R-Squared Score in this case because we are only using one predictor variable; we won't be penalized for by the output for this. This is how we would interpret the score:
"The Regression Model, with Temperature as our INDEPENDENT variable explains roughly 39.37% of the variation in our DEPENDENT variable, Rentals. This is given that our Confidence Level is roughly 95%."
Your interpretation doesn't need to be "word for word" like mine, but it does need to illustrate how much of your predictor "explains" or "accounts for" whatever percentage that your R-Squared score says. Remember that to get a percentage of these values, we need to multiple them by 100, or simply move the decimal point two places to the right.
F-STATISTIC and P-VALUE:
The F-Statistic should be a larger value in order to
determine whether there is a relationship between the response and predictor
variables. Our score is 473.5 on 1, and this is considered a good score as it
is above 1.
We have discussed the p-value briefly, “Pr(>|t|).” The
closer this number is to zero the better the fit of our model. In our case, the
number is so close to zero, that we can be confident that our model fits the
data well.
THANK YOU:
Thank you so much for checking out just one of the machine
learning methods that we are going to learn on this website. It is my goal to
create easily digestible content for Porn Stars and anyone else in the Sex Industry, so that you are able to learn
necessary skills in the new age that we are living in, that is of Machine Learning and Data Analytics. These new skills are not just valuable to you, but for everyone and anyone looking to learn tools that will be required in a newer modern work force.
CITATIONS:
“Practical Machine Learning in R,” by Professors Fred
Nwanganga and Mike Chapple