In this post, I would like to go deeper into the evaluation of the output figures of a Linear Regression Model. In hopes of eliminating any confusion that you might have, I will be discussing:
RESIDUALS
COEFFICIENTS
DIAGNOSITIC readings such as: Residual Standard Error,
Multiple and Adjusted R Squared, The F-Statistic and the P-Value.
For that purpose, let’s examine the output of a linear
model for the bikes data set that I posted on last week.
Call:
lm(formula = rentals ~ temperature, data =
bikes)
Coefficients:
(Intercept) temperature
-166.9 78.5
> summary(bikes_mod1)
Call:
lm(formula = rentals ~ temperature, data =
bikes)
Residuals:
Min 1Q Median 3Q
Max
-4615.3 -1134.9
-104.4 1044.3 3737.8
Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept) -166.877 221.816
-0.752 0.452
temperature 78.495
3.607 21.759
<2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1509 on 729
degrees of freedom
Multiple R-squared: 0.3937, Adjusted
R-squared: 0.3929
F-statistic: 473.5 on 1 and
729 DF, p-value: < 2.2e-16
After passing our model (bikes_mod1)
into the summary() function, R will spit out key metrics that we need to
interpret. Don’t worry. You don’t need to be a math wizard to have a basic
understanding of what these measurements mean.
The goal is to be able to communicate
these numbers to management—hopefully you do this ethically—so they can be
persuaded to take action on decisions that will positively impact the
organization.
I hope this literature review will be
helpful to you either in your studies , or as a data entrepreneur. As you know, supervised
machine learning methods, like linear regression, are just so
exciting. Being able to make future predictions based on past data
observations is a tremendous skill to have, so you can imagine how enthusiastic
I am to learn AND teach these techniques in the field that are industry standard.
Let’s go over the keywords that are
highlighted “red,” in greater detail.
Call:
lm(formula = rentals ~ temperature, data =
bikes)
Under “Call” we can see the linear model that we have built with specific variables in our data set (bikes), namely rentals and temperature.
As you can see in the formula, we state
our dependent variable first, "rentals," which is impacted by an independent
variable, "temperature."
For ease of recollection, I will refer to
the "rentals" variable as the response variable, as it is the variable we
have predicted with our model. Similarly, I will refer to the "temperature" variable as the predictor variable.
Let’s continue…
Coefficients:
(Intercept) temperature
-166.9 78.5
For figures representing Coefficients,
there are numbers assigned to them. These numbers will be expressed in our
model’s final equation. Below is the stock equation for univariate regression
to remind us of the smaller details:
This is the standard template for a
regression line. Our final model, with the appropriate coefficients will appear
like this, or a variation of the following:
rentals = -166.9 +
78.5(temperature)
What does this equation mean?
We will
discover the answer to this question as we continue learning about our
residuals:
Residuals:
Min 1Q Median 3Q Max
-4615.3 -1134.9
-104.4 1044.3 3737.8
Residuals, in mathematical speak is a result of the observed values subtracted by the predicted values.
It is the “error” in our prediction. In plain language, we can simply say that “for at least ONE unit of temperature, we OVERPREDICTED the number of bike rentals by 4,615 bikes (our Min value).
Consequently, for at least ONE
unit of temperature, we UNDERPREDICTED rentals by 3,737 bikes (our Max
value).”
I want to illustrate that I use the term
“units” as a general way to observe the predictor variable, as this variable
can be in ANY unit. For the purpose of the problem, temperature is most likely
to noted in “degrees,” as the unit of measurement.
So, what about the coefficients chart?
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept) -166.877 221.816
-0.752 0.452
temperature 78.495
3.607 21.759 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
We have discussed estimates at this point.
But to remind us, they are the values that will be observed in the final
equation. So, the y-intercept will have a value of negative 166.877,
and the coefficient for the predictor variable, temperature, will be 78.495. These values were given to under the "estimates" column in the "coefficients" section of the output.
As emerging business analysts we are interested in the p-value column, Pr(>|t|), as it tells us if the variables present are statistically significant.
Meaning, that the smaller this number is, the more statistically significant it is, which suggests that it will have a high degree of predictability.
This number is represented in a decimal
and will reside in the range between 0 and 1.
To help us decide HOW significant this p-value is, there are asterisks (*) that are present next to it. It is standard practice to highly regard variables with a p-value that is LESS than an alpha of .05 (Alpha can be any value that we choose that will presuppose a level of error to our calculations--a level of 5% or .05 is the most common).
Put differently, any variable with a
significance level of less than alpha (of .05) is desirable in model building,
as it will have a high degree of predictability, or relevance.
We now need to understand further
diagnostic measurements.
Residual standard error: 1509 on 729
degrees of freedom
Multiple R-squared: 0.3937, Adjusted
R-squared: 0.3929
F-statistic: 473.5 on 1 and
729 DF, p-value:
< 2.2e-16
The Residual Standard Error should be smaller number, as a smaller number tells us that our model fits the data well.
This number can be phrased in context, “The actual number of bike rentals deviates from our predictions by an average of 1509 rentals.”
Now, depending on the needs of the business and how success is measured within an organization, this number may be acceptable.
We need to assure ourselves that the smaller
this number is, the better our model can account for the data.
Multiple R-squared and Adjusted R-squared are similar, as they both explain what percentage of the variability in the data is accounted for by our model.
Specifically, this output can be read as, “Roughly 39% of the uncertainty in the data is explained by our model.”
Another way we can phrase this interpretation is: "Temperature can explain roughly 39% of the variability in rentals, given a significance level (or alpha) of 5%."
Which score you decide to use depends on the problem you are trying to solve. For instance, the Adjusted R-squared is used when considering a large presence of independent variables. And this score, penalizes (it will be lower than the Multiple R-squared) models with more independent variables.
In
instances where our model is utilizing many independent variables, we might use
the Adjusted R-squared score as a conservative measurement. This is an ideal metric when considering multiple predictor variables.
We WANT both R-squared scores to be close
to 1, but not so close to 1, otherwise this may signal that our model has
overfitted the data. If our score is so close to one that it has overfitted the
data, then we can be certain that our model WILL NOT perform well with new
data.
Having a model that performs well with data
it has seen is useless if it performs terribly with data it hasn’t seen. Keep
that in mind!
Also, we need our model to perform as well
in theory as it should in the real world. Creating a model that has a supremely
high level of predictability is also useless if you can’t reproduce it and
implement it in a real-world problem. Rant over!
The F-statistic is the last
measurement I will discuss in detail, as we have already discussed the p-value
metric.
The F-statistic describes the relationship between the response variable and the predictor. The larger this number is above 1, the stronger the relationship.
It is vital that the F-statistic is NOT taken out of the context with the corresponding p-value. Meaning that an F-statistic that is greater than 1, along with a p-value that is less than alpha (typically .05), will tell us how strong the relationship is between our response and predictor.
The F-statistic and the p-value should always be analyzed TOGETHER
when determining the strength of the relationship between response and
predictor.
For our case, the F-Statistic is much
greater than 1 (473.5), and the corresponding p-value is substantially less that .05,
so the strength of the relationship between the rentals and temperature
variable is high.
I used two sources to demonstrate my understanding of how to evaluate a regression model. I urge you to support Professor Nwanganga and Professor Chapple by purchasing their book, “Practical Machine Learning in R.”
Professor Chapple
also creates multiple learning resources in the area of machine learning and
analytics on LinkedIn Learning. Though LinkedIn Learning is a paid service, these resources are well worth your time!
If you prefer a more hands-on learning experience, I recommend pursuing higher education that teaches machine learning and all-things data.
At this point, there are many business schools that offer master’s degrees in business analytics. If you are considering a master's degress, be sure to pick a curriculum that teaches R programming, Python programming, Project Management, SQL, and Data Visualization.
The field of analytics is broad, so
you can apply your business analytics degree in any field where there is a
requirement of acquiring data, cleaning it, visualizing it, modeling it, and
communicating with it.
I can't sing your praises high enough. The fact that you are on this page, and looking at several data analytics resources, signals to me that you are genuinely trying to make healthy changes in your life.
Loved ones, friends, professionals, and others around you will see your transformation and make an investment in you and your growth, and this will be valuable to your transition into an analytics role moving forward.
Thank you so much for your support and taking the time to read this post. I hope it was helpful you. Enjoy your journey into the wonderful world of data analytics and machine learning.
SOURCES:
Book Source: “Practical Machine Learning
in R,” by Fred Nwanganga and Mike Chapple of Notre Dame, Mendoza College of
Business 2020.
Lecture Source: Class Materials presented
by DePaul University, Kellstadt Graduate School
of Business, Chicago, Il, 2022.