Monday, August 8, 2022

Machine Learning for Porn Stars: Evaluating A Linear Regression Model Output

on August 08, 2022 in Blog, Simple Linear Regression

In this post, I would like to go deeper into the evaluation of the output figures of a Linear Regression Model. In hopes of eliminating any confusion that you might have, I will be discussing:

RESIDUALS

COEFFICIENTS

DIAGNOSITIC readings such as: Residual Standard Error, Multiple and Adjusted R Squared, The F-Statistic and the P-Value.

For that purpose, let’s examine the output of a linear model for the bikes data set that I posted on last week.

Call:

lm(formula = rentals ~ temperature, data = bikes)

Coefficients:

(Intercept) temperature

-166.9 78.5

> summary(bikes_mod1)

Call:

lm(formula = rentals ~ temperature, data = bikes)

Residuals:

Min 1Q Median 3Q Max

-4615.3 -1134.9 -104.4 1044.3 3737.8

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -166.877 221.816 -0.752 0.452

temperature 78.495 3.607 21.759 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1509 on 729 degrees of freedom

Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929

F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16

After passing our model (bikes_mod1) into the summary() function, R will spit out key metrics that we need to interpret. Don’t worry. You don’t need to be a math wizard to have a basic understanding of what these measurements mean.

The goal is to be able to communicate these numbers to management—hopefully you do this ethically—so they can be persuaded to take action on decisions that will positively impact the organization.

I hope this literature review will be helpful to you either in your studies , or as a data entrepreneur. As you know, supervised machine learning methods, like linear regression, are just so exciting. Being able to make future predictions based on past data observations is a tremendous skill to have, so you can imagine how enthusiastic I am to learn AND teach these techniques in the field that are industry standard.

Let’s go over the keywords that are highlighted “red,” in greater detail.

Call:

lm(formula = rentals ~ temperature, data = bikes)

Under “Call” we can see the linear model that we have built with specific variables in our data set (bikes), namely rentals and temperature.

As you can see in the formula, we state our dependent variable first, "rentals," which is impacted by an independent variable, "temperature."

For ease of recollection, I will refer to the "rentals" variable as the response variable, as it is the variable we have predicted with our model. Similarly, I will refer to the "temperature" variable as the predictor variable.

Let’s continue…

Coefficients:

(Intercept) temperature

-166.9 78.5

For figures representing Coefficients, there are numbers assigned to them. These numbers will be expressed in our model’s final equation. Below is the stock equation for univariate regression to remind us of the smaller details:

Regression Equation

This is the standard template for a regression line. Our final model, with the appropriate coefficients will appear like this, or a variation of the following:

rentals = -166.9 + 78.5(temperature)

What does this equation mean?

We will discover the answer to this question as we continue learning about our residuals:

Residuals:

Min 1Q Median 3Q Max

-4615.3 -1134.9 -104.4 1044.3 3737.8

Residuals, in mathematical speak is a result of the observed values subtracted by the predicted values.

It is the “error” in our prediction. In plain language, we can simply say that “for at least ONE unit of temperature, we OVERPREDICTED the number of bike rentals by 4,615 bikes (our Min value).

Consequently, for at least ONE unit of temperature, we UNDERPREDICTED rentals by 3,737 bikes (our Max value).”

I want to illustrate that I use the term “units” as a general way to observe the predictor variable, as this variable can be in ANY unit. For the purpose of the problem, temperature is most likely to noted in “degrees,” as the unit of measurement.

So, what about the coefficients chart?

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -166.877 221.816 -0.752 0.452

temperature 78.495 3.607 21.759 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We have discussed estimates at this point.

But to remind us, they are the values that will be observed in the final equation. So, the y-intercept will have a value of negative 166.877, and the coefficient for the predictor variable, temperature, will be 78.495. These values were given to under the "estimates" column in the "coefficients" section of the output.

As emerging business analysts we are interested in the p-value column, Pr(>|t|), as it tells us if the variables present are statistically significant.

Meaning, that the smaller this number is, the more statistically significant it is, which suggests that it will have a high degree of predictability.

This number is represented in a decimal and will reside in the range between 0 and 1.

To help us decide HOW significant this p-value is, there are asterisks (*) that are present next to it. It is standard practice to highly regard variables with a p-value that is LESS than an alpha of .05 (Alpha can be any value that we choose that will presuppose a level of error to our calculations--a level of 5% or .05 is the most common).

Put differently, any variable with a significance level of less than alpha (of .05) is desirable in model building, as it will have a high degree of predictability, or relevance.

We now need to understand further diagnostic measurements.

Residual standard error: 1509 on 729 degrees of freedom

Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929

F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16

The Residual Standard Error should be smaller number, as a smaller number tells us that our model fits the data well.

This number can be phrased in context, “The actual number of bike rentals deviates from our predictions by an average of 1509 rentals.”

Now, depending on the needs of the business and how success is measured within an organization, this number may be acceptable.

We need to assure ourselves that the smaller this number is, the better our model can account for the data.

Multiple R-squared and Adjusted R-squared are similar, as they both explain what percentage of the variability in the data is accounted for by our model.

Specifically, this output can be read as, “Roughly 39% of the uncertainty in the data is explained by our model.”

Another way we can phrase this interpretation is: "Temperature can explain roughly 39% of the variability in rentals, given a significance level (or alpha) of 5%."

Which score you decide to use depends on the problem you are trying to solve. For instance, the Adjusted R-squared is used when considering a large presence of independent variables. And this score, penalizes (it will be lower than the Multiple R-squared) models with more independent variables.

In instances where our model is utilizing many independent variables, we might use the Adjusted R-squared score as a conservative measurement. This is an ideal metric when considering multiple predictor variables.

We WANT both R-squared scores to be close to 1, but not so close to 1, otherwise this may signal that our model has overfitted the data. If our score is so close to one that it has overfitted the data, then we can be certain that our model WILL NOT perform well with new data.

Having a model that performs well with data it has seen is useless if it performs terribly with data it hasn’t seen. Keep that in mind!

Also, we need our model to perform as well in theory as it should in the real world. Creating a model that has a supremely high level of predictability is also useless if you can’t reproduce it and implement it in a real-world problem. Rant over!

The F-statistic is the last measurement I will discuss in detail, as we have already discussed the p-value metric.

The F-statistic describes the relationship between the response variable and the predictor. The larger this number is above 1, the stronger the relationship.

It is vital that the F-statistic is NOT taken out of the context with the corresponding p-value. Meaning that an F-statistic that is greater than 1, along with a p-value that is less than alpha (typically .05), will tell us how strong the relationship is between our response and predictor.

The F-statistic and the p-value should always be analyzed TOGETHER when determining the strength of the relationship between response and predictor.

For our case, the F-Statistic is much greater than 1 (473.5), and the corresponding p-value is substantially less that .05, so the strength of the relationship between the rentals and temperature variable is high.

I used two sources to demonstrate my understanding of how to evaluate a regression model. I urge you to support Professor Nwanganga and Professor Chapple by purchasing their book, “Practical Machine Learning in R.”

Professor Chapple also creates multiple learning resources in the area of machine learning and analytics on LinkedIn Learning. Though LinkedIn Learning is a paid service, these resources are well worth your time!

If you prefer a more hands-on learning experience, I recommend pursuing higher education that teaches machine learning and all-things data.

At this point, there are many business schools that offer master’s degrees in business analytics. If you are considering a master's degress, be sure to pick a curriculum that teaches R programming, Python programming, Project Management, SQL, and Data Visualization.

The field of analytics is broad, so you can apply your business analytics degree in any field where there is a requirement of acquiring data, cleaning it, visualizing it, modeling it, and communicating with it.

THANK YOU!

I can't sing your praises high enough. The fact that you are on this page, and looking at several data analytics resources, signals to me that you are genuinely trying to make healthy changes in your life.

Loved ones, friends, professionals, and others around you will see your transformation and make an investment in you and your growth, and this will be valuable to your transition into an analytics role moving forward.

Thank you so much for your support and taking the time to read this post. I hope it was helpful you. Enjoy your journey into the wonderful world of data analytics and machine learning.

SOURCES:

Book Source: “Practical Machine Learning in R,” by Fred Nwanganga and Mike Chapple of Notre Dame, Mendoza College of Business 2020.

Lecture Source: Class Materials presented by DePaul University, Kellstadt Graduate School of Business, Chicago, Il, 2022.

Share:

Location: Chicago, IL, USA