Wednesday, August 10, 2022

Regression Analysis for School Project: A Template for How Porn Stars Can Do Machine Learning Projects Independently And Efficiently

WELCOME


Hello ladies! Thank you so much for your continued support. I love writing for all of you. 


Last week, I wrote a blog post about attending college, or at least considering the prospect of school. I will link that here.  Related to that, this post details a Final Exam Project that you will likely encounter if you decide to pursue an Business Analytics master's degree. 


Originally, it was completed using Microsoft Excel, but I adapted it for you using R script. The project itself is very basic. As objective as I tried to be, there is a level of subjectivity to the analysis. I don't go into too many details with the decisions that I took to arrive at the final model, and that is a flaw of this post. I hope that you can learn through example, in this case, as opposed to explanation. 


The point of this final was to use Regression to create a prediction equation for a data set (or business problem) of interest. The scenario is fictitious, of course. But I certainly had fun with it. If you care to try it for yourself, I will post the link to the Google Drive. You can find the link under the "Data Sets" tab on the home page. 


Without further delay, I hope you enjoy this sample project.


(As always, the code that you will run in R Studio is highlighted in BLUE).


CLASS FINAL PROJECT: Stair Mart Analysis


INTRODUCTION

 

I am a business analytics consultant, and I have been asked by the General Manager of the Stair Mart Stores chain to conduct a research project to determine which variables in a supplied data set should be used for predicting annual sales for the coming year.

 

As this is the first project in my professional career as a private consultant, I feel that Regression Analysis is a great place to start. By using a specific method called OLS, or the Ordinary Least Squares Method, I can determine which variables in the supplied data set will be correlated with each other, as well as the response variable, annual net sales (ann_net_sales).

 

The OLS method is widely used, and it is considered an industry “best practice,” because it measures the distance between “predicted values and actual values” in analysis. This distance between these two points is known as the “error” term or the “residual.” The smaller this term is, the more predictive a model can be.

 

VARIABLES IN THE DATA SET

 

The data set that was given to me by the General Manager of Stair Mart Stores has six variables, which I will now describe:

 

1.         “ann_net_sales”: This variable will be my response variable, or the variable of interest that I wish to make predictions using the OLS method. This variable describes annual sales representing each observation, or store location.


2.         “sq_ft”: This variable will be one of the five predictor variables that I will use to determine in my regression model. It describes the square footage, or the size of each store.


3.         “inv”: Another independent variable that I will use to determine a prediction equation for annual net sales. This variable describes inventory of goods in each store location.


4.         “adv”: This variable represents the number of advertising dollars spent at each store location. It will also be an explanatory variable, or independent variable, in my analysis.


5.         “dist_size”: This will be the fourth independent variable that I use to determine my regression equation, and it describes the quantity of sales in a district for every 1,000 families.


6.         “comp”: This is the final explanatory variable that I will use for my regression analysis, and it describes how many competing stores (or businesses that offer similar products) are in the vicinity of that specific All Greens store location.

 

There are some important considerations in the project. Firstly, the sample size of the data set contains a total of 27 All Greens store locations. So, this will be an obstacle when determining which variables are more significant than others. 


In conducting a project on a smaller sample size, I will need to account for more error in the regression equation. Second, I do not have access to competitor data, which would be vital for running an ANOVA test. I will proceed with caution.

 

GOAL AND METHOD

 

As I previously mentioned, my goal in this project is to determine a prediction equation using the OLS method. Since all of the variables are of numerical nature, I will operate under the assumption that a regression line can be used in this instance. However, there are some steps that I need to take in order to ensure that my assumptions hold merit:

 

1.         I will use scatter plots to graph each independent variable with the response variable, ann_net_sales, to determine if there is a “linear” relationship between them.


2.         I will build out a regression with as many of the significant variables that explain as much variation in our data as possible, using p-values and the “adjusted R-Squared” score.


3.         I will eliminate the variables with lessor importance so that I can build a regression equation with two or three explanatory variables.


4.         I will use “residual” analysis to ascertain whether my assumptions regarding my analysis can be utilized with confidence.

 

Finally, I will be using R studio to conduct my analysis.

 

SCATTER PLOT ANALYSIS

 

I will begin the analysis with scatter plots of all the explanatory variables (independent variables) compared to my response variable (dependent variable). For reference, my response variable is “ann_net_sales.”

 

setwd("C:/Users/firstnamelastinitial/Documents/R")
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

 

I will bring the data set that the General Manager supplied to me in the computing environment. I will simply name the data set “stair.”

 

stair <- read.csv("stair_mart.csv", header = TRUE)

 

I will now learn about the data set.

 

glimpse(stair)

## Rows: 27
## Columns: 6
## $ ann_net_sales <int> 231000, 156000, 10000, 519000, 437000, 487000, 299000, 1~
## $ sq_ft         <int> 3000, 2200, 500, 5500, 4400, 4800, 3100, 2500, 1200, 600~
## $ inv           <int> 294000, 232000, 149000, 600000, 567000, 571000, 512000, ~
## $ adv           <int> 8200, 6900, 3000, 12000, 10600, 11800, 8100, 7700, 3300,~
## $ dist_size     <dbl> 8.2, 4.1, 4.3, 16.1, 14.1, 12.7, 10.1, 8.4, 2.1, 4.7, 12~
## $ comp          <int> 11, 12, 15, 1, 5, 4, 10, 12, 15, 8, 1, 7, 3, 14, 11, 10,~

 

This data set is small, and it only contains 27 observations (rows) and 6 variables (columns). The variables have been mentioned in an earlier section of this report, and they are numeric in nature.

 

To continue with my scatter plot analysis, I need to create x and y-axis objects to code into separate scatter plots. Here are the objects:

 

ann_net_sales <- stair$ann_net_sales
sq_ft
<- stair$sq_ft
inv
<- stair$inv
adv
<- stair$adv
dist_size
<- stair$dist_size
comp
<- stair$comp

 

Now I can analyze each object separately against my response variable. The process should look like this:

 

Here is the relationship of the square footage and sales:

 

plot(sq_ft, ann_net_sales,
    
main="Effect of Square Feet on Annual Net Sales",
    
xlab="Size", ylab="Annual Net Sales")

abline(lm(ann_net_sales~sq_ft))


Scatter Plot 1

Here is the relationship of advertising costs and sales:

 

plot(adv, ann_net_sales,
    
main="Effect of Advertising on Annual Net Sales",
    
xlab="Advertising Costs", ylab="Annual Net Sales")

abline(lm(ann_net_sales~adv))


Scatter Plot 2


Here is the relationship of the inventory and sales:

 

plot(inv, ann_net_sales,
     main="Effect of Inventory on Annual Net Sales",
     xlab="Inventory", ylab="Annual Net Sales")

abline(lm(ann_net_sales~inv))


Scatter Plot 3

Here is the relationship of the district size and sales:

 

plot(dist_size, ann_net_sales,
     main="Effect of District Size on Annual Net Sales",
     xlab="DistricT Size", ylab="Annual Net Sales")

abline(lm(ann_net_sales~dist_size))


Scatter Plot 4

finally, here is the relationship between competitor locations and sales:

 

plot(comp, ann_net_sales,
     main="Effect of Competition on Annual Net Sales",
     xlab="Competitor Locations", ylab="Annual Net Sales")

abline(lm(ann_net_sales~comp))

Scatter Plot 5

To summarize the scatter plot analysis, all of the graphs contain explanatory variables that seem to display a LINEAR relationship with the response variable. All of the relationships are POSITIVE, except the relationship between competitor locations and annual sales.

 

What does this mean?

 

Relationships that have a POSITIVE LINEAR trend:

 

As there is an increase along the x-axis, the response variable along the y-axis also increases. It is sensible to assume that the larger the store (sq_ft), the more sales are expected to be, for instance.

 

Similarly, a larger inventory (inv) can lead to more sales as customers can expect to make purchases based on more goods that the store provides.

 

One might reasonably expect that a larger advertising budget (adv) can enable stores and businesses to have more visibility, which can increase the sales of the store.

 

The size of the district (dist_size) can also impact sales. The larger the district, the larger the potential for sales.

 

Relationships that have a NEGATIVE LINEAR trend:

 

As noted on the final scatter plot, having more competitors (comp) in the vicinity of your business can impact sales negatively. Meaning, as the number of competitors increases, the amount of sales decreases. This is understandable as consumers have more options to purchase similar product offerings.

 

Finally, I can conclude that regression analysis is appropriate for this data set, as each of the scatter plot reveal seemingly linear relationships between explanatory and response variables.

 

SUMMARY STATISTICS

 

For further analysis, I can derive descriptive statistics concerning the data set:

 

summary(stair$ann_net_sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     500   98500  341000  286574  450500  570000

summary(stair$sq_ft)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     500    1400    3500    3326    4750    8600

summary(stair$comp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   4.000   8.000   7.741  12.000  15.000

summary(stair$dist_size)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   1.600   4.500  11.300   9.693  14.050  16.300

summary(stair$adv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    2500    4800    8100    8100   10950   17400

summary(stair$inv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  102000  204000  382000  387482  551000  788000

 

I can also create a correlation plot to determine which of the explanatory variables has the largest correlation with the response variable.

 

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.2

## corrplot 0.92 loaded

stairnumeric <- stair
stair_correlations <- cor(stairnumeric)
corrplot(stair_correlations, type = "upper")


Top Half Of Correlation Plot



For further specificity:

 

corrplot.mixed(stair_correlations)


Correlation Plot With Decimals



All of the explanatory variables show a higher linear correlation with the response variable (ann_net_sales), so a regression analysis is appropriate for this project.

 

INFERENTIAL STATISTICS

 

As I mentioned in my introduction, I will be using OLS, or Ordinary Least Squares method, to calculate the smallest distance between the actual values and predicted values, as well as the Adjusted R-Squared Score, and significance levels depicted by p-values.

 

Before running the regression, I would like to note that typically, the more variables you add to a regression equation, the higher you can expect the R-Squared value to be. 


However, in my analysis, I will seek to eliminate explanatory variables with lessor importance so as to include the most significant variables for the General Manager to focus on in her sales strategy.

 

With this in mind, there is a cost that I must incur. At the expense of selecting the most significant explanatory variables, I will have a model with a smaller R-Squared score. This is a cost that is acceptable to me, as I want my recommendation to be practical and easily implemented by the General Manager.

 

It will be easier for the General Manager to implement a practical solution to her business if she has less variables (that have the highest significance) to worry about in her upcoming sales campaign. 


In other words, more variables can be more costly to focus on, so I will recommend to her the most important ones so that she can prioritize different features in the data.

 

Here is the output of the Regression with all five explanatory variables impacting our response variable, Annual Net Sales:

 

model_all <- lm(ann_net_sales~sq_ft + adv + comp + dist_size + inv)
summary(model_all)

##
## Call:
## lm(formula = ann_net_sales ~ sq_ft + adv + comp + dist_size +
##     inv)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -26338  -9699  -4496   4040  41139
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.886e+04  3.015e+04  -0.626 0.538372   
## sq_ft        1.620e+01  3.544e+00   4.571 0.000166 ***
## adv          1.153e+01  2.532e+00   4.552 0.000174 ***
## comp        -5.311e+03  1.705e+03  -3.114 0.005249 **
## dist_size    1.358e+04  1.770e+03   7.671 1.61e-07 ***
## inv          1.746e-01  5.761e-02   3.032 0.006347 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17650 on 21 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9916
## F-statistic: 611.6 on 5 and 21 DF,  p-value: < 2.2e-16

 

Knowing that more explanatory variables can be beneficial in improving an R-Square score, it is no surprise that having all of the explanatory variables in our regression equation has yielded such a high score (99.2%). In fact, at this point, I can interpret this score as such:

 

“Our multiple linear regression model can explain roughly 99.2 percent of the variability in our data.”

 

In other words, this model is strong! Further, all the p-values of each explanatory variable are statistically significant given that they are all smaller than our alpha justified by a 5% significance level. The F statistic tells us that the model is statistically significant, as well.

 

If the General Manager had unlimited resources to improve her Annual Net Sales, she might conclude that these five variables would be sufficient in a prediction equation. But there is never a case in which resources are unlimited.

 

It is true that this model has a high predictive power! Generally speaking, the higher the R Square Score, the better. Yet, there are two variables that I wish to eliminate from this model as they are variables that the General Manager cannot control in her upcoming campaign. Specifically, I wish to eliminate competition size (comp) and the size of the district (dist_size). 


The General Manager CAN'T control how many competitors are in the vicinity of Stair Mart locations. Additionally, she CAN'T control the size of the district.

 

It is important to note that our model’s predictive power will suffer in the newest iteration of our prediction equation.

 

The remaining variables will remain in the model, so our new model will look like this:

 

model_final <- lm(ann_net_sales~adv + inv + sq_ft)
summary(model_final)

##
## Call:
## lm(formula = ann_net_sales ~ adv + inv + sq_ft)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -87920 -20388   6427  23885 140000
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.855e+04  2.158e+04  -4.566 0.000137 ***
## adv          1.788e+01  5.725e+00   3.123 0.004776 **
## inv          3.300e-01  1.396e-01   2.364 0.026904 * 
## sq_ft        3.380e+01  8.461e+00   3.995 0.000569 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46470 on 23 degrees of freedom
## Multiple R-squared:  0.9482, Adjusted R-squared:  0.9415
## F-statistic: 140.4 on 3 and 23 DF,  p-value: 6.308e-15

 

As expected, our Adjusted R Square score is roughly 5 percentage points lower than the previous model (94.2%). The F-statistic and the p-values associated with this model are all acceptable. As an observation, the output is telling me that inventory (inv) is of lessor significance compared to the other variables in this model.


I can interpret this new model as:


"Advertising, inventory, and square footage explains roughly 94.15% of the uncertainty in predicted annual net sales, given an significance level of 5%."


With this in mind, I am going to KEEP this model as my final recommendation for the General Manager of Stair Mart so that she can control for all of the variables that are significant in the data set. 


Further, knowing that "inventory" is a less significant variable, the General Manager can fine-tune her modifications to the create the optimal prediction equation.

 

Here is what the model looks like in numerical terms:

 

“Predicted Value of Annual Sales = -98553+ 33.80 (sq_ft) + 0.33 (inv) + 17.88 (adv) + E

 

RESIDUAL OUTPUT

 

Now I will confirm whether my assumptions have merit with a residual plot.

 

plot(model_final)


Residual Plot 1


 


Normality Plot



Residual Analysis Plot



Residual Analysis Plot 2


In our residual vs. fitted plot, there is no observable pattern. Therefore, I can conclude that a linear regression model is appropriate for this project. In fact, in ALL of the Residual Plots, there are no patterns displayed by the data points. This is a good sign, and legitimizes the use of a regression model.

 

CONCLUSION

 

This analysis was conducted for the General Manager of the Stair Mart Franchise stores in the mid-west. The purpose was to determine whether a regression analysis would be appropriate for determining Annual Net Sales for the upcoming year based on this past year’s data. 


Since the variables were mainly numerical, and the scatter plots and residuals confirmed the appropriateness of a regression style analysis, the General Manager now has a prediction equation that she can use to plan and strategize sales campaigns for the following year.

 

Given that these variables are in her control, she can influence specific factors—size of the stores (sq_ft), inventory volume (inv), and advertisement spending (adv)—to improve sales growth for Stair Mart. Here is the prediction equation that the General Manager should use in prioritizing variables of interest:

 

“Predicted Value of Annual Sales = -98553+ 33.80 (sq_ft) + 0.33 (inv) + 17.88 (adv) + E

 

It is vital to note that there are some flaws within this analysis. 


Firstly, it would have been more advantageous to work with more observations in the data set, specifically, more store locations. Having a larger sample size would also allow an analyst to avoid heteroscedasticity in the model production process.

 

Second, having similar corresponding competitor data would have been useful for fine-tuning the final model, and it would have given greater insight on how to manipulate the explanatory variables in the final model.

 

Finally, additional variables and data sets like customer segments (age, SES, ethnicity), climate, supply chain factors, would prove to be useful in creating a richer analysis. Although, this type of data can be very costly to obtain.



THANK YOU!


Ladies, I hope this project will serve as an appropriate learning tool for your development and transition from the Sex Work Industry. Why can't porn stars learn data analytics and machine learning, too? 


Why is it that this type of information is only available to those who are privileged? This is just not right! 


I want this blog to serve as a safe haven for all of you ladies to learn, discover and explore the wonderful world of analytics and machine learning. Are you having as much fun as I am? I hope so....


Thank you ladies! 


Share:
Location: Chicago, IL, USA