WELCOME
Hello ladies! Thank you so much for your continued support. I love writing for all of you.
Last week, I wrote a blog post about attending college, or at least considering the prospect of school. I will link that here. Related to that, this post details a Final Exam Project that you will likely encounter if you decide to pursue an Business Analytics master's degree.
Originally, it was completed using Microsoft Excel, but I adapted it for you using R script. The project itself is very basic. As objective as I tried to be, there is a level of subjectivity to the analysis. I don't go into too many details with the decisions that I took to arrive at the final model, and that is a flaw of this post. I hope that you can learn through example, in this case, as opposed to explanation.
The point of this final was to use Regression to create a prediction equation for a data set (or business problem) of interest. The scenario is fictitious, of course. But I certainly had fun with it. If you care to try it for yourself, I will post the link to the Google Drive. You can find the link under the "Data Sets" tab on the home page.
Without further delay, I hope you enjoy this sample project.
(As always, the code that you will run in R Studio is highlighted in BLUE).
CLASS FINAL PROJECT: Stair Mart Analysis
INTRODUCTION
I am a business analytics consultant, and I have been
asked by the General Manager of the Stair Mart Stores chain to conduct a
research project to determine which variables in a supplied data set should be
used for predicting annual sales for the coming year.
As this is the first project in my professional career as
a private consultant, I feel that Regression Analysis is a great place to
start. By using a specific method called OLS, or the Ordinary Least Squares
Method, I can determine which variables in the supplied data set will be
correlated with each other, as well as the response variable, annual net sales
(ann_net_sales).
The OLS method is widely used, and it is considered an
industry “best practice,” because it measures the distance between “predicted
values and actual values” in analysis. This distance between these two points
is known as the “error” term or the “residual.” The smaller this term is, the
more predictive a model can be.
VARIABLES IN THE DATA SET
The data set that was given to me by the General Manager
of Stair Mart Stores has six variables, which I will now describe:
1.
“ann_net_sales”: This variable will be my
response variable, or the variable of interest that I wish to make predictions
using the OLS method. This variable describes annual sales representing each
observation, or store location.
2.
“sq_ft”: This variable will be one of the five
predictor variables that I will use to determine in my regression model. It
describes the square footage, or the size of each store.
3.
“inv”: Another independent variable that I will
use to determine a prediction equation for annual net sales. This variable
describes inventory of goods in each store location.
4.
“adv”: This variable represents the number of
advertising dollars spent at each store location. It will also be an
explanatory variable, or independent variable, in my analysis.
5.
“dist_size”: This will be the fourth independent
variable that I use to determine my regression equation, and it describes the
quantity of sales in a district for every 1,000 families.
6.
“comp”: This is the final explanatory variable
that I will use for my regression analysis, and it describes how many competing
stores (or businesses that offer similar products) are in the vicinity of that
specific All Greens store location.
There are some important considerations in the project. Firstly, the sample size of the data set contains a total of 27 All Greens store locations. So, this will be an obstacle when determining which variables are more significant than others.
In conducting a project on a smaller sample
size, I will need to account for more error in the regression equation. Second,
I do not have access to competitor data, which would be vital for running an
ANOVA test. I will proceed with caution.
GOAL AND METHOD
As I previously mentioned, my goal in this project is to
determine a prediction equation using the OLS method. Since all of the
variables are of numerical nature, I will operate under the assumption that a
regression line can be used in this instance. However, there are some steps
that I need to take in order to ensure that my assumptions hold merit:
1.
I will use scatter plots to graph each
independent variable with the response variable, ann_net_sales, to determine if
there is a “linear” relationship between them.
2.
I will build out a regression with as many of
the significant variables that explain as much variation in our data as
possible, using p-values and the “adjusted R-Squared” score.
3.
I will eliminate the variables with lessor
importance so that I can build a regression equation with two or three
explanatory variables.
4.
I will use “residual” analysis to ascertain
whether my assumptions regarding my analysis can be utilized with confidence.
Finally, I will be using R studio to conduct my
analysis.
SCATTER PLOT ANALYSIS
I will begin the analysis with scatter plots of all the
explanatory variables (independent variables) compared to my response variable
(dependent variable). For reference, my response variable is “ann_net_sales.”
setwd("C:/Users/firstnamelastinitial/Documents/R")
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts
------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
I will bring the data set that the General Manager
supplied to me in the computing environment. I will simply name the data set
“stair.”
stair <- read.csv("stair_mart.csv", header = TRUE)
I will now learn about the data set.
glimpse(stair)
## Rows: 27
## Columns: 6
## $ ann_net_sales <int> 231000, 156000, 10000,
519000, 437000, 487000, 299000, 1~
## $ sq_ft
<int> 3000, 2200, 500, 5500, 4400, 4800, 3100, 2500, 1200, 600~
## $ inv
<int> 294000, 232000, 149000, 600000, 567000, 571000, 512000, ~
## $ adv
<int> 8200, 6900, 3000, 12000, 10600, 11800, 8100, 7700, 3300,~
## $ dist_size
<dbl> 8.2, 4.1, 4.3, 16.1, 14.1, 12.7, 10.1, 8.4, 2.1, 4.7, 12~
## $ comp
<int> 11, 12, 15, 1, 5, 4, 10, 12, 15, 8, 1, 7, 3, 14, 11, 10,~
This data set is small, and it only contains 27
observations (rows) and 6 variables (columns). The variables have been
mentioned in an earlier section of this report, and they are numeric in nature.
To continue with my scatter plot analysis, I need to
create x and y-axis objects to code into separate scatter plots. Here are the
objects:
ann_net_sales <- stair$ann_net_sales
sq_ft <- stair$sq_ft
inv <- stair$inv
adv <- stair$adv
dist_size <- stair$dist_size
comp <- stair$comp
Now I can analyze each object separately against my
response variable. The process should look like this:
Here is the relationship of the square footage and sales:
plot(sq_ft, ann_net_sales,
main="Effect of Square
Feet on Annual Net Sales",
xlab="Size", ylab="Annual Net
Sales")
abline(lm(ann_net_sales~sq_ft))
Here is the relationship of advertising costs and sales:
plot(adv, ann_net_sales,
main="Effect of
Advertising on Annual Net Sales",
xlab="Advertising Costs", ylab="Annual Net
Sales")
abline(lm(ann_net_sales~adv))
Here is the relationship of the inventory and sales:
plot(inv, ann_net_sales,
main="Effect of Inventory
on Annual Net Sales",
xlab="Inventory", ylab="Annual Net
Sales")
abline(lm(ann_net_sales~inv))
Here is the relationship of the district size and sales:
plot(dist_size, ann_net_sales,
main="Effect of District
Size on Annual Net Sales",
xlab="DistricT Size", ylab="Annual Net
Sales")
abline(lm(ann_net_sales~dist_size))
finally, here is the relationship between competitor locations and sales:
plot(comp, ann_net_sales,
main="Effect of
Competition on Annual Net Sales",
xlab="Competitor
Locations", ylab="Annual Net Sales")
abline(lm(ann_net_sales~comp))
To summarize the scatter plot analysis, all of the graphs contain explanatory variables that seem to display a LINEAR relationship with the response variable. All of the relationships are POSITIVE, except the relationship between competitor locations and annual sales.
What does this mean?
Relationships that have a POSITIVE LINEAR trend:
As there is an increase along the x-axis, the response
variable along the y-axis also increases. It is sensible to assume that the
larger the store (sq_ft), the more sales are expected to be, for instance.
Similarly, a larger inventory (inv) can lead to more sales
as customers can expect to make purchases based on more goods that the store
provides.
One might reasonably expect that a larger advertising
budget (adv) can enable stores and businesses to have more visibility, which
can increase the sales of the store.
The size of the district (dist_size) can also impact
sales. The larger the district, the larger the potential for sales.
Relationships that have a NEGATIVE LINEAR trend:
As noted on the final scatter plot, having more
competitors (comp) in the vicinity of your business can impact sales
negatively. Meaning, as the number of competitors increases, the amount of
sales decreases. This is understandable as consumers have more options to
purchase similar product offerings.
Finally, I can conclude that regression analysis is
appropriate for this data set, as each of the scatter plot reveal seemingly
linear relationships between explanatory and response variables.
SUMMARY STATISTICS
For further analysis, I can derive descriptive statistics
concerning the data set:
summary(stair$ann_net_sales)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 500 98500
341000 286574 450500
570000
summary(stair$sq_ft)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 500 1400
3500 3326 4750
8600
summary(stair$comp)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000
8.000 7.741 12.000
15.000
summary(stair$dist_size)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 1.600 4.500
11.300 9.693 14.050
16.300
summary(stair$adv)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 2500 4800
8100 8100 10950
17400
summary(stair$inv)
## Min. 1st
Qu. Median Mean 3rd Qu. Max.
## 102000 204000
382000 387482 551000
788000
I can also create a correlation plot to determine which
of the explanatory variables has the largest correlation with the response
variable.
library(corrplot)
## Warning: package 'corrplot' was built under R
version 4.1.2
## corrplot 0.92 loaded
stairnumeric <- stair
stair_correlations <- cor(stairnumeric)
corrplot(stair_correlations,
type = "upper")
For further specificity:
corrplot.mixed(stair_correlations)
All of the explanatory variables show a higher linear
correlation with the response variable (ann_net_sales), so a regression
analysis is appropriate for this project.
INFERENTIAL STATISTICS
As I mentioned in my introduction, I will be using OLS, or
Ordinary Least Squares method, to calculate the smallest distance between the
actual values and predicted values, as well as the Adjusted R-Squared Score,
and significance levels depicted by p-values.
Before running the regression, I would like to note that typically, the more variables you add to a regression equation, the higher you can expect the R-Squared value to be.
However, in my analysis, I will seek to
eliminate explanatory variables with lessor importance so as to include the
most significant variables for the General Manager to focus on in her sales
strategy.
With this in mind, there is a cost that I must incur. At
the expense of selecting the most significant explanatory variables, I will
have a model with a smaller R-Squared score. This is a cost that is acceptable
to me, as I want my recommendation to be practical and easily implemented by
the General Manager.
It will be easier for the General Manager to implement a practical solution to her business if she has less variables (that have the highest significance) to worry about in her upcoming sales campaign.
In other
words, more variables can be more costly to focus on, so I will recommend to
her the most important ones so that she can prioritize different features in
the data.
Here is the output of the Regression with all five
explanatory variables impacting our response variable, Annual Net Sales:
model_all <- lm(ann_net_sales~sq_ft + adv + comp + dist_size + inv)
summary(model_all)
##
## Call:
## lm(formula = ann_net_sales ~ sq_ft + adv + comp +
dist_size +
## inv)
##
## Residuals:
## Min 1Q Median 3Q
Max
## -26338
-9699 -4496 4040
41139
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.886e+04 3.015e+04
-0.626 0.538372
## sq_ft
1.620e+01 3.544e+00 4.571 0.000166 ***
## adv
1.153e+01 2.532e+00 4.552 0.000174 ***
## comp
-5.311e+03 1.705e+03 -3.114 0.005249 **
## dist_size
1.358e+04 1.770e+03 7.671 1.61e-07 ***
## inv
1.746e-01 5.761e-02 3.032 0.006347 **
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17650 on 21 degrees of
freedom
## Multiple R-squared:
0.9932, Adjusted R-squared:
0.9916
## F-statistic: 611.6 on 5 and 21 DF, p-value: < 2.2e-16
Knowing that more explanatory variables can be
beneficial in improving an R-Square score, it is no surprise that having all of
the explanatory variables in our regression equation has yielded such a high
score (99.2%). In fact, at this point, I can interpret this score as such:
“Our multiple linear regression model can explain roughly
99.2 percent of the variability in our data.”
In other words, this model is strong! Further, all the
p-values of each explanatory variable are statistically significant given that
they are all smaller than our alpha justified by a 5% significance level. The F
statistic tells us that the model is statistically significant, as well.
If the General Manager had unlimited resources to improve
her Annual Net Sales, she might conclude that these five variables would be
sufficient in a prediction equation. But there is never a case in which
resources are unlimited.
It is true that this model has a high predictive power!
Generally speaking, the higher the R Square Score, the better. Yet, there are
two variables that I wish to eliminate from this model as they are variables
that the General Manager cannot control in her upcoming campaign. Specifically,
I wish to eliminate competition size (comp) and the size of the district
(dist_size).
The General Manager CAN'T control how many competitors are in the vicinity of Stair Mart locations. Additionally, she CAN'T control the size of the district.
It is important to note that our model’s predictive power
will suffer in the newest iteration of our prediction equation.
The remaining variables will remain in the model, so our
new model will look like this:
model_final <- lm(ann_net_sales~adv + inv + sq_ft)
summary(model_final)
##
## Call:
## lm(formula = ann_net_sales ~ adv + inv + sq_ft)
##
## Residuals:
## Min 1Q Median 3Q
Max
## -87920 -20388
6427 23885 140000
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.855e+04 2.158e+04
-4.566 0.000137 ***
## adv
1.788e+01 5.725e+00 3.123 0.004776 **
## inv
3.300e-01 1.396e-01 2.364 0.026904 *
## sq_ft
3.380e+01 8.461e+00 3.995 0.000569 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46470 on 23 degrees of
freedom
## Multiple R-squared:
0.9482, Adjusted R-squared:
0.9415
## F-statistic: 140.4 on 3 and 23 DF, p-value: 6.308e-15
As expected, our Adjusted R Square score is roughly 5
percentage points lower than the previous model (94.2%). The F-statistic and
the p-values associated with this model are all acceptable. As an observation,
the output is telling me that inventory (inv) is of lessor significance
compared to the other variables in this model.
I can interpret this new model as:
"Advertising, inventory, and square footage explains roughly 94.15% of the uncertainty in predicted annual net sales, given an significance level of 5%."
With this in mind, I am going to KEEP this model as my final recommendation for the General Manager of Stair Mart so that she can control for all of the variables that are significant in the data set.
Further, knowing
that "inventory" is a less significant variable, the General Manager can
fine-tune her modifications to the create the optimal prediction equation.
Here is what the model looks like in numerical terms:
“Predicted
Value of Annual Sales = -98553+ 33.80 (sq_ft) + 0.33
(inv) + 17.88 (adv) + E”
RESIDUAL OUTPUT
Now I will confirm whether my assumptions have merit with
a residual plot.
plot(model_final)
In our residual vs. fitted plot, there is no
observable pattern. Therefore, I can conclude that a linear regression model is
appropriate for this project. In fact, in ALL of the Residual Plots, there are no patterns displayed by the data points. This is a good sign, and legitimizes the use of a regression model.
CONCLUSION
This analysis was conducted for the General Manager of the Stair Mart Franchise stores in the mid-west. The purpose was to determine whether a regression analysis would be appropriate for determining Annual Net Sales for the upcoming year based on this past year’s data.
Since the variables
were mainly numerical, and the scatter plots and residuals confirmed the
appropriateness of a regression style analysis, the General Manager now has a
prediction equation that she can use to plan and strategize sales campaigns for
the following year.
Given that these variables are in her control, she can
influence specific factors—size of the stores (sq_ft), inventory volume (inv),
and advertisement spending (adv)—to improve sales growth for Stair Mart. Here
is the prediction equation that the General Manager should use in prioritizing variables
of interest:
“Predicted Value of Annual Sales = -98553+ 33.80 (sq_ft) + 0.33 (inv) + 17.88 (adv) + E”
It is vital to note that there are some flaws within this analysis.
Firstly, it would have been more advantageous to work with more observations
in the data set, specifically, more store locations. Having a larger sample
size would also allow an analyst to avoid heteroscedasticity in the model
production process.
Second, having similar corresponding competitor data would
have been useful for fine-tuning the final model, and it would have given
greater insight on how to manipulate the explanatory variables in the final
model.
Finally, additional variables and data sets like customer
segments (age, SES, ethnicity), climate, supply chain factors, would prove to
be useful in creating a richer analysis. Although, this type of data can be
very costly to obtain.
THANK YOU!
Ladies, I hope this project will serve as an appropriate learning tool for your development and transition from the Sex Work Industry. Why can't porn stars learn data analytics and machine learning, too?
Why is it that this type of information is only available to those who are privileged? This is just not right!
I want this blog to serve as a safe haven for all of you ladies to learn, discover and explore the wonderful world of analytics and machine learning. Are you having as much fun as I am? I hope so....
Thank you ladies!