As you all know by now, I enrolled in DePaul University’s MS in Business Analytics program this year. I am loving the program so much. I am learning new ways of solving business problems, and I am meeting such nice folks in small teams that are assigned for homework and other assignments. My professors are amazing as well!
At this point, I have been learning R programming for
the past 7 months, since before I matriculated into the MS program. And in that
time, I have studied mostly linear regression. It is simple, powerful, and
relatively easy to interpret. No math wizardry needed for this!
For simple linear regression, there are 4 values of
interest:
1. There is a y-variable. This is known as the
“response” variable (dependent variable), and we as analysts are interested in
predicting this value.
2. There is the y-intercept.
3. There is an x-variable, and this is known as the
“predictor” variable (independent variable). This variable IMPACTS the response
variable. Conversely, the response variable is IMPACTED by the predictor
variable. In simple linear regression, there is only one predictor variable,
but in multivariate regression there can be two or more.
4. There is a very subtle e-value, which is the
approximation of all estimation errors. “E” represents the sum of ALL the
differences between the actual values for y and the predicted values for y. The
goal regression is to find the best values of the b coefficient and the x
coefficient such that “E” is minimized.
To regression!
Running the lm() function brings me such joy, it’s a strange sort of joy. I mean, what’s not to love? You pass two variables into the function, along with the name of the data set, and BOOM, R Studio will spit out a formula with very specific numbers that are attached to the four points listed above.
If you don’t like your R Squared score—that pesky thing just
never seems to get to 1—then you can try a different predictor variable to see
if your model improves.
This strategy seems to work well, adding and subtracting a predictor until you find a decent model worth replicating. But what if you had dozens if not hundreds of variables in your data set?
Further, what if you could add more than one
predictor variable to your model that will be able to explain more of the data
that you are trying to account for? On top of that, what if you had a way of
determining which variable is statistically insignificant with regards to all
the other variables in your data?
This is where Regularization and Feature Selection
comes into play.
To make use of Regularization (the maximization of more
significant variables) and Feature Selection (the minimization of less
significant variables), you can use a specific method known as LASSO
regression.
In short, the LASSO “penalizes more variables in the
regression,” confirmed by one of my professors at DePaul University Kellstadt
Graduate School of Business. This penalization reduces the chances of our model
“overfitting” the data. This is ideal as an overfitting of the data will lead
to a model predicting perfectly with data that it has seen, and poorly with
data that it hasn’t seen.
So, by eliminating statistically insignificant
variables, and keeping the regression equation to 2 or 3 predictors, you can
reduce your chances of overfitting, AND increase the predictive power of your
model. This is all thanks to our good friend LASSO! Thank you, my good friend
LASSO!
I am going to post a Word document with all the code
to run a very basic LASSO regression. This code was sourced from a book called,
“Introduction to Machine Learning with R: Rigorous Mathematical Modeling” by Scott
V. Burger.
However, I will not be using this algorithm with the
data set he uses. Instead, I will be using the bikes.csv data set that is
provided in another book called “Practical Machine Learning in R,” by Notre
Dame Professors Fred Nwanganga and Mike Chapple.
The data sets that these authors used with their
demonstration of their algorithms were meticulously sourced by them, and as
such, I strongly encourage you to purchase their works. I promise you will
learn so much, and these books are just so rewarding to learn from.
Thank you so much for reading, and I hope you enjoy
the LASSO process.
Sources for your Interest:
Lecture and Class Materials presented at DePaul
University, Kellstadt Graduate School of Business, in a class entitled
“Fundamentals of Business Analytics,” 2022, Chicago, Illinois.
“Practical Machine Learning in R,” by Fred Nwanganga
and Mike Chapple, Professors at Notre Dame Mendoza College of Business, 2020.
“Introduction to Machine Learning with R: Rigorous
Mathematical Modeling,” by Scott V. Burger, 2018.