Wednesday, August 10, 2022

Linear Regression and Regularization: The LASSO algorithm, A Brief Essay

on August 10, 2022 in Blog, LASSO, Linear Regression

As you all know by now, I enrolled in DePaul University’s MS in Business Analytics program this year. I am loving the program so much. I am learning new ways of solving business problems, and I am meeting such nice folks in small teams that are assigned for homework and other assignments. My professors are amazing as well!

At this point, I have been learning R programming for the past 7 months, since before I matriculated into the MS program. And in that time, I have studied mostly linear regression. It is simple, powerful, and relatively easy to interpret. No math wizardry needed for this!

For simple linear regression, there are 4 values of interest:

1. There is a y-variable. This is known as the “response” variable (dependent variable), and we as analysts are interested in predicting this value.

2. There is the y-intercept.

3. There is an x-variable, and this is known as the “predictor” variable (independent variable). This variable IMPACTS the response variable. Conversely, the response variable is IMPACTED by the predictor variable. In simple linear regression, there is only one predictor variable, but in multivariate regression there can be two or more.

4. There is a very subtle e-value, which is the approximation of all estimation errors. “E” represents the sum of ALL the differences between the actual values for y and the predicted values for y. The goal regression is to find the best values of the b coefficient and the x coefficient such that “E” is minimized.

To regression!

Running the lm() function brings me such joy, it’s a strange sort of joy. I mean, what’s not to love? You pass two variables into the function, along with the name of the data set, and BOOM, R Studio will spit out a formula with very specific numbers that are attached to the four points listed above.

If you don’t like your R Squared score—that pesky thing just never seems to get to 1—then you can try a different predictor variable to see if your model improves.

This strategy seems to work well, adding and subtracting a predictor until you find a decent model worth replicating. But what if you had dozens if not hundreds of variables in your data set?

Further, what if you could add more than one predictor variable to your model that will be able to explain more of the data that you are trying to account for? On top of that, what if you had a way of determining which variable is statistically insignificant with regards to all the other variables in your data?

This is where Regularization and Feature Selection comes into play.

To make use of Regularization (the maximization of more significant variables) and Feature Selection (the minimization of less significant variables), you can use a specific method known as LASSO regression.

In short, the LASSO “penalizes more variables in the regression,” confirmed by one of my professors at DePaul University Kellstadt Graduate School of Business. This penalization reduces the chances of our model “overfitting” the data. This is ideal as an overfitting of the data will lead to a model predicting perfectly with data that it has seen, and poorly with data that it hasn’t seen.

So, by eliminating statistically insignificant variables, and keeping the regression equation to 2 or 3 predictors, you can reduce your chances of overfitting, AND increase the predictive power of your model. This is all thanks to our good friend LASSO! Thank you, my good friend LASSO!

I am going to post a Word document with all the code to run a very basic LASSO regression. This code was sourced from a book called, “Introduction to Machine Learning with R: Rigorous Mathematical Modeling” by Scott V. Burger.

However, I will not be using this algorithm with the data set he uses. Instead, I will be using the bikes.csv data set that is provided in another book called “Practical Machine Learning in R,” by Notre Dame Professors Fred Nwanganga and Mike Chapple.

The data sets that these authors used with their demonstration of their algorithms were meticulously sourced by them, and as such, I strongly encourage you to purchase their works. I promise you will learn so much, and these books are just so rewarding to learn from.

Thank you so much for reading, and I hope you enjoy the LASSO process.

Sources for your Interest:

Lecture and Class Materials presented at DePaul University, Kellstadt Graduate School of Business, in a class entitled “Fundamentals of Business Analytics,” 2022, Chicago, Illinois.

“Practical Machine Learning in R,” by Fred Nwanganga and Mike Chapple, Professors at Notre Dame Mendoza College of Business, 2020.

“Introduction to Machine Learning with R: Rigorous Mathematical Modeling,” by Scott V. Burger, 2018.

Share:

Location: Chicago, IL, USA