Hello. For this post, I thought it might be prudent to cite two articles and critique them. The first article is concerned with Boxplots, and the second has to do with multicollinearity.
The goal of this exercise is to read an article regarding any matter of data analytics, digest the article, and offer a reaction to the authors.
In this exercise, there is a need to think critically about the topic, and how to utilize the information to become a better data analyst.
I hope this helps!
“Why Is a Boxplot Better Than a Histogram?”
Click here for source.
The fundamental concept behind this article is to plot your data BEFORE you do any sort of analysis of the data. But how should you plot the data? What type of visualization should one use at the initial stage of their analysis?
The author of this article argues that boxplots should be the preferred standard to visualization data, and that histograms—although most commonly used—present relationships in the data that can be misinterpreted. The author proposes that boxplots can be a remedy to misinterpretations that histograms may account for.
This is the main
goal, or thesis of this article.
Specifically, boxplots have several advantages when it comes to visualizing data. Along with being able to plot multiple sets of data at once, boxplots can show outliers, imply the distributions of the sets—including whether the data is normal or skewed.
Unlike histograms, boxplots are adept at showing the median, first and third quartiles, and the minimum and maximum approximation of a total range. To be clear, histograms do show this information as well, but there is much more guess work involved as the values can only be approximated upon visual inspection.
Finally, boxplots are easier to read and interpret compared to
histograms.
After reading this article, it is clear that the author prefers to use boxplots over histograms because of its clear advantages. Noting the rationale for boxplots, I am going to begin using boxplots in my analysis, but before I yield to the author’s suggestion, I would like to make an argument that favors the ease of use for histograms.
In the early stages of analysis, I feel that one should be
concerned about both the quality and the quantity of information is appropriate
for that specific stage. Let me explain.
Let’s assume that a business manager or supervisor gives their subordinate the task of fitting a regression line through a set of observations. Let’s also assume that this task uses data sets that are primarily numerical, and that there are very few categorical variables of interest.
We might also assume that this task might be industry-specific; data sets that encompass customer profiles in banking might be different than data sets that encapsulate patient medical history.
With this in mind, I argue that depending on the STAGE (Exploratory Data Analysis, for instance) of the analysis and the TYPE of data sets that are use, there might be a preference for the ease and simplicity of a histogram.
If I already know, based on prior professional experience, that a regression analysis is the goal, then I might not need to know the total range of a data set (first and third quartile, median, minimum and maximum) until further into my analysis.
Consequently, it might be preferable to run a “proc means” (in SAS) or a “summary ()” function (in R) to determine a more accurate account of those descriptive statistics. If my goal is to determine correlation between my response and explanatory variables of interest, then a quick check with a histogram might be a simpler approach, as I don’t require the additional information of a boxplot, yet.
Boxplots are not useful visualizations for describing a correlation between two variables, and that is why I would use a histogram IN THE EARLY stages of an analysis.
In all fairness, the author of the article DOES make the point—almost too briefly—that boxplots can also be used alongside of a histogram to provide richer details about data in question.
This might be a practical approach, and it might require more effort to code into the statistical software, but it could be worth the effort if the output provides a more meaningful account of what is happening in the data.
For my own approach to analysis, it would make sense for me to explore boxplots more closely to see if I can reach an agreement with the author. At the moment, I am most familiar with regression analysis, so histograms serve my initial goal of determining correlation between my response and explanatory variables.
Yet, there are other algorithms and methods that are used in data analysis and machine learning, like classification and clustering methods, which require a different mindset; the problems that these additional methods solve is different than regression.
“Multicollinearity in Regression Analysis: Problems, Detections and Solutions,” by Jim Frost.
Click here for source.
What is multicollinearity, and why is it an issue in regression analysis? Statistics professional and thought leader, Jim Frost, says that multicollinearity occurs when explanatory variables in a regression model are highly correlated. This can cause concern as explanatory variables are supposed to be independent of one another.
In this article, Frost’s main point is to show analysts how to
detect multicollinearity between explanatory variables, and finally how to
resolve issues of multicollinearity.
One can make the claim that one of the most important goals of a regression analysis is to “isolate the relationship between” the response variable and the explanatory variables. And since this goal is so vital to this type of analysis, it is logical to conclude that multicollinearity would present an issue in regression; if an explanatory variable is acting similar to the response variable, then there is an issue that needs to be corrected.
If an analyst decides that a Multiple Linear Regression is the best approach to making a prediction based on the data, they need to be confident that every 1 unit increase in variable Beta 1—given that all other variables are held constant—will result in a change in the response variable.
But this can’t
happen if independent variables display multicollinearity with other
independent variables. At that point, the regression model is almost completely
useless!
Frost suggests that there are two key issues with multicollinearity. The first issue is that beta coefficients can have a “high sensitivity to change.” Meaning that x-variables can be extreme—high or low—and such severe values can lead to an inaccurate prediction model.
Second, since multicollinear values can have extreme beta coefficients, there could be a miscalculation in the statistical significance (p-values) of the variables in the final model. And for this reason, an analyst can NOT trust these levels of significance.
So then, how does an analyst address issues of multicollinearity? Frost argues that there are methods to fix these issues, but boldly suggests that issues of multicollinearity depend heavily on the “primary goal of the regression analysis.”
Frost proposes that moderate multicollinearity might not need to be resolved, as the severity of a moderate position might not impact the regression model. Although, this approach—or neglect—should only be assessed on a case-by-case basis. Also, he posits that variables that have a multicollinear relationship might not be included in the final model, and therefore, do NOT need to be addressed early on.
Yet, for analysts that are conservative and meticulous, there is such a test method called Variance Inflation Factors, or VIF. In the modeling process, VIF can be used to determine the strength of correlation between independent variables, and with scores above a certain threshold, these values can be removed from the prediction model.
In my own analysis, I have been taught to use VIF metrics to determine which explanatory variables to remove during the modeling process.
Also, I use the Adjusted R-Square score compared to the R-Square score, as this metric provides a more meaningful measure of predictability—penalizing models that rely on a large quantity of variables at the expense of statistical significance.
With the combination of VIF metrics, to remove variables, and relying on an Adjusted R-Square score, I feel confident that I can create a model that is, at the very least, a conservative estimation of future response values.
Since I am not a highly skilled analyst (yet), I will be meticulous and remove variables that the VIF method highlights. The author Jim Frost is not the only analytics thought leader to exude a “it depends on the data/situation” mentality.
For him, he is more astute at analytics, therefore, he is more skilled at determining which variables to remove, and which to leave in the model. I am not at his level of preeminence, so for now, I will remove any variable with the smallest hint of multicollinearity.