Thursday, July 28, 2022

Data Visualization Basics: 3 Plots and Graphs That Will Make Your Regression Analysis Project Easier

on July 28, 2022 in Blog, Data Visualization, Graphs, Plots

Thank you so much for your continued support and visiting my website. I hope that my content can be useful to all Porn Stars—or anyone and everyone in the Sex Industry—and it is certainly my honor creating a space for you all to access such a relevant and worthy field.

At this point, I expect that you read my previous post, “6 Basic Terms That You Should Know For Data Science and Analytics.” I will link that here. This post will go deeper into data science and analytics in that I will introduce to you some basic, yet necessary, graphs and charts that will be vital for your development as data analysts and data entrepreneurs.

I will be discussing the following content:

HISTOGRAMS, SKEWNESS and NORMAL DISTRIBUTION

BOX PLOTS

SCATTERPLOTS and CORRELATION COEFFICIENTS

Before, I discuss the content for this blog post. I would like to put the above terms in context with Regression Analysis.

MACHINE LEARNING AND REGRESSION ANALYSIS

Regression Analysis is a form of “supervised machine learning,” in that a machine will ultimately “learn” to make predictions based on input, or code, that humans program into the machine. It is very different from “unsupervised machine learning” in that humans are a necessary component for machine to “grow smarter.”

In “unsupervised machine learning” algorithms—we will learn about one of them called Clustering, later—machines are able to detect patterns in the data on their own. The only caveat is that machines need a tremendous amount of data.

Back to Regression Analysis.

Today’s blog post leans on data visualization as a tool for creating a prediction equation, or more formally known as a Regression algorithm. There are certain assumptions and checks that a data analyst or machine learning expert needs to conduct so that their Regression model (an equation with one or many variables) is appropriate to make predictions—the real magic of machine learning.

Charts and plots are an industry standard for checking these assumptions (more on this in a later post). For example, we will eventually be checking to see if there are ANY correlations (or relationships) between the response variable, the variable we are interested in making a prediction, and the explanatory variables, the independent variables that will impact the response variable.

In other words, though this post will be basic, there will be relevant information for you so that you can run a Regression algorithm AND create your own projects that you can show to potential employers. Regression is a common, yet POWERFUL, machine learning application that is used in the pretty much any industry where you need to make a prediction about ANYTHING!

So, graphs and plots are important tools! Let’s begin with Histograms.

HISTOGRAMS

A HISTOGRAM is a graph that depicts the distribution of numerical data. A histogram commonly comes in three variations, a symmetrical distribution, a right-skewed distribution, and a left-skewed distribution. Here is a symmetrical distribution, otherwise known as a normal distribution:

Normal Distribution of Histogram

We can see from the graph that it is perfectly symmetrical. This means that the mode, mean, and median are ALL equal, and are positioned in the center of the distribution—where most of the data is observed. This graph is also known to have a bell-curve.

To put this in context with Regression Analysis, our first task would be to determine if the distribution of the “response variable,” or the variable in which we wish to predict, is normally distributed. Ideally, we want our histogram of the response variable to look like this! More on this in another blog post.

So, if a histogram doesn’t have a normal shape, then what else can it took like?

Left and Right Skewness of Histogram

In the above graphic, we have both a Left and Right-Skewed distribution.

Left-Skewed distributions are known to have a “negative skewness,” and Right-Skewed distributions are known to have a “positive skewness.”

As you can see, the Left-Skewed graph has most of its data positioned on the right side of the graph. Think of a Left-Skewed distribution as a skier gliding down a mountainside (or slope) to the left. The opposite can be said of Right-Skewed distributions.

If you can recall, Normal or symmetric distributions have the mode, mean, and median, ALL equaling each other. Now, there is a useful relationship between these three descriptive statistics in that:

For Left-Skewed Distributions: The mode (or the peak of the data) is larger than the median, and the median is larger than the mean.

For Right-Skewed Distributions: The mode is smaller than the median, and the median is smaller than the mean.

But why is noting this useful?

Let’s say that you were analyzing employment report data for Harvard Business School MBA graduates. We might observe that the Median salary for post-graduates is $165,000, and that the Mean salary is $158,937. (These are hypothetical numbers).

Since the Median is larger than the Mean, we can conclude that the data is Left-Skewed—assuming it is unimodal. This might be an important deduction for deciding whether or not to pursue a Harvard MBA, as this information tells us that more students lean closer to the Mode post-graduate salary, which we know is greater than $165,000. From this, we can assume that there are many students make more than $165,000.

Knowing this, we can make an inference, as to the salary that we might make after graduating from Harvard’s MBA program, given that we have other factors that are common to most of the incoming cohort—like number of work years prior to entry, GMAT testing scores, Undergraduate GPA score, and industry relevant experience.

Just knowing the shape of the distribution can help us be CONFIDENT about these assumptions given other factors of note. We can never be certain though, and this is a downside to statistics and any form of science, in general.

Let’s discuss Boxplots.

BOXPLOTS

The Boxplot is also known as a “whisker” plot, and it is a graphical representation of the Five-Number Summary (using quartiles).

Box Plot

There is a lot to unpack here, so please be patient with me. Let’s break this plot down beginning with the middle.

The yellow line in this plot is known as the median. Sometimes, depending on which software you are using, there will also be a small diamond-shaped object near this line, and that shape represents the mean of the data.

The Red Box is enclosed by Q1 and Q3, which if you remember from the previous post, is known as the Interquartile Range (IQR).

The two purple lines are drawn outward to two more lines that come to an intersection. The intersection on the left (sometimes at the bottom) is called the “Minimum” value, and the intersection on the right (sometimes at the top) is called the “Maximum” value. These are the whiskers of the boxplot.

The Green dots on either end of the whiskers are called Outliers. These data points are special instances in which their values don’t represent the commonality of the data points in the data set.

Let’s think about the average salary in the US, roughly $56,310 in 2020, according to the Bureau of Labor Statistics. An outlier would be someone like a corporate executive who made, perhaps, a whopping $2,500,000 that year.

This person is not typical! There are more who made closer to the average salary, than there are who brought in a seven figure salary.

This example represents a Right-Skewed distribution in which the mode, median, and mean are closer to the peak of the data. The outlier would be far out, residing in the tail of the data. There are less US citizens making salaries outside of where the peak is, where the tails are. Athletes are another example of outlier in salary data—they make more money than the average American citizen.

One last observation! The above boxplot has a Normal Distribution, and you can tell because the whiskers are equal in length (just eyeballing it). If the Right whisker is larger than the left, we have a Right-Skewed distribution. If the Left whisker is larger than the right, we have a Left-skewed distribution.

Let’s move on to Scatter Plots and Correlation Coefficients.

SCATTER PLOTS

A Scatter Plot basically depicts how much of one variable impacts another. Scatter Plots are useful in the beginning of a Regression Analysis when you need to determine whether the response variable (y-variable) has a “linear” correlation with any of the explanatory variables (x-variables). Regression analysis relies on these correlations, and they are necessary for the prediction equation to hold merit for the final Regression line.

Scatter Plots can have Positive, Negative, or No Correlation. Here is what that looks like:

Scatter Plots Directionality

A Positive Linear Relationship has an increase in the x-axis (horizontal axis) for every unit increase in the y-axis (vertical axis). It has an upward slope, pointing up and to the right.

A Negative Linear Relationship has an increase in the x-axis (horizontal axis) for every unit decrease in the y-axis (vertical axis). It has a downward slope, pointing down and to the right.

A graph with No Correlation cannot be distinguished between a positive or negative linear relationship. There is no pattern, and the data points appear to be scattered randomly.

For a Regression Analysis, you MUST have both Positive and Negative correlations between the response variable, and the explanatory variable.

Furthermore, correlations can be Weak, Moderate, Strong, or Perfect. Here is what that looks like:

Scatter Plot Strength

The Stronger the correlation, the MORE data points fit along—or are at least, close to—the Regression line (depicted by the red line). A plot with a Strong Positive Linear Correlation will look like the plot on the upper left. It will have a positive Correlation Coefficient value closer to 1 (more on this soon).

Similarly, there can exist a plot with a Strong Negative Linear Correlation, and that plot is represented by the one in the upper right. It will have a Correlation Coefficient closer to negative 1.

The Weaker the correlation, the LESS data points fit on or along the Regression line. You can observe a Weak Positive Linear Correlation, on the lower left, and a Weak Negative Linear Correlation, on the lower right. The Weaker the correlation, the closer the Correlation Coefficient is to zero.

Now that I have confused you with correlation coefficients, I will discuss them here in greater detail.

CORRELATION COEFFICIENTS

Correlation Coefficients are commonly denoted as “r” scores, and these scores range from negative one (-1) to positive one (+1).

These scores are measured between two variables, a dependent variable (y-variable, or response variable) and an independent variable (x-variable, or explanatory variable).

The score represents both the Strength and the Direction of the correlation.

There is no unified, or exact range, but here are some sensible metrics for you to begin describing your correlation coefficient. Keep in mind that Positive “r” scores represent a Positive Linear Relationship (recall when the graph is going up and to the right). Conversely, Negative “r” scores represent a Negative Linear Relationship (when the graph is going down and to the right):

1 would be Perfect Positive Linear Correlation

.8 would be Strong Positive Linear Correlation

.6 would be Moderate Positive Linear Correlation

.3 would be Weak Positive Linear Correlation

0 No Linear Correlation

-0.3 would be Weak Negative Linear Correlation

-0.6 would be Moderate Negative Linear Correlation

-0.8 would be Strong Negative Linear Correlation

-1 would be Perfect Negative Linear Correlation

Keep in mind that when you are comparing your x and y-variables with scatterplots that you want to see “r” scores of Moderate or higher (0.6 or higher, or -0.6 or lower), in any direction for an appropriate use of Regression Analysis. This ensures that there IS a relationship between the variables, and that Regression can be used to make predictions.

THANK YOU, AGAIN, and ALWAYS!

It is my pleasure and honor to continue to post about analytics for you in an accessible way. This is all very boring at the moment, and that’s ok, because the fun part is coming soon! Don’t get discouraged if the material seems ambiguous and complicated. Part of the beauty of the world wide web is that information is abundant and permanent. You can read these posts as many times as you wish for the content to stick. Enjoy them, and have fun with them!

Share:

Location: Chicago, IL, USA