Thank you so much for your continued support and
visiting my website. I hope that my content can be useful to all Porn Stars—or anyone
and everyone in the Sex Industry—and it is certainly my honor creating a space
for you all to access such a relevant and worthy field.
At this point, I expect that you read my previous post,
“6 Basic Terms That You Should Know For Data Science and Analytics.” I will link that here. This post will go deeper into data science and analytics in
that I will introduce to you some basic, yet necessary, graphs and charts that
will be vital for your development as data analysts and data entrepreneurs.
I will be discussing the following content:
HISTOGRAMS, SKEWNESS and NORMAL DISTRIBUTION
BOX PLOTS
SCATTERPLOTS and CORRELATION COEFFICIENTS
Before, I discuss the content for this blog post. I
would like to put the above terms in context with Regression Analysis.
MACHINE LEARNING AND REGRESSION ANALYSIS
Regression Analysis is a form of “supervised machine
learning,” in that a machine will ultimately “learn” to make predictions based
on input, or code, that humans program into the machine. It is very different
from “unsupervised machine learning” in that humans are a necessary component for
machine to “grow smarter.”
In “unsupervised machine learning” algorithms—we will learn
about one of them called Clustering, later—machines are able to detect patterns
in the data on their own. The only caveat is that machines need a tremendous
amount of data.
Back to Regression Analysis.
Today’s blog post leans on data visualization as a tool
for creating a prediction equation, or more formally known as a Regression
algorithm. There are certain assumptions and checks that a data analyst or
machine learning expert needs to conduct so that their Regression model (an equation
with one or many variables) is appropriate to make predictions—the real magic
of machine learning.
Charts and plots are an industry standard for checking
these assumptions (more on this in a later post). For example, we will eventually
be checking to see if there are ANY correlations (or relationships) between the
response variable, the variable we are interested in making a prediction, and
the explanatory variables, the independent variables that will impact the response variable.
In other words, though this post will be basic, there
will be relevant information for you so that you can run a Regression algorithm
AND create your own projects that you can show to potential employers. Regression
is a common, yet POWERFUL, machine learning application that is used in the
pretty much any industry where you need to make a prediction about ANYTHING!
So, graphs and plots are important tools! Let’s begin
with Histograms.
HISTOGRAMS
A HISTOGRAM is a graph that depicts the distribution
of numerical data. A histogram commonly comes in three variations, a symmetrical
distribution, a right-skewed distribution, and a left-skewed distribution. Here
is a symmetrical distribution, otherwise known as a normal distribution:
We can see from the graph that it is perfectly symmetrical.
This means that the mode, mean, and median are ALL equal, and are positioned in
the center of the distribution—where most of the data is observed. This graph
is also known to have a bell-curve.
To put this in context with Regression Analysis, our
first task would be to determine if the distribution of the “response variable,”
or the variable in which we wish to predict, is normally distributed. Ideally,
we want our histogram of the response variable to look like this! More on this
in another blog post.
So, if a histogram doesn’t have a normal shape, then
what else can it took like?
In the above graphic, we have both a Left and Right-Skewed
distribution.
Left-Skewed distributions are known to have a “negative
skewness,” and Right-Skewed distributions are known to have a “positive skewness.”
As you can see, the Left-Skewed graph has most of its data
positioned on the right side of the graph. Think of a Left-Skewed distribution
as a skier gliding down a mountainside (or slope) to the left. The opposite can
be said of Right-Skewed distributions.
If you can recall, Normal or symmetric distributions
have the mode, mean, and median, ALL equaling each other. Now, there is a useful
relationship between these three descriptive statistics in that:
For Left-Skewed Distributions: The mode (or the peak
of the data) is larger than the median, and the median is larger than the mean.
For Right-Skewed Distributions: The mode is smaller
than the median, and the median is smaller than the mean.
But why is noting this useful?
Let’s say that you were analyzing employment report data
for Harvard Business School MBA graduates. We might observe that the Median
salary for post-graduates is $165,000, and that the Mean salary is $158,937. (These
are hypothetical numbers).
Since the Median is larger than the Mean, we can
conclude that the data is Left-Skewed—assuming it is unimodal. This might be an
important deduction for deciding whether or not to pursue a Harvard MBA, as this information tells us that more students lean closer to
the Mode post-graduate salary, which we know is greater than $165,000. From this, we can assume that there are many students make more than $165,000.
Knowing this, we can make an inference, as to the salary
that we might make after graduating from Harvard’s MBA program, given that we have
other factors that are common to most of the incoming cohort—like number of work
years prior to entry, GMAT testing scores, Undergraduate GPA score, and industry
relevant experience.
Just knowing the shape of the distribution can help us
be CONFIDENT about these assumptions given other factors of note. We can never be certain though, and this is a downside to statistics and any form of science, in general.
Let’s discuss Boxplots.
BOXPLOTS
The Boxplot is also known as a “whisker” plot, and it
is a graphical representation of the Five-Number Summary (using quartiles).
There is a lot to unpack here, so please be patient
with me. Let’s break this plot down beginning with the middle.
The yellow line in this plot is known as the median. Sometimes, depending on
which software you are using, there will
also be a small diamond-shaped object near this line, and that shape represents
the mean of the data.
The Red Box is enclosed by Q1 and Q3, which if you
remember from the previous post, is known as the Interquartile Range (IQR).
The two purple lines are drawn outward to two more
lines that come to an intersection. The intersection on the left (sometimes at the
bottom) is called the “Minimum” value, and the intersection on the right
(sometimes at the top) is called the “Maximum” value. These are the whiskers of
the boxplot.
The Green dots on either end of the whiskers are
called Outliers. These data points are special instances in which their values don’t
represent the commonality of the data points in the data set.
Let’s think about
the average salary in the US, roughly $56,310 in 2020, according to the Bureau of
Labor Statistics. An outlier would be someone like a corporate executive who made,
perhaps, a whopping $2,500,000 that year.
This person is not typical! There are more who made closer to the average salary, than there are who brought in a seven figure salary.
This example represents a Right-Skewed distribution in
which the mode, median, and mean are closer to the peak of the data. The outlier
would be far out, residing in the tail of the data. There are less US citizens making
salaries outside of where the peak is, where the tails are. Athletes are another
example of outlier in salary data—they make more money than the average American
citizen.
One last observation! The above boxplot has a Normal
Distribution, and you can tell because the whiskers are equal in length (just
eyeballing it). If the Right whisker is larger than the left, we have a Right-Skewed
distribution. If the Left whisker is larger than the right, we have a Left-skewed
distribution.
Let’s move on to Scatter Plots and Correlation Coefficients.
SCATTER PLOTS
A Scatter Plot basically depicts how much of one
variable impacts another. Scatter Plots are useful in the beginning of a Regression
Analysis when you need to determine whether the response variable (y-variable) has a “linear”
correlation with any of the explanatory variables (x-variables). Regression analysis relies
on these correlations, and they are necessary for the prediction equation to
hold merit for the final Regression line.
Scatter Plots can have Positive, Negative, or No Correlation.
Here is what that looks like:
A Positive Linear Relationship has an increase in the
x-axis (horizontal axis) for every unit increase in the y-axis (vertical axis).
It has an upward slope, pointing up and to the right.
A Negative Linear Relationship has an increase in the
x-axis (horizontal axis) for every unit decrease in the y-axis (vertical axis).
It has a downward slope, pointing down and to the right.
A graph with No Correlation cannot be distinguished between
a positive or negative linear relationship. There is no pattern, and the data
points appear to be scattered randomly.
For a Regression Analysis, you MUST have both Positive
and Negative correlations between the response variable, and the explanatory
variable.
Furthermore, correlations can be Weak, Moderate, Strong,
or Perfect. Here is what that looks like:
The Stronger the correlation, the MORE data points fit
along—or are at least, close to—the Regression line (depicted by the red line).
A plot with a Strong Positive Linear Correlation will look like the plot on the
upper left. It will have a positive Correlation Coefficient value closer to 1 (more
on this soon).
Similarly, there can exist a plot with a Strong
Negative Linear Correlation, and that plot is represented by the one in the upper
right. It will have a Correlation Coefficient closer to negative 1.
The Weaker the correlation, the LESS data points fit
on or along the Regression line. You can observe a Weak Positive Linear
Correlation, on the lower left, and a Weak Negative Linear Correlation, on the lower
right. The Weaker the correlation, the closer the Correlation Coefficient is to
zero.
Now that I have confused you with correlation coefficients, I will discuss them here in greater detail.
CORRELATION COEFFICIENTS
Correlation Coefficients are commonly denoted as “r” scores,
and these scores range from negative one (-1) to positive one (+1).
These scores are measured between two variables, a
dependent variable (y-variable, or response variable) and an independent
variable (x-variable, or explanatory variable).
The score represents both the Strength and the Direction
of the correlation.
There is no unified, or exact range, but here are some
sensible metrics for you to begin describing your correlation coefficient. Keep
in mind that Positive “r” scores represent a Positive Linear Relationship
(recall when the graph is going up and to the right). Conversely, Negative “r” scores
represent a Negative Linear Relationship (when the graph is going down and to
the right):
1 would be Perfect Positive Linear Correlation
.8 would be Strong Positive Linear Correlation
.6 would be Moderate Positive Linear Correlation
.3 would be Weak Positive Linear Correlation
0 No Linear Correlation
-0.3 would be Weak Negative Linear Correlation
-0.6 would be Moderate Negative Linear Correlation
-0.8 would be Strong Negative Linear Correlation
-1 would be Perfect Negative Linear Correlation
Keep in mind that when you are comparing your x and y-variables
with scatterplots that you want to see “r” scores of Moderate or higher (0.6 or
higher, or -0.6 or lower), in any direction for an appropriate use of
Regression Analysis. This ensures that there IS a relationship between the variables,
and that Regression can be used to make predictions.
THANK YOU, AGAIN, and ALWAYS!
It is my pleasure and honor to continue to post about analytics
for you in an accessible way. This is all very boring at the moment, and that’s
ok, because the fun part is coming soon! Don’t get discouraged if the material
seems ambiguous and complicated. Part of the beauty of the world wide web is
that information is abundant and permanent. You can read these posts as many
times as you wish for the content to stick. Enjoy them, and have fun with them!