Thank you so much for continuing to use my website as a resource for your professional and entrepreneurial development. I am so happy that I can write about what I love—data analytics and machine learning—for those who can stand to benefit from it.
In a world that can be so harsh and cruel, I hope that
the teaching of this subject will help level the playing field between those
who are in power, and for those who seek to rise within or outside a toxic environment.
You ladies are incredible!
Ok, so for today’s post, I would like to discuss why the
subject of statistics is so important, and how it is used in the new world we
live in, where humans and machines are coexisting.
WHY?
Why do we use statistics in data analytics and machine
learning? What is the primary goal of this field of science and mathematics?
Well, many who have contributed to this field argue—quite
fervently—that the goal is to make an inference about a population based on data
that was collected on a sample of that population. It is difficult to obtain complete
data on any given population; arguably, it is impossible to do so.
And for this reason, mathematicians and scientists
created a form scientific process and inquiry that allows for the collection
and computation of data on a smaller segment of the population, the sample, to
make predictions and even verify assumptions of the larger population.
Statisticians are careful about defining their
predictions and analysis by pre-determining a MARGIN OF ERROR in their
analysis. They understand that there is NO CERTAINTY to making predictions, and this
margin of error allows for them to make predictions about a population with a
degree of confidence.
Scientists—and now the new age data scientists—have been
attempting to infer certain attributes about entire populations since the
beginning of mathematics, and there are reasons why the process of data collection
can be so cumbersome; data is difficult to collect and can be presented in a
form that is messy and almost unusable.
What have these scientists been trying to predict by
using statistics?
1. They have been trying to predict future outcomes
and returns for financial investments.
2. Data Scientists and healthcare professionals have
been using analytics and machine learning to predict various traits within the
healthcare value chain. Such traits include health insurance claims, number of
emergency room visits, cancer rates, and how to use myriad variables within
data to improve patient health outcomes.
3. Marketing professionals use analytics—by Google and
HubSpot—to measure and validate user behaviors, and preferences on websites.
4. Software engineers have been able to collect data
for the purpose of predicting the reliability of software in production. Further,
they can use this data to determine whether to continue building a product or
customer experience.
5. Economists also use statistics to predict both
economic recovery and recessions given a multitude of factors. In the past,
such factors have included housing costs, the unemployment rate, levels of
education, and consumer activity.
As you can see, there are many industries that use
statistics and machine learning. But what is “machine learning?” And how is it
related to statistics?
Think of machine learning as the technology component,
and statistics as the mathematical component. Presently, Machine Learning will
often use sophisticated computer software and programming languages to spit out
predictions. Statistics is the mathematical language that is used for machine
learning algorithms. So, they are both intertwined.
I will continue to bore you at this point, with more statistical
definitions that will be useful to know. They will help you interpret certain
parameters within your projects and analysis, and they are helpful to know when
conversing with your supervisors at work, or even laypersons who are not well-versed
in the subject.
POPULATION and PARAMETER
As I mentioned before, the goal of statistics is to
make inferences about a population of interest? But how is population defined? Simply,
a population is a large (very large) collection of individuals, objects, or
entities. American citizens, millennial students, or even dogs can be
considered populations.
A “population parameter” is a number like a mean or
even percentage, that describes the population.
SAMPLE and STATISTIC
In many, if not all, cases, it is impossible to
collect information about an entire population. So, information from the sample
of the population is collected, and a sample is a smaller group represented
from the population.
A “sample statistic” is a number like a mean or a
percentage, that describes the sample.
DON’T GET IT TWISTED!
When I think of the difference between parameter and
statistic, I recall that parameter begins with the letter “p,” just like
population. Statistic, on the other hand, begins with the letter “s,” just like
sample. So, think of it like this:
A Parameter refers to a Population, P for P.
A Statistic refers to a Sample, S for S.
OTHER GOALS FOR STATISTICS
As I mentioned previously, the primary goal of
statistics is to make an inference about a population of interest. Other goals
are to test hypotheses, and to draw conclusions concerning the correlation
between observed factors of interest.
The main bulk of the content for this website will be
how you can run a Full Linear Regression Model, as it is the most common method
for making predictions. EVERYONE uses this method! So, I will focus more on that.
THANK YOU!
So, this will be the last boring post… I hope! I want
to get into R programming and Regression analysis right away, so that you can get
your feet wet, and begin your journey officially as data analysts or entrepreneurs.
However, there will be many supplemental posts that offer explanations of how
to interpret specific outputs of the regression, and yes, those posts will be
boring. Yet, they are completely necessary!