INTRODUCTION
Hello again! Thank you so much for visiting my blog as
always. I hope that I can make it a one-stop solution to all of your professional
learning needs as emerging data analysts.
This post is going to be about how to handle missing
data in your data set, using a very small data set. It will be brief, but dense with information. Be patient with this one.
With this is mind...
Data sets are not always complete when you first analyze
them, and in the real world, there is likely to be missing values in set. Missing
values makes it difficult for you to perform any sort of analysis.
You could always delete the row that the missing value
is associated with, but that is not ideal for smaller data sets. Data is precious,
and in smaller data sets, you need to keep as much of it is as possible.
So, what are you to do? There is a lot of missing data in your set, and you
have just been told that deleting data is bad.
There is a package in R Studio called “mice,” and it
is perfect for handling missing data.
Basically, it is a tool for substituting statistical
values for the missing values in your data set. It is not a perfect solution,
by any means, as there are many assumptions that this tool makes in order to
handle missing data.
However, it is an acceptable solution for smaller data sets when deleting rows,
or observations, is NOT AN OPTION!
Let’s begin. By installing and loading the “mice”
package into R Studio.
INSTALL AND LOAD MICE
Here is what that looks like (type the following code
highlighted in blue):
install.packages("mice")
library(mice)
As a reminder, once you have installed mice into your
environment, you never need to run the install.packages() function for it
again. You only need to load it using the library() function.
CREATE YOUR DATA SET
We are going to load a pre-build data set, called “nhanes,”
into an object, called “data,” that we will use for analysis. Here is what that
looks like:
data <- nhanes
If know that you are working with a smaller data set,
you can see it using the View() function. Keep in mind that “View” has a capital
“V.”
View(data)
As you can see, there are many rows that have N/A
listed inside the cell. N/A means “not applicable” which also means that the data
is “missing.” There are many reasons why data can be missing, namely because of
recording errors by the researcher compiling the data.
For larger data sets, we can pass our “data” object
through the summary() function to see just how many missing values there are. Here
is what the input for that looks like:
summary(data)
Here is what R Studio computes:
age bmi
hyp
chl
Min. :1.00
Min.
:20.40 Min. :1.000
Min. :113.0
1st
Qu.:1.00 1st Qu.:22.65 1st
Qu.:1.000 1st Qu.:185.0
Median
:2.00 Median :26.75 Median :1.000 Median :187.0
Mean :1.76
Mean :26.56 Mean
:1.235 Mean :191.4
3rd
Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0
Max. :3.00
Max. :35.30 Max.
:2.000 Max. :284.0
NA's
:9 NA's :8
NA's :10
Sorry for the messy output. But if you run the
summary() function with data as the input, R Studio will output (in the lower
left corner) with “bmi” having 9 N/A’s, “hyp” having 8 N/A’s, and “chl” having 10
N/A’s.
This represents A LOT of missing values as our data
set is already very small.
Before we run an IMPUTATION, we should inspect our
data to be certain that we are working with numerical values.
Upon inspection, we can determine that “hyp” is NOT
numerical. Yes, it is represented by 1’s and 2’s, but this likely means that it
is a dummy variable, or a categorical (non-numerical variable) that is recorded
as a numerical variable.
Since this is medical data, we can assume that “hyp” means
hypertension, and furthermore, we can assume that 1 and 2 means “yes” or “no”
for having hypertension.
There are other categorical variables that might also
have a 1 or 2 denoted in its column. How about Male or Female? Black or White?
Sale or No Sale?
This is known as a binary variable in which there are
only two options. Can you think of any other binary variables?
CONVERT ANY CATEGORICAL VARIABLES IF NEEDED
With this in mind, let’s convert “hyp” into a factor
so that we can run our imputation algorithm. In the future, we will need to
convert ALL categorical variables into factors using the as.factor() function.
Here is what that looks like:
data$hyp <- as.factor(data$hyp)
Notice that we are using the “data” object that we
created earlier and referencing the “hyp” column.
We are ready to
run our MICE algorithm.
MICE IMPUTATION METHOD
We can check the different MICE methods by passing in
through the methods() function.
methods(mice)
Here is the R Studio output:
[1] mice.impute.2l.bin mice.impute.2l.lmer
[3]
mice.impute.2l.norm
mice.impute.2l.pan
[5]
mice.impute.2lonly.mean
mice.impute.2lonly.norm
[7]
mice.impute.2lonly.pmm
mice.impute.cart
[9]
mice.impute.jomoImpute
mice.impute.lasso.logreg
[11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
[13] mice.impute.lasso.select.norm mice.impute.lda
[15] mice.impute.logreg mice.impute.logreg.boot
[17] mice.impute.mean mice.impute.midastouch
[19] mice.impute.mnar.logreg mice.impute.mnar.norm
[21] mice.impute.norm mice.impute.norm.boot
[23] mice.impute.norm.nob mice.impute.norm.predict
[25] mice.impute.panImpute mice.impute.passive
[27] mice.impute.pmm mice.impute.polr
[29] mice.impute.polyreg mice.impute.quadratic
[31] mice.impute.rf mice.impute.ri
[33] mice.impute.sample mice.mids
[35] mice.theme
What are going to use mice.impute.pmm for “bmi” and “chl” and mice.impute.logreg for “hyp.”
Without getting into the weeds of the mathematics,
we choose to use the “logreg” method for “hyp” simple because the variable is
categorical. Logistic Regression is used to determine a prediction equation for
categorical response variables.
The “pmm” method means “predictive mean matching,” which
is appropriate for numerical variables.
Now we can create the algorithm:
my_imp = mice(data, m =5, method = c("",
"pmm", "logreg", "pmm"), maxit = 20)
Notice a few things:
That we are using an “=” sign.
In the mice() function, we pass in our “data” object,
and in “methods” there is a blank set of quotations. This means that we DON’T
want to run any method for AGE as there are no missing values.
Also, the algorithm will run 20 times due to the “maxit”
argument. 20 is always a good baseline number for MICE imputation.
After we run that line of code, we need to determine which
of the 5 new data sets we will use as our final data set to do our analysis.
DETERMINE WHICH SET TO USE
So then, we need to check the “bmi” column’s mean
value. But we can’t do that until we create another object (our first object “data”
has already been enhanced).
Let’s name it “data_mice”:
data_mice <- nhanes
summary(data_mice$bmi)
Once we run a summary() function with our new object, “data_mice,”
we can check the mean value for “bmi.” Here is the output:
Min. 1st Qu.
Median Mean 3rd Qu. Max.
NA's
20.40 22.65
26.75 26.56 28.93
35.30 9
The mean value is 26.56. So we need to find a data set
in our “my_imp” object with a mean value that is close to that value, AND has
very little deviation from that value as possible.
Run this code to view the data sets:
my_imp$imp$bmi
Here is the output for reference:
1 2
3 4 5
1 26.3 22.7
33.2 29.6 29.6
3 30.1 30.1
33.2 29.6 27.2
4 27.4 20.4
27.4 26.3 25.5
6 21.7 20.4
25.5 20.4 22.5
10 27.5 33.2 27.5 26.3 27.2
11 29.6 27.2 30.1 35.3 22.0
12 26.3 22.7 22.5 27.4 20.4
16 35.3 28.7 30.1 27.4 28.7
21 26.3 29.6 27.2 27.5 27.2
The numbers that run horizontally at the top of the
output are the data sets we need to examine. After inspection, let’s choose the
5th column as the averages are close to original mean value for “bmi”
with little deviation (we don’t want numbers that are so much larger than 26,
and so much smaller than 26).
Finally, we can create our final data set that we can
run any sort of analysis on.
CREATE FINAL DATA SET
We will create an object that will have our data set;
it will be called “final_clean_data,” or anything that you wish. We will use
the complete() function with two arguments, our data set with the imputations,
and the data set that we chose that closely resembles the original “bmi” mean
value. Run this code and use the View() function to see the final data set:
final_clean_data = complete(my_imp, 5)
View(final_clean_data)
With the View() function, you will be able to see the
final data set. There are no longer any missing values. Feel free to run your Regression,
Clustering and Classification algorithms for further analysis.
THANK YOU
This blog post was a bit intense but being able to
handle missing data is essential. As previously mentioned, data will rarely
come in a form that is complete, so being able to prepare it for analysis is
very important.
Thank you for using my blog as a way to better your lives.
You are all doing so well and learning a skill that will be useful to you as
you transition from the Sex Work industry. You can choose whatever career you
wish. Whether you want to work in Corporate America, be an entrepreneur or go
back to school, it is up to you.
Enjoy the journey. There is no need to rush the learning of data analytics and machine learning. Yet, some level of urgency is required so that you can have more healthy lives, circumstances, and relationships. I admire all of you!