Monday, August 8, 2022

MICE Imputation: How Porn Stars Can Use MICE Imputation to Handle Missing Values

INTRODUCTION

 

Hello again! Thank you so much for visiting my blog as always. I hope that I can make it a one-stop solution to all of your professional learning needs as emerging data analysts.

 

This post is going to be about how to handle missing data in your data set, using a very small data set. It will be brief, but dense with information. Be patient with this one.


With this is mind...

 

Data sets are not always complete when you first analyze them, and in the real world, there is likely to be missing values in set. Missing values makes it difficult for you to perform any sort of analysis.

 

You could always delete the row that the missing value is associated with, but that is not ideal for smaller data sets. Data is precious, and in smaller data sets, you need to keep as much of it is as possible.



So, what are you to do? There is a lot of missing data in your set, and you have just been told that deleting data is bad.

 

There is a package in R Studio called “mice,” and it is perfect for handling missing data.

 

Basically, it is a tool for substituting statistical values for the missing values in your data set. It is not a perfect solution, by any means, as there are many assumptions that this tool makes in order to handle missing data.



However, it is an acceptable solution for smaller data sets when deleting rows, or observations, is NOT AN OPTION!

 

Let’s begin. By installing and loading the “mice” package into R Studio.

 


INSTALL AND LOAD MICE

 

 

Here is what that looks like (type the following code highlighted in blue):

 

install.packages("mice")

 

library(mice)

 

As a reminder, once you have installed mice into your environment, you never need to run the install.packages() function for it again. You only need to load it using the library() function.

 

 

CREATE YOUR DATA SET

 

We are going to load a pre-build data set, called “nhanes,” into an object, called “data,” that we will use for analysis. Here is what that looks like:

 

data <- nhanes

 

If know that you are working with a smaller data set, you can see it using the View() function. Keep in mind that “View” has a capital “V.”

 

View(data)

 

As you can see, there are many rows that have N/A listed inside the cell. N/A means “not applicable” which also means that the data is “missing.” There are many reasons why data can be missing, namely because of recording errors by the researcher compiling the data.

 

For larger data sets, we can pass our “data” object through the summary() function to see just how many missing values there are. Here is what the input for that looks like:

 

summary(data)

 

Here is what R Studio computes:

 

age                   bmi                  hyp                 chl      

 Min.   :1.00     Min.   :20.40   Min.   :1.000   Min.   :113.0 

 1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0 

 Median :2.00  Median :26.75   Median :1.000   Median :187.0 

 Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4 

 3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0 

 Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0 

                        NA's   :9       NA's   :8       NA's   :10

 

Sorry for the messy output. But if you run the summary() function with data as the input, R Studio will output (in the lower left corner) with “bmi” having 9 N/A’s, “hyp” having 8 N/A’s, and “chl” having 10 N/A’s.

 

This represents A LOT of missing values as our data set is already very small.

 

Before we run an IMPUTATION, we should inspect our data to be certain that we are working with numerical values.

 

Upon inspection, we can determine that “hyp” is NOT numerical. Yes, it is represented by 1’s and 2’s, but this likely means that it is a dummy variable, or a categorical (non-numerical variable) that is recorded as a numerical variable.

 

Since this is medical data, we can assume that “hyp” means hypertension, and furthermore, we can assume that 1 and 2 means “yes” or “no” for having hypertension.

 

There are other categorical variables that might also have a 1 or 2 denoted in its column. How about Male or Female? Black or White? Sale or No Sale?

 

This is known as a binary variable in which there are only two options. Can you think of any other binary variables?

 

 

CONVERT ANY CATEGORICAL VARIABLES IF NEEDED

 

With this in mind, let’s convert “hyp” into a factor so that we can run our imputation algorithm. In the future, we will need to convert ALL categorical variables into factors using the as.factor() function. Here is what that looks like:

 

data$hyp <- as.factor(data$hyp)

 

Notice that we are using the “data” object that we created earlier and referencing the “hyp” column.

 

 We are ready to run our MICE algorithm.

 

MICE IMPUTATION METHOD

 

We can check the different MICE methods by passing in through the methods() function.

 

methods(mice)

 

Here is the R Studio output:

 

[1] mice.impute.2l.bin              mice.impute.2l.lmer           

 [3] mice.impute.2l.norm             mice.impute.2l.pan            

 [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm       

 [7] mice.impute.2lonly.pmm          mice.impute.cart              

 [9] mice.impute.jomoImpute          mice.impute.lasso.logreg      

[11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg

[13] mice.impute.lasso.select.norm   mice.impute.lda                

[15] mice.impute.logreg              mice.impute.logreg.boot       

[17] mice.impute.mean                mice.impute.midastouch        

[19] mice.impute.mnar.logreg         mice.impute.mnar.norm         

[21] mice.impute.norm                mice.impute.norm.boot         

[23] mice.impute.norm.nob            mice.impute.norm.predict      

[25] mice.impute.panImpute           mice.impute.passive           

[27] mice.impute.pmm                 mice.impute.polr              

[29] mice.impute.polyreg             mice.impute.quadratic         

[31] mice.impute.rf                  mice.impute.ri                

[33] mice.impute.sample              mice.mids                     

[35] mice.theme

 

What are going to use mice.impute.pmm for “bmi” and “chl” and mice.impute.logreg for “hyp.” 


Without getting into the weeds of the mathematics, we choose to use the “logreg” method for “hyp” simple because the variable is categorical. Logistic Regression is used to determine a prediction equation for categorical response variables.

 

The “pmm” method means “predictive mean matching,” which is appropriate for numerical variables.

 

Now we can create the algorithm:

 

my_imp = mice(data, m =5, method = c("", "pmm", "logreg", "pmm"), maxit = 20)

 

Notice a few things:


That we are using an “=” sign.

 

In the mice() function, we pass in our “data” object, and in “methods” there is a blank set of quotations. This means that we DON’T want to run any method for AGE as there are no missing values.

 

Also, the algorithm will run 20 times due to the “maxit” argument. 20 is always a good baseline number for MICE imputation.

 

 

After we run that line of code, we need to determine which of the 5 new data sets we will use as our final data set to do our analysis.

 

DETERMINE WHICH SET TO USE

 

So then, we need to check the “bmi” column’s mean value. But we can’t do that until we create another object (our first object “data” has already been enhanced).

 

Let’s name it “data_mice”:

 

data_mice <- nhanes

 

summary(data_mice$bmi)

 

Once we run a summary() function with our new object, “data_mice,” we can check the mean value for “bmi.” Here is the output:

 

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's

  20.40   22.65   26.75   26.56   28.93   35.30       9

 

The mean value is 26.56. So we need to find a data set in our “my_imp” object with a mean value that is close to that value, AND has very little deviation from that value as possible.

 

Run this code to view the data sets:

 

my_imp$imp$bmi

 

Here is the output for reference:

    1    2    3    4    5

1  26.3 22.7 33.2 29.6 29.6

3  30.1 30.1 33.2 29.6 27.2

4  27.4 20.4 27.4 26.3 25.5

6  21.7 20.4 25.5 20.4 22.5

10 27.5 33.2 27.5 26.3 27.2

11 29.6 27.2 30.1 35.3 22.0

12 26.3 22.7 22.5 27.4 20.4

16 35.3 28.7 30.1 27.4 28.7

21 26.3 29.6 27.2 27.5 27.2

 

The numbers that run horizontally at the top of the output are the data sets we need to examine. After inspection, let’s choose the 5th column as the averages are close to original mean value for “bmi” with little deviation (we don’t want numbers that are so much larger than 26, and so much smaller than 26).

 

Again, sorry for the messy output. R Studio is much cleaner to read than this blogger template. 


Finally, we can create our final data set that we can run any sort of analysis on.

 

CREATE FINAL DATA SET

 

We will create an object that will have our data set; it will be called “final_clean_data,” or anything that you wish. We will use the complete() function with two arguments, our data set with the imputations, and the data set that we chose that closely resembles the original “bmi” mean value. Run this code and use the View() function to see the final data set:

 

 

final_clean_data = complete(my_imp, 5)

 

View(final_clean_data)

 

With the View() function, you will be able to see the final data set. There are no longer any missing values. Feel free to run your Regression, Clustering and Classification algorithms for further analysis.

 

THANK YOU

 

This blog post was a bit intense but being able to handle missing data is essential. As previously mentioned, data will rarely come in a form that is complete, so being able to prepare it for analysis is very important.

 

Thank you for using my blog as a way to better your lives. You are all doing so well and learning a skill that will be useful to you as you transition from the Sex Work industry. You can choose whatever career you wish. Whether you want to work in Corporate America, be an entrepreneur or go back to school, it is up to you.

 

Enjoy the journey. There is no need to rush the learning of data analytics and machine learning. Yet, some level of urgency is required so that you can have more healthy lives, circumstances, and relationships. I admire all of you! 

Share:
Location: Chicago, IL, USA