Wednesday, August 10, 2022

Regression Analysis Using Small Data Sets (No CSV File)

on August 10, 2022 in Blog, Machine Learning, Simple Linear Regression, Supervised Machine Learning

WELCOME

Hello Ladies! I am thrilled that you are coming back for more! Here is a homework assignment that I did for class. I hope that this sample work can show you by example, and not by explanation. If you get frustrated, read through it again, so that you can assimilate the concepts more efficiently. Enjoy!

PROBLEM 1:

In this problem, we are given three data sets. The following are the sets that we are given, along with a description of the data sets:

Percentage in Poverty: This dataset gives us proportional data about the poverty rates of the 50 states. Each observation in the set represents the poverty rate. For example: the first observation is Alabama, and it has a 20.1 percent poverty rate.

Birth Rate 15-16: Similarly, this data set gives us the proportion of 15-16 year olds in each state that have children. More specifically, the first observation will give us 31.5 percent per every 1000 citizens in Alabama give birth to children.

Birth Rate 17-18: This data set is nearly identical to the previous data set. The only difference is the age group. Each observation gives us a proportion for every 1000 citizens, in each of the 50 states.

This problem will test our assumptions whether there is a correlation between poverty and differing age groups. Is there a relationship? Does poverty (our explanatory variable) impact the Birth Rate per 1000 citizens?

Can we use a regression line to make predictions for Birth Rates within the various age groups? Do higher amounts of poverty within a given state also explain the differences between the various age groups? Is the difference significant?

These are questions that I will attempt to answer in this analysis.

Let begin by loading tidyverse (because I love this package) and setting our working directory.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble 3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

setwd("C:/Users/firstnamelastinitial/Documents/R")

Next, I will store three vectors in each appropriate variable.

I will call the Percent in Poverty variable: pec_pov

I will call the Birth Rate 15-16 variable: br1

I will call the Birth Rate 17-18 variable: br2

pec_pov <- c(20.1, 7.1, 16.1, 14.9, 16.7, 8.8, 9.7, 10.3, 22, 16.2, 12.1, 10.3, 14.5, 12.4, 9.6, 12.2, 10.8, 14.7, 19.7, 11.2, 10.1, 11, 12.2, 9.2, 23.5, 9.4, 15.3, 9.6, 11.1, 5.3, 7.8, 25.3, 16.5, 12.6, 12, 11.5, 17.1, 11.2, 12.2, 10.6, 19.9, 14.5, 15.5, 17.4, 8.4, 10.3, 10.2, 12.5, 16.7, 8.5, 12.2)

br1 <- c(31.5, 18.9, 35, 31.6, 22.6, 26.2, 14.1, 24.7, 44.8, 23.2, 31.4, 17.7, 18.4, 23.4, 22.6, 16.4, 21.4, 26.5, 31.7, 11.9, 20, 12.5, 18, 14.2, 37.6, 22.2, 17.8, 18.3, 28, 8.1, 14.7, 37.8, 15.7, 28.6, 11.7, 20.1, 30.1, 18.2, 17.2, 19.6, 29.2, 17.3, 28.2, 38.2, 17.8, 10.4, 19, 16.8, 21.5, 15.9, 17.7)

br2 <- c(88.7, 73.7, 102.5, 101.7, 69.1, 79.1, 45.1, 77.8, 101.5, 78.4, 92.8, 66.4, 69.1, 70.5, 78.5, 55.4, 74.2, 84.8, 96.1, 45.2, 59.6, 39.6, 60.8, 47.3, 103.3, 76.6, 63.3, 64.2, 96.7, 39, 46.1, 99.5, 50.1, 89.3, 48.7, 69.4, 97.6, 64.8, 53.7, 59, 87.2, 67.8, 94.2, 104.3, 62.4, 44.4, 66, 57.6, 80.7, 57.1, 72.1)

I shall make two scatter plots, comparing pec_pov and br1, and pec_pov and br2. I don’t have enough information at this point to determine if regression analysis is feasible or even appropriate, so I will see if there is a pattern in the data. I will also add titles to the plots, as well as a regression line to see if regression is appropriate.

plot(pec_pov, br1,
     main = "Does Poverty Explain Birth rates for 15-16 Year Olds in America",
     xlab = "Poverty Percentage in Each State",
     ylab = "Birth Rate for 15-16 Year Old per 1,000")
abline(lm(br1~pec_pov))

Scatter Plot 1

This first plot appears to be linear. I need to do further investigating to be certain of this. I will plot the second scatter plot, with the variables pec_pov and br2. As I did previously, I will label each axis and generate a regression line.

plot(pec_pov, br2,
     main = "Does Poverty Explain Birth rates for 17-18 Year Olds in America",
     xlab = "Poverty Percentage in Each State",
     ylab = "Birth Rate for 17-18 Year Olds per 1,000")
abline(lm(br2~pec_pov))

Scatter Plot 2

This plot also appears to follow a linear pattern. Yet, I am still not sure if regression is appropriate for this data set. I need to investigate further.

POVERTY VS BIRTH RATES 15-16 ANALYSIS AND DISCUSSION

As part of my analysis, I will program a model for both graphs to see if I can calculate an appropriate R value. This will give me more confidence that regression can in fact be used or not.

The model for pec_pov and br1 will look like such:

mod1 <- lm(br1~pec_pov)

Now I can call up the summary of the model:

summary(mod1)

##
## Call:
## lm(formula = br1 ~ pec_pov)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -11.2275 -3.6554 -0.0407   2.4972 10.5152
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   4.2673     2.5297   1.687    0.098 .
## pec_pov       1.3733     0.1835   7.483 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.551 on 49 degrees of freedom
## Multiple R-squared: 0.5333, Adjusted R-squared: 0.5238
## F-statistic:    56 on 1 and 49 DF, p-value: 1.188e-09

I notice in the Residuals section of the summary that our Adjusted R-Squared Score is .5238. So, in order to obtain the R value, I need to perform the square root of this value.

sqrt(.5238)

## [1] 0.7237403

Our R value is .7237, which tells us that our R value has a strong positive correlation. This means that there is a high correlation between the explanatory variable (pec_pov) and the response variable br1.

We can be confident in assuming that as poverty increases, the birth rates for ages 15-16 will increase as well. Likewise, we can also be enthusiastic that regression analysis is appropriate for this type of data.

Although, it is worth mentioning that Poverty only explains roughly 52.4% of birth rates from ages 15-16. Conversely, the use of other variables might make our model more predictive, like income, ethnicity, socioeconomic status, etc. We can be confident that 47.6% of the model is NOT being explained by the two variables that we have used in our model, called mod1.

Our equation for this model looks like this:

Birth Rate for 15-16 ^ = 4.2673 + 1.3733*(Percentage in Poverty)

So, if I wanted to predict what the birth rate would be if the poverty rate was 20, then I would use the following equation:

y <- 4.2673 + 1.3733*(20)
print(y)

## [1] 31.7333

This means that if the poverty rate was 20% then the birth rate of 15-16 year olds would be 31.73% per 1000 citizens in this age group.

In summation, the model that we have created for these variables is strong, BUT it can be stronger, if we added more variables to the regression equation (as in multivariate regression).

Maybe there is a variable that explains birth rates much better than poverty. Does the LOCATION of the state, or even CLIMATE have any predictive power in a regression model? We need more data and more features to ensure that this model has a better predictive influence.

I will create a residual plot now to see if any of my analysis held merit.

plot(mod1)

Residual Plot 1

Normality Plot

Residual Plot 2

Residual Plot 3

The residual plot supports that regression analysis was appropriate, as the data points in the plots do NOT conform to a specific or obvious pattern. But, I would like to add that other variables should be used to explore a better model that can explain more of the response variable. However, this is a good start.

I will now conduct the same analysis for the Birth Rate 17-18 and Percentage in Poverty and conclude this report with Managerial Implications for Laypersons involved in policy making.

POVERTY VS BIRTH RATES 17-18 ANALYSIS AND DISCUSSION

Now, I will run a regression model with Birth Rates for 17-18 year olds per 1000 citizens.

mod2 <- lm(br2~pec_pov)
summary(mod2)

##
## Call:
## lm(formula = br2 ~ pec_pov)
##
## Residuals:
##     Min      1Q Median      3Q     Max
## -31.668 -10.541 -1.611 11.381 30.496
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.2124     6.6414   5.151 4.59e-06 ***
## pec_pov       2.8822     0.4818   5.982 2.50e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.57 on 49 degrees of freedom
## Multiple R-squared: 0.4221, Adjusted R-squared: 0.4103
## F-statistic: 35.78 on 1 and 49 DF, p-value: 2.495e-07

Immediately, I notice that our Adjusted R^2 Score is less than the previous age group. So, I am assuming that our R value will be less as well. Let me confirm this:

sqrt(0.4103)

## [1] 0.6405466

My assumption was correct! The R value is much lower, at roughly (rounded up) to .641. Despite this lower number, our R value still represents a positive moderate correlation. Meaning that the percentage of poverty in the state DOES impact Birth rates per 1000 citizens of 17-18 year olds.

We can further deduce that we ABSOLUTELY REQUIRE more variables to improve the predictive power of our model. Unless we find an explanatory variable that has a higher correlation than poverty, we will need multiple variables to explain the Birth Rates in this age group.

Our equation for these two variables looks like this:

Birth Rates 17-18^ = 34.2124 + 2.8822*(Percentage in Poverty)

I am now going to create a residual plot to see if my regression analysis holds some merit.

plot(mod2)

Residual Plot

Normality Plot 2

Residual Plot

Residual Plot

I am on the fence here! I can see that the plot titled, “Residuals vs. Fitted,” has a subtle pattern. IT IS NOT OBVIOUS, but I can see that there are more observations to the left of the value 80, and less observations passed the value of 80. At this point, I can conclude that I would need to go beyond a regression model or make use of other variables.

I have also neglected to mention that I have not removed any outliers or influence points from the vectors that I created in the beginning of this analysis. The main reason for this is because our sample size is only 50 (n=50).

Since our sample set is relatively smaller, I didn’t want to take away any information that could be valuable to me in this analysis. Yet, one could also argue that the removal of three, perhaps 4, outliers and influence points, might enable our model’s predictive power to be higher.

If our R^2 score was higher in this case, would that necessarily translate well in a real world setting, in which human lives are so vital in the model’s predictive outcome? I lack the experience and knowledge to answer this question at this time.

Now for the Managerial Implications.

MANAGERIAL IMPLICATIONS FOR POLICY MAKERS (HYPOTHETICAL LECTURE)

To all of the policy makers who have gathered today at this TedTalk, I beg you all to hear me out on some key points that I have discovered when it comes to birth rates in the US.

Policy makers who create birth control programs will need to be receptive to that fact that each state has different poverty levels. Additionally, we can be confident that the Percentage in Poverty does explain birth rates in both ages groups, from 15-16 and 17-18 years of age.

HOWEVER, and this is important, Poverty is NOT the only variable that can explain Birth Rates amongst these age groups. We need to collect more data, and consider more, and perhaps, use multiple variables to create regression models that you can build your programs around.

Policy makers can be confident in using birth control programs to target those living in states, where income levels are 20% or less for the 15-16 year age group. Although, there are no guarantees and certainties in statistics, we can be confident of this fact.

hist(br1,
main = "Distribution of Birth Rates per 1000",
xlab = "Proportion of Birthday Rates per 1000 Citizens")

Histogram

The histogram below shows the distribution of Birth Rates for 17-18 Years Old. You might note that these proportions are higher for this age group than for 15-16 year olds. I am going to make a hypothesis that the main reason for the disparity between groups is because 17-18 year olds are of legal age to drive, and have driver’s licenses. You might consider having flyers or promotional birth control documents and solutions at DMVs, and even hotels–these kids need some place private to “get it on.”

hist(br2)

Histogram 2

Thank you very much for attending my Ted Talk today. I hope my advice is enough to persuade you to take action!

THANK YOU!

Thank you so very much for coming to my blog to learn data analytics and machine learning. I know that you are making a tremendous effort to learn this material, and that some of it may be foreign to you.

This fine! Take your time with it. Go at a pace that is comfortable to you, and enjoy the ride!

Share:

Location: Chicago, IL, USA