WELCOME
Hello Ladies! I am thrilled that you are coming back for more! Here is a homework assignment that I did for class. I hope that this sample work can show you by example, and not by explanation. If you get frustrated, read through it again, so that you can assimilate the concepts more efficiently. Enjoy!
PROBLEM 1:
In this problem, we are given three data sets. The
following are the sets that we are given, along with a description of the data
sets:
Percentage in Poverty: This dataset gives us
proportional data about the poverty rates of the 50 states. Each observation in
the set represents the poverty rate. For example: the first observation is
Alabama, and it has a 20.1 percent poverty rate.
Birth Rate 15-16: Similarly, this data set gives us
the proportion of 15-16 year olds in each state that have children. More
specifically, the first observation will give us 31.5 percent per every 1000
citizens in Alabama give birth to children.
Birth Rate 17-18: This data set is nearly identical
to the previous data set. The only difference is the age group. Each
observation gives us a proportion for every 1000 citizens, in each of the 50
states.
This problem will test our assumptions whether there is a correlation between poverty and differing age groups. Is there a relationship? Does poverty (our explanatory variable) impact the Birth Rate per 1000 citizens?
Can we use a regression line to make predictions for Birth Rates
within the various age groups? Do higher amounts of poverty within a given
state also explain the differences between the various age groups? Is the
difference significant?
These are questions that I will attempt to answer in this
analysis.
Let begin by loading tidyverse (because I love this
package) and setting our working directory.
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr
1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts
------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/Documents/R")
Next, I will store three vectors in each appropriate variable.
I will call the Percent in Poverty variable: pec_pov
I will call the Birth Rate 15-16 variable: br1
I will call the Birth Rate 17-18 variable: br2
pec_pov <- c(20.1, 7.1, 16.1, 14.9, 16.7, 8.8, 9.7, 10.3, 22, 16.2, 12.1, 10.3, 14.5, 12.4, 9.6, 12.2, 10.8, 14.7, 19.7, 11.2, 10.1, 11, 12.2, 9.2, 23.5, 9.4, 15.3, 9.6, 11.1, 5.3, 7.8, 25.3, 16.5, 12.6, 12, 11.5, 17.1, 11.2, 12.2, 10.6, 19.9, 14.5, 15.5, 17.4, 8.4, 10.3, 10.2, 12.5, 16.7, 8.5, 12.2)
br1 <- c(31.5, 18.9, 35, 31.6, 22.6, 26.2, 14.1, 24.7, 44.8, 23.2, 31.4, 17.7, 18.4, 23.4, 22.6, 16.4, 21.4, 26.5, 31.7, 11.9, 20, 12.5, 18, 14.2, 37.6, 22.2, 17.8, 18.3, 28, 8.1, 14.7, 37.8, 15.7, 28.6, 11.7, 20.1, 30.1, 18.2, 17.2, 19.6, 29.2, 17.3, 28.2, 38.2, 17.8, 10.4, 19, 16.8, 21.5, 15.9, 17.7)
br2 <- c(88.7, 73.7, 102.5, 101.7, 69.1, 79.1, 45.1, 77.8, 101.5, 78.4, 92.8, 66.4, 69.1, 70.5, 78.5, 55.4, 74.2, 84.8, 96.1, 45.2, 59.6, 39.6, 60.8, 47.3, 103.3, 76.6, 63.3, 64.2, 96.7, 39, 46.1, 99.5, 50.1, 89.3, 48.7, 69.4, 97.6, 64.8, 53.7, 59, 87.2, 67.8, 94.2, 104.3, 62.4, 44.4, 66, 57.6, 80.7, 57.1, 72.1)
I shall make two scatter plots, comparing pec_pov
and br1, and pec_pov and br2. I don’t have enough
information at this point to determine if regression analysis is feasible or
even appropriate, so I will see if there is a pattern in the data. I will also
add titles to the plots, as well as a regression line to see if regression is
appropriate.
plot(pec_pov, br1,
main = "Does
Poverty Explain Birth rates for 15-16 Year Olds in America",
xlab = "Poverty
Percentage in Each State",
ylab = "Birth Rate
for 15-16 Year Old per 1,000")
abline(lm(br1~pec_pov))
This first plot appears to be linear. I need to do further investigating to be certain of this. I will plot the second scatter plot, with the variables pec_pov and br2. As I did previously, I will label each axis and generate a regression line.
plot(pec_pov, br2,
main = "Does
Poverty Explain Birth rates for 17-18 Year Olds in America",
xlab = "Poverty
Percentage in Each State",
ylab = "Birth Rate
for 17-18 Year Olds per 1,000")
abline(lm(br2~pec_pov))
This plot also appears to follow a linear pattern. Yet, I am still not sure if regression is appropriate for this data set. I need to investigate further.
POVERTY VS BIRTH RATES 15-16 ANALYSIS AND DISCUSSION
As part of my analysis, I will program a model for both
graphs to see if I can calculate an appropriate R value. This will give me more
confidence that regression can in fact be used or not.
The model for pec_pov and br1 will look like such:
mod1 <- lm(br1~pec_pov)
Now I can call up the summary of the model:
summary(mod1)
##
## Call:
## lm(formula = br1 ~ pec_pov)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -11.2275
-3.6554 -0.0407 2.4972
10.5152
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
4.2673 2.5297 1.687
0.098 .
## pec_pov
1.3733 0.1835 7.483 1.19e-09 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.551 on 49 degrees of
freedom
## Multiple R-squared:
0.5333, Adjusted R-squared:
0.5238
## F-statistic:
56 on 1 and 49 DF, p-value:
1.188e-09
I notice in the Residuals section of the summary that
our Adjusted R-Squared Score is .5238. So, in order to obtain the R value, I
need to perform the square root of this value.
sqrt(.5238)
## [1] 0.7237403
Our R value is .7237, which tells us that our R value has a strong positive correlation. This means that there is a high correlation between the explanatory variable (pec_pov) and the response variable br1.
We
can be confident in assuming that as poverty increases, the birth rates for
ages 15-16 will increase as well. Likewise, we can also be enthusiastic that
regression analysis is appropriate for this type of data.
Although, it is worth mentioning that Poverty only
explains roughly 52.4% of birth rates from ages 15-16. Conversely, the use of
other variables might make our model more predictive, like income, ethnicity,
socioeconomic status, etc. We can be confident that 47.6% of the model is NOT
being explained by the two variables that we have used in our model, called
mod1.
Our equation for this model looks like this:
Birth Rate for 15-16 ^ = 4.2673 + 1.3733*(Percentage
in Poverty)
So, if I wanted to predict what the birth rate would be if
the poverty rate was 20, then I would use the following equation:
y <- 4.2673 + 1.3733*(20)
print(y)
## [1] 31.7333
This means that if the poverty rate was 20% then the
birth rate of 15-16 year olds would be 31.73% per 1000 citizens in this age
group.
In summation, the model that we have created for these variables is strong, BUT it can be stronger, if we added more variables to the regression equation (as in multivariate regression).
Maybe there is a variable
that explains birth rates much better than poverty. Does the LOCATION of the
state, or even CLIMATE have any predictive power in a regression model? We need
more data and more features to ensure that this model has a better predictive
influence.
I will create a residual plot now to see if any of my
analysis held merit.
plot(mod1)
The residual plot supports that regression analysis was appropriate, as the data points in the plots do NOT conform to a specific or obvious pattern. But, I would like to add that other variables should be used to explore a better model that can explain more of the response variable. However, this is a good start.
I will now conduct the same analysis for the Birth Rate
17-18 and Percentage in Poverty and conclude this report with
Managerial Implications for Laypersons involved in policy making.
POVERTY VS BIRTH RATES 17-18 ANALYSIS AND DISCUSSION
Now, I will run a regression model with Birth Rates for
17-18 year olds per 1000 citizens.
mod2 <- lm(br2~pec_pov)
summary(mod2)
##
## Call:
## lm(formula = br2 ~ pec_pov)
##
## Residuals:
## Min 1Q
Median 3Q Max
## -31.668 -10.541
-1.611 11.381 30.496
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
34.2124 6.6414 5.151 4.59e-06 ***
## pec_pov
2.8822 0.4818 5.982 2.50e-07 ***
## ---
## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.57 on 49 degrees of
freedom
## Multiple R-squared:
0.4221, Adjusted R-squared:
0.4103
## F-statistic: 35.78 on 1 and 49 DF, p-value: 2.495e-07
Immediately, I notice that our Adjusted R^2 Score is
less than the previous age group. So, I am assuming that our R value will be
less as well. Let me confirm this:
sqrt(0.4103)
## [1] 0.6405466
My assumption was correct! The R value is much lower, at
roughly (rounded up) to .641. Despite this lower number, our R value still
represents a positive moderate correlation. Meaning that the percentage of
poverty in the state DOES impact Birth rates per 1000 citizens of 17-18 year
olds.
We can further deduce that we ABSOLUTELY REQUIRE more
variables to improve the predictive power of our model. Unless we find an
explanatory variable that has a higher correlation than poverty, we will need
multiple variables to explain the Birth Rates in this age group.
Our equation for these two variables looks like this:
Birth Rates 17-18^ = 34.2124 + 2.8822*(Percentage in
Poverty)
I am now going to create a residual plot to see if my
regression analysis holds some merit.
plot(mod2)
I am on the fence here! I can see that the plot titled, “Residuals vs. Fitted,” has a subtle pattern. IT IS NOT OBVIOUS, but I can see that there are more observations to the left of the value 80, and less observations passed the value of 80. At this point, I can conclude that I would need to go beyond a regression model or make use of other variables.
I have also neglected to mention that I have not removed any outliers or influence points from the vectors that I created in the beginning of this analysis. The main reason for this is because our sample size is only 50 (n=50).
Since our sample set is relatively smaller, I didn’t want to
take away any information that could be valuable to me in this analysis. Yet,
one could also argue that the removal of three, perhaps 4, outliers and
influence points, might enable our model’s predictive power to be higher.
If our R^2 score was higher in this case, would that
necessarily translate well in a real world setting, in which human lives are so
vital in the model’s predictive outcome? I lack the experience and knowledge to
answer this question at this time.
Now for the Managerial Implications.
MANAGERIAL IMPLICATIONS FOR POLICY MAKERS (HYPOTHETICAL LECTURE)
To all of the policy makers who have gathered today at
this TedTalk, I beg you all to hear me out on some key points that I have
discovered when it comes to birth rates in the US.
Policy makers who create birth control programs will need
to be receptive to that fact that each state has different poverty levels.
Additionally, we can be confident that the Percentage in Poverty does explain
birth rates in both ages groups, from 15-16 and 17-18 years of age.
HOWEVER, and this is important, Poverty is NOT the only
variable that can explain Birth Rates amongst these age groups. We need to
collect more data, and consider more, and perhaps, use multiple variables to
create regression models that you can build your programs around.
Policy makers can be confident in using birth control
programs to target those living in states, where income levels are 20% or less
for the 15-16 year age group. Although, there are no guarantees and certainties
in statistics, we can be confident of this fact.
hist(br1,
main = "Distribution
of Birth Rates per 1000",
xlab = "Proportion
of Birthday Rates per 1000 Citizens")
The histogram below shows the distribution of Birth
Rates for 17-18 Years Old. You might note that these proportions are higher for
this age group than for 15-16 year olds. I am going to make a hypothesis that
the main reason for the disparity between groups is because 17-18 year olds are
of legal age to drive, and have driver’s licenses. You might consider having
flyers or promotional birth control documents and solutions at DMVs, and even
hotels–these kids need some place private to “get it on.”
hist(br2)
Thank you very much for attending my Ted Talk today. I hope my advice is enough to persuade you to take action!
THANK YOU!
Thank you so very much for coming to my blog to learn data analytics and machine learning. I know that you are making a tremendous effort to learn this material, and that some of it may be foreign to you.
This fine! Take your time with it. Go at a pace that is comfortable to you, and enjoy the ride!