I would like this post to go into the basics of ggplot2, which is a package that is meant to specialize in the visualization of data. I am using a book called “ggplot2: Elegant Graphics for Data Analysis,” by Hadley Wickham, as well as lecture material I have learned in class.
I will
use one of my favorite data sets called bikes, and the mpg data set later on,
but I will not go into great details for this post. I want this post to show
you how you can plot two numerical variables on the x and y-axis, and add
appropriate labels to make this plot more interpretable.
library(tidyverse)
## -- Attaching packages
--------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5
v purrr 0.3.4
## v tibble
3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr
2.0.1 v forcats 0.5.1
## Warning: package 'stringr' was built under R
version 4.1.2
## -- Conflicts
------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()
masks stats::lag()
setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/Fund of Bus Anal")
bikes <- read.csv("bikes.csv", header = TRUE)
str(bikes)
## 'data.frame':
731 obs. of 10 variables:
## $ date : chr
"2011-01-01" "2011-01-02" "2011-01-03"
"2011-01-04" ...
## $
season : int 1 1 1 1 1 1 1 1 1 1 ...
## $
holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $
weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $
weather : int 2 2 1 1 1 1 2 2 1 1 ...
## $
temperature: num 46.7 48.4 34.2 34.5
36.8 ...
## $
realfeel : num 46.4 45.2 25.7 28.4 30.4 ...
## $
humidity : num 0.806 0.696 0.437 0.59 0.437 ...
## $
windspeed : num 6.68 10.35 10.34 6.67 7.78 ...
## $
rentals : int 985 801 1349 1562 1600 1606 1510 959 822 1321
...
So, I have loaded tidyverse, which contains ggplot2, and my data set. Now I would like to use ggplot to visualize variables within the bikes data frame.
But first, allow me to summarize the basic format, or
grammar, of the ggplot() function.
1.
You need a data frame.
2.
the second argument for this function requires a
mapping aesthetic, or aes, which will have the variables you wish to see in
graphical form.
3.
After you have satisfied these two arguments for
the ggplot() function, you will need include a geom function, which follows a +
sign.
Let me demonstrate what this may look like.
ggplot(bikes, aes(x =
temperature, y = rentals))+
geom_point()
Here is what that plot looks like. Throughout this post, I will use the same plot, but I will make it look more presentable, by making use of various functions and phrases.
I hope that each iteration of this plot will
give you an idea, and basic understanding of how ggplot is used, and its
effectiveness in presenting your findings to your managers at your place of
work.
Let’s change the color of the plot to blue.
ggplot(bikes, aes(x =
temperature, y = rentals))+
geom_point(colour = "blue")
I would like to note that color and shape work well for
categorical variables (non-numerical), and for continuous variables (numerical), size might be a better
argument.
Now, it might be difficult to determine what type of relationship exists between the variables in our visualization. To me, it looks like there is a moderate (?) positive LINEAR (?) relationship between temperature and rentals.
So, let’s test my hypothesis by superimposing a regression line over the data. We can do this with the geom_smooth() function, and inserting “lm” into “method” for our argument.
ggplot(bikes, aes(x =
temperature, y = rentals))+
geom_point()+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
My assumption that the variables in question have a linear
relationship appears to have only some merit. Although, at this point, I don’t
know how strong the correlation is. And now that I think about it, a linear
model might NOT fit this data appropriately, as this line does not fit the
outer extremities of our data too well. That is the subject of another post.
Let’s continue.
In order to keep this post shorter, I want to touch
quickly on how you can simply add labels to your graph. Please examine the following
code, and note it’s output:
ggplot(bikes, aes(x =
temperature, y = rentals))+
geom_point(colour = "red")+
geom_smooth(method = "lm")+
labs(title = "Relationship
Between Rentals and Temperature") +
ylab("Rentals") +
xlab("Temperature")
## `geom_smooth()` using formula 'y ~ x'
In our final input, we specified that the color of our observations will be red in the geom_point() function. We kept our Regression Line to give management an idea how a linear model might or might not be appropriate for our data set. This was specified in our geom_smooth() function.
Lastly, we created a title for our graph called, “Relationship Between Rentals
and Temperature.” We labeled our y-axis “Rentals” and this is our dependent
variable. We labeled our x-axis “Temperature” and this will be our independent
variable.
Let me do a comparison of both graphs so we can see how
far we have come in our interpretation of the two variables in question.
ggplot(bikes, aes(x =
temperature, y = rentals))+
geom_point()
ggplot(bikes, aes(x =
temperature, y = rentals))+
geom_point(colour = "red")+
geom_smooth(method = "lm")+
labs(title = "Relationship
Between Rentals and Temperature") +
ylab("Rentals") +
xlab("Temperature")
## `geom_smooth()` using formula 'y ~ x'
There are other ways that you can graph variables in
ggplot, this goes without saying. Here are some other plots you can experiment
with. To demonstrate these variations, I will use the mpg data set.
ggplot(mpg, aes(drv, hwy))+
geom_jitter()
ggplot(mpg, aes(drv, hwy))+
geom_boxplot()
ggplot(mpg, aes(drv, hwy))+
geom_violin()
You can find more information about the advantages and disadvantages of these plots in the book by Hadly Wickham, “ggplot2: Elegant Graphics for Data Analysis,” and I will not go into these elements for this post. I just wish to show you the potential of this powerful and useful package. Here is another plot that you can make:
ggplot(mpg, aes(displ, colour
=
drv))+
geom_freqpoly(binwidth = 0.5)
You can explore this further in your own time.
To begin, pick two variables from your data set, try numerical values at first. Alter the default graphs color, add some sort of line to fit the data, and lastly, change the labels of the graph.
THANK YOU!
I hope that
this post gives you a quick solution for visualizing two variables on a plot,
and also inspires you to go out and buy Hadley Wickham’s book, “ggplot2:
Elegant Graphics for Data Analysis.”
Thank you so much for looking at this post. I hope it was
helpful to you in your journey of becoming a data analyst, machine learning leader, or data entrepreneur.