Wednesday, August 10, 2022

The Basics of Data Visualization: A Guide on GGplot2

I would like this post to be a demonstration on Data Visualization with the ggplot package. As you will discover, the package is adept at creating plots that are attractive and appealing. 


There won't be too much explanation on how to interpret these plots. I list a worthy resource for that at the conclusion of this post. Please look into it if this post inspires you.

I would like this post to go into the basics of ggplot2, which is a package that is meant to specialize in the visualization of data. I am using a book called “ggplot2: Elegant Graphics for Data Analysis,” by Hadley Wickham, as well as lecture material I have learned in class. 


I will use one of my favorite data sets called bikes, and the mpg data set later on, but I will not go into great details for this post. I want this post to show you how you can plot two numerical variables on the x and y-axis, and add appropriate labels to make this plot more interpretable.


As always, if you wish to follow along, please type the text that is highlighted in BLUE in your R Studio environment.


library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/firstnamelastinitial/OneDrive/Documents/Fund of Bus Anal")
bikes <- read.csv("bikes.csv", header = TRUE)
str(bikes)

## 'data.frame':    731 obs. of  10 variables:
##  $ date       : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
##  $ season     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday    : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ weather    : int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temperature: num  46.7 48.4 34.2 34.5 36.8 ...
##  $ realfeel   : num  46.4 45.2 25.7 28.4 30.4 ...
##  $ humidity   : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed  : num  6.68 10.35 10.34 6.67 7.78 ...
##  $ rentals    : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...


So, I have loaded tidyverse, which contains ggplot2, and my data set. Now I would like to use ggplot to visualize variables within the bikes data frame. 


But first, allow me to summarize the basic format, or grammar, of the ggplot() function.


1.         You need a data frame.


2.         the second argument for this function requires a mapping aesthetic, or aes, which will have the variables you wish to see in graphical form.


3.         After you have satisfied these two arguments for the ggplot() function, you will need include a geom function, which follows a + sign.


Let me demonstrate what this may look like.


ggplot(bikes, aes(x = temperature, y = rentals))+
  geom_point()


Scatter Plot

Here is what that plot looks like. Throughout this post, I will use the same plot, but I will make it look more presentable, by making use of various functions and phrases. 


I hope that each iteration of this plot will give you an idea, and basic understanding of how ggplot is used, and its effectiveness in presenting your findings to your managers at your place of work.


Let’s change the color of the plot to blue.


ggplot(bikes, aes(x = temperature, y = rentals))+
  geom_point(colour = "blue")


Scatter Plot in Color



I would like to note that color and shape work well for categorical variables (non-numerical), and for continuous variables (numerical), size might be a better argument.


Now, it might be difficult to determine what type of relationship exists between the variables in our visualization. To me, it looks like there is a moderate (?) positive LINEAR (?) relationship between temperature and rentals. 


So, let’s test my hypothesis by superimposing a regression line over the data. We can do this with the geom_smooth() function, and inserting “lm” into “method” for our argument.


ggplot(bikes, aes(x = temperature, y = rentals))+
  geom_point()+
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'



Scatter Plot with Regression Line


My assumption that the variables in question have a linear relationship appears to have only some merit. Although, at this point, I don’t know how strong the correlation is. And now that I think about it, a linear model might NOT fit this data appropriately, as this line does not fit the outer extremities of our data too well. That is the subject of another post. Let’s continue.


In order to keep this post shorter, I want to touch quickly on how you can simply add labels to your graph. Please examine the following code, and note it’s output:


ggplot(bikes, aes(x = temperature, y = rentals))+
  geom_point(colour = "red")+
  geom_smooth(method = "lm")+
  labs(title = "Relationship Between Rentals and Temperature") +
  ylab("Rentals") +
  xlab("Temperature")

## `geom_smooth()` using formula 'y ~ x'



Scatter Plot with Regression Line 2


In our final input, we specified that the color of our observations will be red in the geom_point() function. We kept our Regression Line to give management an idea how a linear model might or might not be appropriate for our data set. This was specified in our geom_smooth() function. 


Lastly, we created a title for our graph called, “Relationship Between Rentals and Temperature.” We labeled our y-axis “Rentals” and this is our dependent variable. We labeled our x-axis “Temperature” and this will be our independent variable.


Let me do a comparison of both graphs so we can see how far we have come in our interpretation of the two variables in question.


ggplot(bikes, aes(x = temperature, y = rentals))+
  geom_point()



Scatter Plot


ggplot(bikes, aes(x = temperature, y = rentals))+
  geom_point(colour = "red")+
  geom_smooth(method = "lm")+
  labs(title = "Relationship Between Rentals and Temperature") +
  ylab("Rentals") +
  xlab("Temperature")

## `geom_smooth()` using formula 'y ~ x'


Scatter Plot

There are other ways that you can graph variables in ggplot, this goes without saying. Here are some other plots you can experiment with. To demonstrate these variations, I will use the mpg data set.


ggplot(mpg, aes(drv, hwy))+
  geom_jitter()



Plot

ggplot(mpg, aes(drv, hwy))+
  geom_boxplot()


Box Plot

ggplot(mpg, aes(drv, hwy))+

  geom_violin()


Violin Plot

You can find more information about the advantages and disadvantages of these plots in the book by Hadly Wickham, “ggplot2: Elegant Graphics for Data Analysis,” and I will not go into these elements for this post. I just wish to show you the potential of this powerful and useful package. Here is another plot that you can make:

ggplot(mpg, aes(displ, colour = drv))+
  geom_freqpoly(binwidth = 0.5)


Frequency Plot

You can explore this further in your own time.


To begin, pick two variables from your data set, try numerical values at first. Alter the default graphs color, add some sort of line to fit the data, and lastly, change the labels of the graph. 


THANK YOU!


I hope that this post gives you a quick solution for visualizing two variables on a plot, and also inspires you to go out and buy Hadley Wickham’s book, “ggplot2: Elegant Graphics for Data Analysis.”


Thank you so much for looking at this post. I hope it was helpful to you in your journey of becoming a data analyst, machine learning leader, or data entrepreneur.


Share:
Location: Chicago, IL, USA