ggplot2
ggplot2
Overview
The package ggplot2
, which forms part of the tidyverse
, offers a powerful graphics language for creating elegant and complex plots. In this lesson we will cover how various types of plots can be created using the ggplot2
package
Since the ggplot2
package forms part of the tidyverse
, you do not have to install the package if you have installed the tidyverse
package previously. However, if you would like to install the package, the function install.packages()
can be used
install.packages("ggplot2")
Once the package is installed, the package can be loaded using the function library()
library(ggplot2)
The design of ggplot2
is based on the idea that any complex plot can be divided into layers. For example, a scatterplot and smoothed regression line can be combined to summarise the relationship between two continuous features
Primary components of a plot
Layers and aesthetics
ggplot2
graphic objects consist of two primary components layers and aesthetics:
-
Layers are the components of a graph
- For example, the layer
geom_point()
adds a layer of scatterplot points - Additional layers can be added to a ggplot2 graphic object by using the
+
operator is e.g.ggplot() + geom_point()
- For example, the layer
-
Aesthetics determine how layers appear e.g. we can use aesthetics to specify the colour of the points in a
geom_point()
__layer __- Aesthetics are set using arguments inside a layer function e.g.
geom_point(color = “red”)
- Aesthetics includes location, colours and sizes
- Aesthetics are set using arguments inside a layer function e.g.
Aesthetics: Setting versus mapping
Each layer has several arguments that can be used to control the appearance of a layer. When deciding how a layer should appear, we first have to decide if the appearance should be based on a variable or not. An aesthetic i.e. colour can be:
- set to a constant value
ggplot(mtcars,
aes(x = disp, y = mpg)) +
geom_point(colour = "purple") +
labs(x = "Displacement",
y = "Milles/gallon",
title = "Setting aesthetic")
- or mapped to a variable i.e. transmission
ggplot(mtcars, aes(x = disp, y = mpg, colour = as.factor(am))) +
geom_point() +
labs(x = "Displacement",
y = "Milles/gallon",
title = "Mapping aesthetic",
colour = "Transmission") +
scale_color_brewer(palette = "Dark2")
-
Setting aesthetics: Arguments like colour, size, line type, shape, fill and alpha can be passed directly to a layer. These aesthetics are not influenced by data. For example, we can specify that all points in a scatterplot should be purple
-
Mapping aesthetics: Mapping aesthetics depend on data. For example, if we want the points in a scatterplot to have a different colour based on the values of a variable a mapping aesthetic is required. Mapping aesthetics are specified inside the
aes()
argument
Creating a basic plot
Loading data
To illustrate how we can create visualisations using ggplot2
we will be using data from the Gapminder project. We will be specifically focusing on the life expectancy of different countries.
The Gapminder data can be accessed by installing the package gapminder
install.packages("gapminder")
After installing the gapminder
package, load the package using the function library()
library(gapminder)
Once the package is loaded the data set will be stored in the object gapminder
. The function str()
can be used to view the structure of the data
object
str(gapminder)
The gapminder
object is a tibble; a special kind of data frame with 1704 rows and 6 columns.
Painting
The first step in creating a ggplot2
graphics object is to define a ggplot
object using the function ggplot()
. The function ggplot()
simply creates a blank canvas. This blank canvas can be used to add graphical elements to. Run the code below to view the blank canvas
ggplot()
Adding a layer
We can add layers to our blank canvas using the +
operator. For instance we can add a geom_point
layer to our initial blank slate created with the function ggplot()
.
ggplot() +
geom_point()
Since we have not specified the data that we want to plot or how we want our data to be plotted, our canvas will remain blank.
Adding data
To specify what data to use in our plot, the argument data
should be set. Passing a data set to the argument data
is however not enough, we need to inform ggplot2
how we want our data to map to the plot.
Inside the function aes()
, we specify the values that should be used for the x and y axis.
Try plotting the life expectancy i.e. lifeExp
of South Africa over time i.e. year
modifying the code below:
ggplot() +
geom_point(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = ___, y = ___))
- Memo
ggplot() + geom_point(data = gapminder[gapminder$country == "South Africa",1:6], aes(x = year, y = lifeExp))
The layer geom_point
, mapped each data instance to a point on the graph.
Adding a layer
If we wanted to add a line graph to our existing ggplot2
graphics object, we can simply add a new layer. In this case we use the geom_line
layer, a layer that connect data points with a line. Add the layer geom_line
in the code below
ggplot() +
geom_point(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
___(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp))
- Memo
ggplot() + geom_point(data = gapminder[gapminder$country == "South Africa",1:6], aes(x = year, y = lifeExp)) + geom_line(data = gapminder[gapminder$country == "South Africa",1:6], aes(x = year, y = lifeExp))
New layers will always be drawn over previous layers.
Avoid work; inherit
It seems rather tedious to specify the data and the aesthetics for each layer.
To avoid repetitive code, any mapping aesthetic arguments specified in the ggplot
layer, will be inherited by subsequent layers. The code below produces the same plot as the code in the previous block; with less typing
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line()
This does not imply that we cannot overwrite the arguments in subsequent layers.
Adding flavour
Colour
If we want to change the colour of the line graph we can simply set the colour aesthetic equal to the value “blue”.
Try changing the colour of the line to blue; without changing the colour of the points
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(___)
- Memo
ggplot(data = gapminder[gapminder$country == "South Africa",1:6], aes(x = year, y = lifeExp)) + geom_point() + geom_line(colour = "blue")
What happens if we use a mapping aesthetic as oppose to a setting aesthetic to set the colour of the line graph?
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(aes(colour = "blue"))
When we set the colour using a mapping aesthetic, our line plot uses the colour red even though we specified blue. Using aes(colour = “blue”)
maps the vector c(“blue”)
to the colour element of the line plot.
Mapping aesthetics is used when we want to change how a layer is displayed based on the underlying data. For instance, we can map the year
column of the data set to the colour mapping aesthetic of the geom_line
layer. In this case, ggplot2
will assign colours based on the values in the year
column. ggplot2
will automatically assign a unique level of the aesthetic to each value, a process known as scaling
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(aes(colour = year))
Scales
In the previous graph, the colours used for the variable year
were automatically selected. If we want to change how the variable year
map to a set of colours we need to add a scale layer to a ggplot2 graphics object
A scale layer use the following syntax: scale_[aesthetic]_[option]
where:
[aesthetic]
should be replaced the name of the mapping aesthetic you would like to change e.g. colour, shape, linetype, alpha, size, fill, x, y ,[option]
should be used to specify how you would like to change the aesthetic. For example,manual
,continuous
ordiscrete
(depending on the nature of the variable)
Examples
scale_linetype_manual()
: Manually specify the linetype of each different valuescale_alpha_continuous()
: Varies transparency over a continuous range
To change the default colours use to map the variable year
to colour, the scale layer scale_color_continuous
can be added
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(aes(colour = year)) +
scale_color_continuous(type = "gradient",
low = "red",
high = "blue")
When we try to add a scale that is not compatible with the map variable, the code will return an error
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(aes(colour = year)) +
scale_color_discrete()
In the above example, ggplot2
returns an error since we are trying to map a continuous variable to a discrete set of colour.
When we map a discrete variable to the aesthetic colour e.g. country
, we can map the variable to a discrete set of colours by adding the layer scale_color_discrete
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line(aes(colour = country)) +
scale_color_discrete(type = c("blue"))
Since x is a mapping aesthetic, it also has a scale. For example, we can add the scale layer “scale_x_continuous` to set the major tick marks or breaks to align with the years that we collected data for
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(1952, 2007, by = 5))
Scales can also be used to limit the data displayed. In the example, any rows with years before 1995 is removed
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
scale_x_continuous(limit = c(1995, NA))
Labels
A x-label will be automatically created using the name of the variable mapped to the x aesthetic. To add a manual label, the layer xlab
can be added. The xlab
layer can also be used to remove the label e.g. xlab(NULL)
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
xlab("Year")
To add a custom y-label and title to a plot add the layers ylab
and ggtitle
respectively
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
xlab("Year") +
ylab("Life expectancy") +
ggtitle("Life expectancy in SA")
The ggtitle
, xlab
and ylab
are helper layers; shortcuts to quickly change a ggplot2 graphics object. In general, any labels including labels of legends can be set using the labs
__layer __
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
labs(title = "Life expectancy in SA",
x = "Year",
y = "Life expectancy")
Themes
The theme
layer allows you to exercise fine control over non-data elements of a plot. The ggplot2
package includes several build-in themes e.g. theme_grey()
, theme_bw()
, theme_linedraw()
, theme_light()
, theme_dark()
, theme_minimal()
or theme_classic()
Adding a built-in theme is straightforward, simply add the theme as a layer. If you want all your plots to use the same theme use the function theme_set()
at the start of an R Script e.g. theme_set(theme_classic())
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
labs(title = "Life expectancy in SA",
x = "Year",
y = "Life expectancy") +
theme_bw()
We can also create our own theme using the theme
layer:
- The theme layer has several arguments (also called elements) that specify the non-data elements that can be controlled. For example, the
plot.title
element controls the appearance of the plot titles - Each element is associated with an element function, which describes the visual properties of the element. For example,
element_text()
sets the font size - There are over 30 different elements, which can be viewed by opening the help file of the function
theme
e.g.?theme
- If you are interested in how the different elements work refer to the book: 'ggplot2: Elegant Graphics for Data Analysis by Wickham H'
You can also add a built-in theme layer followed by the theme layer to override some of the built-in theme settings. I usually add the theme layer last to ensure none of the other layer overrides any of the theme settings
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
geom_line() +
labs(title = "Life expectancy in SA",
x = "Year",
y = "Life expectancy") +
theme(plot.title = element_text(size = 8))
Saving plots as objects
Since a ggplot2
plot is an object we can assign a name to the ggplot
object
my_plot <- ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point()
When you assign a name to a ggplot
object, the plot will not automatically show in the Plots
tab of R Studio
or as an output. To show the graph simply add a single line of code with the name of the object, e.g. the same as printing an object. Assigning a name to a ggplot
object means that we can easily use it later. If I want some of my plots to look the same I usually store a custom theme in an object and add it as a layer to a ggplot2
graphics object.
Once a name is assigned to a ggplot
object, we can easily use the object. For example, a name can be assigned to a ggplot
object that stores a custom theme. The name of the object
can then simply be added to any ggplot2
object
my_theme <- theme(axis.text = element_text(size = 10)) # Custom theme
ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
aes(x = year, y = lifeExp)) +
geom_point() +
my_theme
Advance plots
Grammer of graphics
The layer grammar of graphics consists of eight components namely: Data, Mapping, Statistics, Scales, Geometries, Facets, Coordinates and Themes. We have discussed some of these components before. In this lesson we will introduce some new layers and discuss some fundamental assumptions made by ggplot2
Grouping data points
Points are connected in a specific order. When creating a line plot, ggplot2
has to decide how each data point or instance should be connected. By default, data points or instances will all be connected based on a heuristic.
ggplot(data.frame(x = c(1,1,2,2,3,3), y = c(1,2,2,1,1,2)),
aes(x = x, y = y)) +
geom_line()
When we try to create a line plot of the life expectancy over time for all countries in the gapminder
data set, ggplot2
connects all the instances. From the output, the life expectancy per country is not clear. We require a way to tell ggplot2
that only specific data points should be connected
ggplot(data = gapminder,
aes(x = year, y = lifeExp)) +
geom_line()
One way to create a separate line for each country is to map the variable country
to a mapping aesthetic. For example, the variable country
can be mapped to the colour aesthetic
. Here, we need to include 142 countries in the legend which in turn uses up all the plotting space
ggplot(data = gapminder,
aes(x = year, y = lifeExp,
colour = country)) +
geom_line()
The theme layer can be used to remove the legend of the plot. Once the legend is removed, we end up with a line for each country in the gapminder data set
ggplot(data = gapminder,
aes(x = year, y = lifeExp,
colour = country)) +
geom_line() +
theme(legend.position = "none")
Many geoms layers, like geom_line()
, use a single geometric object to display multiple rows of data. For these geoms
, you can set the group aesthetic to draw multiple objects without adding a legend or distinguishing feature to the plot
ggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country)) +
geom_line()
Facets
To highlight the regional difference between life expectancy trends, the variable continent can be mapped to the colour aesthetic
ggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
colour = continent)) +
geom_line()
A facet can be used to split a plot into subplots where each subplot displays a subset of the data. To facet a plot add the layer facet_wrap()
ggplot(data = gapminder,
aes(x = year, y = lifeExp,
group = country,
colour = continent)) +
geom_line() +
facet_wrap(c("continent"))
Our previous plot had various text that overlapped. To remove any text that overlap, the legend can be removed and the text size reduced
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country,
colour = continent)) +
geom_line() +
facet_wrap(c("continent")) +
theme(legend.position = "none",
text = element_text(size = 8))
Modifying position
Suppose we are interested in understanding the distribution of life expectancy over years. We could create a scatterplot, but due to multiple points overlapping, the scatterplot is difficult to interpret. All layers include a position adjustment argument that can be used to resolve overlapping data. The default position can be changed using the position argument
ggplot(data = gapminder,
aes(x = year, y = lifeExp)) +
geom_point()
To avoid overlapping data points a small amount of noise can be added to the y-coordinate of each data point. To add a small amount of y noise to each data point, the argument position
can be set to jitter
ggplot(data = gapminder,
aes(x = year, y = lifeExp)) +
geom_point(position = "jitter",
aes(colour = as.character(year))) +
scale_x_continuous(breaks =
seq(1952, 2007, by = 5)) +
labs(colour = "year")
Statistical transformations
Imagine we wanted to develop a function that produces a bar chart from data. When we design the function we will have to decide:
- if the user simply passes the raw data and we calculate the count per category or
- If the user passes the count per category
When a geom_bar()
layer is added to a ggplot graphic object, ggplot2 computes the count per category from the raw data
student_data <- data.frame(student_number = 1:6,
degree = c(rep("Engineering", 2),
rep("Computer Science", 3),
"Accounting"))
ggplot(data = student_data,
aes(x = degree)) +
geom_bar()
The algorithm that ggplot
uses to calculate statics from raw data is called a stat
, short for statistical transformation. The figure below illustrates how the process works for the geom_bar
layer:
- To determine which stat is applied by a geom layer the help file of the geom can be viewed i.e.
?geom_bar
- The default stat used for the
geom_bar
layer iscount
, which means thatgeom_bar()
usesstat_count()
- You can generally use geom and stats interchangeably i.e. a bar plot can be created with
geom_bar()
orstat_count()
The default argument passed to the stat argument of a geom layer can be changed. For example, the stat can be changed from count
to identity
. When using the stat identity
the height of the bars is plotted to the raw values of the variable mapped to the y aesthetic
data_count <- data.frame(degree = c("Engineering", "Accounting", "Computer Science"),
count = c(2, 3, 1))
ggplot(data = data_count ,
aes(x = degree, y = count)) +
geom_bar(stat = "identity")