ggplot2

ggplot2

Overview

The package ggplot2, which forms part of the tidyverse, offers a powerful graphics language for creating elegant and complex plots. In this lesson we will cover how various types of plots can be created using the ggplot2 package

Since the ggplot2 package forms part of the tidyverse, you do not have to install the package if you have installed the tidyverse package previously. However, if you would like to install the package, the function install.packages() can be used

install.packages("ggplot2")

Once the package is installed, the package can be loaded using the function library()

library(ggplot2) 

The design of ggplot2 is based on the idea that any complex plot can be divided into layers. For example, a scatterplot and smoothed regression line can be combined to summarise the relationship between two continuous features

Combine.png


Primary components of a plot

Layers and aesthetics

ggplot2 graphic objects consist of two primary components layers and aesthetics:

  1. Layers are the components of a graph

    • For example, the layer geom_point() adds a layer of scatterplot points
    • Additional layers can be added to a ggplot2 graphic object by using the + operator is e.g. ggplot() + geom_point()
  2. Aesthetics determine how layers appear e.g. we can use aesthetics to specify the colour of the points in a geom_point() __layer __

    • Aesthetics are set using arguments inside a layer function e.g. geom_point(color = “red”)
    • Aesthetics includes location, colours and sizes

Aesthetics: Setting versus mapping

Each layer has several arguments that can be used to control the appearance of a layer. When deciding how a layer should appear, we first have to decide if the appearance should be based on a variable or not. An aesthetic i.e. colour can be:

  1. set to a constant value
ggplot(mtcars, 
       aes(x = disp, y = mpg)) +
  geom_point(colour = "purple") + 
  labs(x = "Displacement", 
       y = "Milles/gallon",
       title = "Setting aesthetic")
  1. or mapped to a variable i.e. transmission
ggplot(mtcars, aes(x = disp, y = mpg, colour = as.factor(am))) +
  geom_point() +
  labs(x = "Displacement", 
       y = "Milles/gallon",
       title = "Mapping aesthetic",
       colour = "Transmission") +
  scale_color_brewer(palette = "Dark2")
  1. Setting aesthetics: Arguments like colour, size, line type, shape, fill and alpha can be passed directly to a layer. These aesthetics are not influenced by data. For example, we can specify that all points in a scatterplot should be purple

  2. Mapping aesthetics: Mapping aesthetics depend on data. For example, if we want the points in a scatterplot to have a different colour based on the values of a variable a mapping aesthetic is required. Mapping aesthetics are specified inside the aes() argument

Aesthetic example.png

Creating a basic plot

Loading data

To illustrate how we can create visualisations using ggplot2 we will be using data from the Gapminder project. We will be specifically focusing on the life expectancy of different countries.

The Gapminder data can be accessed by installing the package gapminder

install.packages("gapminder")

After installing the gapminder package, load the package using the function library()

library(gapminder) 

Once the package is loaded the data set will be stored in the object gapminder. The function str() can be used to view the structure of the data object

str(gapminder) 

The gapminder object is a tibble; a special kind of data frame with 1704 rows and 6 columns.


Painting

The first step in creating a ggplot2 graphics object is to define a ggplot object using the function ggplot(). The function ggplot() simply creates a blank canvas. This blank canvas can be used to add graphical elements to. Run the code below to view the blank canvas

ggplot()

Adding a layer

We can add layers to our blank canvas using the + operator. For instance we can add a geom_point layer to our initial blank slate created with the function ggplot().

ggplot() + 
  geom_point()

Since we have not specified the data that we want to plot or how we want our data to be plotted, our canvas will remain blank.

Adding data

To specify what data to use in our plot, the argument data should be set. Passing a data set to the argument data is however not enough, we need to inform ggplot2 how we want our data to map to the plot.

Inside the function aes(), we specify the values that should be used for the x and y axis.

Try plotting the life expectancy i.e. lifeExp of South Africa over time i.e. year modifying the code below:

ggplot() + 
  geom_point(data = gapminder[gapminder$country == "South Africa",1:6], 
             aes(x = ___, y = ___))
  • Memo
ggplot() + 
geom_point(data = gapminder[gapminder$country == "South Africa",1:6], 
            aes(x = year, y = lifeExp)) 

The layer geom_point, mapped each data instance to a point on the graph.

Adding a layer

If we wanted to add a line graph to our existing ggplot2 graphics object, we can simply add a new layer. In this case we use the geom_line layer, a layer that connect data points with a line. Add the layer geom_line in the code below

ggplot() + 
  geom_point(data = gapminder[gapminder$country == "South Africa",1:6], 
             aes(x = year, y = lifeExp)) +
  ___(data = gapminder[gapminder$country == "South Africa",1:6], 
             aes(x = year, y = lifeExp))
  • Memo
ggplot() + 
geom_point(data = gapminder[gapminder$country == "South Africa",1:6], 
            aes(x = year, y = lifeExp)) +
geom_line(data = gapminder[gapminder$country == "South Africa",1:6], 
            aes(x = year, y = lifeExp))

New layers will always be drawn over previous layers.

Avoid work; inherit

It seems rather tedious to specify the data and the aesthetics for each layer.

To avoid repetitive code, any mapping aesthetic arguments specified in the ggplot layer, will be inherited by subsequent layers. The code below produces the same plot as the code in the previous block; with less typing

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() 

This does not imply that we cannot overwrite the arguments in subsequent layers.

Adding flavour

Colour

If we want to change the colour of the line graph we can simply set the colour aesthetic equal to the value “blue”.

Try changing the colour of the line to blue; without changing the colour of the points

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line(___) 
  • Memo
ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
      aes(x = year, y = lifeExp)) + 
geom_point() + 
geom_line(colour = "blue") 

What happens if we use a mapping aesthetic as oppose to a setting aesthetic to set the colour of the line graph?

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line(aes(colour = "blue"))

When we set the colour using a mapping aesthetic, our line plot uses the colour red even though we specified blue. Using aes(colour = “blue”) maps the vector c(“blue”) to the colour element of the line plot.

Mapping aesthetics is used when we want to change how a layer is displayed based on the underlying data. For instance, we can map the year column of the data set to the colour mapping aesthetic of the geom_line layer. In this case, ggplot2 will assign colours based on the values in the year column. ggplot2 will automatically assign a unique level of the aesthetic to each value, a process known as scaling

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line(aes(colour = year))

Scales

In the previous graph, the colours used for the variable year were automatically selected. If we want to change how the variable year map to a set of colours we need to add a scale layer to a ggplot2 graphics object

A scale layer use the following syntax: scale_[aesthetic]_[option] where:

  • [aesthetic] should be replaced the name of the mapping aesthetic you would like to change e.g. colour, shape, linetype, alpha, size, fill, x, y ,
  • [option] should be used to specify how you would like to change the aesthetic. For example, manual, continuous or discrete (depending on the nature of the variable)

Examples

  • scale_linetype_manual(): Manually specify the linetype of each different value
  • scale_alpha_continuous(): Varies transparency over a continuous range

To change the default colours use to map the variable year to colour, the scale layer scale_color_continuous can be added

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line(aes(colour = year)) +
  scale_color_continuous(type = "gradient",
                         low = "red", 
                         high = "blue")

When we try to add a scale that is not compatible with the map variable, the code will return an error

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line(aes(colour = year)) +
  scale_color_discrete()

In the above example, ggplot2 returns an error since we are trying to map a continuous variable to a discrete set of colour.

When we map a discrete variable to the aesthetic colour e.g. country, we can map the variable to a discrete set of colours by adding the layer scale_color_discrete

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line(aes(colour = country)) +
  scale_color_discrete(type = c("blue"))

Since x is a mapping aesthetic, it also has a scale. For example, we can add the scale layer “scale_x_continuous` to set the major tick marks or breaks to align with the years that we collected data for

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  scale_x_continuous(breaks = seq(1952, 2007, by = 5))

Scales can also be used to limit the data displayed. In the example, any rows with years before 1995 is removed

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  scale_x_continuous(limit = c(1995, NA))

Labels

A x-label will be automatically created using the name of the variable mapped to the x aesthetic. To add a manual label, the layer xlab can be added. The xlab layer can also be used to remove the label e.g. xlab(NULL)

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  xlab("Year") 

To add a custom y-label and title to a plot add the layers ylab and ggtitle respectively

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  xlab("Year") + 
  ylab("Life expectancy") +
  ggtitle("Life expectancy in SA")

The ggtitle, xlab and ylab are helper layers; shortcuts to quickly change a ggplot2 graphics object. In general, any labels including labels of legends can be set using the labs __layer __

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  labs(title = "Life expectancy in SA", 
       x = "Year", 
       y = "Life expectancy")

Themes

The theme layer allows you to exercise fine control over non-data elements of a plot. The ggplot2 package includes several build-in themes e.g. theme_grey(), theme_bw(), theme_linedraw(), theme_light(), theme_dark(), theme_minimal() or theme_classic()

Theme.png

Adding a built-in theme is straightforward, simply add the theme as a layer. If you want all your plots to use the same theme use the function theme_set() at the start of an R Script e.g. theme_set(theme_classic())

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  labs(title = "Life expectancy in SA", 
       x = "Year", 
       y = "Life expectancy") +
  theme_bw()

We can also create our own theme using the theme layer:

  • The theme layer has several arguments (also called elements) that specify the non-data elements that can be controlled. For example, the plot.title element controls the appearance of the plot titles
  • Each element is associated with an element function, which describes the visual properties of the element. For example, element_text() sets the font size
  • There are over 30 different elements, which can be viewed by opening the help file of the function theme e.g. ?theme
  • If you are interested in how the different elements work refer to the book: 'ggplot2: Elegant Graphics for Data Analysis by Wickham H'

You can also add a built-in theme layer followed by the theme layer to override some of the built-in theme settings. I usually add the theme layer last to ensure none of the other layer overrides any of the theme settings

ggplot(data = gapminder[gapminder$country == "South Africa",1:6], 
       aes(x = year, y = lifeExp)) + 
  geom_point() + 
  geom_line() +
  labs(title = "Life expectancy in SA", 
       x = "Year", 
       y = "Life expectancy") +
  theme(plot.title = element_text(size = 8))

Saving plots as objects

Since a ggplot2 plot is an object we can assign a name to the ggplot object

my_plot <- ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
                  aes(x = year, y = lifeExp)) + 
  geom_point() 

When you assign a name to a ggplot object, the plot will not automatically show in the Plots tab of R Studio or as an output. To show the graph simply add a single line of code with the name of the object, e.g. the same as printing an object. Assigning a name to a ggplot object means that we can easily use it later. If I want some of my plots to look the same I usually store a custom theme in an object and add it as a layer to a ggplot2 graphics object.

Once a name is assigned to a ggplot object, we can easily use the object. For example, a name can be assigned to a ggplot object that stores a custom theme. The name of the object can then simply be added to any ggplot2 object

my_theme <- theme(axis.text = element_text(size = 10)) # Custom theme 

ggplot(data = gapminder[gapminder$country == "South Africa",1:6],
                  aes(x = year, y = lifeExp)) + 
  geom_point() +
  my_theme

Advance plots

Grammer of graphics

The layer grammar of graphics consists of eight components namely: Data, Mapping, Statistics, Scales, Geometries, Facets, Coordinates and Themes. We have discussed some of these components before. In this lesson we will introduce some new layers and discuss some fundamental assumptions made by ggplot2

grammer.png|350

Grouping data points

Points are connected in a specific order. When creating a line plot, ggplot2 has to decide how each data point or instance should be connected. By default, data points or instances will all be connected based on a heuristic.

ggplot(data.frame(x = c(1,1,2,2,3,3), y = c(1,2,2,1,1,2)), 
       aes(x = x, y = y)) + 
  geom_line()

When we try to create a line plot of the life expectancy over time for all countries in the gapminder data set, ggplot2 connects all the instances. From the output, the life expectancy per country is not clear. We require a way to tell ggplot2 that only specific data points should be connected

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) + 
  geom_line()

One way to create a separate line for each country is to map the variable country to a mapping aesthetic. For example, the variable country can be mapped to the colour aesthetic. Here, we need to include 142 countries in the legend which in turn uses up all the plotting space

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           colour = country)) + 
geom_line()

The theme layer can be used to remove the legend of the plot. Once the legend is removed, we end up with a line for each country in the gapminder data set

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           colour = country)) + 
geom_line() +
  theme(legend.position = "none")

Many geoms layers, like geom_line(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to draw multiple objects without adding a legend or distinguishing feature to the plot

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) + 
  geom_line() 

Facets

To highlight the regional difference between life expectancy trends, the variable continent can be mapped to the colour aesthetic

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country,  
           colour = continent)) +
  geom_line() 

A facet can be used to split a plot into subplots where each subplot displays a subset of the data. To facet a plot add the layer facet_wrap()

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           colour = continent)) + 
  geom_line() + 
  facet_wrap(c("continent"))

Our previous plot had various text that overlapped. To remove any text that overlap, the legend can be removed and the text size reduced

ggplot(data = gapminder, aes(x = year, y = lifeExp,  group = country, 
                             colour = continent)) + 
  geom_line() + 
  facet_wrap(c("continent")) + 
  theme(legend.position = "none", 
        text = element_text(size = 8))

Modifying position

Suppose we are interested in understanding the distribution of life expectancy over years. We could create a scatterplot, but due to multiple points overlapping, the scatterplot is difficult to interpret. All layers include a position adjustment argument that can be used to resolve overlapping data. The default position can be changed using the position argument

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) +
  geom_point()

To avoid overlapping data points a small amount of noise can be added to the y-coordinate of each data point. To add a small amount of y noise to each data point, the argument position can be set to jitter

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) + 
  geom_point(position = "jitter", 
             aes(colour = as.character(year))) + 
  scale_x_continuous(breaks = 
                       seq(1952, 2007, by = 5)) + 
  labs(colour = "year")

Statistical transformations

Imagine we wanted to develop a function that produces a bar chart from data. When we design the function we will have to decide:

  1. if the user simply passes the raw data and we calculate the count per category or
  2. If the user passes the count per category

Decision.png

When a geom_bar() layer is added to a ggplot graphic object, ggplot2 computes the count per category from the raw data

student_data <- data.frame(student_number = 1:6, 
                           degree = c(rep("Engineering", 2), 
                                      rep("Computer Science", 3),
                                      "Accounting")) 

ggplot(data = student_data, 
       aes(x = degree)) + 
  geom_bar()

The algorithm that ggplot uses to calculate statics from raw data is called a stat, short for statistical transformation. The figure below illustrates how the process works for the geom_bar layer:

geom_bar.png

  • To determine which stat is applied by a geom layer the help file of the geom can be viewed i.e. ?geom_bar
  • The default stat used for the geom_bar layer is count, which means that geom_bar() uses stat_count()
  • You can generally use geom and stats interchangeably i.e. a bar plot can be created with geom_bar() or stat_count()

The default argument passed to the stat argument of a geom layer can be changed. For example, the stat can be changed from count to identity. When using the stat identity the height of the bars is plotted to the raw values of the variable mapped to the y aesthetic

data_count <- data.frame(degree = c("Engineering", "Accounting", "Computer Science"), 
                         count = c(2, 3, 1)) 

ggplot(data = data_count , 
       aes(x = degree, y = count)) + 
  geom_bar(stat = "identity")

© 2024 All rights reserved

Built with DataHub LogoDataHub Cloud