The pipe operator
The pipe operator
You will frequently need to perform a series of steps to answer a data related question.
Such a series of steps can often be represented as a nested function i.e.
- we start by first calculating
h(data)
, - then proceed to calculate
g(h(data)),
- and finally calculate
f(g(h(data)))
.
When a nested function consists of multiple functions and arguments it can become difficult to both write and read the function.
Alternatively, a series of steps can be viewed as a flow chart of functions
- the function
f( )
process some data and sends the output to the functiong( )
, - the function
g( )
then process the output received from the functionf( )
and send the output to the functionh( )
and - the function
h( )
then process the output received from the functiong( )
and produce the final answer
In this section we will discuss how we can write code in a flow chart style using the pipe operator %>%
.
To illustrate how flow chart style code can be written, we will use the data set mtcars
. To load the mtcars
data set the function data()
can be used
data(mtcars) # loads the data frame into the global environment
The mtcars
data set was extracted from the 1975 Motor Trend magazine, and comprises of the fuel consumption mpg
and ten other accepts (features) of 32 automobiles (instances). We can view the first three rows of the data by using the function head()
head(mtcars, n = 3) # displays the first three rows
Using the data set mtcars
we can calculate various statistics. For example, suppose that we want to determine the mean miles per gallon mpg
of cars per their number of cylinders cyl
for cars with more than one carburettors carb
.
Our first option to calculate the statistic illustrated above is to break the calculation up into a series of steps, where we save the result of each step using an object.
# 1. Remove cars with less than 2 carburettors
a <- filter(mtcars, carb > 1)
# 2. Group data
b <- group_by(a, cyl)
# 3. calculate the mpg mean per group
c <- summarise(b, mean(mpg))
# 4. sort the results by cyl
arrange(c, cyl)
Breaking the calculation down into a series of steps makes it easy to read the code, however as a result unnecessary copies of the same object is created.
A second option to calculate the statistic of interest, is to use nested functions
arrange(summarise(group_by(filter(mtcars, carb > 1), cyl), mean(mpg)), cyl)
Using nested functions avoid the use of explicitly creating objects, but can the readability of the code be improved?
We can express the functions as a flow chart using the pipe operator >%>
.
mtcars %>% filter(carb > 1) %>% group_by(cyl) %>% summarise(mean(mpg)) %>% arrange(cyl)
The pipe operator >%>
avoids creating unnecessary objects and makes it easier to read code. The pipe operator >%>
can be interpreted as a flow chart where we send input
to a function that produces an output, the output is then be sent to the next function
To insert the pipe operator >%>
the shortcut Ctrl + Shift + M
can be used. Pipes are clearer to read when each function is on a separate line
mtcars %>%
filter(carb > 1) %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
arrange(cyl)
The thing to the left of a pipe operator %>%
is passed to the first argument of the function on the __right __of the pipe operator %>%
, while additional function arguments can be specified as usual
If you ever need to pass an object to an argument other then the first argument, set the argument equal to .
. For example, y %>% f(x,.)
is equivalent to f(x,y)
and z %>% g(x, y, arg = .)
is equivalent to g(x,y,arg = z)
So far we have only illustrated how to print the results from pipe operations. We can also assign the output of a chain of pipe operations using the assignment operator <-
result <-
mtcars %>%
filter(carb > 1) %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
arrange(cyl)
result
Assignment can also be performed at the end of a chain using the assignment operator ->
mtcars %>%
filter(carb > 1) %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
arrange(cyl) -> result
result
Lastly, note that the the pipe operator can only be used if the package tidyverse
is loaded.
dplyr
Given that our data is in a data frame we might want to manipulate the data frame in various ways, for example:
- add or remove row(s) or column(s),
- rearrange the row(s) or column(s) or
- change the names of column(s).
All the above examples can be implemented using base R, but the base functions is not self-describing. Using base R code to modify data frames often lead to nested functions that are difficult to read.
The package dplyr
was created for the sole purpose of simplifying the process of manipulating, sorting, summarising and joining data frames. The functions included in the dplyr
package often leads to (i) more efficient code, (ii) code that is easier to read and (iii) code that is easier to write
The dplyr
package forms part of the tidyverse
packages. This means that if you have installed the tidyverse
package previously you do not have to reinstall the dplyr
package. Once installed the dplyr
package can be loaded using the function library()
library(dplyr)
The dplyr
package includes various function to help you manipulate data frames. Some examples include:
Function | Purpose |
---|---|
filter() | select row(s) to keep using conditions |
distinct() | select unique rows(s) |
slice_ | select, remove or duplicate row(s) |
arrange() | order the rows of a data frame by the values of selected column(s) |
select() | select column(s) from a data frame to keep |
pull() | select a single column of a data frame as a vector |
rename() | assign new names to one or more columns |
relocate() | change the order of columns |
mutate() | creates new columns(s) |
Tibbles
In this lesson we will work with tibbles instead of traditional R data frames. Tibbles are data frames, but are tweaked to make our life a bit easier:
- Tibbles are lazy i.e. the names and data types of columns are not automatically changed
- Tibbles are strict i.e. partial matching is not performed
- Tibbles are self-descriptive
- Tibbles print better than data frames
- Tibbles do not use row names
- The column names of __tibbles __can be more descriptive e.g. white spaces can be used in column names
The function class(
) can be used to verify that an object is a tibble
library(gapminder)
class(gapminder)
When subsetting is performed on a data frame the result returned is not always a data frame. When we are unsure of the data type of an object, it becomes very difficult to write code in a flow chart style. How do we now that the next operation in our flow diagram will accept the data type pass to it?
- we will either have to keep track of the different data types,
- write explicit tests or conversions to ensure the code execute without an error or
- we can use tibbles .
The main advantage of tibbles, in my opinion, is that when we apply a dplyr function to a tibble the results are always returned as a tibble:
Row operations
Filter
Objective: Select rows to keep using filter()
Description: Filtering is a common task used to identify or select row(s) of a data set where a particular variable matches a specific value/condition
Function
filter(.data, ...)
or data %>% filter(...)
Argument | Description |
---|---|
.data | data frame or tibble |
… | One or more expressions that returns logical vectors. When multiple expressions separated by commas are provided, the expressions are combined with the and & operator |
Example 1
Select rows from the gapminder data set where the country
is equal to South Africa
gapminder %>% filter(___)
- Memo
gapminder %>% filter(country == "South Africa")
Example 2
Select rows from the gapminder data set where the country is equal to South Africa or Lesotho
gapminder %>% filter(country %in% ___)
- Memo
gapminder %>% filter(country %in% c("South Africa","Lesotho"))
Example 3
Select rows where the country is equal to South Africa and the year is greater than 2000
gapminder %>% filter(country == "South Africa", ___)
- Memo
gapminder %>% filter(country == "South Africa", year > 2000)
or alternatively
gapminder %>% filter(country == "South Africa" ___ year > 2000)
- Memo
gapminder %>% filter(country == "South Africa" & year > 2000)
Multiple logic rules can be applied in the filter()
function
Rows can be filtered using the range of comparison operators, logical operators and functions to evaluate missing i.e. NA
values
Operator | Description |
---|---|
< | Less than |
> | Greater than |
== | Equal to |
<= | Less than or equal to |
>= | Greater than or equal to |
!= | Not equal to |
%in% | Group membership |
is.na | is NA |
!is.na | is not NA |
& | And |
| | Or |
! | Not |
Distinct
Objective: Select unique row(s)
Description: Select unique row(s) based on a list of specified columns
Function
distinct(.data, ...)
or data %>% distinct(...)
Argument | Description |
---|---|
.data | data frame or tibble |
… | Column names to use to determine uniqueness of a row. When multiple rows for a given combination of columns exists, only the first rows are preserved. When column names are not specified all columns are used to determine the uniqueness of a row |
.keep_all | The default behaviour of the function is to drop all unspecified columns. If you want to get distinct rows by certain columns without dropping the other columns set the optional argument .keep_all to TRUE |
Example 1
Find all the unique years in the data set gapminder
gapminder %>% distinct(___)
- Memo
gapminder %>% distinct(year)
Note that a tibble is returned although the result can be easily represented by an atomic vector
Example 2
Find all the unique years , but show all column
gapminder %>% distinct(___)
- Memo
gapminder %>% distinct(year, .keep_all = TRUE)
Only 12 rows are returned since the function distinct()
only preserves the first distinct row.
Slice
Objective: Select, remove or duplicate rows
Description: The slice_
functions include various functions to select, remove or duplicate rows. To illustrate how the slice_
functions work we will look at the function slice_sample() which can be used to randomly select rows
Function
slice_sample(.data, ...)
or data %>% slice_sample(...)
Argument | Description |
---|---|
.data | data frame or tibble |
n, prop | Provide either (i) n the number of rows to sample at random or (ii) prop to sample a proportion of rows from the data set |
Example 1
Sample four random rows from the data set
set.seed(123) # makes random numbers repeatable
gapminder %>% ___ # select four random rows
- Memo
set.seed(123) # makes random numbers repeatable gapminder %>% slice_sample(n = 4) # selects four random rows
Arrange
Objective: Order the rows of a data frame by the values of selected column(s)
Description: Often used to view the rows of an observation in a specific order based on a the values in a particular column
Function
arrange(.data, ...)
or data %>% arrange(...)
Argument | Description |
---|---|
.data | data frame or tibble |
… | Names of columns to sort rows on separated by commas where the column names listed first takes precedence |
Values are sorted in ascending order by default. To sort in descending order using column x
pass desc(x)
to the …
argument of the function arrange()
Example 1
Sort the data frame gapminder by continent
in ascending order and country
in descending order
gapminder %>% arrange(___)
- Memo
gapminder %>% arrange(continent, desc(country))
Column operations
Select
Objective: Select column(s) from a data frame to keep
Description: Columns to keep can be selected based on the (i) column names, (ii) using a column range or (ii) by using helper functions
Function
select(.data, ...)
or select %>% arrange(...)
Argument | Description |
---|---|
.data | data frame or tibble |
… | One or more unquoted column names, a range of column names to keep specified as x:y where x:y selects column x, column y and all the columns between the columns x and y, and/or using helper functions |
Example 1
Select the columns country, year and population
gapminder %>%
select(___)
- Memo
gapminder %>% select(country, year, pop)
alternatively we can specify the columns to remove using -
signs
gapminder %>%
select(___)
- Memo
gapminder %>% select(-continent, -lifeExp, -gdpPercap)
Example 2
We can also use the :
symbol to select a range of columns
gapminder %>%
select(country:year, pop)
where the value country:year
will select the country
column, the year
column and all columns between the country
and year
columns
Example 3
Various helper functions such as starts_with()
, ends_with()
and contains()
can also be used to select columns based on conditions
Try selecting all columns that starts with the character c
or contains the phrase gdp
gapminder %>%
select(___)
- Memo
gapminder %>% select(starts_with("c"), contains("gdp"))
The help file of the function select()
can be viewed to see thevarious helper functions available i.e. ?select()
Example 4
To select columns
based on column types, the helper function where()
can be used. Try selecting all the numeric columns
gapminder %>%
select(where(___))
- Memo
gapminder %>% select(where(is.numeric))
The function passed to the argument
of where()
should return a single TRUE
or FALSE
. If you are creative you can for instance remove all columns with more than 50% missing values
gapminder %>%
select(where(function(x) sum(is.na(x)) / length(x) < 0.5))
Pull
Objective: Select a single column of a data frame as a vector
Description: Often in specific steps of an analysis a vector is required rather than a data frame. The function pull(
) allows you to select a single column or value from a data frame as a vector
Function
pull(.data, ...)
or select %>% pull(...)
Argument | Description |
---|---|
.data | data frame or tibble |
… | A column specified as (i) the column name or (ii) a positive integer, giving the position counting from left or (iii) a negative integer, giving the positing counting from right |
Example 1
gapminder %>%
pull(year)
You can verify that the above returns a vector by running
gapminder %>%
pull(year) %>%
class()
Rename
Objective: Assign new names to one or more columns
Description: Replace names of columns with descriptive names. Descriptive names improve readability and reduces the amount of formatting required e.g. when creating a plot
Function
rename(.data, ...)
or rename
%>% pull(…)`
Argument | Description |
---|---|
.data | data frame or tibble |
… | Use new_name = old_name to rename a column. To rename multiple columns separate the assignments using commas i.e. new_name_1 = old_name_1 , new_name_2 = old_name_2 |
Example 1
Rename the columns lifeExp
to Life Expectancy
and pop
to Population
gapminder %>%
rename(___ = ___, ___ = ___)
- Memo
gapminder %>% rename("Life Expectancy" = lifeExp, Population = pop)
In a tibble, unlike in a data frame, we can use spaces in column names given that the name is surrounded by double quotations “
or backticks `
Relocate
Objective: Change the order of columns
Description: Change the position of the columns of a data frame
Function
relocate(.data, ...)
or relocate
%>% pull(…)`
Argument | Description |
---|---|
.data | data frame or tibble |
… | columns names separated by commas that will be moved to the front of the object passed to the .data argument |
The default arguments .before
or .after
of the function relocate()
can be used to move a column(s) before or after a specific column. Only one of the default arguments can be used at a time
Example 1
Move the column pop
before the column lifeExp
gapminder %>%
relocate(___, ___)
- Memo
gapminder %>% relocate(pop, .before = lifeExp)
Mutate
Objective: Create new column(s)
Description: Used to add new a column to a data set while preserving the old columns of the data set. The new variable can be derived from the values of another column(s) in the data set
Function
mutate(.data, ...)
or mutate
%>% pull(…)`
Argument | Description |
---|---|
.data | data frame or tibble |
… | New column(s) specified as name = value and separated with a comma where the value can be: (i) a vector of length 1 which will be recycled, (ii) a vector of the same length as the number of rows, (iii) NULL to remove a column or (iv) a data frame or tibble to create multiple columns |
Example 1
Create a new column pop_in_million
which divided the values in the pop
column by 1 000 000
gapminder %>%
mutate(pop_in_million = ___)
- Memo
gapminder %>% mutate(pop_in_million = pop/1000000)
Additional columns can be added in a single mutate()
call, by separating expressions with commas
Example 2
The ifelse()
function can be used to assign values to a column based on a condition. Use the function mutate to change the country Swaziland
to Estwatini
gapminder %>%
mutate(country == ifelse(___))
- Memo
gapminder %>% mutate(country = ifelse(country == "Swaziland", "Eswatini", as.character(country)))
The function mutate()
will replace the values in an existing column, when an existing column name is provided
Example 3
The function case_when()
can be used to perform multiple ifelse()
operations]
gapminder %>%
mutate(gdp = case_when(gdpPercap < 700 ~ "Low",
gdpPercap < 800 ~ "Moderate",
TRUE ~ "High"))
Summarise data
Summarise
Objective: Calculate the summary statistic of specific columns
Description: The summarise()
function takes column(s) and computes something (any calculation that can aggregate multiple values into a single value) using the values of every row
Function
summarise(.data, ...)
or summarise
%>% pull(…)`
Argument | Description |
---|---|
.data | data frame or tibble |
… | Summary statistic to calculate specified as name = function() where the name is optional and the function() can be any summary function like min() , mean() and max() . Multiple statistics can be calculated by separating summary statistics using commas or by using helper functions |
Example 1
For South Africa, calculate the number of observations, max population, mean life expectancy and the range of life expectancy over the years
gapminder %>%
filter(country == "South Africa") %>%
summarise(count = n(), # dplyr count function
max_pop = ___,
___,
range_life_exp = ___ - ___)
- Memo
gapminder %>% filter(country == "South Africa") %>% summarise(count = n(), # dplyr count function max_pop = max(pop), mean(lifeExp), # no name specified range_life_exp = max(lifeExp) - min(lifeExp))
The new columns are calculated using all the rows of the gapminder data set filtered for the country South Africa
Across
Objective: Avoid repetitive code due to applying the same function to multiple columns
Description: The across()
function is a helper function that can be used to apply one or more functions to one or more columns
Function
across(.cols, .fns)
Argument | Description |
---|---|
.cols | A column name or a vector of column names e.g. c(col1, col2) |
.fns | A reference to a function e.g. mean or a list of references to functions e.g. list(mean = mean, max = max, min = min) |
Note that a reference to a function must be provided i.e. a function without parenthesis
Example 1
Calculate the mean, maximum and minimum value of the column lifeExp
and the column pop
gapminder %>%
filter(country == "South Africa") %>%
summarise(across(___,
list(___)))
- Memo
gapminder %>% filter(country == "South Africa") %>% summarise(across(c(lifeExp, pop), list(mean = mean, max = max, min = min)))
Names are automatically created based on the column and list names
The function across(
)` can be used in additional ways to avoid repetition
across(everything())
will summarise / mutate all columns. For example, to calculate the mean and standard deviation of all the columns for the data set data:
gapminder %>%
summarise(across(everything(), list(mean = mean, sd = sd)))
across(where())
will summarise / mutate all columns that satisfy some logical condition. For example, to calculate the mean and standard deviation for all numeric columns of the data set data
gapminder %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
Other helper functions previously discussed can also be used inside the function across() e.g. starts_with()
, ends_with()
and contains()
Group statistics
- To gain better insights, we can calculate statistics by a group
- For example, suppose we have a data set composed of a key and data column. Then we can compute the mean of the values in the data column for each group or unique value present in the key column
- The operations of calculating group statistics can be thought of as a split, apply and combine operation
In the dplyr
package, we will use the group_by()
function to implement the split operation, while the summarise()
function can be used to perform the apply and combine operations
Group by
Objective: Group rows
Description: Takes an existing data object and transform it into a grouped data object. Once a data set is grouped, operations can be performed "by group"
Function
group_by(.data, ...)
or data %>% group_by(...)
Argument | Description |
---|---|
.data | data frame or tibble |
… | A column name to group the data frame by or multiple column names separated by commas |
- Also read up on the
ungroup()
function
Example 1
Calculate the mean lifeExp
per country
gapminder %>%
___ %>%
summarise(lifeExp_mean = mean(lifeExp))
- Memo
gapminder %>% group_by(country) %>% summarise(lifeExp_mean = mean(lifeExp))
If we did not use the group_by()
function in the example above, the operation would have computed the mean life expectancy using all the rows of the gapminder data set
Window
We have previously seen how we can compute some statistics of South Africa using the summarise()
function i.e. calculating the minimum, mean and maximum life expectancy
gapminder %>%
filter(country == "South Africa") %>%
select(lifeExp) %>%
summarise(min_lifeExp = min(lifeExp),
mean_lifeExp = mean(lifeExp),
max_lifeExp = max(lifeExp))
However, we do not have any way to directly calculate (i) how South Africa compares against other countries and (ii) whether the life expectancy of South Africa has improved or not
Window functions allow us to compare rows to each other. We will start by looking at the two offset functions known as lag()
and lead()
. The function lag()
retrieves the previous element of a vector, while the function lead()
retrieves the next element of a vector
x <- 1:5
lag(x)
In the example above, we use the function lag()
to retrieve the previous element of the vector x
. Since the first element of the vector x
does not have a previous element the function lag()
returns NA
Accessing the previous and next __element __of a vector can help us to calculate trends or create new variables
Example 1
Calculate the change in life expectancy per country
per year
gapminder %>%
filter(year > 2000) %>%
arrange(country, year) %>%
mutate(change_lifeExp = lifeExp - lag(lifeExp)) %>%
select(country, year, lifeExp, change_lifeExp)
At first our code seems to work, but what happens when the country
changes? To prevent the problem we can use group_by()
Example 2
Calculate the change in life expectancy per country
gapminder %>%
filter(year > 2000) %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(change_lifeExp = lifeExp - lag(lifeExp)) %>%
select(country, year, lifeExp, change_lifeExp)
Ranking
Window functions allow us to compare rows to each other. Ranking functions takes a vector to order, and returns various types of ranks. The function row_number()
assigns ranks to values in a vector based on the minimum value, where ties are resolved by assigning the lowest rank to the value that appears first
x <- c(10, 2, 10, 6, 4)
row_number(x)
The function min_rank()
works exactly the same as the function row_number()
except that ties are resolved by assigning the same rank to equal values
x <- c(10, 2, 10, 6, 4)
min_rank(x)
Example 1
Rank the countries
by lifeExp
in the year
2007
gapminder %>%
select(country, year, lifeExp) %>%
filter(year == 2007) %>%
mutate(rank = row_number(lifeExp)) %>%
arrange(rank)