R Data Structures
R Data Structures
When we perform data analysis in general we will typically use data in the form of an analytic base table. In an analytic base table, the rows of the table represent different observations and the different variables reported for each observation defines the columns
Up to now, we have discussed how we can store one value in an object. For example:
length <- 5
width <- 10
area <- length * width
area
but, storing one value in an object will only get us that far. We require methods to import data from different sources in R, to store the data in R and to manipulate the data stored in R.
Tutorial 1 will focus on the different data types available in R.
Working with data in R involves selecting a data structure to hold your data and (ii) entering or importing the data into the data structure identified. R has a wide variety of objects available for holding data including atomic vectors, matrices, arrays, lists and data frames. When an object is created with a single value, R will create an atomic vector since R does not include a scalar data type.
Data frame
The analytic base table equivalent structure in R is known as a data frame
For now, we will avoid the technical details of a data frame and rather focus on the high-level concepts. R includes several built-in data sets, some of the data sets are stored as data frames. To load an example data set the function data() can be used.
data(mtcars) # loads the data frame into the global environment
We can view the first 6 observations of the data frame mtcars using the function head()
or the last 6 observations of the data frame mtcars using the function tail()
. Try modifying the code below to only display the first four observations of the dataframe mtcars
head(x = mtcars, n = 6) # displays the first six observations
- Memo
head(x = mtcars, n = 4)
The function str()
can be used to view the structure of an object
str(mtcars)
Once we have a data frame it becomes easy to calculate summary statistics, create plots or even analytical models.
Summary statistics: The function summary()
can be used to display the summary statistics of a data frame
summary(mtcars)
Plots: Given a data frame various plots can be created, for example a histogram of the variable miles per gallon (mpg)
hist(mtcars$mpg)
Analytical model: Create a linear regression model using the transmission (variable am) as input and miles per gallon (mpg) as output
lm(mpg~am,data=mtcars)
Atomic vectors
Introduction to vectors
The simplest and most common data structure in R is called a vector. There are two types of vectors in R: atomic vectors and lists. The main difference between atomic vectors and lists is that atomic vectors can only store data of the same type, while lists can be used to store data of different types. For example, an atomic vector can be used to store the grades achieved by students in this course
Various functions can be used to create vectors in R. The most straightforward way to create a vector in R is by using the function c()
which stands for combine or concatenate. For example, a vector with four elements can be created and assigned to the object first_vector
first_vector <- c(1, 3, 7, -0.5) # create a vector with four elements
Try creating a vector that contains the values 1,2,3,4:
- Memo
c(1,2,3,4)
The function c()
can also be used to combine vectors to form a new vector
c(c(1, 2, 3), c(4, 5, 6)) # combine the vector c(1,2,3) and c(4,5,6)
Given a vector, it is possible to create a new vector where the new vector contains repetitions of the elements of the original vector. To repeat specific elements of a vector the function rep()
which stands for repetition can be used
rep(c(1, 2), times = 3) # repeat the vector c(1,2) three times
The above example, repeats the vector c(1,2)
three times. Elements of a vector can also be repeated using the each
argument of the function rep()
Modify the code below to produce the vector c(1,2,1,2,1,2) using the each
argument
rep(c(1, 2))
- Memo
rep(c(1, 2), each = 3)
Values can also be passed to both the each
and times
arguments of the rep()
function. When values are passed to both the each
and times
arguments, the each
operation is performed first.
Create the vector c(1,1,1,2,2,2,1,1,1,2,2,2)
by filling in the blanks:
rep(c(1,2), times = ____, each = ____)
- Memo
rep(c(1,2), times = 2, each = 3)
The function head()
and tail()
can be used to obtain a preview of the values stored in a vector
R by default includes the character vectors letters
and LETTERS
. The vector letters
contain the 26 lower-case letters of the Roman alphabet, while the vector LETTERS
contain the 26 upper-case letters of the Roman alphabet
letters
The functions head()
and tail()
can be used to preview the values stored in a vector
head(letters)
tail(letters)
Display the last four elements of the vector letters
by filling in the blanks:
tail(letters, n = ___)
- Memo
tail(letters, n = 4)
Type of atomic vectors
There are six types of atomic vectors in R: logical, integer, double, character, complex and raw. Integer and double vectors are collectively known as numeric vectors. We will only focus on logical, integer, double and character vectors in this course also referred to as the primary type of atomic vectors
Each atomic vector in R uses a special syntax to define the elements of the vector
Logical vectors: can only contain the values (i) TRUE
or T
and (ii) FALSE
or F
logical_vector <- c(T, F, TRUE, FALSE) # Note TRUE can be abbreviated as T
Character vectors contain elements of type string. Strings are values surrounded single quotation marks ‘’
or double quotation marks ““
character_vector <- c("Andrew", 'Mike', 'John', 'Sara')
Double vectors can be specified in decimal, scientific or hexadecimal form. Double vectors can contain three special values: Inf
(infinity), -Inf
(negative infinity) and NaN
(not a number)
double_vector <- c(1.2, 1.2e3, 0xcafe, Inf, NaN)
Integer vectors are defined similarly to double vectors, but the elements must be followed by L
and cannot contain fractions
int_vector <- c(1L, 1.2e3L, 0xcafeL)
Double vectors and integer vectors are both numeric vectors
Properties of vectors
Each vector has three properties: (1) a type (2) a length and (3) attributes:
Type: The type of a vector, how the object is internally stored, can be checked using the typeof()
function. The typeof()
function determines the R internal type or storage mode of any R object
typeof(letters) # letters is a builtin character vector
Length: The number of elements stored in a vector can be determined with the function length()
length(letters)
Attributes: An attribute is a piece of information that can be attached to an atomic vector or any R object. You can think of attributes as metadata
- a convenient place to store information associated with an object. By default an atomic vector does not have any attributes assigned to it. To display the attributes of an object the attribute()
function can be used.
my_vector <- 1:10
attributes(my_vector)
The object my_vector
does not have any attributes assigned to it. However, this does not mean that attributes cannot be assigned to an object. The most common attributes to give an atomic vector are names, dimensions and classes. We will only discuss names at this point
By default an atomic vector will not have a names attribute assigned to it. To check whether the names attribute is assigned a vector the function names()
can be used
weekly_rainfall = c(10, 12, 0, 4, 0)
names(weekly_rainfall)
Names can be assigned to a vector either when a vector is created or using the function names()
weekly_rainfall = c("Mo" = 10, "Tu" = 12, "We" = 0, "Th" = 4, "Fr" = 0)
names(weekly_rainfall) <- c("Mo", "Tu", "We", "Th", "Fr")
attributes(weekly_rainfall)
Names will not affect the actual values of the vector, nor will the names be affected when the elements of the vector are manipulated
When you attempt to create a vector with different types, R will “convert” the elements to a compatible type of vector
Recall that an atomic vectors can only contain elements of the same type. If you try to create a vector with different elements, R will automatically coarse the values to a compatible type in the order: logical » integer » double » character
Practice questions
- R stores the vector
c(TRUE, 1L)
as a ______ vector.
- a) logical
- b) integer
- c) double
- d) character
- Memo
The correct answer is b) integer
- R stores the vector
c(TRUE, 1))
as a _____ vector
- a) logical
- b) integer
- c) double
- d) character
- Memo
The correct answer is c) double
- R stores the vector
typeof(c('a', 1)))
as a _____ vector
- a) logical
- b) integer
- c) double
- d) character
- Memo
The correct answer is d) character
Logical vectors
Logical vectors can contain the values TRUE
, FALSE
and NA
(for “not” available). Logical vectors are typically a product of performing a logical test, for example:
c(1, 2, 3) == 1
The example above evaluates whether each element in the vector c(1, 2, 3)
is equal to 1 using the comparison operator ==
. Recall that the =
operator is reserved for assignment, instead ==
is used to determine equality
R includes all the standard comparison operators : >
, >=
, <
, <=
, !=
(not equal) and ==
(equal)
c(1, 2, 3) == 1
To test if two objects are exactly equal the function identical()
can be used
v1 <- c(4, 4, 9, 12)
v2 <- c(4, 4, 9, 13)
identical(v1, v2) #will give a false output
v1 <- c(4, 4, 9, 12)
v2 <- c(4, 4, 9, 12)
identical(v1, v2) #will give a true output
Sometimes you wish to test for “nearly equal”. The function all.equal()
test for equality with a tolerance difference of 1.5e-8
v1 <- c(4.00000005, 4.00000008)
v2 <- c(4.00000002, 4.00000006)
all.equal(v1, v2)
If the difference is greater than the tolerance level, the mean relative difference is returned
v1 <- c(4.0005, 4.0008)
v2 <- c(4.0002, 4.0006)
all.equal(v1, v2)
To evaluate more than one logical expressions the AND &
operator or the OR |
operator can be used. For the AND &
operator both conditions must be TRUE
to be TRUE
(3 > 5) & (4 == 4)
For the OR |
operator at least one condition must be TRUE
to be TRUE
(3 > 5) | (4 == 4)
Lastly a condition can be switched using the NOT !
operator
!(3 > 5)
Consider the following code
result <- ((111 >= 111) | !(TRUE)) & ((4 + 1) == 5) # result = TRUE
The function %in%
avoids the use of using the OR |
operator excessively
c(1, 2, 3, 4) %in% c(1, 2)
The function which()
can be used to return the indices of elements that evaluate to TRUE
which(c(1, 2, 3, 4) %in% c(1, 2))
Math operations can also be performed with logical vectors, since TRUE = 1
and FALSE = 0
(c(1,0,1) == 1) + 1
Typically use cases includes determining the proportion of elements that are TRUE of a logical vector
mean(c(1, 2, 3) == 1)
Character vectors
Character vectors stores data as strings (“text”) and is typically used to store information such as names, addresses and IDs as:
first_names <- c("Andrew", "Beth", "Carly", "Dan")
String operators can be performed on strings to determine useful properties, such as the length of each string:
nchar(first_names)
Note that R uses a global string pool. This means that a unique string is only stored in memory once, reducing the amount of memory required to store duplicate strings
Numeric vectors
Numeric Vectors, rather then using the function c()
, can also be created with the function seq()
which stands for sequence. The function seq()
creates a vector which starts at the value passed to the from argument, in increments of 1, up to the value passed to the to argument. For example, a numeric vector can be created starting from 3 and ending at 10:
seq(from = 3, to = 10)
The seq()
function can also be written shorthand using the :
operator
3:10
The seq()
function will always try to create an integer vector first, if not possible a double vector will be created
To create a vector using seq()
with increments other then 1, a value can be passed to the optional by argument of the function. For example, a vector can be created which starts at 10 up to 0 in increments of -2
seq(from = 10, to = 0.2, by = -2)
In some cases you want to generate a numeric vector with a specific number of elements between two numbers. To generate a vector with a specific number of elements a value can be passed to the length.out argument of the seq()
function. For example, a vector of length 10 can be generated as follows:
seq(from = 3, to = 10, length.out = 10)
When performing arithmetic operations on numeric vectors, R perform element-wise operations by default
c(1, 2, 3) + c(4, 5, 6) # c(1 + 4, 2 + 5, 3 + 6)
c(1, 2, 3, 4)^2 # to the power of 2
When vectors of different lengths are used, R will recycle the shorter vector by repeating the vector to match the longer vector
c(1, 1, 1, 1) * c(1, 2) # c(1*1, 1*2, 1*1, 1*2)
When the longer vector is not a multiple of the shorter vectors, R will still perform recycling, however a warning will be shown
c(1, 1, 1, 1) * c(1, 2, 3) # c(1*1, 1*2, 1*3, 1*1)
Some functions perform operations on an entire vector as oppose to working element-wise
sum(c(1,2,3,4))
max(c(1,2,3,4))
Some other useful functions includes: min()
, median()
, sd()
and var()
Sometimes operations will produce Inf
(positive infinity), -Inf
(negative infinity) or NaN
(Not a Number) as a result from a calculation
c(-2, -1, 0, 1, 2)/0
To determine whether a function is Inf
or –Inf
the function is.infinite()
can be used
is.infinite(c(-2, -1, 0, 1, 2)/0)
To determine whether a function is NaN
the function is.nan()
can be used
is.na(c(-2, -1, 0, 1, 2)/0)
We can combine functions to perform common operations on numeric vectors. For instance some models expect all values to be within the range 0 to 1. To convert values to the range 0 to 1, normalisation can be used
x <- c(0, 2, 55, 23, 20, 48, 76)
(x - min(x)) / (max(x) - min(x))
Subsetting vectors
Elements of a vector can be selected, subset, in a several ways. To illustrate subsetting consider the vector
first_names <- c("Andrew", "Beth", "Carly", "Dan")
Option 1 Passing a single index or vector of entries to keep using [ ]
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[c(1,4)]
Option 2 Passing a single index or vector of entries to drop using [-]
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[-c(1,4)] # or first_names[c(-1, -4)]
Option 3 Passing a logical vector of entries to keep (TRUE) and entries to drop (FALSE) using []
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[nchar(first_names) > 4]
Note that if the logical vector passed is of a different length than the vector to be subset, recycling will be applied
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[c(TRUE, FALSE)]
Option 4 Names can be assigned to a vector when a vector is created or using the function names()
. Once names are defined, names can be used to subset a vector
weekly_rainfall = c("Mo" = 10, "Tu" = 12, "We" = 0, "Th" = 4, "Fr" = 0) # option 1
names(weekly_rainfall) <- c("Mo", "Tu", "We", "Th", "Fr") # option 2
weekly_rainfall
weekly_rainfall[c("Mo", "Tu")]
Missing values
Most data sets will contain missing values. R use the encoding NA
(not available), without quotes, to represent missing values
vector_with_missing <- c(1, 2, 3, NA, 4, 5, 6, NA)
When you try to apply operations to vectors with NA
values, most functions will return an error or simply NA
mean(vector_with_missing)
In some functions, the optional argument na.rm = TRUE
can be used to ignore the NA
values in calculations
mean(vector_with_missing, na.rm = TRUE)
Most operations that involve missing values will simply return a missing value. After all, R has “no idea” what the missing value represents
NA > 3
Similarly testing whether two missing values are equal with return NA
NA == NA # R has no idea if the two missing values are the same value
When the actual value represented by NA
is not important, R can return a non-NA output
NA^0 # returns 1
NA | TRUE # return TRUE
NA & FALSE # return FALSE
To test whether a specific element of a vector is missing the is.na()
function can be used
vector_with_missing <- c(1, 2, 3, NA, 4, 5, 6, NA)
is.na(vector_with_missing)
The is.na()
function can also be used to create a subset of a vector that excludes missing values
vector_with_missing[!is.na(vector_with_missing)]
Matrices and Arrays
A matrix extends the idea of vectors into two dimensions: rows and columns. A simple way of thinking of a matrix is the simple reordering of the values of a vector into two dimensions where all rows of the matrix are the same length
Elements of a vector can also be arranged in more than two dimensions, known as an array. For example, a colour image is typically represented as a three-dimensional array. As with atomic vectors, all the elements of an array and matrix must be of the same type.
Creating matrices
A matrix can be directly constructed using the matrix()
function. The byrow
argument of the matrix()
function determines whether the data fill is by row or by column
matrix(1:9, nrow = 3) # create a matrix with 3 rows
matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE) # create a matrix with 3 rows
R will try to construct a matrix even if it means that elements of the vector used to construct the matrix is repeated or dropped. R will not necessarily show a warning message if elements are repeated or dropped
matrix(1:5, nrow = 2)
matrix(1:6, nrow = 2, ncol = 2)
A matrix can also be created by binding vectors together with the function rbind()
which stands for row bind or cbind()
which stands for column bind
rbind(c(1, 2, 3), c(4, 5, 6)) # Bind rows to form a matrix
cbind(c(1, 2), c(3, 4), c(5, 6)) # Bind columns to form a matrix
Vectors used to construct a matrix must be of the same length. If vectors of different lengths are provided, R will recycle the elements of the shorter vector(s)
cbind(c(1, 2), c(3))
As with atomic vectors, all the elements of a matrix must be of the same type. If a matrix is constructed with elements of different types, R will coarse the elements to the same type
cbind(c("1", "2"), c(3, 4))
The functions cbind()
and rbind()
can also be used to extend existing matrices. Again, keep in mind the length and type of the “added” vectors
square_matrix <- matrix(c(1, 1, 1, 1), nrow = 2)
cbind(square_matrix, c(2, 2))
square_matrix <- matrix(c(1, 1, 1, 1), nrow = 2)
cbind(square_matrix, c("2")) # repeat + coarse
The diagonal of a matrix can be obtained by using the diag()
function
diag(matrix(c(1, 0, 0, 1), nrow = 2))
The diag() function can also be used to construct a diagonal matrix, such as the identify matrix
diag(x = 1, nrow = 3) # x is used to specify the value that is used to fill the diagonal
Properties of matrices
Recall that an atomic vector has three properties: (i) a type, (ii) a length and (iii) attributes. A matrix has the same three properties as an atomic vector but includes some unique attributes
Type: To verify how a matrix is internally stored, the typeof()
function can be used
typeof(matrix(c(1, 1, 1, 1), nrow = 2))
Internally R simply stores the matrix defined in the example above as “double”
Length The number of elements stored in a matrix can be determined with the function length()
length(matrix(c(1, 1, 1, 1), nrow = 2))
Attributes: By default, an atomic vector has no attributes assigned to it
my_vector <- 1:20
attributes(my_vector)
To transform an atomic vector into a matrix or array, the dimension dim() attribute of the vector can be set
dim(my_vector) <- c(4,5) # rearrange the vector into 4 rows and 5 columns
my_vector
To verify that the “transformed” vector is indeed a matrix the attribute class can be checked
my_vector <- 1:20
dim(my_vector) <- c(4, 5) # assign values to the dim() attribute of my_vector
class(my_vector)
The class
attribute helps us understand the type of R object, while typeof specifies how the object is stored internally
typeof(my_vector)
Like vectors, names can be assigned to each element of a matrix. However, it is much more common to assign names to the rows and columns of a matrix. The function rownames() and colnames() can be used to set the names of the rows and columns of a matrix
new_hope <- c(461, 314)
empire_strikes <- c(291, 248)
return_jedi <- c(301, 166)
box_office <- rbind(new_hope, empire_strikes, return_jedi)
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
rownames(box_office) <- titles
region <- c("US", "non-US")
colnames(box_office) <- region
box_office
Assume we have three numeric vectors of length two, where each numeric vector represents a Star Wars movie and the elements the US box office revenue and the Non-US box office revenue. We can combine the three numeric vectors using rbind()
into a matrix box_office
new_hope <- c(461, 314)
empire_strikes <- c(291, 248)
return_jedi <- c(301, 166)
box_office <- rbind(new_hope, empire_strikes, return_jedi)
Using rbind()
will automatically assign the names of the vectors to the rownames attribute of the matrix box_office
new_hope <- c(461, 314)
empire_strikes <- c(291, 248)
return_jedi <- c(301, 166)
box_office <- rbind(new_hope, empire_strikes, return_jedi)
rownames(box_office)
To view the attributes assigned to an object the function attributes()
can be used
my_matrix <- matrix(c(1, 1))
attributes(my_matrix)
my_matrix <- matrix(c(1, 1))
rownames(my_matrix) <- c("row 1", "row 2" # adds the attribute dimnames
attributes(my_matrix)
Matrix calculations
Math operations are performed on matrices entry-wise given that the matrices are of the same dimensions
(matrix(c(1, 1, 1, 1), nrow = 2) + matrix(c(1, 1, 1, 1), nrow = 2)) * 2
However, using matrices of different dimensions will result in an error
matrix(c(1, 1, 1, 1), nrow = 2) + matrix(c(1, 1, 1, 1, 1, 1), nrow = 2)
Actual matrix multiplication (not-entry wise) is performed using the %*% operator
matrix(c(1, 1, 1, 1), nrow = 2) %*% matrix(c(2, 2), nrow = 2)
When matrix multiplication is performed with matrices of incompatible dimensions an error will be generate
matrix(c(1, 1, 1, 1), nrow = 2) %*% matrix(c(2, 2), ncol = 2)
The transpose of a matrix can be computed using the function t()
for transpose. Recall that the transpose of a matrix is
test_matrix = matrix(1:6, nrow = 2)
t(test_matrix)
To invert a matrix, use the function solve()
my_matrix <- matrix(c(22, 49, 28, 64), nrow = 2)
my_matrix_inv <- solve(my_matrix)
my_matrix_inv
Note however if we try to check whether the invert hold, the off-diagonals of the result are not exactly zero
my_matrix %*% my_matrix_inv
Subsetting matrices
A matrix can be subset in a similar way as a vector. Instead of vectors, the index [rows, columns] is used:
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[2, 2] # subset row 2 column 2
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[, 2] # keep all rows, subset column 2
If the columns or rows of a matrix has names assigned to it, the rownames or colnames can be used to subset a matrix
test_matrix <- matrix(1:6, nrow = 3)
row.names(test_matrix) <- c("a", "b", "c")
test_matrix[c("a","c"),]
In the previous example, we saw that R returns a vector if a matrix ends up having just one row or column after subsetting
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[, 2]
To prevent the behaviour the optional argument drop should be set to FALSE
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[, 2, drop = FALSE]
Lists
A list in R can be used to store objects of multiple types. Storing objects of multiple types makes lists extremely versatile. A good analogy is to think of a list as your to-do list: items in your to-do list will likely differ in length, characteristics, and the type of activity that has to be done. For example, we can use a list to store a single value, a numeric vector and a matrix
The results of models are often returned as a list; therefore it is critical to understand how to work with lists
Creating lists
To create a list the function list()
can be used
list(5, c(1:4), matrix(c(1:4), nrow = 2))
- the value
5
is stored in the first index of the list - the numeric vector
c(1:4)
is stored in the second index of the list - the matrix
matrix(c(1:4), nrow = 2)
is stored in the third index of the list
list can contain different data objects of different data types
my_list <- list(TRUE, c(1:4), matrix(c("a", "b", "c", "d"), nrow = 2))
str(my_list)
In the above example, we create a list containing (i) an atomic vector of type logical with a single element, (ii) an atomic vector of type integer with four elements and (iii) a matrix of type character with four elements
Lists are very versatile data structures capable of storing any data object. For example, a list can be used to store a list
child_list <- list(TRUE, c(1:4), matrix(c("a", "b", "c", "d"), nrow = 2))
parent_list <- list(child_list, c(1:4))
str(parent_list)
In the above example, we create a list named parent_list
which stores a list in the first element and an atomic vector in the second element
Extending lists
If we try to add an list to an existing list using the function list(), R will add the existing list as an element to the current list
l1 <- list(1:3, "a", c(TRUE, FALSE, TRUE))
l2 <- list(l1, c(2.5, 4.2))
str(l2)
To extend a list with a different list, simply use the function append()
l1 <- list(1:3, "a", c(TRUE, FALSE, TRUE))
l2 <- append(l1, c(2.5, 4.2))
str(l2)
Properties of lists
Recall that an atomic vector has three properties: (i) a type, (ii) a length and (iii) attributes. A list has the same three properties as an atomic vector but includes some unique attributes
Type: The data type of a list is a list. Recall that a list is a type of vector that R internally store as the data type list
my_list = list(first_thing = 55, second_thing = c(60, 42))
typeof(my_list)
Length: The length of a list is simply the number of elements in the list
my_list = list(first_thing = 55, second_thing = c(60, 42))
length(my_list)
Attributes: Like atomic vectors, names can also be assigned to the elements of a list. Names can be assigned to the elements of a list when the list is created or by using the function names()
my_list = list(first_thing = 55, second_thing = c(60,42), third_thing = c("a", "b"))
names(my_list)
my_list = list(first_thing = 55, second_thing = c(60,42), third_thing = c("a", "b"))
names(my_list)
Subsetting lists
There are three subsetting operators [[
, [
and $
that can be used to subset a list. When thinking of how to subset a list it is often useful to think of a list as a train where each carriage of the train is an element of the list. Since the elements of a list can be named, the carriages of the train can be assigned names
“If list x is a train (list) carrying objects, then x[[2]]
is the object in car 2; x[c(1:2)]
is a train (list) of cars 1-2” - @RLangTip
In other words, single brackets []
is used to select one or more elements from a list as a list, while double brackets [[]]
are used to select the elements of a list
To obtain the actual elements stored in a list use double brackets [[ \| ]]
my_list <- list(5, c(1:4), matrix(c(1:4), nrow = 2))
my_list[[2]]
When single brackets [ ]
is used to access list elements a list is returned instead
my_list <- list(5, c(1:4), matrix(c(1:4), nrow = 2))
my_list[1]
Single brackets are useful to obtain multiple elements stored in the list, since double brackets cannot be used to select multiple elements from the list
Given a list with names, the names can be used to select elements of a list either by (i) using the name and double brackets [[ ]]
or using a $
followed by the name
my_list = list(first_thing = 55, second_thing = c(60,42), third_thing = c("a", "b"))
my_list[["first_thing"]] # or try my_list$first_thing
Subsetting can be used to extend a list. For example, we can assign an object to an element of a list that does not exist
l1 <- list(c(1,1))
l1[3] <- TRUE
str(l1)
or by adding a new named element
l1 <- list(c(1,1))
l1$"New element" <- TRUE
str(l1)
Dataframes
A data frame stores data in a list of equal length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. Since a data frame consist of a list, a data frame can store different types of data i.e. numeric, logical, character … in each column
Creating a data frame
In most cases, we will create a data frame by importing a data set from an external source. However, data frames can also be created explicitly using the function data.frame()
. Run the code below to create a data frame with three rows and four columns
df <- data.frame(col1 = 1:3,
col2 = c("this", "is", "text"),
col3 = c(TRUE, FALSE, TRUE),
col4 = c(2.5, 4.2, pi))
str(df)
- If you do not provide names for the columns of a data frame, R will assign custom column names but it is not recommended.
- In addition, avoid using duplicate column names
The elements of a data frame should be of equal length. Try running the code below; you should get an error that the number of rows differs
df <- data.frame(col1 = c(1, 2, 3), col2 = c(1, 2))
R will only perform recycling when an atomic vector of length 1 is provided but is best avoided. Note that R will automatically recycle the value provided for column 2
data.frame(col1 = c(1,2,3), col2 = c(1))
Apart from atomic vectors, matrices and lists can be used to construct a data frame, but when lists are used the elements must be of the equal length.
data.frame(matrix(c(1, 2, 3, 4), nrow = 2, dimnames = list(NULL, c("a", "b"))))
data.frame(list("col1" = c(1, 2, 3), "col2" = c(1, 2, 3)))
Extending data frames
Columns: Columns can be added to a data frame using the function cbind()
. Note that when using cbind()
one of the objects being combined must be a data frame otherwise a matrix is created
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
cbind(df, col3 = c(5, 6))
Rows: Rows can be added to a data frame using the function rbind()
. However, when adding rows to a data frame the data type of columns can change. R will coarse all values to a compatible data type
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
df <- rbind(df, c("1", "2"))
str(df)
Properties of data frames
An atomic vector has three properties: (i) a type, (ii) a length and (iii) attributes. A data frame has the same three properties as an atomic vector but includes some unique attributes
Length: The length of a data frame is the number of columns of the data frame
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
length(df)
Data Type: The data type of a data frame is a list. R stores a data frame as a list with some special conditions
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
typeof(df)
Attributes: Data frames can have additional attributes such as row names and column names
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
attributes(df)
Row names can be added or changed using the function rownames()
. Column names can be changed using the function colnames()
or the function names()
Subsetting data frames
If you subset a data frame using a single index [columns], a data frame behave like a list and return the selected columns with all rows as a data frame
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[1]
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[c("col1", "col3")]
Since the subsetting of a data frame behave like lists, double brackets or a $ followed by the name of a column can be used to select the elements of the data frame. As with list, the result is returned as the most simplified data type i.e. an atomic vector and not a data frame
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[["col1"]]
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df$col1
If you subset a data frame using two vectors i.e. [rows, columns], a data frame behaves like a matrix and return the selected rows and columns as the most simplified data structure by default
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[1, c(1, 2)] # return row 1 and column 1 and 2
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[, 1] # returns a vector
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[, 1, drop = FALSE] # returns a data frame
The rows of a data frame can also be selected using logical vectors. For example, using the built-in data frame cars
cars[cars$speed == 24,]
cars[cars$speed == 24, "dist", drop = FALSE]