# R Data Structures

## R Data Structures

When we perform data analysis in general we will typically use data in the form of an **analytic base table**. In an **analytic base table**, the **rows** of the table represent different **observations** and the different **variables** reported for each observation defines the **columns**

Up to now, we have discussed how we can store one value in an **object**. For example:

```
length <- 5
width <- 10
area <- length * width
area
```

but, storing one value in an **object** will only get us that far. We require methods to **import** data from different sources in R, to **store** the data in R and to **manipulate** the data stored in R.

Tutorial 1 will focus on the different **data types** available in R.

Working with data in R involves selecting a **data structure** to hold your data and (ii) entering or importing the data into the **data structure** identified. R has a wide variety of **objects** available for holding data including **atomic vectors**, **matrices**, **arrays**, **lists** and **data frames**. When an **object** is created with a single value, R will create an **atomic vector** since R does not include a **scalar** data type.

## Data frame

The **analytic base table** equivalent structure in R is known as a **data frame**

For now, we will avoid the technical details of a **data frame** and rather focus on the high-level concepts. R includes several built-in data sets, some of the data sets are stored as **data frames**. To load an example data set the **function** data() can be used.

```
data(mtcars) # loads the data frame into the global environment
```

We can view the first 6 observations of the **data frame** *mtcars* using the **function** `head()`

or the last 6 observations of the **data frame** *mtcars* using the **function** `tail()`

. Try modifying the code below to only display the first four observations of the **dataframe** `mtcars`

```
head(x = mtcars, n = 6) # displays the first six observations
```

- Memo
`head(x = mtcars, n = 4)`

The **function** `str()`

can be used to view the **structure** of an **object**

```
str(mtcars)
```

Once we have a **data frame** it becomes easy to calculate summary statistics, create plots or even analytical models.

**Summary statistics**: The function `summary()`

can be used to display the summary statistics of a **data frame**

```
summary(mtcars)
```

**Plots**: Given a **data frame** various plots can be created, for example a histogram of the variable miles per gallon (mpg)

```
hist(mtcars$mpg)
```

**Analytical model**: Create a linear regression model using the transmission (variable am) as input and miles per gallon (mpg) as output

```
lm(mpg~am,data=mtcars)
```

## Atomic vectors

### Introduction to vectors

The simplest and most common data structure in R is called a **vector**. There are two types of **vectors** in R: **atomic vectors** and **lists**. The main difference between **atomic vectors** and **lists** is that **atomic vectors** can only store data of the same type, while **lists** can be used to store data of different types. For example, an atomic vector can be used to store the grades achieved by students in this course

Various **functions** can be used to create **vectors** in R. The most straightforward way to create a **vector** in R is by using the **function** `c()`

which stands for **combine** or **concatenate**. For example, a **vector** with four elements can be created and assigned to the object `first_vector`

```
first_vector <- c(1, 3, 7, -0.5) # create a vector with four elements
```

Try creating a **vector** that contains the values 1,2,3,4:

- Memo
`c(1,2,3,4)`

The **function** `c()`

can also be used to combine **vectors** to form a new vector

```
c(c(1, 2, 3), c(4, 5, 6)) # combine the vector c(1,2,3) and c(4,5,6)
```

Given a **vector**, it is possible to create a new **vector** where the new **vector** contains repetitions of the elements of the original **vector**. To repeat specific **elements** of a **vector** the function `rep()`

which stands for **repetition** can be used

```
rep(c(1, 2), times = 3) # repeat the vector c(1,2) three times
```

The above example, repeats the **vector** `c(1,2)`

three times. Elements of a vector can also be repeated using the `each`

**argument** of the **function** `rep()`

Modify the code below to produce the **vector** c(1,2,1,2,1,2) using the `each`

**argument**

```
rep(c(1, 2))
```

- Memo
`rep(c(1, 2), each = 3)`

Values can also be passed to both the `each`

and `times`

**arguments** of the `rep()`

**function**. When values are passed to both the `each`

and `times`

arguments, the `each`

operation is performed first.

Create the vector `c(1,1,1,2,2,2,1,1,1,2,2,2)`

by filling in the blanks:

```
rep(c(1,2), times = ____, each = ____)
```

- Memo
`rep(c(1,2), times = 2, each = 3)`

The function `head()`

and `tail()`

can be used to obtain a preview of the values stored in a **vector**

R by default includes the **character vectors** `letters`

and `LETTERS`

. The **vector** `letters`

contain the 26 lower-case letters of the Roman alphabet, while the **vector** `LETTERS`

contain the 26 upper-case letters of the Roman alphabet

```
letters
```

The **functions** `head()`

and `tail()`

can be used to preview the values stored in a **vector**

```
head(letters)
```

```
tail(letters)
```

Display the last four elements of the **vector** `letters`

by filling in the blanks:

```
tail(letters, n = ___)
```

- Memo
`tail(letters, n = 4)`

## Type of atomic vectors

There are six types of **atomic vectors** in R: **logical**, **integer**, **double**, **character**, **complex** and **raw**. **Integer** and **double** **vectors** are collectively known as **numeric vectors**. We will only focus on **logical**, **integer**, **double** and **character** vectors in this course also referred to as the **primary** type of **atomic vectors**

Each **atomic vector** in R uses a special syntax to define the **elements** of the **vector**

**Logical vectors**: can only contain the values (i) `TRUE`

or `T`

and (ii) `FALSE`

or `F`

```
logical_vector <- c(T, F, TRUE, FALSE) # Note TRUE can be abbreviated as T
```

**Character vectors** contain **elements** of type string. **Strings** are values surrounded single quotation marks `‘’`

or double quotation marks `““`

```
character_vector <- c("Andrew", 'Mike', 'John', 'Sara')
```

**Double vectors** can be specified in decimal, scientific or hexadecimal form. **Double vectors** can contain three special values: `Inf`

(infinity), `-Inf`

(negative infinity) and `NaN`

(not a number)

```
double_vector <- c(1.2, 1.2e3, 0xcafe, Inf, NaN)
```

**Integer vectors** are defined similarly to **double vectors**, but the elements must be followed by `L`

and cannot contain fractions

```
int_vector <- c(1L, 1.2e3L, 0xcafeL)
```

**Double vectors** and **integer vectors** are both **numeric vectors**

### Properties of vectors

Each **vector** has three properties: (1) a **type** (2) a **length** and (3) **attributes**:

**Type**: The **type** of a **vector**, how the **object** is internally stored, can be checked using the `typeof()`

function. The `typeof()`

*function* determines the R internal type or storage mode of any R object

```
typeof(letters) # letters is a builtin character vector
```

**Length**: The number of **elements** stored in a **vector** can be determined with the **function** `length()`

```
length(letters)
```

**Attributes**: An **attribute** is a piece of information that can be attached to an **atomic vector** or any R **object**. You can think of **attributes** as `metadata`

- a convenient place to store information associated with an **object**. By default an **atomic vector** does not have any **attributes** assigned to it. To display the **attributes** of an **object** the `attribute()`

**function** can be used.

```
my_vector <- 1:10
attributes(my_vector)
```

The **object** `my_vector`

does not have any **attributes** assigned to it. However, this does not mean that **attributes** cannot be assigned to an **object**. The most common attributes to give an **atomic vector** are **names**, **dimensions** and **classes**. We will only discuss **names** at this point

By default an **atomic vector** will not have a **names** **attribute** assigned to it. To check whether the **names attribute** is assigned a **vector** the **function** `names()`

can be used

```
weekly_rainfall = c(10, 12, 0, 4, 0)
names(weekly_rainfall)
```

**Names** can be assigned to a **vector** either when a **vector** is created or using the **function** `names()`

```
weekly_rainfall = c("Mo" = 10, "Tu" = 12, "We" = 0, "Th" = 4, "Fr" = 0)
names(weekly_rainfall) <- c("Mo", "Tu", "We", "Th", "Fr")
attributes(weekly_rainfall)
```

**Names** will not affect the actual values of the **vector**, nor will the **names** be affected when the **elements** of the **vector** are manipulated

When you attempt to create a **vector** with different types, R will “convert” the **elements** to a compatible type of vector

Recall that an **atomic vectors** can only contain **elements** of the same type. If you try to create a **vector** with different elements, R will automatically **coarse** the values to a compatible type in the order: `logical » integer » double » character`

#### Practice questions

- R stores the vector
`c(TRUE, 1L)`

as a ______ vector.

- a) logical
- b) integer
- c) double
- d) character

- Memo
The correct answer is b) integer

- R stores the vector
`c(TRUE, 1))`

as a _____ vector

- a) logical
- b) integer
- c) double
- d) character

- Memo
The correct answer is c) double

- R stores the vector
`typeof(c('a', 1)))`

as a _____ vector

- a) logical
- b) integer
- c) double
- d) character

- Memo
The correct answer is d) character

### Logical vectors

**Logical vectors** can contain the values `TRUE`

, `FALSE`

and `NA`

(for “not” available). Logical vectors are typically a product of performing a logical test, for example:

```
c(1, 2, 3) == 1
```

The example above evaluates whether each element in the vector `c(1, 2, 3)`

is equal to 1 using the comparison operator `==`

. Recall that the `=`

operator is reserved for assignment, instead `==`

is used to determine equality

R includes all the standard comparison operators : `>`

, `>=`

, `<`

, `<=`

, `!=`

(not equal) and `==`

(equal)

```
c(1, 2, 3) == 1
```

To test if two objects are exactly equal the function `identical()`

can be used

```
v1 <- c(4, 4, 9, 12)
v2 <- c(4, 4, 9, 13)
identical(v1, v2) #will give a false output
```

```
v1 <- c(4, 4, 9, 12)
v2 <- c(4, 4, 9, 12)
identical(v1, v2) #will give a true output
```

Sometimes you wish to test for “nearly equal”. The function `all.equal()`

test for equality with a tolerance difference of 1.5e-8

```
v1 <- c(4.00000005, 4.00000008)
v2 <- c(4.00000002, 4.00000006)
all.equal(v1, v2)
```

If the difference is greater than the tolerance level, the mean relative difference is returned

```
v1 <- c(4.0005, 4.0008)
v2 <- c(4.0002, 4.0006)
all.equal(v1, v2)
```

To evaluate more than one logical expressions the AND `&`

operator or the OR `|`

operator can be used. For the AND `&`

operator both conditions must be `TRUE`

to be `TRUE`

```
(3 > 5) & (4 == 4)
```

For the OR `|`

operator at least one condition must be `TRUE`

to be `TRUE`

```
(3 > 5) | (4 == 4)
```

Lastly a condition can be switched using the NOT `!`

operator

```
!(3 > 5)
```

Consider the following code

```
result <- ((111 >= 111) | !(TRUE)) & ((4 + 1) == 5) # result = TRUE
```

The function `%in%`

avoids the use of using the OR `|`

operator excessively

```
c(1, 2, 3, 4) %in% c(1, 2)
```

The function `which()`

can be used to return the indices of elements that evaluate to `TRUE`

```
which(c(1, 2, 3, 4) %in% c(1, 2))
```

Math operations can also be performed with logical vectors, since `TRUE = 1`

and `FALSE = 0`

```
(c(1,0,1) == 1) + 1
```

Typically use cases includes determining the proportion of elements that are **TRUE** of a **logical vector**

```
mean(c(1, 2, 3) == 1)
```

### Character vectors

**Character vectors** stores data as strings (“text”) and is typically used to store information such as names, addresses and IDs as:

```
first_names <- c("Andrew", "Beth", "Carly", "Dan")
```

String operators can be performed on strings to determine useful properties, such as the length of each string:

```
nchar(first_names)
```

Note that R uses a global string pool. This means that a unique string is only stored in memory once, reducing the amount of memory required to store duplicate strings

### Numeric vectors

**Numeric Vectors**, rather then using the function `c()`

, can also be created with the function `seq()`

which stands for sequence. The function `seq()`

creates a vector which starts at the value passed to the from argument, in increments of 1, up to the value passed to the to argument. For example, a numeric vector can be created starting from 3 and ending at 10:

```
seq(from = 3, to = 10)
```

The `seq()`

function can also be written shorthand using the `:`

operator

```
3:10
```

The `seq()`

function will always try to create an integer vector first, if not possible a double vector will be created

To create a vector using `seq()`

with increments other then 1, a value can be passed to the optional by argument of the function. For example, a vector can be created which starts at 10 up to 0 in increments of -2

```
seq(from = 10, to = 0.2, by = -2)
```

In some cases you want to generate a numeric vector with a specific number of elements between two numbers. To generate a vector with a specific number of elements a value can be passed to the length.out argument of the `seq()`

function. For example, a vector of length 10 can be generated as follows:

```
seq(from = 3, to = 10, length.out = 10)
```

When performing arithmetic operations on numeric vectors, R perform element-wise operations by default

```
c(1, 2, 3) + c(4, 5, 6) # c(1 + 4, 2 + 5, 3 + 6)
c(1, 2, 3, 4)^2 # to the power of 2
```

When vectors of different lengths are used, R will recycle the shorter vector by repeating the vector to match the longer vector

```
c(1, 1, 1, 1) * c(1, 2) # c(1*1, 1*2, 1*1, 1*2)
```

When the longer vector is not a multiple of the shorter vectors, R will still perform **recycling**, however a warning will be shown

```
c(1, 1, 1, 1) * c(1, 2, 3) # c(1*1, 1*2, 1*3, 1*1)
```

Some functions perform operations on an entire vector as oppose to working **element-wise**

```
sum(c(1,2,3,4))
```

```
max(c(1,2,3,4))
```

Some other useful functions includes: `min()`

, `median()`

, `sd()`

and `var()`

Sometimes operations will produce `Inf`

(positive infinity), `-Inf`

(negative infinity) or `NaN`

(Not a Number) as a result from a calculation

```
c(-2, -1, 0, 1, 2)/0
```

To determine whether a function is `Inf`

or `–Inf`

the function `is.infinite()`

can be used

```
is.infinite(c(-2, -1, 0, 1, 2)/0)
```

To determine whether a function is `NaN`

the function `is.nan()`

can be used

```
is.na(c(-2, -1, 0, 1, 2)/0)
```

We can combine functions to perform common operations on numeric vectors. For instance some models expect all values to be within the range 0 to 1. To convert values to the range 0 to 1, normalisation can be used $X_{normalised} = \frac{X- X_{min}}{X_{max}-X_{min}}$

```
x <- c(0, 2, 55, 23, 20, 48, 76)
(x - min(x)) / (max(x) - min(x))
```

### Subsetting vectors

Elements of a vector can be selected, subset, in a several ways. To illustrate subsetting consider the vector

```
first_names <- c("Andrew", "Beth", "Carly", "Dan")
```

**Option 1** Passing a single index or vector of entries to keep using `[ ]`

```
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[c(1,4)]
```

**Option 2** Passing a single index or vector of entries to drop using `[-]`

```
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[-c(1,4)] # or first_names[c(-1, -4)]
```

**Option 3** Passing a logical vector of entries to keep (TRUE) and entries to drop (FALSE) using `[]`

```
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[nchar(first_names) > 4]
```

Note that if the logical vector passed is of a different length than the vector to be subset, recycling will be applied

```
first_names <- c("Andrew", "Beth", "Carly", "Dan")
first_names[c(TRUE, FALSE)]
```

**Option 4** Names can be assigned to a vector when a vector is created or using the function `names()`

. Once names are defined, names can be used to subset a vector

```
weekly_rainfall = c("Mo" = 10, "Tu" = 12, "We" = 0, "Th" = 4, "Fr" = 0) # option 1
names(weekly_rainfall) <- c("Mo", "Tu", "We", "Th", "Fr") # option 2
weekly_rainfall
weekly_rainfall[c("Mo", "Tu")]
```

### Missing values

Most data sets will contain missing values. R use the encoding `NA`

(not available), without quotes, to represent missing values

```
vector_with_missing <- c(1, 2, 3, NA, 4, 5, 6, NA)
```

When you try to apply operations to vectors with `NA`

values, most functions will return an error or simply `NA`

```
mean(vector_with_missing)
```

In some functions, the optional argument `na.rm = TRUE`

can be used to ignore the `NA`

values in calculations

```
mean(vector_with_missing, na.rm = TRUE)
```

Most operations that involve missing values will simply return a missing value. After all, R has “no idea” what the missing value represents

```
NA > 3
```

Similarly testing whether two missing values are equal with return `NA`

```
NA == NA # R has no idea if the two missing values are the same value
```

When the actual value represented by `NA`

is not important, R can return a non-NA output

```
NA^0 # returns 1
NA | TRUE # return TRUE
NA & FALSE # return FALSE
```

To test whether a specific element of a vector is missing the `is.na()`

function can be used

```
vector_with_missing <- c(1, 2, 3, NA, 4, 5, 6, NA)
is.na(vector_with_missing)
```

The `is.na()`

function can also be used to create a subset of a vector that excludes missing values

```
vector_with_missing[!is.na(vector_with_missing)]
```

## Matrices and Arrays

A **matrix** extends the idea of vectors into two dimensions: rows and columns. A simple way of thinking of a matrix is the simple reordering of the values of a vector into two dimensions where all rows of the matrix are the same length

Elements of a vector can also be arranged in more than two dimensions, known as an **array**. For example, a colour image is typically represented as a three-dimensional array. As with atomic vectors, all the elements of an array and matrix must be of the same type.

### Creating matrices

A matrix can be directly constructed using the `matrix()`

function. The `byrow`

argument of the `matrix()`

function determines whether the data fill is by row or by column

```
matrix(1:9, nrow = 3) # create a matrix with 3 rows
```

```
matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE) # create a matrix with 3 rows
```

R will try to construct a matrix even if it means that elements of the vector used to construct the matrix is repeated or dropped. R will not necessarily show a warning message if elements are repeated or dropped

```
matrix(1:5, nrow = 2)
```

```
matrix(1:6, nrow = 2, ncol = 2)
```

A matrix can also be created by binding vectors together with the function `rbind()`

which stands for row bind or `cbind()`

which stands for column bind

```
rbind(c(1, 2, 3), c(4, 5, 6)) # Bind rows to form a matrix
```

```
cbind(c(1, 2), c(3, 4), c(5, 6)) # Bind columns to form a matrix
```

Vectors used to construct a matrix must be of the same length. If vectors of different lengths are provided, R will recycle the elements of the shorter vector(s)

```
cbind(c(1, 2), c(3))
```

As with atomic vectors, all the elements of a matrix must be of the same type. If a matrix is constructed with elements of different types, R will coarse the elements to the same type

```
cbind(c("1", "2"), c(3, 4))
```

The functions `cbind()`

and `rbind()`

can also be used to extend existing matrices. Again, keep in mind the length and type of the “added” vectors

```
square_matrix <- matrix(c(1, 1, 1, 1), nrow = 2)
cbind(square_matrix, c(2, 2))
```

```
square_matrix <- matrix(c(1, 1, 1, 1), nrow = 2)
cbind(square_matrix, c("2")) # repeat + coarse
```

The diagonal of a matrix can be obtained by using the `diag()`

function

```
diag(matrix(c(1, 0, 0, 1), nrow = 2))
```

The diag() function can also be used to construct a diagonal matrix, such as the identify matrix

```
diag(x = 1, nrow = 3) # x is used to specify the value that is used to fill the diagonal
```

### Properties of matrices

Recall that an atomic vector has three properties: (i) a **type**, (ii) a **length** and (iii) **attributes**. A matrix has the same three properties as an atomic vector but includes some unique attributes

**Type**: To verify how a matrix is internally stored, the `typeof()`

function can be used

```
typeof(matrix(c(1, 1, 1, 1), nrow = 2))
```

Internally R simply stores the matrix defined in the example above as “double”

**Length** The number of elements stored in a matrix can be determined with the function `length()`

```
length(matrix(c(1, 1, 1, 1), nrow = 2))
```

**Attributes**: By default, an atomic vector has no attributes assigned to it

```
my_vector <- 1:20
attributes(my_vector)
```

To transform an atomic vector into a matrix or array, the dimension dim() attribute of the vector can be set

```
dim(my_vector) <- c(4,5) # rearrange the vector into 4 rows and 5 columns
my_vector
```

To verify that the “transformed” vector is indeed a matrix the attribute class can be checked

```
my_vector <- 1:20
dim(my_vector) <- c(4, 5) # assign values to the dim() attribute of my_vector
class(my_vector)
```

The `class`

attribute helps us understand the type of R object, while typeof specifies how the object is stored internally

```
typeof(my_vector)
```

Like vectors, names can be assigned to each element of a matrix. However, it is much more common to assign names to the rows and columns of a matrix. The function rownames() and colnames() can be used to set the names of the rows and columns of a matrix

```
new_hope <- c(461, 314)
empire_strikes <- c(291, 248)
return_jedi <- c(301, 166)
box_office <- rbind(new_hope, empire_strikes, return_jedi)
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
rownames(box_office) <- titles
region <- c("US", "non-US")
colnames(box_office) <- region
box_office
```

Assume we have three numeric vectors of length two, where each numeric vector represents a Star Wars movie and the elements the US box office revenue and the Non-US box office revenue. We can combine the three numeric vectors using `rbind()`

into a matrix box_office

```
new_hope <- c(461, 314)
empire_strikes <- c(291, 248)
return_jedi <- c(301, 166)
box_office <- rbind(new_hope, empire_strikes, return_jedi)
```

Using `rbind()`

will automatically assign the names of the vectors to the rownames attribute of the matrix box_office

```
new_hope <- c(461, 314)
empire_strikes <- c(291, 248)
return_jedi <- c(301, 166)
box_office <- rbind(new_hope, empire_strikes, return_jedi)
rownames(box_office)
```

To view the attributes assigned to an object the function `attributes()`

can be used

```
my_matrix <- matrix(c(1, 1))
attributes(my_matrix)
```

```
my_matrix <- matrix(c(1, 1))
rownames(my_matrix) <- c("row 1", "row 2" # adds the attribute dimnames
attributes(my_matrix)
```

### Matrix calculations

Math operations are performed on matrices entry-wise given that the matrices are of the same dimensions

```
(matrix(c(1, 1, 1, 1), nrow = 2) + matrix(c(1, 1, 1, 1), nrow = 2)) * 2
```

However, using matrices of different dimensions will result in an error

```
matrix(c(1, 1, 1, 1), nrow = 2) + matrix(c(1, 1, 1, 1, 1, 1), nrow = 2)
```

Actual matrix multiplication (not-entry wise) is performed using the %*% operator

```
matrix(c(1, 1, 1, 1), nrow = 2) %*% matrix(c(2, 2), nrow = 2)
```

When matrix multiplication is performed with matrices of incompatible dimensions an error will be generate

```
matrix(c(1, 1, 1, 1), nrow = 2) %*% matrix(c(2, 2), ncol = 2)
```

The transpose of a matrix can be computed using the function `t()`

for transpose. Recall that the transpose of a matrix is

```
test_matrix = matrix(1:6, nrow = 2)
t(test_matrix)
```

To invert a matrix, use the function `solve()`

```
my_matrix <- matrix(c(22, 49, 28, 64), nrow = 2)
my_matrix_inv <- solve(my_matrix)
my_matrix_inv
```

Note however if we try to check whether the invert hold, the off-diagonals of the result are not exactly zero

```
my_matrix %*% my_matrix_inv
```

### Subsetting matrices

A matrix can be subset in a similar way as a vector. Instead of vectors, the index [rows, columns] is used:

```
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix
```

```
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[2, 2] # subset row 2 column 2
```

```
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[, 2] # keep all rows, subset column 2
```

If the columns or rows of a matrix has names assigned to it, the rownames or colnames can be used to subset a matrix

```
test_matrix <- matrix(1:6, nrow = 3)
row.names(test_matrix) <- c("a", "b", "c")
test_matrix[c("a","c"),]
```

In the previous example, we saw that R returns a vector if a matrix ends up having just one row or column after subsetting

```
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[, 2]
```

To prevent the behaviour the optional argument drop should be set to FALSE

```
char_matrix <- matrix(letters, nrow = 2, ncol = 2)
char_matrix[, 2, drop = FALSE]
```

## Lists

A **list** in R can be used to store objects of multiple types. Storing objects of multiple types makes lists extremely versatile. A good analogy is to think of a list as your to-do list: items in your to-do list will likely differ in length, characteristics, and the type of activity that has to be done. For example, we can use a list to store a single value, a numeric vector and a matrix

The results of models are often returned as a list; therefore it is critical to understand how to work with lists

### Creating lists

To create a list the function `list()`

can be used

```
list(5, c(1:4), matrix(c(1:4), nrow = 2))
```

- the value
`5`

is stored in the first index of the list - the numeric vector
`c(1:4)`

is stored in the second index of the list - the matrix
`matrix(c(1:4), nrow = 2)`

is stored in the third index of the list

list can contain different data objects of different data types

```
my_list <- list(TRUE, c(1:4), matrix(c("a", "b", "c", "d"), nrow = 2))
str(my_list)
```

In the above example, we create a list containing (i) an atomic vector of type logical with a single element, (ii) an atomic vector of type integer with four elements and (iii) a matrix of type character with four elements

Lists are very versatile data structures capable of storing any data object. For example, a list can be used to store a list

```
child_list <- list(TRUE, c(1:4), matrix(c("a", "b", "c", "d"), nrow = 2))
parent_list <- list(child_list, c(1:4))
str(parent_list)
```

In the above example, we create a list named `parent_list`

which stores a list in the first element and an atomic vector in the second element

### Extending lists

If we try to add an list to an existing list using the function list(), R will add the existing list as an element to the current list

```
l1 <- list(1:3, "a", c(TRUE, FALSE, TRUE))
l2 <- list(l1, c(2.5, 4.2))
str(l2)
```

To extend a list with a different list, simply use the function `append()`

```
l1 <- list(1:3, "a", c(TRUE, FALSE, TRUE))
l2 <- append(l1, c(2.5, 4.2))
str(l2)
```

### Properties of lists

Recall that an atomic vector has three properties: (i) a **type**, (ii) a **length** and (iii) **attributes**. A list has the same three properties as an atomic vector but includes some unique attributes

**Type**: The data type of a list is a list. Recall that a list is a type of vector that R internally store as the data type list

```
my_list = list(first_thing = 55, second_thing = c(60, 42))
typeof(my_list)
```

**Length**: The length of a list is simply the number of elements in the list

```
my_list = list(first_thing = 55, second_thing = c(60, 42))
length(my_list)
```

**Attributes**: Like atomic vectors, names can also be assigned to the elements of a list. Names can be assigned to the elements of a list when the list is created or by using the function names()

```
my_list = list(first_thing = 55, second_thing = c(60,42), third_thing = c("a", "b"))
names(my_list)
```

```
my_list = list(first_thing = 55, second_thing = c(60,42), third_thing = c("a", "b"))
names(my_list)
```

### Subsetting lists

There are three subsetting operators `[[`

, `[`

and `$`

that can be used to subset a list. When thinking of how to subset a list it is often useful to think of a list as a train where each carriage of the train is an element of the list. Since the elements of a list can be named, the carriages of the train can be assigned names

“If list x is a train (list) carrying objects, then `x[[2]]`

is the object in car 2; `x[c(1:2)]`

is a train (list) of cars 1-2” - @RLangTip

In other words, single brackets `[]`

is used to select one or more elements from a list as a list, while double brackets `[[]]`

are used to select the elements of a list

To obtain the actual elements stored in a list use double brackets `[[ \| ]]`

```
my_list <- list(5, c(1:4), matrix(c(1:4), nrow = 2))
my_list[[2]]
```

When single brackets `[ ]`

is used to access list elements a list is returned instead

```
my_list <- list(5, c(1:4), matrix(c(1:4), nrow = 2))
my_list[1]
```

Single brackets are useful to obtain multiple elements stored in the list, since double brackets cannot be used to select multiple elements from the list

Given a list with names, the names can be used to select elements of a list either by (i) using the name and double brackets `[[ ]]`

or using a `$`

followed by the name

```
my_list = list(first_thing = 55, second_thing = c(60,42), third_thing = c("a", "b"))
my_list[["first_thing"]] # or try my_list$first_thing
```

Subsetting can be used to extend a list. For example, we can assign an object to an element of a list that does not exist

```
l1 <- list(c(1,1))
l1[3] <- TRUE
str(l1)
```

or by adding a new named element

```
l1 <- list(c(1,1))
l1$"New element" <- TRUE
str(l1)
```

## Dataframes

A **data frame** stores data in a **list** of **equal length vectors**. Each **element** of the **list** can be thought of as a column and the **length** of each element of the **list** is the number of rows. Since a data frame consist of a **list**, a data frame can store different types of data i.e. **numeric**, **logical**, **character** … in each column

### Creating a data frame

In most cases, we will create a data frame by importing a data set from an external source. However, data frames can also be created explicitly using the function `data.frame()`

. Run the code below to create a data frame with three rows and four columns

```
df <- data.frame(col1 = 1:3,
col2 = c("this", "is", "text"),
col3 = c(TRUE, FALSE, TRUE),
col4 = c(2.5, 4.2, pi))
str(df)
```

- If you do not provide names for the columns of a data frame, R will assign custom column names but it is not recommended.
- In addition, avoid using duplicate column names

The elements of a data frame should be of equal length. Try running the code below; you should get an error that the number of rows differs

```
df <- data.frame(col1 = c(1, 2, 3), col2 = c(1, 2))
```

R will only perform recycling when an atomic vector of length 1 is provided but is best avoided. Note that R will automatically **recycle** the value provided for column 2

```
data.frame(col1 = c(1,2,3), col2 = c(1))
```

Apart from atomic vectors, matrices and lists can be used to construct a data frame, but when lists are used the elements must be of the equal length.

```
data.frame(matrix(c(1, 2, 3, 4), nrow = 2, dimnames = list(NULL, c("a", "b"))))
```

```
data.frame(list("col1" = c(1, 2, 3), "col2" = c(1, 2, 3)))
```

### Extending data frames

**Columns**: Columns can be added to a data frame using the function `cbind()`

. Note that when using `cbind()`

one of the objects being combined must be a data frame otherwise a matrix is created

```
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
cbind(df, col3 = c(5, 6))
```

**Rows**: Rows can be added to a data frame using the function `rbind()`

. However, when adding rows to a data frame the data type of columns can change. R will **coarse** all values to a compatible data type

```
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
df <- rbind(df, c("1", "2"))
str(df)
```

### Properties of data frames

An atomic vector has three properties: (i) a **type**, (ii) a **length** and (iii) **attributes**. A data frame has the same three properties as an atomic vector but includes some unique **attributes**

**Length:** The length of a data frame is the number of columns of the data frame

```
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
length(df)
```

**Data Type:** The data type of a data frame is a list. R stores a data frame as a list with some special conditions

```
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
typeof(df)
```

**Attributes**: Data frames can have additional attributes such as row names and column names

```
df <- data.frame(col1 = c(1, 2), col2 = c(3, 4))
attributes(df)
```

Row names can be added or changed using the function `rownames()`

. Column names can be changed using the function `colnames()`

or the function `names()`

### Subsetting data frames

If you subset a data frame using a single index [columns], a data frame behave like a list and return the selected columns with all rows as a data frame

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[1]
```

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[c("col1", "col3")]
```

Since the subsetting of a data frame behave like lists, double brackets or a $ followed by the name of a column can be used to select the elements of the data frame. As with list, the result is returned as the most simplified data type i.e. an atomic vector and not a data frame

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[["col1"]]
```

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df$col1
```

If you subset a data frame using two vectors i.e. [rows, columns], a data frame behaves like a matrix and return the selected rows and columns as the most simplified data structure by default

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[1, c(1, 2)] # return row 1 and column 1 and 2
```

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[, 1] # returns a vector
```

```
df <- data.frame(col1 = c(1, 1), col2 = c(2, 2), col3 = c(3, 3))
df[, 1, drop = FALSE] # returns a data frame
```

The rows of a data frame can also be selected using logical vectors. For example, using the built-in data frame cars

```
cars[cars$speed == 24,]
```

```
cars[cars$speed == 24, "dist", drop = FALSE]
```