Data Analytics 344
# Store your student number in the variable below
student_number <- 12345678
# DO NOT EDIT
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
Instructions
This tutorial will be auto-graded using test cases. To ensure your submission can be graded, please follow these guidelines:
- Adding your student number: In the first code block, please store your student number in the variable student_number. For instance if your student number is 1234, the code block should contain the code
student_number <- 1234
. - Package usage: Only use the packages specified in the setup code block. If the packages defined in the setup block is not installed, install these packages using the R Console.
- Hands-Off code blocks: Do not modify code that follows the comment # DO NOT EDIT. These blocks are for loading packages or illustrating programming concepts and should remain unchanged.
- Completing code blocks: Your main task is to complete the incomplete code provided. There's no need to add or modify any of the code blocks or questions; simply fill in the necessary answers.
- Error-Free Submission: Before submitting your final answer, make sure your R Markdown document runs without errors. Only variables declared in this document will be accessible for grading.
- Case Sensitivity: Remember that variables in R are case-sensitive. Use the exact variable names specified in the questions (e.g., Q1, not q1) and do not overwrite your answers in subsequent code blocks.
- Submission: Your final submission should consist of a single .Rmd file. Name your file as ???????.Rmd where the question marks should be replaced with your student number.
Please adhere to these guidelines, as submissions that do not follow them cannot be graded. Following the above principles will also help you to learn how to produce reproducible code.
Introduction
In this tutorial, we will consider the k-nearest neighbour model. The k-nearest neighbour does not explicitly build a model from training data but rather stores the data and uses the data to predict the target for a new instance. As its name implies, the k-nearest neighbour algorithm searches for the k instances in the training data that are "closest" to the new instance that must be classified.
Part 1
The data set below contains six instances, each described by two descriptive features. Each instance is also labelled as either class 0 or class 1. Take a moment and consider how the decision boundaries will look when the 1-nearest neighbour algorithm is used with Euclidean distance. In the next questions, we will draw the decision boundary.
d1 = c(4, 2, 4, 6, 4, 6)
d2 = c(6, 4, 2, 4, 4, 2)
t = c("class 0", "class 0", "class 0", "class 0", "class 1", "class 1")
data = tibble(d1, d2, t)
ggplot(data = data, aes(x = d1, y = d2)) +
geom_point(aes(color = t)) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0, 10)) +
scale_y_continuous(breaks = seq(0, 10, by = 1), limits = c(0, 10)) +
coord_fixed(ratio = 1) +
theme_minimal()
Question 1 [2]
Create a function called knn
. The function knn
should have four parameters: dataTrain
, x1
, x2
and k
where dataTrain
is a Tibble, x1
and x2
represent the values for the new instance that must be classified and k represents the number of neighbours that will be accounted for.
When calling knn
the function should:
- calculate the distance between each instance in
dataTrain
and(x1,x2)
, - select the k instances in
dataTrain
with the smallest distance - return the most common target as a character vector e.g. "class 1" (when there is a tie return class 1)
knn <- function(dataTrain, x1, x2, k) {
dataTrain <- dataTrain %>%
mutate(distance = sqrt((x1- .[[1]])^2 + (x2 - .[[2]])^2))
nn <- dataTrain %>%
arrange(distance) %>%
head(k)
count <- table(nn[, 3])
mode_val <- names(count)[which.max(count)]
return(mode_val)
}
knn(dataTrain = data, x1 = 8, x2 = 1, k = 1)
knn(data, 8, 8, 1)
knn(data, 5, 3, 1)
Question 2 [2]
To draw decision boundaries using the k-nearest neighbors (KNN) algorithm, you should create a grid of points, apply the KNN function to predict labels for each point in the grid, and then plot these points with their associated labels.
Create a new tibble called grid. The tibble grid should contain the columns d1 and d2. The tibble grid should contain 10100 instances, generated in the range 0 to 10 with a step size of 0.1.
grid <- tibble(
d1 = rep(seq(0, 10, by = 0.1), 100),
d2 = rep(seq(0, 10, by = 0.1),each = 100)
)
grid
Question 3 [1]
Add the column t to the data frame grid. The column t should be populated with the prediction generated by using the knn
function and a k value of 1.
grid <- grid %>%
rowwise() %>%
mutate(t=knn(data, d1, d2, 1))
grid
ggplot(data = grid, aes(x = d1, y = d2)) +
geom_point(aes(color = t)) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0, 10)) +
scale_y_continuous(breaks = seq(0, 10, by = 1), limits = c(0, 10)) +
coord_fixed(ratio = 1) +
theme_minimal()
Part 2
The data set below contains 14 instances, each described by two descriptive features. Each instance is also labelled as either class 0 or class 1. Take a moment and consider which value of k will minimise the training error.
d1 <- c(1,2,3,4,5,7,8,2,3,5,6,7,8,9)
d2 <- c(5,6,7,8,9,2,3,7,8,1,2,3,4,5)
t <- c(rep("class 0", 7),rep("class 1", 7))
data = tibble(d1, d2, t)
ggplot(data = data, aes(x = d1, y = d2)) +
geom_point(aes(color = t)) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0, 10)) +
scale_y_continuous(breaks = seq(0, 10, by = 1), limits = c(0, 10)) +
coord_fixed(ratio = 1) +
theme_minimal()
Question 4 [2]
Create a function called error
with two parameters actual
and predict
(both vectors). The function should compare the two vectors passed to the actual and predict and return the number of misclassified instances.
error <- function(actual, predict) {
misclassified <- sum(actual != predict)
return(misclassified)
}
Question 5 [6]
Find the value for k that minimises the the leave-one-out cross-validation error. Store your answer in the variable q5
k_values <- 1:10
cv_errors <- numeric(length(k_values))
# for diff k vals
for (i in 1:length(k_values)) {
k <- k_values[i]
errors <- character(nrow(data))
for (j in 1:nrow(data)) {
subset <- data[-j, ]
prediction <- knn(subset, d1[j], d2[j], k)
errors[j] <- prediction
}
cv_errors[i] <- error(errors, data$t)
}
# k value that minimizes LOOCV error
best_k <- k_values[which.min(cv_errors)]
q5 <- as.numeric(best_k)