Data Analytics 344
# Store your student number in the variable below
student_number <- 12345678
# DO NOT EDIT
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
set.seed(student_number)
Instructions
This tutorial will be auto-graded using test cases. To ensure your submission can be graded, please follow these guidelines:
- Adding your student number: In the setup code block, please store your student number in the variable student_number. For instance if your student number is 1234, the code block should contain the code
student_number <- 1234
. - Package usage: Only use the packages specified in the setup code block. If the packages defined in the setup block is not installed, install these packages using the R Console.
- Hands-Off code blocks: Do not modify code that follows the comment # DO NOT EDIT. These blocks are for loading packages or illustrating programming concepts and should remain unchanged.
- Completing code blocks: Your main task is to complete the incomplete code provided. Do not add or modify any of the code blocks or questions; simply fill in the necessary answers.
- Error-Free Submission: Before submitting your final answer, make sure your R Markdown document runs without errors.
- Case Sensitivity: Remember that variables in R are case-sensitive. Use the exact variable names specified in the questions (e.g., Q1, not q1) and do not overwrite your answers in subsequent code blocks.
- Submission: Your final submission should consist of a single .Rmd file. Name your file as ???????.Rmd where the question marks should be replaced with your student number.
Please adhere to these guidelines, as submissions that do not follow them cannot be graded. Following the above principles will also help you to learn how to produce reproducible code.
Introduction
The tutorial will require you to build a linear regression model to capture the relationship between property size and price. We will implement a gradient descent algorithm to guide the search for optimal parameters in the equation y = w0 + w1*x
, where y is the target feature (price), x is the descriptive feature (size) and w0 and w1 are the model parameters.
The code block below will create a dataframe property
which contains size and price pairs. The values have been scaled to the range [0,1].
property <- data.frame(
size=c(0.00, 0.10, 0.24, 0.26, 0.33, 0.40, 0.54, 0.76, 0.84, 1.00),
price=c(0.00, 0.20, 0.267, 0.23, 0.22, 0.30, 0.53, 0.93, 0.83, 1.00))
w0 <- 0.98
w1 <- -0.84
learning_rate <- 0.05
Question 1: Plotting the data (1 mark)
Use ggplot
to draw a scatterplot using with property size and price. Assign your plot to the variable 'q1'.
q1 <- ggplot(data = property, mapping = aes(x = size, y = price )) + geom_point()
q1
Question 2: Defining the linear regression function (1 mark)
Define a function called 'q2' that receives 3 arguments as input: w0, w1 and x, where x is a vector of the descriptive feature. The function must return a vector of the predictions.
q2 <- function(w0, w1, x){
return(as.numeric(w0 + w1*x))
}
Question 3: Plotting the regression line (2 mark)
Create a scatter plot for the property data as you have done in Question 1 and add the regression line q2 to the plot. This will allow you to visually compare the predictions to the dataset.
q3 <- q1 + geom_line(aes(x = size, y = q2(w0, w1, size) ))
q3
Question 4: Predicting the target variable (1 mark)
Use the function q2 to add the a column of predictions to the property
dataframe. Name the new column 'pred'. Save your new dataframe as 'q4'.
q4 <- cbind(property, pred=q2(w0, w1, property$size))
q4
Question 5: Calculate the model error (1 mark)
Now that the model has generated predictions which have been added to the property
dataframe, we need a way to evaluate the performance of the model. For this model, we will use the L2 loss function, SSE/2*.*
Create a function called 'q5' that takes as input 2 arguments, target
and prediction
. The target
argument will be 1 column of a dataframe and the prediction
argument will be one column of a dataframe. These inputs will not be vectors.
Your function q5 must return the loss value according to the L2 loss function. Do not round your answer.
q5 <- function(target, prediction){
L2 <- 0.5*(sum((target-prediction)^2))
}
test <- q5(q4["price"], q4["pred"])
test
Question 6: Calculate the error signal for w0 (2 marks)
To determine the change that must take place for each parameter w, the gradient descent algorithm must find the partial derivative of the loss function with respect to each parameter w. These derivatives are used to calculate the error signal, which guides the search for improved parameter values. Refer to the lecture slides and the Kelleher et al textbook for the error signal function.
Calculate the error signal for the parameter w0 and store this value in 'q6'. Do not round your answer off.
Calculate the new value for w0 after this first adjustment by the algorithm.
x1 <- 1 #only 1 descriptive feature
q6 <- sum((q4$price-q4$pred)*x1)
q6
w0_new <- w0 + q6
Question 7: Calculate the error signal for w1 (2 marks)
Determine the error signal for the parameter w1, using the derivative of the loss function with respect to w1. Store your answer as 'q7'. Do not round your answer off.
q7 <- sum((q4$price-q4$pred)*q4$size)
q7
w1_new <- w1 + q7
Question 8: Finding the new parameter values (2 marks)
Create a function called 'adjust' that takes as input learning_rate, current_weight and error_signal where: learning_rate is the value of the learning rate variable current_weight is the current value of a given parameter error_signal is the error signal of a given parameter
The function 'adjust' must return the new weight for a given parameter.
Use this function to find the new weight for the w0 parameter, store your answer as q8a. Use this function to find the new weight for the w1 parameter, store your answer as q8b.
Use the learning rate value that was previously defined.
adjust <- function (learning_rate, current_weight, error_signal){
return(current_weight + learning_rate*error_signal)
}
q8a <- adjust(learning_rate, w0, q6)
q8b <- adjust(learning_rate, w1, q7)
Question 9: Using the algorithm components together(2 marks)
Questions 6 & 7 calculate the new parameter values after 1 iteration of the gradient descent algorithm. However, in most cases many iterations of the algorithm need to run for the loss function to converge to a minimum in the error space.
You must create a function q9
that contains a for
loop to run the function code 250 times. Use your code chunks from Questions 4, 6, 7, 8 and Q5 inside a for
loop in q9
.
The function q9
must contain receive 5 arguments: 'w0', 'w1', 'learning_rate', 'property' & 'iterations'. Lastly, the function must print w0 and w1 and return the model loss value using the function q5 created in Q5. Ensure that the print and return commands are outside the for
loop.
Run the function for 250 iterations and save the loss value as q9b.
w0 <- 0.98
w1 <- -0.84
q9 <- function(w0, w1, learning_rate, property, iterations) {
for (j in 1:iterations) {
q2 <- function(ww0, ww1, x){
y <- ww0 + ww1*x
return(y)
}
work_dammit <- q2(w0, w1, property$size)
hot <- data.frame(property, pred=work_dammit)
w0_signal <- sum((hot$price - hot$pred))
w1_signal <- sum((hot$price - hot$pred))*hot$size
adjust <- function (learning_rate, current_weight, error_signal){
bloo <- current_weight + learning_rate*error_signal
return(bloo)
}
w0_new <- adjust(learning_rate, w0_new, w0_signal)
w1_new <- adjust(learning_rate, w1_new, w1_signal)
w0 <- w0_new
w1 <- w1_new
}
q5 <- function(target, prediction){
L2 <- 0.5*(sum((target-prediction)^2))
return(L2)
}
print(w0)
print(w1)
firga_needs_ac <- q5(hot[2], hot[3])
return(firga_needs_ac)
}
q9b <- q9(w0, w1, learning_rate, property, 250)
q9b
Question 10: Checking the final weights (0 marks)
To check that the w0 and w1 weights printed in q9 capture the relationship between price and size, use your plot form q3 and the final weights from q9 to plot the final regression line. Store this new, fitted plot, as q10.
w0_test <- 0.1129392
w1_test <- 0.9794952
q10 <- q1 + geom_line(aes(x = size, y = q2(w0_test, w1_test, size)))
q10
Question 11: Comparing learning rates (0 marks)
The code chunk below uses your function 'q9' and compares the loss value when 3 learning rates are used: small, medium and large. If your answer for the previous questions are correct, the plot will demonstrate the importance of learning rate selection.
-
The small learning rate means that minimal adjustments to the weights are made and the algorithm takes more time to find the minimum value.
-
The medium learning rate balances convergence speed and minimises the loss value.
-
The large learning rate means that the algorithm misses the global minimum since it jumps from one side of the minimum to the other. Indeed, the large learning rate below causes the loss never to converge and the loss actually increases.
small_lr_loss <-q9(w0, w1, 0.0000005, property, 10)
medium_lr_loss <-q9(w0, w1, 0.05, property, 10)
large_lr_loss <-q9(w0, w1, 0.2, property, 10)
q11 <- data.frame(
learning_rate_values=c(0.0000005, 0.05, 0.2),
loss_values=c(small_lr_loss, medium_lr_loss, large_lr_loss))
ggplot(q11,
aes(x =learning_rate_values,
y = loss_values)) +
geom_point()