Ch. 12: Tasks & Quizzes

Your turn!

In this task, you will seek answers to the research question: To what extent can the results of the non-verbal IQ Blocks test be used to predict Vocab scores among L1 and L2 speakers of English based on the data from Dąbrowska (2019)? As we will be answering this question within a linear regression framework, what we are really asking is: To what extent are these variables linearly associated with each other?

Q12.1 Fit a linear model to predict Vocab scores among L1 participants based on their Blocks test scores. What is the adjusted R² value of this model?

Show sample code to answer Q12.1.

taskmodel1 <- lm(formula = Vocab ~ Blocks,
                 data = L1.data)

summary(taskmodel1) # Adjusted *R*-squared:  0.07232

Q12.2 According to your model, which of these values is the predicted Vocab score of an L1 speaker with a Blocks score of 20?

🐭 Click on the mouse for a second hint.

Show code to answer Q12.2

# a) Taking a numerical approach, you can add up the model coefficients printed in the model summary. As always, start with the intercept coefficient and add to it the coefficient corresponding to an increase in one point on the Blocks test multiplied by the number of points that this participant achieved (20):
summary(taskmodel1)

54.5614 + 1.0527*20

# Alternatively, we can take the coefficient estimate values directly from the model object to get an even more precise prediction like this:
taskmodel1$coefficients[1] + (taskmodel1$coefficients[2] * 20)

# b) Taking a graphical approach, we need to plot the model's predicted Vocab scores against all Blocks test scores in the dataset in order to find out how what the model's prediction is for a Blocks test score of 20. The ggplot code below adds an arrow and a dotted line to help you read the predicted Vocab score from the plot:
ggplot(L1.data, 
       aes(x = Blocks, 
           y = predict(taskmodel1))) + 
  geom_point() +
  annotate("segment",
           x = 20, 
           y = 50, 
           yend = 75,
           colour = "blue",
           arrow = arrow(length = unit(0.25, "cm"))) +
  annotate("segment",
           x = 0, 
           xend = 19.5, 
           y = 75.6,
           colour = "blue",
           linetype = "dotted") +
  labs(y='Predicted Vocab scores', 
       x='Blocks test scores') +
  theme_minimal()

Q12.3 Now fit a new linear model to predict Vocab scores among L2 participants based on their Blocks test scores. Comparing the adjusted R² value of this new L2 model to your previous L1 model, which of these statements is/are true?

🐭 Click on the mouse for a second hint.

Show sample code to answer Q12.3.

taskmodel2 <- lm(formula = Vocab ~ Blocks,
                 data = L2.data)

summary(taskmodel2) # Adjusted *R*-squared:  0.007177 
summary(taskmodel1) # Adjusted *R*-squared:  0.07232

Q12.4 In your L2 model, the p-value for the Blocks predictor should be 0.228620. True or false: This p-value means that there is a 23% chance of obtaining a Blocks coefficient estimate of 0.72 or higher in a random sample of 67 L2 speakers of English when there is actually no association between the results of the Blocks test and those of the Vocab test?

Your turn!

Q12.5 Draw a boxplot showing the distribution of Vocab scores among male and female participants in Dabrowska.data. Based on your boxplot, do you expect Gender to be a statistically significant predictor of Vocab scores?

Show sample code to answer Q12.5.

Dabrowska.data |> 
  ggplot(mapping = aes(x = Gender,
                       y = Vocab)) +
  geom_boxplot() +
  labs(x = "Gender", 
       y = "Vocab scores") +
  theme_minimal()

Q12.6 Fit a model with Vocab as the outcome variable and Gender as the predictor to test your intuition based on your boxplot. Is Gender a statistically significant predictor of Vocab score in this simple linear regression model?

🐭 Click on the mouse for a hint.

Show code to answer Q12.6

taskmodel3 <- lm(formula = Vocab ~ Gender,
                 data = Dabrowska.data)

summary(taskmodel3) # p-value associated with coefficient estimate for GenderM = 0.957

Q12.7 The adjusted R² coefficient of the model is -0.006433. True or false: This means that male participants, on average, score 0.006433 fewer points than female participants on the Vocab test?

Your turn!

In this task, you will explore whether L2 participants’ native language is a useful predictor when trying to model their English Vocab scores.

Q12.8 Fit a linear model to model L2 participants’ Vocab score based on their native language (NativeLg). According to the model’s adjusted R² coefficient, how much of the variance in Vocab scores among L2 participants does this model account for?

Show sample code to answer Q12.8.

taskmodel3 <- lm(formula = Vocab ~ NativeLg,
                data = L2.data) 

summary(taskmodel3) # Adjusted *R*-squared:  0.1331

Q12.9 In this model, there is a greater difference between the multiple R² and the adjusted R² coefficients than in all previous models in this chapter. Why might that be?

🐭 Click on the mouse for a hint.

Q12.10 One way to reduce the number of levels in the NativeLg variable is to model Vocab scores based on L2 participants’ native language family, instead. Fit a model that attempts to predict Vocab scores among L2 participants based on the NativeLgFamily variable (the creation of this variable was a Your turn! task in Using case_when()). Based on your comparison of the two models, which of the following statements is/are true?

Show code to answer Q.12.10

taskmodel4 <- lm(formula = Vocab ~ NativeLgFamily,
                data = L2.data) 

summary(taskmodel4)

Q12.11 According to your NativeLgFamily model, what is the predicted Vocab score of an L2 participant with a Baltic L1?

Show sample code to answer Q12.11

# The Baltic language family is the first level in this categorical predictor (because we haven't changed the order which, by default, is alphabetical):
levels(L2.data$NativeLgFamily)

# The Intercept coefficient corresponds to an L2 participant with a Baltic native language:
summary(taskmodel4)

Q12.12 According to your NativeLgFamily model, what is the predicted Vocab score of an L2 participant with a Slavic L1?

Show sample code to answer Q12.12

# a) Numeric approach
summary(taskmodel4)

73.778 + -23.749

# b) Graphical approach
library(visreg)

visreg(taskmodel4, gg = TRUE) +
  labs(title = "Participants' L1 language family",
       x = NULL,
       y = "Vocab scores") +
  theme_bw()

# The predicted value for the Slavic L1 speakers is visualised by the last blue line.

Q12.13 According to the NativeLgFamily model, Hellenic L1 speakers are predicted to perform considerably better than the reference level of Baltic L1 speakers. However, the p-value associated with this coefficient estimate is very large (0.4591). Why is that?

🐭 Click on the mouse for a second hint.

Show sample code to answer Q12.13

# The plot of predicted values (see code below) shows that the coefficient estimate for Hellenic speakers is based on only one data point (as is also the case for Germanic speakers).
library(visreg)
visreg(taskmodel4, gg = TRUE) +
  labs(title = "Participants' L1 language family",
       x = NULL,
       y = "Vocab scores") +
  theme_bw()

# We can check the distribution of native language family using the `table()`, `summary()` or `count()` functions:
table(L2.data$NativeLgFamily)

summary(L2.data$NativeLgFamily)

L2.data |> 
  count(NativeLgFamily)

Q12.14 Apply the emmeans function from the {emmeans} package to the NativeLgFamily model to compute the predicted mean Vocab scores of each L1 language family group. Romance L1 speakers are predicted to attain a mean score of 66.3. This predicted score is associated with a 95% confidence interval that spans from 43.2 to 56.8. What does this mean?

🐭 Click on the mouse for a second hint.

Show sample code to answer Q12.14

#install.packages("emmeans")
library(emmeans)

emmeans(taskmodel4, ~ NativeLgFamily)

Check your progress 🌟

Well done! You have successfully completed this chapter introducing the linear regression modelling. You have answered 0 out of 14 questions correctly.