This script is part of the Online Appendix to my PhD thesis.

Please cite as: Le Foll, Elen. 2022. Textbook English: A Corpus-Based Analysis of the Language of EFL textbooks used in Secondary Schools in France, Germany and Spain. PhD thesis. Osnabrück University.

For more information, see: https://elenlefoll.github.io/TextbookEnglish/

Please note that the plot dimensions in this notebook have been optimised for the print version of the thesis.

Set-up

Built with R 4.0.3

knitr::opts_chunk$set(echo = TRUE, paged.print=TRUE, fig.width = 10, warning=FALSE)

#renv::restore() # Restore the project's dependencies from the lockfile to ensure that same package versions are used as in the original thesis.

library(car) # For data wrangling
library(caret) # For its confusion matrix function
#library(clipr) # For quick exports to other programme
library(cowplot)
library(DescTools) # For 95% CI
library(dplyr)
library(emmeans)
#library(FactoMineR) # For Shiny app
library(factoextra) # For circular graphs of variables
library(forcats) # For the fct_relevel function
library(here) # For dynamic file paths
library(ggplot2) 
library(ggthemes) # For theme of factoextra plots
library(lme4) # For linear regression modelling
library(patchwork) # To create figures with more than one plot
#library(pca3d) # For 3-D plots (not rendered in html knit)
library(PCAtools) # For nice biplots of PCA results
library(purrr) # For data wrangling
library(psych) # For various useful stats function
library(sjPlot) # For model plots and tables
library(suffrager) # For pretty feminist colour palettes :)
library(visreg) # For plots of interaction effects

source(here("R_rainclouds.R")) # For geom_flat_violin rainplots

PCA and 3-D plots

Import full data (see 7_Ref_data_prep.Rmd for data preparation steps)

data <- readRDS(here("FullMDA", "dataforPCA.rds")) 

data %>% 
  filter(Series=="NGL") %>% 
  group_by(Series, Level) %>% 
  summarise(wordcount = sum(Words))

## `summarise()` has grouped output by 'Series'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 3
## # Groups:   Series [1]
##   Series Level wordcount
##   <fct>  <fct>     <int>
## 1 NGL    A         18388
## 2 NGL    B         37471
## 3 NGL    C         30574
## 4 NGL    D         36395
## 5 NGL    E         61486

This chunk can be used to perform the MDA on various subsets of the data.

# Full dataset
data <- readRDS(here("FullMDA", "dataforPCA.rds")) 

# Subset of the data that excludes levels A and B textbooks
data <- readRDS(here("FullMDA", "dataforPCA.rds")) %>%
  filter(Level !="A" & Level != "B") %>%
  droplevels()
summary(data$Level)

# Subset of the data to only include one Country subcorpus of the TEC
data <- readRDS(here("FullMDA", "dataforPCA.rds")) %>%
  #filter(Country != "France" & Country != "Germany") %>% # Spain only
  #filter(Country != "France" & Country != "Spain") %>% # Germany only
  filter(Country != "Spain" & Country != "Germany") %>% # France only
  droplevels()
summary(data$Country)

# Perform PCA on random subset of the data to test the stability of the solution. Re-running this line will generate a new subset of 2/3 of the texts randomly sampled.
set.seed(13) # Seed used for the biplot printed in Chapter 8
data <- readRDS(here("FullMDA", "dataforPCA.rds")) %>%
  slice_sample(n = 4980*0.6, replace = FALSE)
nrow(data)
data$Filename[1:4]

This chunk is used to create the 3-D plots. These cannot be rendered in the html knit.

colnames(data)
pca <- prcomp(data[,9:ncol(data)], scale.=FALSE) # All quantitative variables
register <- factor(data[,"Register"]) 
corpus <- factor(data[,"Corpus"])
subcorpus <- factor(data[,"Subcorpus"])

summary(register)
summary(corpus)
summary(subcorpus)
summary(pca)

# 3-D plot

colours <- suf_palette(name = "london", n = 6, type = "continuous") 
colours2 <- suf_palette(name = "classic", n = 5, type = "continuous") 
colours <- c(colours, colours2[c(2:4)]) # Nine colours range
col6 <- colours[c(6,5,4,7,9,2)] # Good order for PCA
scales::show_col(col6)

col6 <- c("#F9B921", "#A18A33", "#722672", "#BD241E", "#267226", "#15274D")
names(col6) <- c("Textbook Conversation", "Textbook Fiction", "Textbook Informative", "Spoken BNC2014 Ref.", "Youth Fiction Ref.", "Info Teens Ref.")
shapes6 <- c(rep("cube", 3),rep("sphere", 3))
names(shapes6) <- c("Textbook Conversation", "Textbook Fiction", "Textbook Informative", "Spoken BNC2014 Ref.", "Youth Fiction Ref.", "Info Teens Ref.")

library(pca3d)

pca3d(pca, group = subcorpus, 
      components = 1:3,
      #components = 4:6,
      show.plane=FALSE,
      col = col6,
      shape = shapes6,
      radius = 0.7,
      legend = "right")

snapshotPCA3d(here("plots", "PCA_TxB_3Ref_3Dsnapshot.png"))

## Looking at all three Textbook English registers in one colour

col4 <- colours[c(1,3,7,9)]
col4 <- c("#EA7E1E", "#15274D", "#BD241E", "#267226")

names(col4) <- c("Textbook.English", "Informative.Teens", "Spoken.BNC2014", "Youth.Fiction")
shapes4 <- c("cube", rep("sphere", 3))
names(shapes4) <- c("Textbook.English", "Informative.Teens", "Spoken.BNC2014", "Youth.Fiction")

pca3d(pca, group = corpus, 
      show.plane=FALSE,
      components = 1:3,
      col = col4,
      shape = shapes4,
      radius = 0.7,
      legend = "right")

PCA biplots

data2 <- data %>% 
  mutate(Source = recode_factor(Corpus, Textbook.English = "Textbook English (TEC)", Informative.Teens = "Reference corpora", Spoken.BNC2014 = "Reference corpora", Youth.Fiction = "Reference corpora")) %>% 
  mutate(Corpus = fct_relevel(Subcorpus, "Info Teens Ref.", after = 9)) %>%
  relocate(Source, .after = "Corpus") %>% 
  droplevels(.)

colnames(data2)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Source"    "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"   
## [13] "AWL"       "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"     
## [19] "COND"      "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"    
## [25] "DT"        "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"    
## [31] "FPP1P"     "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"     
## [37] "IN"        "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"     
## [43] "MDMM"      "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"    
## [49] "NN"        "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"    
## [55] "POLITE"    "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"    
## [61] "RB"        "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"    
## [67] "THRC"      "THSC"      "TTR"       "VBD"       "VBG"       "VBN"      
## [73] "VIMP"      "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"     
## [79] "FQTI"

data2meta <- data2[,1:9]
rownames(data2meta) <- data2meta$Filename
data2meta <- data2meta %>% select(-Filename)
head(data2meta)

##                                            Register Level    Series Country
## Achievers_B1_plus_Informative_0007.txt  Informative     D Achievers   Spain
## POC_5e_Spoken_0003.txt                 Conversation     B       POC  France
## Access_4_Narrative_0013.txt                 Fiction     D    Access Germany
## NGL_1_Spoken_0002.txt                  Conversation     A       NGL Germany
## Access_1_Narrative_0005.txt                 Fiction     A    Access Germany
## NGL_2_Narrative_0007.txt                    Fiction     B       NGL Germany
##                                                       Corpus
## Achievers_B1_plus_Informative_0007.txt  Textbook Informative
## POC_5e_Spoken_0003.txt                 Textbook Conversation
## Access_4_Narrative_0013.txt                 Textbook Fiction
## NGL_1_Spoken_0002.txt                  Textbook Conversation
## Access_1_Narrative_0005.txt                 Textbook Fiction
## NGL_2_Narrative_0007.txt                    Textbook Fiction
##                                                        Source
## Achievers_B1_plus_Informative_0007.txt Textbook English (TEC)
## POC_5e_Spoken_0003.txt                 Textbook English (TEC)
## Access_4_Narrative_0013.txt            Textbook English (TEC)
## NGL_1_Spoken_0002.txt                  Textbook English (TEC)
## Access_1_Narrative_0005.txt            Textbook English (TEC)
## NGL_2_Narrative_0007.txt               Textbook English (TEC)
##                                                    Subcorpus Words
## Achievers_B1_plus_Informative_0007.txt  Textbook Informative   690
## POC_5e_Spoken_0003.txt                 Textbook Conversation   694
## Access_4_Narrative_0013.txt                 Textbook Fiction   547
## NGL_1_Spoken_0002.txt                  Textbook Conversation   927
## Access_1_Narrative_0005.txt                 Textbook Fiction   840
## NGL_2_Narrative_0007.txt                    Textbook Fiction  1127

rownames(data2) <- data2$Filename
data2num <- as.data.frame(base::t(data2[,10:ncol(data2)]))
data2num[1:5,1:5] # Check data frame format is correct

##        Achievers_B1_plus_Informative_0007.txt POC_5e_Spoken_0003.txt
## ACT                                 1.3305834             -0.8133966
## AMP                                -0.4504541              0.1376718
## ASPECT                              1.0228557             -0.4515438
## AWL                                 0.7288132             -0.7268089
## BEMA                               -0.4801820              1.1272237
##        Access_4_Narrative_0013.txt NGL_1_Spoken_0002.txt
## ACT                     -0.4707243             -1.014328
## AMP                      1.4534008             -0.561931
## ASPECT                   1.1039502             -0.749580
## AWL                     -0.5477092             -0.716272
## BEMA                     0.2456594              1.700641
##        Access_1_Narrative_0005.txt
## ACT                     0.03320131
## AMP                    -0.52977306
## ASPECT                 -0.55344739
## AWL                    -0.56332789
## BEMA                   -0.39233698

p <- PCAtools::pca(data2num, 
         metadata = data2meta,
         scale = FALSE)

p$variance[1:6]

##       PC1       PC2       PC3       PC4       PC5       PC6 
## 30.334148  7.512979  5.886828  3.415906  2.703581  2.415911

sum(p$variance[1:4])

## [1] 47.14986

# For five TEC registers
# colkey = c(`Spoken BNC2014 Ref.`="#BD241E", `Info Teens Ref.`="#15274D", `Youth Fiction Ref.`="#267226", `Textbook Fiction`="#A18A33", `Textbook Conversation`="#F9B921", `Textbook Informative` = "#722672", `Textbook Instructional` = "grey", `Textbook Personal` = "black")

# For three TEC registers
summary(data2$Corpus)

## Textbook Conversation      Textbook Fiction  Textbook Informative 
##                   565                   285                   352 
##   Spoken BNC2014 Ref.    Youth Fiction Ref.       Info Teens Ref. 
##                  1250                  1191                  1337

colkey = c(`Spoken BNC2014 Ref.`="#BD241E", `Info Teens Ref.`="#15274D", `Youth Fiction Ref.`="#267226", `Textbook Fiction`="#A18A33", `Textbook Conversation`="#F9B921", `Textbook Informative` = "#722672")

#summary(data2$Source)
#shapekey = c(`Textbook English (TEC)`=6, `Reference corpora`=1)

summary(data2$Level)

##    A    B    C    D    E Ref. 
##  152  231  306  304  209 3778

shapekey = c(A=1, B=2, C=6, D=0, E=5, `Ref.`=4)

## Very slow, open in zoomed out window!
# Add legend manually? Yes (take it from the biplot code below), let's not waste too much time, here. Or use Evert's mvar.pairs plot (though that annoyingly requires manual axis annotation so not really better!)

#png(here("plots", "PCA_3Ref_pairsplot.png"), width = 25, height= 40, units = "cm", res = 300)
PCAtools::pairsplot(p,
                 triangle = FALSE,
                 components = 1:4,
                 ncol = 2,
                 nrow = 3,
                 pointSize = 1,
                 shape = "Level",
                 shapekey = shapekey,
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Corpus",
                 legendPosition = "none",
                 margingaps = unit(c(0.2, 0.2, 0.8, 0.2), "cm"),
                 colkey = colkey)

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

dev.off()

## null device 
##           1

# Biplots to examine components more carefully

# These settings (with legendPosition = "top") were used to generate the legend for the scatterplot matrix above:
#png(here("plots", "PCA_3Ref_Biplot_PC1_PC2test.png"), width = 40, height= 25, units = "cm", res = 300) 
PCAtools::biplot(p,
                 x = "PC1",
                 y = "PC2",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Corpus",
                 pointSize = 1.3,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 xlim = c(min(p$rotated[, "PC1"]), max(p$rotated[, "PC1"])),
                 ylim = c(min(p$rotated[, "PC2"]), max(p$rotated[, "PC2"])),
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 18,
                 legendPosition = 'right',
                 legendTitleSize = 18,
                 legendLabSize = 14, 
                 legendIconSize = 5) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

# These settings were used to save this biplot:
#ggsave(here("Plots", "PCA_Ref3TxB_BiplotPC1_PC2.svg"), width = 12, height = 8)

# Biplots to examine components more carefully
PCAtools::biplot(p,
                 x = "PC3",
                 y = "PC4",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Corpus",
                 pointSize = 1.2,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 xlim = c(min(p$rotated[, "PC3"]), max(p$rotated[, "PC3"])),
                 ylim = c(min(p$rotated[, "PC4"]), max(p$rotated[, "PC4"])),
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 18,
                 legendPosition = 'right',
                 legendTitleSize = 18,
                 legendLabSize = 14, 
                 legendIconSize = 5) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

#ggsave(here("Plots", "PCA_Ref3TxB_BiplotPC3_PC4.svg"), width = 12, height = 8)

# Biplots to examine components more carefully
PCAtools::biplot(p,
                 x = "PC5",
                 y = "PC6",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Corpus",
                 pointSize = 1.2,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 xlim = c(min(p$rotated[, "PC5"]), max(p$rotated[, "PC5"])),
                 ylim = c(min(p$rotated[, "PC6"]), max(p$rotated[, "PC6"])),
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 18,
                 legendPosition = 'right',
                 legendTitleSize = 18,
                 legendLabSize = 14, 
                 legendIconSize = 5) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

#ggsave(here("Plots", "PCA_Ref3TxB_BiplotPC5_PC6.svg"), width = 12, height = 8)

# Biplot with ellipses for Level rather than Register
colkey = c(A="#F9B921", B="#A18A33", C="#BD241E", D="#722672", E="#15274D", `Ref. data`= "darkgrey")
shapekey = c(`Spoken BNC2014 Ref.`=16, `Info Teens Ref.`=17, `Youth Fiction Ref.`=15, `Textbook Fiction`=0, `Textbook Conversation`=1, `Textbook Informative`=2)

PCAtools::biplot(p,
                 x = "PC3",
                 y = "PC4",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Level",
                 pointSize = 1.3,
                 colkey = colkey,
                 shape = "Corpus",
                 shapekey = shapekey,
                 xlim = c(min(p$rotated[, "PC3"]), max(p$rotated[, "PC3"])),
                 ylim = c(min(p$rotated[, "PC4"]), max(p$rotated[, "PC4"])),
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 18,
                 legendPosition = 'right',
                 legendTitleSize = 18,
                 legendLabSize = 14, 
                 legendIconSize = 5) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

#ggsave(here("Plots", "PCA_Ref3TxB_BiplotPC3_PC4_levels.svg"), width = 12, height = 8)

# Biplots to examine components more carefully
PCAtools::biplot(p,
                 x = "PC5",
                 y = "PC6",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Level",
                 pointSize = 1.3,
                 colkey = colkey,
                 shape = "Corpus",
                 shapekey = shapekey,
                 xlim = c(min(p$rotated[, "PC5"]), max(p$rotated[, "PC5"])),
                 ylim = c(min(p$rotated[, "PC6"]), max(p$rotated[, "PC6"])),
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 18,
                 legendPosition = 'right',
                 legendTitleSize = 18,
                 legendLabSize = 14, 
                 legendIconSize = 5) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

#ggsave(here("Plots", "PCA_Ref3TxB_BiplotPC5_PC6_levels.svg"), width = 12, height = 8)

Feature contributions (loadings) on each component

#data <- readRDS(here("FullMDA", "dataforPCA.rds")) 
colnames(data)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"      "COND"     
## [19] "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"     "DT"       
## [25] "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"     "FPP1P"    
## [31] "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"      "IN"       
## [37] "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"      "MDMM"     
## [43] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [49] "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"     "POLITE"   
## [55] "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"     "RB"       
## [61] "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"     "THRC"     
## [67] "THSC"      "TTR"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [73] "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"      "FQTI"

pca <- prcomp(data[,9:ncol(data)], scale.=FALSE) # All quantitative variables

# The rotated data that represents the observations / samples is stored in rotated, while the variable loadings are stored in loadings
loadings <- as.data.frame(pca$rotation[,1:4])


# Function that applies a cut-off point for "relevant" loadings (not used in thesis)
smallToZero <- function(x) {if_else(x < 0.05 & x > -0.05, 0, x)}
loadings %>% 
  filter_all(any_vars(. > abs(0.05))) %>% 
  mutate_all(smallToZero) %>% 
  round(3)

##           PC1    PC2    PC3    PC4
## ACT    -0.102  0.000  0.000  0.121
## AMP     0.000 -0.054 -0.052  0.094
## ASPECT -0.081  0.104  0.000  0.000
## BEMA    0.075 -0.241  0.063  0.000
## CAUSE  -0.084 -0.117  0.056  0.145
## COMM    0.000  0.193  0.000  0.196
## COND    0.083  0.000 -0.167  0.234
## CONT    0.223 -0.050  0.000  0.000
## CUZ     0.104 -0.145 -0.200 -0.176
## DEMO    0.146 -0.119 -0.077  0.000
## DMA     0.202 -0.088  0.000 -0.164
## DOAUX   0.185  0.000  0.057  0.000
## DT      0.078  0.159 -0.240  0.000
## DWNT    0.000  0.104 -0.120  0.093
## ELAB   -0.073 -0.165  0.000  0.068
## EMPH    0.168 -0.065 -0.094  0.000
## EX      0.055  0.000  0.000  0.000
## FPP1P   0.077  0.000  0.090  0.188
## FPP1S   0.174  0.057  0.062  0.082
## FPUH    0.180 -0.097  0.000 -0.189
## GTO     0.142  0.000  0.000  0.000
## HDG     0.100 -0.069 -0.138 -0.121
## HGOT    0.160  0.000  0.000 -0.108
## JJAT   -0.050 -0.126 -0.250  0.066
## JJPR    0.000 -0.172  0.000  0.213
## LD     -0.174 -0.163  0.127  0.000
## MDCA    0.000 -0.210  0.108  0.225
## MDCO    0.000  0.187 -0.149  0.097
## MDMM    0.000 -0.098 -0.142  0.173
## MDNE    0.063  0.000 -0.059  0.223
## MDWO    0.071  0.108 -0.178  0.000
## MDWS    0.056  0.000  0.000  0.246
## MENTAL  0.109  0.000 -0.055  0.157
## NN     -0.213 -0.104  0.098 -0.070
## PEAS   -0.058  0.119 -0.186  0.124
## PIT     0.152 -0.102 -0.151 -0.069
## PLACE   0.000  0.090  0.092  0.072
## POLITE  0.086  0.000  0.196  0.105
## POS     0.000  0.094  0.000 -0.054
## PROG    0.090  0.078  0.000  0.152
## QUAN    0.171  0.000 -0.157  0.000
## QUPR    0.080  0.109 -0.124  0.205
## QUTAG   0.147  0.000 -0.070 -0.149
## RB      0.144  0.150 -0.182  0.073
## RP      0.000  0.216 -0.089  0.150
## SPLIT   0.000 -0.107 -0.213  0.085
## SPP2    0.167 -0.066  0.098  0.163
## STPR    0.097  0.000  0.000  0.000
## THATD   0.159  0.000 -0.140  0.000
## THSC    0.000 -0.076 -0.269  0.065
## TTR    -0.186  0.072  0.000  0.157
## VBD    -0.096  0.349 -0.052 -0.196
## VBG    -0.144  0.000 -0.135  0.123
## VIMP    0.000 -0.071  0.214  0.205
## WHQU    0.134  0.000  0.202  0.065
## WHSC   -0.091 -0.103 -0.199  0.054
## XX0     0.181  0.000 -0.058  0.056
## YNQU    0.177  0.000  0.140  0.000
## TPP3   -0.054  0.303  0.000 -0.151
## FQTI   -0.069  0.000  0.000  0.139

# Tabular overiew of loadings with no threshold
loadings %>% 
  round(2)

##          PC1   PC2   PC3   PC4
## ACT    -0.10  0.01 -0.01  0.12
## AMP     0.00 -0.05 -0.05  0.09
## ASPECT -0.08  0.10 -0.01  0.00
## AWL    -0.21 -0.13 -0.06 -0.01
## BEMA    0.08 -0.24  0.06 -0.02
## CAUSE  -0.08 -0.12  0.06  0.14
## CC     -0.14 -0.13 -0.09 -0.09
## COMM   -0.03  0.19  0.03  0.20
## CONC   -0.03 -0.03 -0.18 -0.06
## COND    0.08 -0.02 -0.17  0.23
## CONT    0.22 -0.05  0.03  0.00
## CUZ     0.10 -0.14 -0.20 -0.18
## DEMO    0.15 -0.12 -0.08 -0.04
## DMA     0.20 -0.09 -0.04 -0.16
## DOAUX   0.18 -0.02  0.06  0.00
## DT      0.08  0.16 -0.24 -0.03
## DWNT   -0.04  0.10 -0.12  0.09
## ELAB   -0.07 -0.17 -0.04  0.07
## EMPH    0.17 -0.07 -0.09 -0.03
## EX      0.06 -0.02 -0.04  0.00
## EXIST  -0.13 -0.09 -0.05  0.00
## FPP1P   0.08 -0.02  0.09  0.19
## FPP1S   0.17  0.06  0.06  0.08
## FPUH    0.18 -0.10 -0.05 -0.19
## GTO     0.14  0.00 -0.04  0.00
## HDG     0.10 -0.07 -0.14 -0.12
## HGOT    0.16 -0.05 -0.01 -0.11
## IN     -0.21 -0.01 -0.08  0.01
## JJAT   -0.05 -0.13 -0.25  0.07
## JJPR   -0.03 -0.17 -0.03  0.21
## LD     -0.17 -0.16  0.13 -0.04
## MDCA    0.05 -0.21  0.11  0.22
## MDCO    0.00  0.19 -0.15  0.10
## MDMM   -0.02 -0.10 -0.14  0.17
## MDNE    0.06 -0.02 -0.06  0.22
## MDWO    0.07  0.11 -0.18  0.04
## MDWS    0.06 -0.02 -0.01  0.25
## MENTAL  0.11 -0.02 -0.05  0.16
## NCOMP   0.00 -0.27 -0.05  0.03
## NN     -0.21 -0.10  0.10 -0.07
## OCCUR  -0.13 -0.02 -0.05 -0.06
## PASS   -0.14 -0.13 -0.09 -0.10
## PEAS   -0.06  0.12 -0.19  0.12
## PIT     0.15 -0.10 -0.15 -0.07
## PLACE   0.02  0.09  0.09  0.07
## POLITE  0.09  0.00  0.20  0.11
## POS     0.02  0.09  0.04 -0.05
## PROG    0.09  0.08 -0.04  0.15
## QUAN    0.17 -0.01 -0.16  0.01
## QUPR    0.08  0.11 -0.12  0.21
## QUTAG   0.15 -0.04 -0.07 -0.15
## RB      0.14  0.15 -0.18  0.07
## RP     -0.01  0.22 -0.09  0.15
## SPLIT  -0.03 -0.11 -0.21  0.08
## SPP2    0.17 -0.07  0.10  0.16
## STPR    0.10  0.01  0.01 -0.04
## THATD   0.16 -0.01 -0.14 -0.02
## THRC   -0.05 -0.17 -0.15 -0.02
## THSC   -0.02 -0.08 -0.27  0.07
## TTR    -0.19  0.07 -0.02  0.16
## VBD    -0.10  0.35 -0.05 -0.20
## VBG    -0.14 -0.02 -0.14  0.12
## VBN    -0.14 -0.10 -0.08 -0.07
## VIMP    0.01 -0.07  0.21  0.21
## WHQU    0.13 -0.02  0.20  0.07
## WHSC   -0.09 -0.10 -0.20  0.05
## XX0     0.18  0.01 -0.06  0.06
## YNQU    0.18 -0.03  0.14 -0.02
## TPP3   -0.05  0.30 -0.04 -0.15
## FQTI   -0.07  0.03  0.01  0.14

#write_last_clip()

Graphs of variables

factoextra::fviz_pca_var(pca,
             axes = c(1,2),
             select.var = list(contrib = 30),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

#ggsave(here("Plots", "fviz_pca_var_PC1_PC2_Ref3Reg.svg"), width = 11, height = 9)

factoextra::fviz_pca_var(pca,
             axes = c(3,2),
             select.var = list(contrib = 30),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

#ggsave(here("Plots", "fviz_pca_var_PC3_PC2_Ref3Reg.svg"), width = 9, height = 8)

factoextra::fviz_pca_var(pca,
             axes = c(3,4),
             select.var = list(contrib = 30),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

#ggsave(here("Plots", "fviz_pca_var_PC3_PC4_Ref3Reg.svg"), width = 9, height = 8)

factoextra::fviz_pca_var(pca,
             axes = c(5,6),
             select.var = list(contrib = 25),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

#ggsave(here("Plots", "fviz_pca_var_PC5_PC6_Ref3Reg.svg"), width = 9, height = 8)

Exploring feature contributions

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

ncounts <- readRDS(here("FullMDA", "counts3Reg.rds"))

ncounts %>%
  group_by(Register, Level) %>% 
  summarise(median(Words), mad(Words))

## `summarise()` has grouped output by 'Register'. You can override using the
## `.groups` argument.

## # A tibble: 18 × 4
## # Groups:   Register [3]
##    Register     Level `median(Words)` `mad(Words)`
##    <fct>        <fct>           <dbl>        <dbl>
##  1 Conversation A                826          156.
##  2 Conversation B                801          172.
##  3 Conversation C                796          169.
##  4 Conversation D                794          146.
##  5 Conversation E                914.         305.
##  6 Conversation Ref.            8115         6028.
##  7 Fiction      A                902.         284.
##  8 Fiction      B                887          258.
##  9 Fiction      C                811          218.
## 10 Fiction      D                784.         218.
## 11 Fiction      E                855          194.
## 12 Fiction      Ref.            5950          181.
## 13 Informative  A                833          224.
## 14 Informative  B                890          162.
## 15 Informative  C                857          162.
## 16 Informative  D                839          190.
## 17 Informative  E                849          169.
## 18 Informative  Ref.             767          190.

ncounts %>% 
  filter(Register=="Informative") %>% 
  #filter(Level %in% c("C", "D", "E")) %>% 
  select(Level, VBD, PEAS) %>% 
  group_by(Level) %>% 
  summarise_if(is.numeric, median) %>% 
  mutate_if(is.numeric, round, 2)

## # A tibble: 6 × 3
##   Level   VBD  PEAS
##   <fct> <dbl> <dbl>
## 1 A      22.7  0   
## 2 B      31.9  1.35
## 3 C      31.6  4.26
## 4 D      39.2  7   
## 5 E      33.3  6.35
## 6 Ref.   30    4.08

cols = c("#F9B921", "#A18A33", "#BD241E", "#722672", "#15274D", "darkgrey")

boxfeature <- ncounts %>% 
  filter(Register=="Informative") %>% 
  #filter(Level %in% c("C", "D", "E")) %>% 
  select(Level, FPP1S, SPP2, CONT, EXIST, AWL, XX0, PASS, VBN) %>% 
  ggplot(aes(x = Level, y = CONT, colour = Level, fill = Level)) +
  geom_jitter(size=0.7, alpha=.7) +
  geom_boxplot(outlier.shape = NA, fatten = 2, fill = "white", alpha = 0.3) +
  scale_colour_manual(values = cols) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("")

SPP2 <- boxfeature + aes(y = SPP2) + ylim(c(0,80))# These y-axis limits remove individual outliers that overextend the scales and make the real differences invisible
EXIST <- boxfeature + aes(y = EXIST) + ylim(c(0,25)) 
FFP1 <- boxfeature + aes(y = FPP1S) + ylim(c(0,60))  #
AWL <- boxfeature + aes(y = AWL)
XX0 <- boxfeature + aes(y = XX0) + ylim(c(0,25))
PASS <- boxfeature + aes(y = PASS)
VBN <- boxfeature + aes(y = VBN) + ylim(c(0,40))

boxplots <- grid.arrange( PASS, VBN, AWL, EXIST, FFP1, SPP2, boxfeature, XX0, ncol=2, nrow=4)

#ggsave(here("Plots", "BoxplotsInformativeFeatures.svg"), plot = boxplots, dpi = 300, width = 9, height = 11)

Descriptive PCA results

## PCA
#data <- readRDS(here("FullMDA", "dataforPCA.rds")) 
#colnames(data)
pca <- prcomp(data[,9:ncol(data)], scale.=FALSE) # All quantitative variables

## Access to the PCA results
#colnames(data)
res.ind <- cbind(data[,1:8], as.data.frame(pca$x)[,1:4])
head(res.ind)

##                                 Filename     Register Level    Series Country
## 1 Achievers_B1_plus_Informative_0007.txt  Informative     D Achievers   Spain
## 2                 POC_5e_Spoken_0003.txt Conversation     B       POC  France
## 3            Access_4_Narrative_0013.txt      Fiction     D    Access Germany
## 4                  NGL_1_Spoken_0002.txt Conversation     A       NGL Germany
## 5            Access_1_Narrative_0005.txt      Fiction     A    Access Germany
## 6               NGL_2_Narrative_0007.txt      Fiction     B       NGL Germany
##             Corpus             Subcorpus Words        PC1        PC2       PC3
## 1 Textbook.English  Textbook Informative   690 -2.9689011 -1.0213555 0.6131214
## 2 Textbook.English Textbook Conversation   694  2.9897829 -0.8224829 2.0366782
## 3 Textbook.English      Textbook Fiction   547 -0.9601937  1.7910562 0.9396939
## 4 Textbook.English Textbook Conversation   927  2.4059242 -2.2722873 4.3576009
## 5 Textbook.English      Textbook Fiction   840 -0.3931322  0.1616840 3.5972351
## 6 Textbook.English      Textbook Fiction  1127  0.0644907  1.7802591 1.4620307
##          PC4
## 1  2.1252104
## 2  0.9981444
## 3 -0.3421301
## 4 -0.5251778
## 5 -0.5603783
## 6 -0.1834330

## Examine individual text values
res.ind %>% 
  filter(PC1 < 0.5 & PC1 > -1.5 & PC2 < 1.5 & PC2 > -1.5 & PC3 > 0.5) %>% 
  filter(Subcorpus == "Textbook Informative") %>% 
  select(Filename, Level)

##                                           Filename Level
## 1        Solutions_Elementary_Informative_0009.txt     A
## 2  Solutions_Pre-Intermediate_Informative_0003.txt     B
## 3  Solutions_Pre-Intermediate_Informative_0001.txt     B
## 4                       EIM_1_Informative_0005.txt     B
## 5                       NGL_4_Informative_0007.txt     D
## 6                Achievers_B2_Informative_0001.txt     E
## 7                       EIM_1_Informative_0003.txt     B
## 8                       EIM_1_Informative_0001.txt     B
## 9                Achievers_A2_Informative_0006.txt     B
## 10               Achievers_A1_Informative_0001.txt     A
## 11 Solutions_Pre-Intermediate_Informative_0006.txt     B
## 12                      NGL_2_Informative_0001.txt     B
## 13                GreenLine_3_Informative_0002.txt     C
## 14                   Access_5_Informative_0008.txt     E
## 15                GreenLine_1_Informative_0002.txt     A
## 16                GreenLine_4_Informative_0002.txt     D
## 17       Solutions_Elementary_Informative_0004.txt     A
## 18               Achievers_A1_Informative_0003.txt     A
## 19               Achievers_A2_Informative_0004.txt     B
## 20          Achievers_B1_plus_Informative_0002.txt     D
## 21                      EIM_1_Informative_0004.txt     B
## 22                     POC_4e_Informative_0002.txt     C
## 23                GreenLine_3_Informative_0003.txt     C
## 24               Achievers_A1_Informative_0004.txt     A
## 25                GreenLine_3_Informative_0005.txt     C
## 26                     POC_3e_Informative_0001.txt     D
## 27                      JTT_6_Informative_0002.txt     A
## 28                     POC_4e_Informative_0001.txt     C
## 29                EIM_Starter_Informative_0005.txt     A
## 30               Achievers_B1_Informative_0004.txt     C
## 31                GreenLine_2_Informative_0002.txt     B
## 32               Achievers_A2_Informative_0008.txt     B
## 33                      NGL_2_Informative_0002.txt     B
## 34               Achievers_A1_Informative_0002.txt     A
## 35                     POC_5e_Informative_0003.txt     B
## 36       Solutions_Elementary_Informative_0008.txt     A
## 37                     JTT_4_Informative_00051.txt     C
## 38                GreenLine_4_Informative_0003.txt     D
## 39               Achievers_A2_Informative_0009.txt     B
## 40                GreenLine_3_Informative_0007.txt     C
## 41               Achievers_B1_Informative_0009.txt     C
## 42 Solutions_Pre-Intermediate_Informative_0005.txt     B
## 43                EIM_Starter_Informative_0001.txt     A
## 44          Achievers_B1_plus_Informative_0003.txt     D
## 45               Achievers_A2_Informative_0005.txt     B
## 46                   Access_2_Informative_0002.txt     B
## 47                GreenLine_3_Informative_0004.txt     C
## 48                      NGL_4_Informative_0001.txt     D
## 49               Achievers_A2_Informative_0001.txt     B
## 50          Achievers_B1_plus_Informative_0001.txt     D
## 51                GreenLine_4_Informative_0001.txt     D
## 52                   Access_3_Informative_0001.txt     C

res.ind %>% 
  filter(PC4 < -1.5) %>% 
  filter(Register == "Conversation") %>% 
  select(Filename, Level)

##                       Filename Level
## 1     HT_6_ELF_Spoken_0002.txt     A
## 2    JTT_6_ELF_Spoken_0001.txt     A
## 3  GreenLine_1_Spoken_0002.txt     A
## 4  GreenLine_1_Spoken_0001.txt     A
## 5                     SWMV.txt  Ref.
## 6                     STBF.txt  Ref.
## 7                     SU8C.txt  Ref.
## 8                     SC6N.txt  Ref.
## 9                     SN27.txt  Ref.
## 10                    SBQZ.txt  Ref.
## 11                    SEKZ.txt  Ref.
## 12                    SYHW.txt  Ref.
## 13                    S57J.txt  Ref.
## 14                    SD62.txt  Ref.
## 15                    S9HC.txt  Ref.
## 16                    SXXN.txt  Ref.
## 17                    SNZS.txt  Ref.
## 18                    S2NQ.txt  Ref.
## 19                    SXWG.txt  Ref.
## 20                    SZB2.txt  Ref.
## 21                    SXPD.txt  Ref.
## 22                    SVX4.txt  Ref.
## 23                    SFB9.txt  Ref.
## 24                    SDHB.txt  Ref.
## 25                    SKPX.txt  Ref.
## 26                    SCSF.txt  Ref.
## 27                    SZKX.txt  Ref.
## 28                    SKMV.txt  Ref.
## 29                    S8NX.txt  Ref.
## 30                    SAQD.txt  Ref.
## 31                    S37F.txt  Ref.
## 32                    S8SH.txt  Ref.
## 33                    SKDX.txt  Ref.

## Summary statistics
res.ind %>% 
  group_by(Register, Corpus) %>% 
  summarise_if(is.numeric, c(mean = mean, sd = sd)) %>% 
  as.data.frame()

##       Register            Corpus Words_mean   PC1_mean   PC2_mean   PC3_mean
## 1 Conversation  Textbook.English   853.4531  1.5649317 -0.5732427  1.7069082
## 2 Conversation    Spoken.BNC2014 10637.7440  3.7102869 -0.5155994 -0.6389811
## 3      Fiction  Textbook.English   847.4105 -0.4336570  1.2517651  0.9570921
## 4      Fiction     Youth.Fiction  5944.7817 -0.4868147  1.7493984 -0.2056360
## 5  Informative  Textbook.English   840.2131 -2.0502507 -0.5236450  0.3127942
## 6  Informative Informative.Teens   805.4069 -3.0642993 -0.9630382 -0.2271048
##      PC4_mean  Words_sd    PC1_sd    PC2_sd    PC3_sd    PC4_sd
## 1  0.61137062  307.1369 1.2475649 0.7361774 1.4291394 0.8593066
## 2 -0.52542301 8974.1416 0.4851624 0.4693906 0.7551311 0.5179927
## 3  0.01757880  208.4950 1.0472422 0.8382890 1.2243113 0.6429595
## 4  0.40802195  198.6376 0.8955793 0.5209732 0.8836525 0.5221250
## 5  0.04558038  176.3652 1.0387846 0.9593349 1.1042635 0.9772219
## 6 -0.14633809  193.3123 1.0920901 1.1873367 0.8797793 1.1912055

res.ind %>% 
  group_by(Subcorpus, Level) %>% 
  summarise_if(is.numeric, c(mean = mean, sd = sd)) %>% 
  as.data.frame()

##                Subcorpus Level Words_mean   PC1_mean   PC2_mean   PC3_mean
## 1  Textbook Conversation     A   813.0227  1.9092854 -1.0167467  3.4878120
## 2  Textbook Conversation     B   819.4831  2.1123314 -0.6365859  2.3331894
## 3  Textbook Conversation     C   823.6899  1.5949529 -0.3813790  1.3871505
## 4  Textbook Conversation     D   797.8473  1.0912302 -0.4303822  0.7856455
## 5  Textbook Conversation     E  1132.7857  1.0280068 -0.6093334  0.8581428
## 6       Textbook Fiction     A   886.0000  0.1283309  0.2241170  2.7351812
## 7       Textbook Fiction     B   864.1636 -0.2089426  1.3426772  1.7261761
## 8       Textbook Fiction     C   854.2069 -0.2716990  1.6316923  0.7552243
## 9       Textbook Fiction     D   801.7794 -0.8380549  1.3757628  0.2587375
## 10      Textbook Fiction     E   853.2647 -0.6466772  1.2742289  0.2642335
## 11  Textbook Informative     A   851.7143 -1.4581371 -0.9605771  1.8578094
## 12  Textbook Informative     B   844.0345 -1.6345253 -0.4995651  1.2300162
## 13  Textbook Informative     C   838.9000 -1.8729877 -0.5397556  0.2542641
## 14  Textbook Informative     D   847.6476 -2.2357094 -0.3151493 -0.1537030
## 15  Textbook Informative     E   823.2254 -2.5737965 -0.6589216 -0.2817041
## 16   Spoken BNC2014 Ref.  Ref. 10637.7440  3.7102869 -0.5155994 -0.6389811
## 17    Youth Fiction Ref.  Ref.  5944.7817 -0.4868147  1.7493984 -0.2056360
## 18       Info Teens Ref.  Ref.   805.4069 -3.0642993 -0.9630382 -0.2271048
##      PC4_mean  Words_sd    PC1_sd    PC2_sd    PC3_sd    PC4_sd
## 1  -0.1695147  154.5695 0.8944244 0.6008896 1.0268633 0.7300862
## 2   0.3890092  218.8782 0.9114469 0.6934638 1.0426835 0.7859229
## 3   0.7486503  279.1360 1.2022430 0.7255549 1.0449879 0.7947628
## 4   0.9147864  187.9778 1.4471805 0.7204971 1.2556890 0.7921655
## 5   1.0902120  569.8536 1.2984042 0.7754213 0.8799244 0.6196130
## 6  -0.2672265  242.7776 0.9731705 0.6896123 0.8950683 0.8144431
## 7  -0.2856516  222.4384 0.7906850 0.6167031 0.7259353 0.5282789
## 8   0.1153949  198.2061 0.8922167 0.7809139 0.9633007 0.6251870
## 9   0.1457670  196.3337 1.1520193 0.7901601 0.8252677 0.5882694
## 10  0.2019984  195.9728 1.0953892 0.7518765 0.9155087 0.5689872
## 11 -0.7049678  199.7658 0.8248181 0.8644780 0.6872746 0.8517101
## 12 -0.2740043  182.8300 0.9204919 1.0340683 0.7954736 1.0779407
## 13  0.1372063  160.2093 1.0859612 0.9827889 0.9336982 1.0584275
## 14  0.1251155  179.2517 0.9452589 0.9171954 0.9518653 0.8314073
## 15  0.3688725  180.3941 0.9889681 0.9038587 0.7881581 0.8215341
## 16 -0.5254230 8974.1416 0.4851624 0.4693906 0.7551311 0.5179927
## 17  0.4080219  198.6376 0.8955793 0.5209732 0.8836525 0.5221250
## 18 -0.1463381  193.3123 1.0920901 1.1873367 0.8797793 1.1912055

library(gtsummary)
res.ind <- res.ind %>% 
  mutate(Subsubcorpus = paste(Corpus, Register, Level, sep = "_")) %>% 
  mutate(Subsubcorpus = as.factor(Subsubcorpus))
  
res.ind %>% 
  select(PC1, PC2, PC3, PC4, Subsubcorpus) %>% 
  tbl_summary(by = Subsubcorpus,
              digits = list(all_continuous() ~ c(2, 2)),
              statistic = all_continuous() ~  "{mean} ({sd})")

Characteristic	Informative.Teens_Informative_Ref., N = 1,337¹	Spoken.BNC2014_Conversation_Ref., N = 1,250¹	Textbook.English_Conversation_A, N = 88¹	Textbook.English_Conversation_B, N = 118¹	Textbook.English_Conversation_C, N = 158¹	Textbook.English_Conversation_D, N = 131¹	Textbook.English_Conversation_E, N = 70¹	Textbook.English_Fiction_A, N = 36¹	Textbook.English_Fiction_B, N = 55¹	Textbook.English_Fiction_C, N = 58¹	Textbook.English_Fiction_D, N = 68¹	Textbook.English_Fiction_E, N = 68¹	Textbook.English_Informative_A, N = 28¹	Textbook.English_Informative_B, N = 58¹	Textbook.English_Informative_C, N = 90¹	Textbook.English_Informative_D, N = 105¹	Textbook.English_Informative_E, N = 71¹	Youth.Fiction_Fiction_Ref., N = 1,191¹
PC1	-3.06 (1.09)	3.71 (0.49)	1.91 (0.89)	2.11 (0.91)	1.59 (1.20)	1.09 (1.45)	1.03 (1.30)	0.13 (0.97)	-0.21 (0.79)	-0.27 (0.89)	-0.84 (1.15)	-0.65 (1.10)	-1.46 (0.82)	-1.63 (0.92)	-1.87 (1.09)	-2.24 (0.95)	-2.57 (0.99)	-0.49 (0.90)
PC2	-0.96 (1.19)	-0.52 (0.47)	-1.02 (0.60)	-0.64 (0.69)	-0.38 (0.73)	-0.43 (0.72)	-0.61 (0.78)	0.22 (0.69)	1.34 (0.62)	1.63 (0.78)	1.38 (0.79)	1.27 (0.75)	-0.96 (0.86)	-0.50 (1.03)	-0.54 (0.98)	-0.32 (0.92)	-0.66 (0.90)	1.75 (0.52)
PC3	-0.23 (0.88)	-0.64 (0.76)	3.49 (1.03)	2.33 (1.04)	1.39 (1.04)	0.79 (1.26)	0.86 (0.88)	2.74 (0.90)	1.73 (0.73)	0.76 (0.96)	0.26 (0.83)	0.26 (0.92)	1.86 (0.69)	1.23 (0.80)	0.25 (0.93)	-0.15 (0.95)	-0.28 (0.79)	-0.21 (0.88)
PC4	-0.15 (1.19)	-0.53 (0.52)	-0.17 (0.73)	0.39 (0.79)	0.75 (0.79)	0.91 (0.79)	1.09 (0.62)	-0.27 (0.81)	-0.29 (0.53)	0.12 (0.63)	0.15 (0.59)	0.20 (0.57)	-0.70 (0.85)	-0.27 (1.08)	0.14 (1.06)	0.13 (0.83)	0.37 (0.82)	0.41 (0.52)
¹ Mean (SD)

res.ind %>% 
  select(Register, Level, PC4) %>% 
  group_by(Register, Level) %>% 
  summarise_if(is.numeric, c(Median = median, MAD = mad))

## # A tibble: 18 × 4
## # Groups:   Register [3]
##    Register     Level  Median   MAD
##    <fct>        <fct>   <dbl> <dbl>
##  1 Conversation A     -0.0696 0.684
##  2 Conversation B      0.422  0.801
##  3 Conversation C      0.735  0.797
##  4 Conversation D      0.857  0.695
##  5 Conversation E      1.09   0.609
##  6 Conversation Ref.  -0.541  0.522
##  7 Fiction      A     -0.382  0.749
##  8 Fiction      B     -0.388  0.529
##  9 Fiction      C      0.125  0.670
## 10 Fiction      D      0.0135 0.539
## 11 Fiction      E      0.117  0.672
## 12 Fiction      Ref.   0.404  0.499
## 13 Informative  A     -0.706  0.823
## 14 Informative  B     -0.374  1.10 
## 15 Informative  C      0.0700 1.09 
## 16 Informative  D      0.0409 0.787
## 17 Informative  E      0.369  0.565
## 18 Informative  Ref.  -0.0753 1.21

# Search for example texts to illustrate results
res.ind %>% 
  filter(PC3 > 2.5 & PC2 < -2) %>% 
  #filter(Register=="Conversation") %>% 
  #filter(Level == "B") %>% 
  #filter(PC1 > 4.7) %>% 
  select(Filename, PC1, PC2, PC3) %>% 
  mutate(across(where(is.numeric), round, 2))

##                    Filename   PC1   PC2  PC3
## 1     NGL_1_Spoken_0002.txt  2.41 -2.27 4.36
## 2  HT_5_ELF_Spoken_0003.txt  1.94 -2.30 3.49
## 3 HT_6_Informative_0001.txt -2.19 -2.36 3.11

Raincloud plots

res.ind$Subcorpus <- fct_relevel(res.ind$Subcorpus, "Spoken BNC2014 Ref.", "Textbook Conversation", "Youth Fiction Ref.", "Textbook Fiction", "Info Teens Ref.", "Textbook Informative")

colours <- suf_palette(name = "london", n = 6, type = "continuous") 
colours2 <- suf_palette(name = "classic", n = 5, type = "continuous") 
colours <- c(colours, colours2[c(2:4)]) # Nine colours range
palette <- colours[c(1,5,6,2,3,8,7,4,9)] # Good order for PCA

colours <- palette[c(1,8,9,2,7,3)]

ggplot(res.ind, aes(x=Subcorpus,y=PC1, fill = Subcorpus, colour = Subcorpus))+ # Or leave out "colour = Register" to keep the dots in black
  geom_flat_violin(position = position_nudge(x = .25, y = 0),adjust = 2, trim = FALSE)+
  geom_point(position = position_jitter(width = .15), size = .25)+
# note that here we need to set the x-variable to a numeric variable and bump it to get the boxplots to line up with the rainclouds. 
  geom_boxplot(aes(x = as.numeric(Subcorpus)+0.25, y = PC1), outlier.shape = NA, alpha = 0.3, width = .15, colour = "BLACK") +
  ylab('PC1')+ 
  theme_cowplot()+
  theme(axis.title.x=element_blank())+
  guides(fill = "none", colour = "none") +
  scale_colour_manual(values = colours)+
  scale_fill_manual(values = colours) +
  annotate(geom = "text", x = 1.5, y = -7, label = "Conversation", size = 5) +
  annotate(geom = "segment", x = 0.7, xend = 2.5, y = -6.5, yend = -6.5) +
  annotate(geom = "text", x = 3.5, y = -7, label = "Fiction", size = 5) +
  annotate(geom = "segment", x = 2.7, xend = 4.5, y = -6.5, yend = -6.5) +
  annotate(geom = "text", x = 5.7, y = -7, label = "Informative", size = 5) +
  annotate(geom = "segment", x = 4.7, xend = 6.5, y = -6.5, yend = -6.5) +
  scale_x_discrete(labels=rep(c("Reference", "Textbook"), 3))+
  scale_y_continuous(sec.axis = dup_axis(name=NULL), breaks = seq(from = -6, to = 5, by = 1))

#ggsave(here("Plots", "PC11_3RegComparison.svg"), width = 13, height = 8)
#ggsave(here("Plots", "PC1_3RegComparison.png"), width = 20, height = 15, units = "cm", dpi = 300)

Mixed-effects models predicting PC scores

Adding the ‘source’ variable for the random effects

library(stringr)

# Add Source variable for random effect variable in mixed effect models
res.ind <- res.ind %>% 
  mutate(Source = case_when(
  Corpus=="Youth.Fiction" ~ paste("Book", str_extract(Filename, "[0-9]{1,3}"), sep = ""),
  Corpus=="Spoken.BNC2014" ~ "Spoken.BNC2014",
  Corpus=="Textbook.English" ~ as.character(Series),
  Corpus=="Informative.Teens" ~ str_extract(Filename, "BBC|Science_Tech"),
  TRUE ~ "NA")) %>% 
  mutate(Source = case_when(
  Corpus=="Informative.Teens" & is.na(Source) ~ str_remove(Filename, "_.*"),
  TRUE ~ as.character(Source))) %>% 
  mutate(Source = as.factor(Source)) %>% 
  mutate(Corpus = case_when(
    Corpus=="Textbook.English" ~ "Textbook",
    Corpus=="Informative.Teens" ~ "Reference",
    Corpus=="Spoken.BNC2014" ~ "Reference",
    Corpus=="Youth.Fiction" ~ "Reference"
  )) %>% 
  mutate(Corpus = as.factor(Corpus))

summary(res.ind$Source)

## Spoken.BNC2014         Access      Solutions            NGL      GreenLine 
##           1250            219            207            188            127 
##      Achievers             HT      TeenVogue       WhyFiles        History 
##            107            104            100            100             99 
##           Teen   Encyclopedia    Factmonster           Dogo      Ducksters 
##             98             97             97             96             96 
##            JTT            EIM        Science          Quatr            BBC 
##             96             95             95             93             92 
##          World       Revision   Science_Tech            POC   TweenTribute 
##             83             82             81             59             28 
##          Book1         Book10        Book100        Book101        Book102 
##              4              4              4              4              4 
##        Book103        Book104        Book105        Book106        Book107 
##              4              4              4              4              4 
##        Book108        Book109         Book11        Book110        Book111 
##              4              4              4              4              4 
##        Book112        Book113        Book114        Book115        Book116 
##              4              4              4              4              4 
##        Book117        Book118        Book119         Book12        Book120 
##              4              4              4              4              4 
##        Book121        Book122        Book123        Book124        Book125 
##              4              4              4              4              4 
##        Book126        Book127        Book128        Book129         Book13 
##              4              4              4              4              4 
##        Book130        Book131        Book132        Book133        Book134 
##              4              4              4              4              4 
##        Book135        Book136        Book137        Book138        Book139 
##              4              4              4              4              4 
##         Book14        Book140        Book141        Book142        Book143 
##              4              4              4              4              4 
##        Book144        Book145        Book146        Book147        Book148 
##              4              4              4              4              4 
##        Book149         Book15        Book150        Book151        Book152 
##              4              4              4              4              4 
##        Book153        Book154        Book155        Book156        Book157 
##              4              4              4              4              4 
##        Book158        Book159         Book16        Book160        Book161 
##              4              4              4              4              4 
##        Book162        Book163        Book164        Book165        (Other) 
##              4              4              4              4            895

# Change the reference level to a theoretically more meaningful level and one that is better populated (see, e.g., https://stats.stackexchange.com/questions/430770/in-a-multilevel-linear-regression-how-does-the-reference-level-affect-other-lev)
summary(res.ind$Corpus)

## Reference  Textbook 
##      3778      1202

summary(res.ind$Subcorpus)

##   Spoken BNC2014 Ref. Textbook Conversation    Youth Fiction Ref. 
##                  1250                   565                  1191 
##      Textbook Fiction       Info Teens Ref.  Textbook Informative 
##                   285                  1337                   352

summary(res.ind$Level)

##    A    B    C    D    E Ref. 
##  152  231  306  304  209 3778

res.ind$Level <- relevel(res.ind$Level, "Ref.")
res.ind$Subcorpus <- factor(res.ind$Subcorpus, levels = c("Spoken BNC2014 Ref.", "Textbook Conversation", "Youth Fiction Ref.", "Textbook Fiction", "Info Teens Ref.", "Textbook Informative"))
res.ind$Corpus <- relevel(res.ind$Corpus, "Reference")

PC1

# Mixed-effects models
# PC1

# no random effects
summary(lm(PC1 ~ Level:Register, data = res.ind))

## 
## Call:
## lm(formula = PC1 ~ Level:Register, data = res.ind)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0504 -0.5218  0.0214  0.5277  4.3593 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -2.5738     0.1097 -23.465  < 2e-16 ***
## LevelRef.:RegisterConversation   6.2841     0.1128  55.729  < 2e-16 ***
## LevelA:RegisterConversation      4.4831     0.1474  30.406  < 2e-16 ***
## LevelB:RegisterConversation      4.6861     0.1388  33.757  < 2e-16 ***
## LevelC:RegisterConversation      4.1687     0.1321  31.569  < 2e-16 ***
## LevelD:RegisterConversation      3.6650     0.1362  26.908  < 2e-16 ***
## LevelE:RegisterConversation      3.6018     0.1557  23.137  < 2e-16 ***
## LevelRef.:RegisterFiction        2.0870     0.1129  18.483  < 2e-16 ***
## LevelA:RegisterFiction           2.7021     0.1891  14.289  < 2e-16 ***
## LevelB:RegisterFiction           2.3649     0.1660  14.244  < 2e-16 ***
## LevelC:RegisterFiction           2.3021     0.1636  14.073  < 2e-16 ***
## LevelD:RegisterFiction           1.7357     0.1568  11.068  < 2e-16 ***
## LevelE:RegisterFiction           1.9271     0.1568  12.288  < 2e-16 ***
## LevelRef.:RegisterInformative   -0.4905     0.1126  -4.358 1.34e-05 ***
## LevelA:RegisterInformative       1.1157     0.2063   5.409 6.63e-08 ***
## LevelB:RegisterInformative       0.9393     0.1636   5.742 9.92e-09 ***
## LevelC:RegisterInformative       0.7008     0.1467   4.777 1.83e-06 ***
## LevelD:RegisterInformative       0.3381     0.1420   2.381   0.0173 *  
## LevelE:RegisterInformative           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9243 on 4962 degrees of freedom
## Multiple R-squared:  0.8866, Adjusted R-squared:  0.8862 
## F-statistic:  2282 on 17 and 4962 DF,  p-value: < 2.2e-16

# with random effects (slopes + intercepts)
md_source <- lmer(PC1 ~ 1 + (Register|Source), res.ind, REML = FALSE) 
md_corpus <- update(md_source, .~. + Level)
md_register <- update(md_source, . ~ . + Register)
md_both <- update(md_corpus, .~. + Register)
md_interaction <- update(md_both, . ~ . + Level:Register)

anova(md_source, md_corpus, md_register, md_both, md_interaction)

## Data: res.ind
## Models:
## md_source: PC1 ~ 1 + (Register | Source)
## md_register: PC1 ~ (Register | Source) + Register
## md_corpus: PC1 ~ (Register | Source) + Level
## md_both: PC1 ~ (Register | Source) + Level + Register
## md_interaction: PC1 ~ (Register | Source) + Level + Register + Level:Register
##                npar   AIC   BIC  logLik deviance   Chisq Df Pr(>Chisq)    
## md_source         8 12421 12473 -6202.5    12405                          
## md_register      10 12357 12422 -6168.4    12337  68.106  2  1.625e-15 ***
## md_corpus        13 12181 12266 -6077.5    12155 181.996  3  < 2.2e-16 ***
## md_both          15 12117 12215 -6043.5    12087  67.843  2  1.854e-15 ***
## md_interaction   25 12098 12261 -6024.0    12048  39.104 10  2.435e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(md_interaction, wrap.labels = 200)

	PC1
Predictors	Estimates	CI	p
(Intercept)	3.71	2.46 – 4.96	<0.001
Level [A]	-1.73	-3.06 – -0.40	0.011
Level [B]	-1.65	-2.97 – -0.32	0.015
Level [C]	-2.08	-3.40 – -0.76	0.002
Level [D]	-2.47	-3.80 – -1.15	<0.001
Level [E]	-2.73	-4.06 – -1.40	<0.001
Register [Fiction]	-4.20	-5.45 – -2.95	<0.001
Register [Informative]	-6.75	-8.03 – -5.47	<0.001
Level [A] * Register [Fiction]	2.18	0.86 – 3.49	0.001
Level [B] * Register [Fiction]	1.71	0.41 – 3.01	0.010
Level [C] * Register [Fiction]	2.12	0.83 – 3.42	0.001
Level [D] * Register [Fiction]	1.93	0.64 – 3.23	0.003
Level [E] * Register [Fiction]	2.41	1.10 – 3.71	<0.001
Level [A] * Register [Informative]	3.37	2.01 – 4.73	<0.001
Level [B] * Register [Informative]	3.17	1.83 – 4.51	<0.001
Level [C] * Register [Informative]	3.24	1.90 – 4.57	<0.001
Level [D] * Register [Informative]	3.20	1.87 – 4.53	<0.001
Level [E] * Register [Informative]	3.14	1.80 – 4.48	<0.001
Random Effects
σ²	0.59
τ₀₀ _Source	0.41
τ₁₁ _{Source.RegisterFiction}	0.12
τ₁₁ _{Source.RegisterInformative}	0.20
ρ₀₁	-0.05
	-0.48
ICC	0.41
N _Source	325
Observations	4980
Marginal R² / Conditional R²	0.870 / 0.923

# Random effects
ranef <- as.data.frame(ranef(md_interaction))

ranef %>% 
  filter(grp == "TeenVogue")

##   grpvar                term       grp    condval    condsd
## 1 Source         (Intercept) TeenVogue  0.9017680 0.4340609
## 2 Source     RegisterFiction TeenVogue -0.1999336 0.3289553
## 3 Source RegisterInformative TeenVogue  0.2052211 0.4298679

Testing assumptions

This chunk was used to check the assumptions of all the other models below.

# cf. https://stackoverflow.com/questions/63751541/plot-does-not-show-all-diagnostic-plots-for-lme-lmer

# check distribution of residuals
plot(md_interaction)

# scale-location plot
plot(md_interaction,
     sqrt(abs(resid(.)))~fitted(.),
     type=c("p","smooth"), col.line=1)

# Q-Q plot
lattice::qqmath(md_interaction)

PC2 to PC6

# PC2
# Note that the model with random intercepts and slopes failed to converge for PC2 which is why only slopes are modelled here
md_source <- lmer(PC2 ~ 1 + (1|Source), res.ind, REML = FALSE) 
md_corpus <- update(md_source, .~. + Level)
md_register <- update(md_source, . ~ . + Register)
md_both <- update(md_corpus, .~. + Register)
md_interaction <- update(md_both, . ~ . + Level:Register)

anova(md_source, md_corpus, md_register, md_both, md_interaction)

## Data: res.ind
## Models:
## md_source: PC2 ~ 1 + (1 | Source)
## md_register: PC2 ~ (1 | Source) + Register
## md_corpus: PC2 ~ (1 | Source) + Level
## md_both: PC2 ~ (1 | Source) + Level + Register
## md_interaction: PC2 ~ (1 | Source) + Level + Register + Level:Register
##                npar   AIC   BIC  logLik deviance    Chisq Df Pr(>Chisq)    
## md_source         3 12281 12301 -6137.6    12275                           
## md_register       5 10761 10794 -5375.7    10751 1523.882  2  < 2.2e-16 ***
## md_corpus         8 12138 12190 -6061.1    12122    0.000  3          1    
## md_both          10 10616 10681 -5298.1    10596 1525.963  2  < 2.2e-16 ***
## md_interaction   20 10550 10680 -5254.9    10510   86.436 10  2.718e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(md_interaction, wrap.labels = 200)

	PC2
Predictors	Estimates	CI	p
(Intercept)	-0.52	-1.27 – 0.24	0.183
Level [A]	-0.54	-1.35 – 0.28	0.195
Level [B]	-0.15	-0.96 – 0.66	0.721
Level [C]	0.08	-0.72 – 0.89	0.842
Level [D]	0.04	-0.76 – 0.85	0.915
Level [E]	0.07	-0.75 – 0.89	0.866
Register [Fiction]	2.27	1.51 – 3.03	<0.001
Register [Informative]	-0.46	-1.24 – 0.32	0.250
Level [A] * Register [Fiction]	-1.01	-1.82 – -0.21	0.014
Level [B] * Register [Fiction]	-0.25	-1.04 – 0.54	0.532
Level [C] * Register [Fiction]	-0.21	-1.00 – 0.58	0.602
Level [D] * Register [Fiction]	-0.41	-1.20 – 0.38	0.307
Level [E] * Register [Fiction]	-0.47	-1.26 – 0.32	0.246
Level [A] * Register [Informative]	0.49	-0.34 – 1.33	0.246
Level [B] * Register [Informative]	0.59	-0.22 – 1.40	0.154
Level [C] * Register [Informative]	0.26	-0.54 – 1.06	0.524
Level [D] * Register [Informative]	0.55	-0.26 – 1.35	0.183
Level [E] * Register [Informative]	0.33	-0.49 – 1.14	0.431
Random Effects
σ²	0.45
τ₀₀ _Source	0.15
ICC	0.25
N _Source	325
Observations	4980
Marginal R² / Conditional R²	0.671 / 0.753

# PC3
#md_source <- lmer(PC3 ~ 1 + (Register|Source), res.ind, REML = FALSE) # Fails to converge
md_source <- lmer(PC3 ~ 1 + (1|Source), res.ind, REML = FALSE)
md_corpus <- update(md_source, .~. + Level)
md_register <- update(md_source, . ~ . + Register)
md_both <- update(md_corpus, .~. + Register)
md_interaction <- update(md_both, . ~ . + Level:Register)

anova(md_source, md_corpus, md_register, md_both, md_interaction)

## Data: res.ind
## Models:
## md_source: PC3 ~ 1 + (1 | Source)
## md_register: PC3 ~ (1 | Source) + Register
## md_corpus: PC3 ~ (1 | Source) + Level
## md_both: PC3 ~ (1 | Source) + Level + Register
## md_interaction: PC3 ~ (1 | Source) + Level + Register + Level:Register
##                npar   AIC   BIC  logLik deviance    Chisq Df Pr(>Chisq)    
## md_source         3 13523 13542 -6758.3    13517                           
## md_register       5 12988 13020 -6489.0    12978  538.750  2  < 2.2e-16 ***
## md_corpus         8 11928 11981 -5956.2    11912 1065.455  3  < 2.2e-16 ***
## md_both          10 11466 11531 -5722.8    11446  466.870  2  < 2.2e-16 ***
## md_interaction   20 11461 11592 -5710.7    11421   24.264 10   0.006929 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(md_interaction, wrap.labels = 200)

	PC3
Predictors	Estimates	CI	p
(Intercept)	-0.64	-1.99 – 0.71	0.354
Level [A]	4.25	2.82 – 5.69	<0.001
Level [B]	3.09	1.66 – 4.52	<0.001
Level [C]	2.12	0.69 – 3.55	0.004
Level [D]	1.64	0.21 – 3.07	0.024
Level [E]	1.29	-0.15 – 2.73	0.078
Register [Fiction]	0.43	-0.92 – 1.79	0.533
Register [Informative]	0.43	-0.96 – 1.83	0.544
Level [A] * Register [Fiction]	-1.50	-2.89 – -0.12	0.033
Level [B] * Register [Fiction]	-1.39	-2.76 – -0.01	0.048
Level [C] * Register [Fiction]	-1.34	-2.71 – 0.03	0.056
Level [D] * Register [Fiction]	-1.35	-2.72 – 0.03	0.055
Level [E] * Register [Fiction]	-1.03	-2.41 – 0.35	0.142
Level [A] * Register [Informative]	-1.92	-3.35 – -0.49	0.008
Level [B] * Register [Informative]	-1.45	-2.86 – -0.03	0.045
Level [C] * Register [Informative]	-1.36	-2.77 – 0.05	0.058
Level [D] * Register [Informative]	-1.43	-2.84 – -0.02	0.047
Level [E] * Register [Informative]	-1.53	-2.95 – -0.11	0.034
Random Effects
σ²	0.52
τ₀₀ _Source	0.48
ICC	0.48
N _Source	325
Observations	4980
Marginal R² / Conditional R²	0.425 / 0.700

# PC4
#md_source <- lmer(PC4 ~ 1 + (Register|Source), res.ind, REML = FALSE) # Singular fit
md_source <- lmer(PC4 ~ 1 + (1|Source), res.ind, REML = FALSE)
md_corpus <- update(md_source, .~. + Level)
md_register <- update(md_source, . ~ . + Register)
md_both <- update(md_corpus, .~. + Register)
md_interaction <- update(md_both, . ~ . + Level:Register)

anova(md_source, md_corpus, md_register, md_both, md_interaction)

## Data: res.ind
## Models:
## md_source: PC4 ~ 1 + (1 | Source)
## md_register: PC4 ~ (1 | Source) + Register
## md_corpus: PC4 ~ (1 | Source) + Level
## md_both: PC4 ~ (1 | Source) + Level + Register
## md_interaction: PC4 ~ (1 | Source) + Level + Register + Level:Register
##                npar   AIC   BIC  logLik deviance    Chisq Df Pr(>Chisq)    
## md_source         3 11019 11039 -5506.5    11013                           
## md_register       5 10825 10857 -5407.4    10815 198.2593  2    < 2e-16 ***
## md_corpus         8 10827 10879 -5405.3    10811   4.0956  3     0.2513    
## md_both          10 10563 10628 -5271.6    10543 267.5805  2    < 2e-16 ***
## md_interaction   20 10527 10657 -5243.3    10487  56.4370 10    1.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(md_interaction, wrap.labels = 200)

	PC4
Predictors	Estimates	CI	p
(Intercept)	-0.53	-1.30 – 0.25	0.186
Level [A]	0.36	-0.47 – 1.20	0.392
Level [B]	0.91	0.08 – 1.74	0.033
Level [C]	1.28	0.45 – 2.11	0.002
Level [D]	1.45	0.62 – 2.28	0.001
Level [E]	1.64	0.80 – 2.47	<0.001
Register [Fiction]	0.93	0.15 – 1.71	0.019
Register [Informative]	0.38	-0.43 – 1.18	0.358
Level [A] * Register [Fiction]	-1.11	-1.94 – -0.29	0.008
Level [B] * Register [Fiction]	-1.67	-2.48 – -0.85	<0.001
Level [C] * Register [Fiction]	-1.59	-2.40 – -0.78	<0.001
Level [D] * Register [Fiction]	-1.75	-2.55 – -0.94	<0.001
Level [E] * Register [Fiction]	-1.86	-2.67 – -1.05	<0.001
Level [A] * Register [Informative]	-0.92	-1.78 – -0.07	0.034
Level [B] * Register [Informative]	-1.02	-1.85 – -0.19	0.016
Level [C] * Register [Informative]	-1.00	-1.83 – -0.18	0.017
Level [D] * Register [Informative]	-1.20	-2.03 – -0.38	0.004
Level [E] * Register [Informative]	-1.16	-1.99 – -0.32	0.007
Random Effects
σ²	0.45
τ₀₀ _Source	0.16
ICC	0.26
N _Source	325
Observations	4980
Marginal R² / Conditional R²	0.234 / 0.434

PC5 and PC6 were not included in the PhD thesis in the end.

# PC5
#md_source <- lmer(PC5 ~ 1 + (Register|Source), res.ind, REML = FALSE) # Fails to converge
md_source <- lmer(PC5 ~ 1 + (1|Source), res.ind, REML = FALSE) 
md_corpus <- update(md_source, .~. + Level)
md_register <- update(md_source, . ~ . + Register)
md_both <- update(md_corpus, .~. + Register)
md_interaction <- update(md_both, . ~ . + Level:Register)

anova(md_source, md_corpus, md_register, md_both, md_interaction)

tab_model(md_interaction, wrap.labels = 200)

# PC6
#md_source <- lmer(PC6 ~ 1 + (Register|Source), res.ind, REML = FALSE) # Fails to converge
md_source <- lmer(PC6 ~ 1 + (1|Source), res.ind, REML = FALSE) 
md_corpus <- update(md_source, .~. + Level)
md_register <- update(md_source, . ~ . + Register)
md_both <- update(md_corpus, .~. + Register)
md_interaction <- update(md_both, . ~ . + Level:Register)

anova(md_source, md_corpus, md_register, md_both, md_interaction)

tab_model(md_interaction, wrap.labels = 200)

This chunk was used to generate plots for all of the models computed in the chunk above.

# Tweak plot aesthetics with: https://cran.r-project.org/web/packages/sjPlot/vignettes/custplot.html
# Colour customisation trick from: https://stackoverflow.com/questions/55598920/different-line-colors-in-forest-plot-output-from-sjplot-r-package

plot_model(md_interaction, 
           #type = "re", # Option to visualise random effects 
           show.intercept = TRUE,
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = palette[c(1:3,7:9)],
           group.terms = c(1,5,5,5,5,5,6,4,2,2,2,2,2,3,3,3,3,3), 
           title="Fixed effects",
           wrap.labels = 40,
           axis.title = "PC6 estimated coefficients") +
  theme_sjplot2()

#ggsave(here("Plots", "TxBRef3Reg_PCA6_lmer_fixed.svg"), height = 6, width = 7)

#svg(here("Plots", "TxBReg3Reg_predicted_PC6_scores_interactions.svg"), height = 8, width = 9)
visreg(md_interaction, xvar = "Level", by="Register", 
       #type = "contrast",
       type = "conditional",
       line=list(col="darkred"), 
       points=list(cex=0.3),
       xlab = "Ref. corpora and textbook level (A to E)", ylab = "PC6",
       layout=c(3,1)
)

#dev.off()

# Random effects
ranef <- as.data.frame(ranef(md_interaction))

ranef %>% 
  filter(grp %in% c("TeenVogue", "BBC", "Dogo", "Ducksters", "Encyclopedia", "Factmonster", "History", "Quatr", "Revision", "Science", "Science_Tech", "Teen", "TweenTribute", "WhyFiles", "World")) %>% 
  ggplot(aes(x = grp, y = condval)) +
  geom_point() +
  coord_flip()

ranef %>% 
  filter(grp %in% levels(data$Series)) %>% 
  ggplot(aes(x = grp, y = condval)) +
  geom_point() +
  coord_flip()

# Modeling subcorpora
md_subcorpus <- lmer(PC1 ~ Subcorpus + (1|Source), res.ind, REML = FALSE)
tab_model(md_subcorpus)

	PC1
Predictors	Estimates	CI	p
(Intercept)	3.71	2.35 – 5.07	<0.001
Subcorpus [Textbook Conversation]	-2.14	-3.57 – -0.71	0.003
Subcorpus [Youth Fiction Ref.]	-4.20	-5.56 – -2.84	<0.001
Subcorpus [Textbook Fiction]	-4.27	-5.71 – -2.84	<0.001
Subcorpus [Info Teens Ref.]	-6.75	-8.15 – -5.35	<0.001
Subcorpus [Textbook Informative]	-5.81	-7.24 – -4.38	<0.001
Random Effects
σ²	0.63
τ₀₀ _Source	0.48
ICC	0.43
N _Source	325
Observations	4980
Marginal R² / Conditional R²	0.856 / 0.918

plot_model(md_subcorpus, 
           #type = "re", # Option to visualise random effects 
           show.intercept = TRUE,
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = palette[c(1:3,7:9)],
           group.terms = c(1,5,2,6,4,3), 
           wrap.labels = 40,
           axis.title = "PC6 estimated coefficients",
           title="Fixed effects") +
  theme_sjplot2()

subcorpus_results <- emmeans(md_subcorpus, "Subcorpus")

## Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
## To enable adjustments, add the argument 'pbkrtest.limit = 4980' (or larger)
## [or, globally, 'set emm_options(pbkrtest.limit = 4980)' or larger];
## but be warned that this may result in large computation time and memory use.

## Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
## To enable adjustments, add the argument 'lmerTest.limit = 4980' (or larger)
## [or, globally, 'set emm_options(lmerTest.limit = 4980)' or larger];
## but be warned that this may result in large computation time and memory use.

summary(subcorpus_results)

##  Subcorpus             emmean     SE  df asymp.LCL asymp.UCL
##  Spoken BNC2014 Ref.    3.710 0.6920 Inf     2.354     5.067
##  Textbook Conversation  1.573 0.2334 Inf     1.116     2.031
##  Youth Fiction Ref.    -0.491 0.0461 Inf    -0.581    -0.400
##  Textbook Fiction      -0.564 0.2361 Inf    -1.026    -0.101
##  Info Teens Ref.       -3.039 0.1800 Inf    -3.392    -2.686
##  Textbook Informative  -2.099 0.2347 Inf    -2.559    -1.639
## 
## Degrees-of-freedom method: asymptotic 
## Confidence level used: 0.95

#write_last_clip()

Packages used in this script

#packages.bib <- sapply(1:length(loadedNamespaces()), function(i) toBibtex(citation(loadedNamespaces()[i])))

knitr::write_bib(c(.packages(), "knitr"), "packages.bib")

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] stringr_1.4.0     gtsummary_1.5.0   gridExtra_2.3     visreg_2.7.0     
##  [5] suffrager_0.1.0   sjPlot_2.8.9      psych_2.0.12      purrr_0.3.4      
##  [9] PCAtools_2.2.0    ggrepel_0.9.1     patchwork_1.1.1   lme4_1.1-27.1    
## [13] Matrix_1.3-2      ggthemes_4.2.4    here_1.0.1        forcats_0.5.1    
## [17] factoextra_1.0.7  emmeans_1.5.4     dplyr_1.0.8       DescTools_0.99.40
## [21] cowplot_1.1.1     caret_6.0-86      ggplot2_3.3.5     lattice_0.20-41  
## [25] car_3.0-10        carData_3.0-4    
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.4.1           readxl_1.3.1             
##   [3] plyr_1.8.6                splines_4.0.3            
##   [5] BiocParallel_1.24.1       digest_0.6.29            
##   [7] foreach_1.5.1             htmltools_0.5.2          
##   [9] fansi_1.0.3               checkmate_2.0.0          
##  [11] magrittr_2.0.3            openxlsx_4.2.3           
##  [13] recipes_0.1.15            modelr_0.1.8             
##  [15] gower_0.2.2               matrixStats_0.58.0       
##  [17] colorspace_2.0-3          haven_2.3.1              
##  [19] xfun_0.29                 crayon_1.5.1             
##  [21] jsonlite_1.8.0            Exact_2.1                
##  [23] survival_3.2-7            iterators_1.0.13         
##  [25] glue_1.6.2                gtable_0.3.0             
##  [27] ipred_0.9-11              sjstats_0.18.1           
##  [29] DelayedArray_0.16.2       sjmisc_2.8.6             
##  [31] BiocSingular_1.6.0        BiocGenerics_0.36.0      
##  [33] abind_1.4-5               scales_1.2.0             
##  [35] mvtnorm_1.1-1             DBI_1.1.1                
##  [37] rstatix_0.7.0             ggeffects_1.0.1          
##  [39] Rcpp_1.0.8.3              xtable_1.8-4             
##  [41] performance_0.7.2         tmvnsim_1.0-2            
##  [43] dqrng_0.2.1               foreign_0.8-81           
##  [45] rsvd_1.0.3                stats4_4.0.3             
##  [47] lava_1.6.9                prodlim_2019.11.13       
##  [49] datawizard_0.2.1          ellipsis_0.3.2           
##  [51] farver_2.1.0              pkgconfig_2.0.3          
##  [53] nnet_7.3-15               sass_0.4.1               
##  [55] utf8_1.2.2                labeling_0.4.2           
##  [57] tidyselect_1.1.2          rlang_1.0.2              
##  [59] reshape2_1.4.4            effectsize_0.4.4-2       
##  [61] munsell_0.5.0             cellranger_1.1.0         
##  [63] tools_4.0.3               cli_3.2.0                
##  [65] generics_0.1.2            sjlabelled_1.1.7         
##  [67] broom_0.7.9               evaluate_0.14            
##  [69] fastmap_1.1.0             yaml_2.2.1               
##  [71] ModelMetrics_1.2.2.2      knitr_1.37               
##  [73] zip_2.1.1                 rootSolve_1.8.2.1        
##  [75] nlme_3.1-152              sparseMatrixStats_1.2.1  
##  [77] compiler_4.0.3            rstudioapi_0.13          
##  [79] curl_4.3.2                ggsignif_0.6.1           
##  [81] e1071_1.7-4               gt_0.3.1                 
##  [83] broom.helpers_1.4.0       tibble_3.1.6             
##  [85] bslib_0.3.1               stringi_1.7.6            
##  [87] highr_0.9                 parameters_0.13.0.1      
##  [89] commonmark_1.8.0          nloptr_1.2.2.3           
##  [91] vctrs_0.4.0               pillar_1.7.0             
##  [93] lifecycle_1.0.1           jquerylib_0.1.4          
##  [95] estimability_1.3          data.table_1.14.2        
##  [97] irlba_2.3.3               insight_0.14.5           
##  [99] lmom_2.8                  R6_2.5.1                 
## [101] renv_0.14.0               rio_0.5.26               
## [103] IRanges_2.24.1            gld_2.6.2                
## [105] codetools_0.2-18          boot_1.3-27              
## [107] MASS_7.3-53.1             assertthat_0.2.1         
## [109] rprojroot_2.0.3           withr_2.5.0              
## [111] mnormt_2.0.2              S4Vectors_0.28.1         
## [113] bayestestR_0.9.0          expm_0.999-6             
## [115] parallel_4.0.3            hms_1.1.1                
## [117] grid_4.0.3                rpart_4.1-15             
## [119] beachmat_2.6.4            timeDate_3043.102        
## [121] tidyr_1.2.0               coda_0.19-4              
## [123] class_7.3-18              minqa_1.2.4              
## [125] rmarkdown_2.11            DelayedMatrixStats_1.12.3
## [127] MatrixGenerics_1.2.1      ggpubr_0.4.0             
## [129] pROC_1.17.0.1             lubridate_1.7.10

Alburez-Gutierrez, Diego. 2020. Suffrager: A Feminist Colour Palette for r.

Arnold, Jeffrey B. 2021. Ggthemes: Extra Themes, Scales and Geoms for Ggplot2. https://github.com/jrnold/ggthemes.

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bates, Douglas, and Martin Maechler. 2021. Matrix: Sparse and Dense Matrix Classes and Methods. http://Matrix.R-forge.R-project.org/.

Bates, Douglas, Martin Maechler, Ben Bolker, and Steven Walker. 2021. Lme4: Linear Mixed-Effects Models Using Eigen and S4. https://github.com/lme4/lme4/.

Blighe, Kevin, and Aaron Lun. 2020. PCAtools: Everything Principal Components Analysis. https://github.com/kevinblighe/PCAtools.

Breheny, Patrick, and Woodrow Burchett. 2017. “Visualization of Regression Models Using Visreg.” The R Journal 9 (2): 56–71.

———. 2020. Visreg: Visualization of Regression Models. http://pbreheny.github.io/visreg.

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third. Thousand Oaks CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.

Fox, John, Sanford Weisberg, and Brad Price. 2020a. Car: Companion to Applied Regression. https://CRAN.R-project.org/package=car.

———. 2020b. carData: Companion to Applied Regression Data Sets. https://CRAN.R-project.org/package=carData.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Kassambara, Alboukadel, and Fabian Mundt. 2020. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. http://www.sthda.com/english/rpkgs/factoextra.

Kuhn, Max. 2020. Caret: Classification and Regression Training. https://github.com/topepo/caret/.

Lenth, Russell V. 2021. Emmeans: Estimated Marginal Means, Aka Least-Squares Means. https://github.com/rvlenth/emmeans.

Lüdecke, Daniel. 2021. sjPlot: Data Visualization for Statistics in Social Science. https://strengejacke.github.io/sjPlot/.

Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Pedersen, Thomas Lin. 2020. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Revelle, William. 2020. Psych: Procedures for Psychological, Psychometric, and Personality Research. https://personality-project.org/r/psych/ https://personality-project.org/r/psych-manual.pdf.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.

———. 2020. Lattice: Trellis Graphics for r. http://lattice.r-forge.r-project.org/.

Signorell, Andri. 2021. DescTools: Tools for Descriptive Statistics. https://CRAN.R-project.org/package=DescTools.

Sjoberg, Daniel D., Michael Curry, Joseph Larmarange, Jessica Lavery, Karissa Whiting, and Emily C. Zabor. 2021. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://CRAN.R-project.org/package=gtsummary.

Sjoberg, Daniel D., Karissa Whiting, Michael Curry, Jessica A. Lavery, and Joseph Larmarange. 2021. “Reproducible Summary Tables with the Gtsummary Package.” The R Journal 13: 570–80. https://doi.org/10.32614/RJ-2021-053.

Slowikowski, Kamil. 2021. Ggrepel: Automatically Position Non-Overlapping Text Labels with Ggplot2. https://github.com/slowkow/ggrepel.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

———. 2021. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wilke, Claus O. 2020. Cowplot: Streamlined Plot Theme and Plot Annotations for Ggplot2. https://wilkelab.org/cowplot/.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

New full MD models of Textbook English

Elen Le Foll

09/11/2021