Appendix F — Data Analysis for the Model of Intra-Textbook Variation

Last modifications on this page

January 24, 2025

This script documents the analysis of the pre-processed data from the Textbook English Corpus (TEC) to arrive at the multi-dimensional model of intra-textbook linguistic variation (Chapter 6). It generates all of the statistics and plots included in the book, as well as many others that were used in the analysis, but not included in the book for reasons of space.

F.1 Packages required

The following packages must be installed and loaded to carry out the following analyses.

#renv::restore() # Restore the project's dependencies from the lockfile to ensure that same package versions are used as in the original study

library(caret) # For its confusion matrix function
library(cowplot)
library(DescTools) # For 95% CI
library(emmeans)
library(factoextra) # For circular graphs of variables
library(forcats) # For data manipulation
library(ggthemes) # For theme of factoextra plots
library(here) # For dynamic file paths
library(knitr) # Loaded to display the tables using the kable() function
library(lme4) # For linear regression modelling
library(patchwork) # To create figures with more than one plot
library(PCAtools) # For nice biplots of PCA results
library(psych) # For various useful stats function
library(sjPlot) # For model plots and tables
library(tidyverse) # For data wrangling
library(visreg) # For plots of interaction effects

# From https://github.com/RainCloudPlots/RainCloudPlots:
source(here("R_rainclouds.R")) # For geom_flat_violin rainplots

F.2 Preparing the data for PCA

F.2.1 TEC data import

Code

TxBcounts <- readRDS(here("data", "processed", "TxBcounts3.rds"))
# colnames(TxBcounts)
# nrow(TxBcounts)

TxBzlogcounts <- readRDS(here("data", "processed", "TxBzlogcounts.rds")) 
# nrow(TxBzlogcounts)
# colnames(TxBzlogcounts)

TxBdata <- cbind(TxBcounts[,1:6], as.data.frame(TxBzlogcounts))
# str(TxBdata)

First, the TEC data as processed in Appendix D is imported. It comprises 1,961 texts/files, each with logged standardised normalised frequencies for 66 linguistic features.

F.3 Checking the factorability of data

kmo <- KMO(TxBdata[,7:ncol(TxBdata)])

The overall MSA value of the dataset is 0.86. The features have the following individual MSA values (ordered from lowest to largest):

kmo$MSAi[order(kmo$MSAi)] |>  round(2)

  MDWO   MDWS   MDNE   MDCA    VBD   VPRT    POS    ACT   FREQ  TPP3S     LD 
  0.34   0.46   0.52   0.53   0.59   0.60   0.64   0.65   0.65   0.66   0.68 
 CAUSE   COND   MDCO   VIMP  NCOMP     DT  TPP3P   STPR     RP   SPP2 MENTAL 
  0.69   0.75   0.77   0.78   0.79   0.80   0.80   0.81   0.81   0.83   0.84 
 DOAUX   WHSC    VBG  EXIST  THATD   COMM  FPP1S     IN     NN   WHQU   JJAT 
  0.84   0.85   0.86   0.86   0.86   0.87   0.87   0.88   0.88   0.89   0.89 
  DEMO   THRC ASPECT     CC     EX  OCCUR   PEAS    TTR   YNQU    AWL   QUAN 
  0.89   0.89   0.90   0.90   0.90   0.90   0.91   0.91   0.91   0.92   0.92 
 FPP1P   PROG    XX0   CONT   TIME   BEMA  SPLIT   PASS   JJPR    AMP   QUPR 
  0.92   0.92   0.92   0.92   0.93   0.93   0.93   0.94   0.94   0.95   0.95 
  THSC     RB   FPUH    CUZ    VBN    PIT    DMA POLITE   EMPH    HDG  PLACE 
  0.95   0.95   0.95   0.95   0.95   0.96   0.96   0.96   0.96   0.96   0.97

F.3.1 Removal of feature with MSAs of < 0.5

We first remove the first feature with an individual MSA < 0.5, then check the MSA values again and continue removing features one by one if necessary.

TxBdata <- TxBdata |> 
  select(-c(MDWO))

kmo2 <- KMO(TxBdata[,7:ncol(TxBdata)])

The overall MSA value of the dataset is now 0.87. None of the remaining features have individual MSA values below 0.5:

kmo2$MSAi[order(kmo2$MSAi)] |>  round(2)

  MDWS   MDNE   MDCA    VBD    POS   VPRT   FREQ    ACT  TPP3S     LD  CAUSE 
  0.55   0.58   0.61   0.63   0.64   0.65   0.65   0.66   0.66   0.69   0.70 
  MDCO   COND     DT  TPP3P   VIMP  NCOMP     RP   STPR   SPP2  DOAUX MENTAL 
  0.77   0.80   0.80   0.80   0.81   0.81   0.81   0.82   0.83   0.84   0.85 
   VBG   WHSC  EXIST  THATD  FPP1S   COMM     IN     NN   DEMO   WHQU   THRC 
  0.85   0.85   0.86   0.87   0.87   0.87   0.88   0.89   0.89   0.89   0.89 
  JJAT ASPECT   PEAS     EX  OCCUR     CC    TTR   YNQU    AWL   QUAN  FPP1P 
  0.89   0.89   0.90   0.90   0.91   0.91   0.91   0.91   0.92   0.92   0.92 
  TIME    XX0   CONT   PROG   BEMA  SPLIT   PASS   JJPR   THSC    AMP     RB 
  0.92   0.92   0.93   0.93   0.93   0.93   0.94   0.94   0.95   0.95   0.95 
  QUPR   FPUH    PIT    VBN    DMA    CUZ POLITE   EMPH    HDG  PLACE 
  0.95   0.95   0.96   0.96   0.96   0.96   0.96   0.96   0.96   0.97

F.3.2 Choosing the number of principal components to retain

On the basis of this scree plot, six principal components were initially retained.

Code

# Plot screen plot
#png(here("plots", "screeplot-TEC-only.png"), width = 20, height= 12, units = "cm", res = 300)
scree(TxBdata[,7:ncol(TxBdata)], factors = FALSE, pc = TRUE) # Retain six components

Code

#dev.off()

# Perform PCA
pca1 <- psych::principal(TxBdata[,7:ncol(TxBdata)], 
                         nfactors = 6,
                         rotate = "none")
#pca1$loadings

F.3.3 Excluding features with low final communalites

We first check whether some feature have extremely low communalities (see https://rdrr.io/cran/FactorAssumptions/man/communalities_optimal_solution.html).

  STPR   MDNE    HDG  CAUSE   FREQ   THRC    POS   PROG    ACT   DEMO   MDWS 
  0.09   0.17   0.17   0.19   0.22   0.22   0.24   0.26   0.26   0.27   0.28 
   CUZ   COND   QUPR  EXIST   MDCO  NCOMP  OCCUR   TIME ASPECT  TPP3P    AMP 
  0.28   0.28   0.29   0.29   0.31   0.31   0.32   0.32   0.33   0.33   0.34 
    RP  THATD   THSC     EX  FPP1P  PLACE    PIT    VBG   PEAS   MDCA  DOAUX 
  0.34   0.36   0.39   0.43   0.43   0.43   0.45   0.46   0.47   0.48   0.48 
   VBN   JJPR   JJAT   WHSC  SPLIT   EMPH   QUAN MENTAL  TPP3S   PASS   YNQU 
  0.48   0.49   0.49   0.50   0.51   0.53   0.55   0.56   0.56   0.57   0.58 
POLITE     RB     CC    XX0     DT   COMM   WHQU    TTR  FPP1S     IN     LD 
  0.58   0.58   0.58   0.59   0.62   0.62   0.64   0.65   0.67   0.68   0.68 
  FPUH   VPRT   SPP2   BEMA    DMA    VBD    AWL   CONT     NN   VIMP 
  0.70   0.71   0.72   0.74   0.74   0.80   0.85   0.85   0.87   0.90

As we chose to exclude features with communalities of < 0.2, we remove STPR, HDG, MDNE and CAUSE from the dataset to be analysed.

TxBdataforPCA <- TxBdata |> 
  select(-c(STPR, MDNE, HDG, CAUSE))

The overall MSA value of the dataset is now 0.88. None of the remaining features have individual MSA values below 0.5:

kmo3$MSAi[order(kmo3$MSAi)] |>round(2)

  MDWS   MDCA    POS   FREQ    VBD  TPP3S   VPRT    ACT     LD   COND     DT 
  0.54   0.64   0.64   0.65   0.65   0.66   0.66   0.67   0.69   0.78   0.79 
  MDCO  TPP3P     RP  NCOMP   VIMP   SPP2  DOAUX MENTAL   WHSC    VBG  THATD 
  0.79   0.81   0.82   0.82   0.82   0.82   0.84   0.85   0.86   0.86   0.86 
 EXIST  FPP1S   COMM     NN     IN   WHQU   DEMO ASPECT   JJAT   THRC     EX 
  0.86   0.87   0.88   0.89   0.89   0.89   0.89   0.89   0.89   0.90   0.90 
 OCCUR   PEAS     CC   YNQU   QUAN    AWL   TIME    XX0  FPP1P    TTR   CONT 
  0.90   0.91   0.91   0.91   0.91   0.92   0.92   0.92   0.92   0.92   0.93 
  PROG   BEMA  SPLIT   PASS   JJPR   THSC     RB   QUPR    AMP   FPUH    PIT 
  0.93   0.93   0.93   0.94   0.94   0.95   0.95   0.95   0.95   0.95   0.95 
   VBN    DMA POLITE    CUZ   EMPH  PLACE 
  0.95   0.96   0.96   0.96   0.96   0.97

The final number of linguistic features entered in the intra-textbook model of linguistic variation is 61.

F.4 Testing the effect of rotating the components

This chunk was used when considering whether or not to rotate the components (see methods section). Ultimately, the components were not rotated.

# Comparing a rotated vs. a non-rotated solution

#TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds"))

# No rotation
pca2 <- psych::principal(TxBdata[,7:ncol(TxBdata)], 
                         nfactors = 6, 
                         rotate = "none")

pca2$loadings

biplot.psych(pca2, 
             vars = TRUE, 
             choose=c(1,2),
             )

# Promax rotation
pca2.rotated <- psych::principal(TxBdata[,7:ncol(TxBdata)], 
                         nfactors = 6, 
                         rotate = "promax")

# This summary shows the component correlations which is particularly interesting
pca2.rotated

pca2.rotated$loadings

biplot.psych(pca2.rotated, vars = TRUE, choose=c(1,2))

F.5 Principal Component Analysis (PCA)

F.5.1 Using the full dataset

Except outliers removed as part of the data preparation (see Appendix D).

# Perform PCA on full data
TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds"))

F.5.2 Using random subsets of the data

Alternatively, it is possible to conduct the PCA on random subsets of the data to test the stability of the solution. Re-running this line will generate a new subset of the TEC texts containing 2/3 of the texts randomly sampled.

TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds")) |>
  slice_sample(n = round(1961*0.6), replace = FALSE)

nrow(TxBdata)
TxBdata$Filename[1:10]
nrow(TxBdata) / (ncol(TxBdata)-6) # Check that there is enough data to conduct a PCA. This ratio should be at least 5 (see Friginal & Hardy 2014: 303–304).

F.5.3 Using specific subsets of the data

The following chunk can be used to perform the PCA on a country subset of the data to test the stability of the solution. See (Le Foll) for a detailed analysis of the subcorpus of textbooks used in Germany.

TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds")) |>
  #filter(Country == "France")
  #filter(Country == "Germany")
  filter(Country == "Spain")

nrow(TxBdata)
TxBdata$Filename[1:10] # Check data
nrow(TxBdata) / (ncol(TxBdata)-6) # Check that there is enough data to conduct a PCA. This should be > 5 (see Friginal & Hardy 2014: 303–304).

F.5.4 Performing the PCA

We perform the PCA using the prcomp function and print a summary of the results.

Code

pca <- prcomp(TxBdata[,7:ncol(TxBdata)], scale.=FALSE, rank. = 6) # All quantitative variables for all TxB files except outliers
register  <- factor(TxBdata[,"Register"]) # Register
level <- factor(TxBdata[,"Level"]) # Textbook proficiency level

# summary(register)
# summary(level)
summary(pca)

Importance of first k=6 (out of 61) components:
                          PC1    PC2     PC3     PC4     PC5     PC6
Standard deviation     2.1693 1.7776 1.08902 1.00207 0.84288 0.76792
Proportion of Variance 0.2108 0.1416 0.05313 0.04499 0.03183 0.02642
Cumulative Proportion  0.2108 0.3524 0.40553 0.45051 0.48234 0.50876

F.6 Plotting PCA results

F.6.1 3D plots

The following chunk can be used to create projections of TEC texts on three dimensions of the model. These plots cannot be rendered in two dimensions and are therefore not generated in the present document. For more information on the pca3d library, see: https://cran.r-project.org/web/packages/pca3d/vignettes/pca3d.pdf.

library(pca3d) # For 3-D plots

col <- palette[c(1:3,8,7)] # without poetry
names(col) <- c("Conversation", "Fiction", "Informative", "Instructional", "Personal")
scales::show_col(col) # Check colours

pca3d(pca, 
      group = register,
      components = 1:3,
      #components = 4:6,
      show.ellipses=FALSE, 
      ellipse.ci=0.75,
      show.plane=FALSE,
      col = col,
      shape = "sphere",
      radius = 1,
      legend = "right")

snapshotPCA3d(here("plots", "PCA_TxB_3Dsnapshot.png"))

names(col) <- c("C", "B", "E", "A", "D") # To colour the dots according to the profiency level of the textbooks
pca3d(pca, 
      components = 4:6,
      group = level,
      show.ellipses=FALSE, 
      ellipse.ci=0.75,
      show.plane=FALSE,
      col = col,
      shape = "sphere",
      radius = 0.8,
      legend = "right")

F.7 Two-dimensional plots (biplots)

These plots were generated using the PCAtools package, which requires the data to be formatted in a rather unconventional way so it needs to wrangled first.

F.7.1 Data wrangling for PCAtools

Code

#TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds"))

TxBdata2meta <- TxBdata[,1:6]
rownames(TxBdata2meta) <- TxBdata2meta$Filename
TxBdata2meta <- TxBdata2meta |> select(-Filename)
#head(TxBdata2meta)

TxBdata2 = TxBdata
rownames(TxBdata2) <- TxBdata2$Filename
TxBdata2num <- as.data.frame(base::t(TxBdata2[,7:ncol(TxBdata2)]))
#TxBdata2num[1:12,1:3] # Check sanity of data

p <- PCAtools::pca(TxBdata2num, 
         metadata = TxBdata2meta,
         scale = FALSE)

F.7.2 Pairs plot

We first produce a scatterplot matrix of all the combinations of the first six dimensions of the model of intra-textbook variation. Note that the number before the comma on each axis label shows which principal component is plotted on that axis; this is followed by the percentage of the total variance explained by that particular component. The colours correspond to the text registers.

Code

## Colour and shape scheme for all biplots
colkey = c(Conversation="#BD241E", Fiction="#A18A33", Informative="#15274D", Instructional="#F9B921", Personal="#722672")
shapekey = c(A=1, B=2, C=6, D=0, E=5)

## Very slow, open in zoomed out window!
# Add legend manually? Yes (take it from the biplot code below), sadly really the simplest solution, here. Or use Evert's mvar.pairs plot function (though that also requires manual axis annotation).

# png(here("plots", "PCA_TxB_pairsplot.png"), width = 12, height= 19, units = "cm", res = 300)
PCAtools::pairsplot(p,
                 triangle = FALSE,
                 components = 1:6,
                 ncol = 3,
                 nrow = 5,
                 pointSize = 0.8,
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Register",
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 margingaps = unit(c(0.2, 0.2, 0.2, 0.2), "cm"),
                 legendPosition = "none")

F.7.3 Bi-plots

Then, biplots of the most important dimensions are generated to examine components more carefully.

The following biplot displays the distribution of the texts of the TEC along the first and second dimensions of the intra-textbook model.

Code

colkey = c(Conversation="#BD241E", Fiction="#A18A33", Informative="#15274D", Instructional="#F9B921", Personal="#722672")
shapekey = c(A=1, B=2, C=6, D=0, E=5)

#png(here("plots", "PCA_TxB_Biplot_PC1_PC2.png"), width = 40, height= 25, units = "cm", res = 300)
PCAtools::biplot(p,
                 x = "PC1",
                 y = "PC2",
                 lab = NULL, # Otherwise will try to label each data point!
                 #xlim = c(min(p$rotated$PC1)-0.5, max(p$rotated$PC1)+0.5),
                 #ylim = c(min(p$rotated$PC2)-0.5, max(p$rotated$PC2)+0.5),
                 colby = "Register",
                 pointSize = 2,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 22,
                 legendPosition = 'right',
                 legendTitleSize = 22,
                 legendLabSize = 18, 
                 legendIconSize = 7) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

Code

#dev.off()
#ggsave(here("plots", "PCA_TxB_BiplotPC1_PC2.svg"), width = 12, height = 10)

The following code chunk can be used for the interactive examination of individual data points (i.e., texts).

library(plotly)

plot <- PCAtools::biplot(p,
                 x = "PC1",
                 y = "PC2",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Register",
                 pointSize = 2,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 axisLabSize = 22,
                 legendPosition = 'right',
                 legendTitleSize = 22,
                 legendLabSize = 18, 
                 legendIconSize = 7) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

ggplotly(plot) # tooltip

###

For the fourth to sixth dimensions of the model, it is worth comparing the effet of text register with that of textbook proficiency level. Hence, the following two chunks generate biplots with colour and ellipses representing a) text register (as above) and b) textbook level (A = first year of EFL tuition at secondary school, E = fifth year).

Code

pRegisters <- PCAtools::biplot(p,
                 x = "PC3",
                 y = "PC4",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Register",
                 pointSize = 2,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 legendPosition = 'right',
                 legendTitleSize = 22,
                 legendLabSize = 18, 
                 legendIconSize = 7) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

#ggsave(here("plots", "PCA_TxB_BiplotPC3_PC4.svg"), width = 12, height = 10)

pRegisters2 <- PCAtools::biplot(p,
                 x = "PC5",
                 y = "PC6",
                 lab = NULL, # Otherwise will try to label each data point!
                 colby = "Register",
                 pointSize = 2,
                 colkey = colkey,
                 shape = "Level",
                 shapekey = shapekey,
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 legendPosition = 'right',
                 legendTitleSize = 22,
                 legendLabSize = 18, 
                 legendIconSize = 7) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))

#ggsave(here("plots", "PCA_TxB_BiplotPC5_PC6.svg"), width = 12, height = 10)

Code

# Inverted keys for the biplots with ellipses for Level rather than Register
colkeyLevels = c(A="#F9B921", B="#A18A33", C="#BD241E", D="#722672", E="#15274D")
shapekeyLevels = c(Conversation=1, Fiction=2, Informative=6, Instructional=0, Personal=5)

pLevels <- PCAtools::biplot(p,
                 x = "PC3",
                 y = "PC4",
                 lab = NULL, # Otherwise will try to label each data point!
                 #xlim = c(min(p$rotated$PC1)-0.5, max(p$rotated$PC1)+0.5),
                 #ylim = c(min(p$rotated$PC2)-0.5, max(p$rotated$PC2)+0.5),
                 colby = "Level",
                 pointSize = 2,
                 colkey = colkeyLevels,
                 shape = "Register",
                 shapekey = shapekeyLevels,
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 legendPosition = 'right',
                 legendTitleSize = 22,
                 legendLabSize = 18, 
                 legendIconSize = 7) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))
#ggsave(here("plots", "PCA_TxB_BiplotPC3_PC4_Level.svg"), width = 12, height = 10)

pLevels2 <- PCAtools::biplot(p,
                 x = "PC5",
                 y = "PC6",
                 lab = NULL, # Otherwise will try to label each data point!
                 #xlim = c(min(p$rotated$PC1)-0.5, max(p$rotated$PC1)+0.5),
                 #ylim = c(min(p$rotated$PC2)-0.5, max(p$rotated$PC2)+0.5),
                 colby = "Level",
                 pointSize = 2,
                 colkey = colkeyLevels,
                 shape = "Register",
                 shapekey = shapekeyLevels,
                 showLoadings = FALSE,
                 ellipse = TRUE,
                 legendPosition = 'right',
                 legendTitleSize = 22,
                 legendLabSize = 18, 
                 legendIconSize = 7) +
  theme(plot.margin = unit(c(0,0,0,0.2), "cm"))
#ggsave(here("plots", "PCA_TxB_BiplotPC5_PC6_Level.svg"), width = 12, height = 10)

The following biplots display the distribution of the texts of the TEC along the third and fourth dimensions of the intra-textbook model.

The following biplots display the distribution of the texts of the TEC along the fifth and sixth dimensions of the intra-textbook model.

F.8 Feature contributions (loadings) on each component

Code

#TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds"))

pca <- prcomp(TxBdata[,7:ncol(TxBdata)], scale.=FALSE) # All quantitative variables for all TEC files

# The rotated data that represents the observations / samples is stored in rotated, while the variable loadings are stored in loadings
loadings <- as.data.frame(pca$rotation[,1:4])
loadings |> 
  round(2) |> 
  kable()

	PC1	PC2	PC3	PC4
ACT	0.08	-0.11	0.04	-0.10
AMP	-0.12	-0.10	-0.11	0.01
ASPECT	0.10	-0.05	0.14	-0.01
AWL	0.22	-0.16	-0.12	-0.13
BEMA	-0.22	0.01	-0.21	0.02
CC	0.05	-0.21	-0.19	0.00
COMM	0.20	0.09	0.14	-0.04
COND	-0.01	-0.02	0.11	-0.24
CONT	-0.25	0.11	-0.03	-0.06
CUZ	-0.09	-0.13	-0.06	-0.02
DEMO	-0.12	0.08	0.03	-0.09
DMA	-0.20	0.14	-0.02	0.00
DOAUX	-0.01	0.20	0.05	-0.15
DT	0.12	0.00	0.31	-0.02
EMPH	-0.19	-0.02	0.06	-0.14
EX	-0.10	-0.05	-0.11	0.05
EXIST	-0.02	-0.15	-0.09	-0.09
FPP1P	-0.17	0.01	-0.07	0.00
FPP1S	-0.23	0.07	0.08	-0.01
FPUH	-0.16	0.15	-0.09	0.07
FREQ	-0.03	-0.05	0.01	-0.10
IN	0.17	-0.18	0.02	-0.08
JJAT	-0.06	-0.18	0.04	-0.21
JJPR	-0.17	-0.06	-0.11	-0.11
LD	0.16	-0.03	-0.26	-0.01
MDCA	-0.04	0.10	-0.18	-0.09
MDCO	-0.05	-0.10	0.22	0.01
MDWS	-0.07	-0.01	0.05	-0.16
MENTAL	0.14	0.13	0.12	-0.25
NCOMP	0.04	-0.05	-0.24	-0.15
NN	0.20	-0.09	-0.29	0.11
OCCUR	0.02	-0.18	0.03	0.02
PASS	-0.01	-0.22	-0.06	-0.05
PEAS	-0.06	-0.17	0.13	-0.13
PIT	-0.19	-0.04	-0.06	-0.06
PLACE	-0.16	-0.01	-0.07	0.09
POLITE	-0.14	0.13	-0.07	0.02
POS	-0.01	0.03	-0.04	0.16
PROG	-0.11	-0.02	0.11	0.00
QUAN	-0.15	-0.03	0.12	-0.19
QUPR	-0.10	-0.05	0.16	-0.11
RB	-0.19	-0.08	0.20	0.00
RP	0.00	-0.09	0.14	0.02
SPLIT	-0.11	-0.18	0.02	-0.16
SPP2	0.10	0.22	-0.01	-0.25
THATD	-0.05	0.04	0.16	-0.24
THRC	0.02	-0.11	-0.02	-0.18
THSC	-0.06	-0.17	0.07	-0.14
TIME	-0.12	-0.08	-0.01	0.06
TPP3P	-0.01	-0.16	-0.09	-0.02
TPP3S	-0.06	-0.11	0.13	0.30
TTR	-0.04	-0.26	-0.05	-0.01
VBD	-0.08	-0.20	0.23	0.30
VBG	0.04	-0.18	0.00	-0.22
VBN	0.03	-0.18	-0.07	-0.04
VIMP	0.25	0.15	0.04	-0.08
VPRT	-0.15	0.05	-0.32	-0.22
WHQU	0.11	0.23	0.00	-0.09
WHSC	0.11	-0.11	0.03	-0.15
XX0	-0.22	0.03	0.06	-0.06
YNQU	-0.03	0.23	0.00	-0.08

We can go back to the normalised frequencies of the individual features to compare them across different registers and levels, e.g.:

TxBcounts |> 
  group_by(Register, Level) |> 
  summarise(median(NCOMP), MAD(NCOMP)) |> 
  select(1:4) |> 
  kable(digits=2)

Register	Level	median(NCOMP)	MAD(NCOMP)
Conversation	A	5.69	2.79
Conversation	B	5.48	2.66
Conversation	C	5.32	2.58
Conversation	D	6.18	2.91
Conversation	E	6.21	2.62
Fiction	A	4.14	2.34
Fiction	B	3.96	2.17
Fiction	C	4.05	1.86
Fiction	D	5.05	2.34
Fiction	E	5.05	2.16
Informative	A	8.07	2.48
Informative	B	7.62	2.40
Informative	C	7.49	3.16
Informative	D	7.56	2.46
Informative	E	8.77	2.45
Instructional	A	6.84	2.54
Instructional	B	6.80	2.65
Instructional	C	6.14	2.35
Instructional	D	6.22	2.29
Instructional	E	6.75	2.69
Personal	A	6.72	1.42
Personal	B	4.92	2.33
Personal	C	5.75	1.45
Personal	D	6.46	3.19
Personal	E	8.22	3.09

Graphs of features display the features with the strongest contributions to any two dimensions of the model of intra-textbook variation. They are created using the factoextra::fviz_pca_var function.

Code

factoextra::fviz_pca_var(pca,
             axes = c(1,2),
             select.var = list(cos2 = 0.1),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

Code

#ggsave(here("plots", "fviz_pca_var_PC1_PC2.svg"), width = 11, height = 9)

factoextra::fviz_pca_var(pca,
             axes = c(3,2),
             select.var = list(contrib = 30),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

Code

#ggsave(here("plots", "fviz_pca_var_PC3_PC2.svg"), width = 9, height = 8)

factoextra::fviz_pca_var(pca,
             axes = c(3,4),
             select.var = list(contrib = 30),
             col.var = "contrib", # Colour by contributions to the PC
             gradient.cols = c("#F9B921", "#DB241E", "#722672"),
             title = "",
             repel = TRUE, # Try to avoid too much text overlapping
             ggtheme = ggthemes::theme_few())

Code

#ggsave(here("plots", "fviz_pca_var_PC3_PC4.svg"), width = 9, height = 8)

F.9 Exploring the dimensions of the model

We begin with some descriptive statistics of the dimension scores.

Code

# http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/#pca-results-for-variables

#TxBdata <- readRDS(here("data", "processed", "TxBdataforPCA.rds"))

pca <- prcomp(TxBdata[,7:ncol(TxBdata)], scale.=FALSE) # All quantitative variables for all TxB files
register  <- factor(TxBdata[,"Register"]) # Register
level <- factor(TxBdata[,"Level"]) # Textbook proficiency level

# summary(register)
# summary(level)
# summary(pca)

## Access to the PCA results for individual PC
#pca$rotation[,1]

res.ind <- cbind(TxBdata[,1:5], as.data.frame(pca$x)[,1:6])

res.ind |> 
  group_by(Register) |> 
  summarise_if(is.numeric, mean) |> 
  kable(digits = 2)

Register	PC1	PC2	PC3	PC4	PC5	PC6
Conversation	-2.29	0.93	-0.14	-0.27	-0.02	0.06
Fiction	-0.85	-0.81	1.02	1.09	0.11	-0.10
Informative	0.06	-2.45	-0.83	0.01	-0.08	0.12
Instructional	2.68	0.93	0.15	-0.24	0.01	-0.07
Personal	-1.92	-0.29	-0.05	-0.02	0.07	-0.09

Code

res.ind |> 
  group_by(Register, Level) |> 
  summarise_if(is.numeric, mean) |> 
  kable(digits = 2)

Register	Level	PC1	PC2	PC3	PC4	PC5	PC6
Conversation	A	-2.39	2.39	-1.23	0.71	-0.45	-0.01
Conversation	B	-2.54	1.72	-0.25	0.04	-0.14	0.13
Conversation	C	-2.25	0.70	0.18	-0.41	0.09	-0.02
Conversation	D	-2.10	-0.08	0.28	-0.73	0.17	0.09
Conversation	E	-2.13	-0.14	0.07	-0.98	0.16	0.17
Fiction	A	-0.95	0.85	-0.54	1.48	-0.31	-0.46
Fiction	B	-0.89	-0.14	0.95	1.78	-0.06	-0.03
Fiction	C	-0.98	-0.81	1.62	1.23	0.26	-0.16
Fiction	D	-0.71	-1.57	1.27	0.72	0.21	-0.01
Fiction	E	-0.80	-1.45	1.16	0.56	0.25	-0.01
Informative	A	-0.09	-1.11	-1.94	0.87	-0.88	-0.15
Informative	B	0.15	-1.67	-1.19	0.46	-0.38	0.13
Informative	C	-0.02	-2.37	-0.68	-0.03	-0.06	-0.01
Informative	D	0.06	-2.89	-0.45	-0.19	0.06	0.10
Informative	E	0.15	-3.13	-0.79	-0.38	0.30	0.43
Instructional	A	2.89	1.55	-0.20	0.46	-0.34	-0.24
Instructional	B	2.68	1.27	0.09	0.00	-0.12	-0.12
Instructional	C	2.59	0.99	0.28	-0.32	-0.07	0.01
Instructional	D	2.63	0.70	0.28	-0.49	0.12	0.07
Instructional	E	2.64	0.09	0.20	-0.80	0.49	-0.16
Personal	A	-1.84	0.53	-1.11	1.21	-0.31	0.12
Personal	B	-1.85	0.40	-0.58	0.59	0.21	-0.07
Personal	C	-2.05	-0.46	0.52	-0.17	0.06	-0.03
Personal	D	-1.89	-1.05	0.45	-0.63	0.39	-0.06
Personal	E	-1.96	-0.92	0.21	-1.10	-0.19	-0.45

The following chunk can be used to search for example texts that are located in specific areas of the biplots. For example, we can search for texts that have high scores on Dim3 and low ones on Dim2 to proceed with a qualitative comparison and analysis of these texts.

res.ind |> 
  filter(PC3 > 2.5 & PC2 < -2) |> 
  select(Filename, PC2, PC3) |> 
  kable(digits = 2)

Filename	PC2	PC3
Achievers_B1_plus_Narrative_0005.txt	-3.88	2.60
Solutions_Intermediate_Plus_Spoken_0018.txt	-2.08	2.56
JTT_3_Narrative_0005.txt	-2.85	2.76
Achievers_B2_Narrative_00031.txt	-2.61	2.59
Access_4_Narrative_0006.txt	-2.19	3.18

F.10 Computing mixed-effects models of the dimension scores

F.10.1 Dimension 1: ‘Overt instructions and explanations’

Having compared various models, the following model is chosen as the best-fitting one.

# Models with Textbook series as random intercepts
md1 <- lmer(PC1 ~ Register*Level + (1|Series), data = res.ind, REML = FALSE)
md1Register <- lmer(PC1 ~ Register + (1|Series), data = res.ind, REML = FALSE)
md1Level <- lmer(PC1 ~ Level + (1|Series), data = res.ind, REML = FALSE)

anova(md1, md1Register, md1Level)

Data: res.ind
Models:
md1Register: PC1 ~ Register + (1 | Series)
md1Level: PC1 ~ Level + (1 | Series)
md1: PC1 ~ Register * Level + (1 | Series)
            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
md1Register    7 4080.4 4119.4 -2033.2   4066.4                         
md1Level       7 8533.0 8572.0 -4259.5   8519.0    0.0  0               
md1           27 4068.3 4219.0 -2007.2   4014.3 4504.6 20  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tab_model(md1, wrap.labels = 300) # Marginal R2 = 0.890

	PC 1
Predictors	Estimates	CI	p
(Intercept)	-2.37	-2.59 – -2.15	<0.001
Register [Fiction]	1.61	1.36 – 1.87	<0.001
Register [Informative]	2.23	1.96 – 2.50	<0.001
Register [Instructional]	5.29	5.10 – 5.47	<0.001
Register [Personal]	0.48	0.08 – 0.88	0.019
Level [B]	-0.12	-0.30 – 0.05	0.167
Level [C]	0.12	-0.05 – 0.29	0.159
Level [D]	0.23	0.06 – 0.41	0.010
Level [E]	0.27	0.07 – 0.48	0.010
Register [Fiction] × Level [B]	0.18	-0.15 – 0.51	0.284
Register [Informative] × Level [B]	0.36	0.02 – 0.70	0.038
Register [Instructional] × Level [B]	-0.10	-0.35 – 0.15	0.434
Register [Personal] × Level [B]	0.11	-0.39 – 0.61	0.671
Register [Fiction] × Level [C]	-0.25	-0.58 – 0.07	0.130
Register [Informative] × Level [C]	-0.00	-0.32 – 0.32	0.993
Register [Instructional] × Level [C]	-0.39	-0.62 – -0.15	0.001
Register [Personal] × Level [C]	-0.22	-0.72 – 0.28	0.381
Register [Fiction] × Level [D]	-0.05	-0.38 – 0.27	0.739
Register [Informative] × Level [D]	-0.01	-0.33 – 0.31	0.946
Register [Instructional] × Level [D]	-0.47	-0.72 – -0.23	<0.001
Register [Personal] × Level [D]	-0.07	-0.60 – 0.46	0.800
Register [Fiction] × Level [E]	-0.24	-0.58 – 0.10	0.173
Register [Informative] × Level [E]	0.06	-0.29 – 0.40	0.747
Register [Instructional] × Level [E]	-0.50	-0.77 – -0.22	<0.001
Register [Personal] × Level [E]	-0.18	-0.74 – 0.38	0.527
Random Effects
σ²	0.45
τ₀₀_Series	0.07
ICC	0.14
N _Series	9
Observations	1961
Marginal R² / Conditional R²	0.890 / 0.906

Its estimated coefficients are visualised in the plot below.

Code

# Plot of fixed effects:
plot_model(md1Register, 
           type = "est",
           show.intercept = TRUE,
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = palette[c(1:3,8,7)],
           group.terms = c(1:5), 
           title = "",
           wrap.labels = 40,
           axis.title = "PC1 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA1_lmer_fixedeffects_Register.svg"), height = 3, width = 8)

The emmeans and pairs functions are used to compare the estimated Dim1 scores for each register and to compare these to one another.

Code

Register_results <- emmeans(md1Register, "Register")
summary(Register_results)

 Register       emmean    SE   df lower.CL upper.CL
 Conversation  -2.2793 0.102 11.6   -2.502   -2.056
 Fiction       -0.7267 0.106 13.9   -0.955   -0.498
 Informative    0.0603 0.104 12.7   -0.165    0.286
 Instructional  2.7141 0.101 11.3    2.492    2.937
 Personal      -1.8734 0.122 25.5   -2.125   -1.622

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95

Code

comparisons <- pairs(Register_results, adjust = "tukey")
comparisons

 contrast                     estimate     SE   df  t.ratio p.value
 Conversation - Fiction         -1.553 0.0508 1963  -30.535  <.0001
 Conversation - Informative     -2.340 0.0465 1961  -50.341  <.0001
 Conversation - Instructional   -4.993 0.0399 1961 -125.141  <.0001
 Conversation - Personal        -0.406 0.0791 1958   -5.134  <.0001
 Fiction - Informative          -0.787 0.0557 1962  -14.135  <.0001
 Fiction - Instructional        -3.441 0.0497 1962  -69.168  <.0001
 Fiction - Personal              1.147 0.0840 1958   13.645  <.0001
 Informative - Instructional    -2.654 0.0447 1957  -59.399  <.0001
 Informative - Personal          1.934 0.0816 1957   23.692  <.0001
 Instructional - Personal        4.587 0.0780 1957   58.820  <.0001

Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 5 estimates

Code

#write_last_clip()
confint(comparisons)

 contrast                     estimate     SE   df lower.CL upper.CL
 Conversation - Fiction         -1.553 0.0508 1963   -1.691   -1.414
 Conversation - Informative     -2.340 0.0465 1961   -2.466   -2.213
 Conversation - Instructional   -4.993 0.0399 1961   -5.102   -4.884
 Conversation - Personal        -0.406 0.0791 1958   -0.622   -0.190
 Fiction - Informative          -0.787 0.0557 1962   -0.939   -0.635
 Fiction - Instructional        -3.441 0.0497 1962   -3.577   -3.305
 Fiction - Personal              1.147 0.0840 1958    0.917    1.376
 Informative - Instructional    -2.654 0.0447 1957   -2.776   -2.532
 Informative - Personal          1.934 0.0816 1957    1.711    2.156
 Instructional - Personal        4.587 0.0780 1957    4.374    4.800

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 5 estimates

Code

#write_last_clip()

We can also visualise the estimated coefficients for the textbook series, which is modelled here as a random effect.

Code

plot_model(md1, 
           type = "re", # Option to visualise random effects
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = "bw",
           wrap.labels = 40,
           axis.title = "PC1 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA1_lmer_randomeffects.svg"), height = 3, width = 8)

F.10.2 Dimension 2: ‘Involved vs. Informational Production’

Code

md2 <- lmer(PC2 ~ Register*Level + (1|Series), data = res.ind, REML = FALSE)
md2Register <- lmer(PC2 ~ Register + (1|Series), data = res.ind, REML = FALSE)
md2Level <- lmer(PC2 ~ Level + (1|Series), data = res.ind, REML = FALSE)
anova(md2, md2Register, md2Level)

Data: res.ind
Models:
md2Register: PC2 ~ Register + (1 | Series)
md2Level: PC2 ~ Level + (1 | Series)
md2: PC2 ~ Register * Level + (1 | Series)
            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
md2Register    7 6155.2 6194.3 -3070.6   6141.2                         
md2Level       7 7290.1 7329.2 -3638.1   7276.1    0.0  0               
md2           27 5200.9 5351.6 -2573.4   5146.9 2129.2 20  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

tab_model(md2) # Marginal R2 = 0.723

	PC 2
Predictors	Estimates	CI	p
(Intercept)	2.46	2.20 – 2.73	<0.001
Register [Fiction]	-1.80	-2.14 – -1.46	<0.001
Register [Informative]	-3.44	-3.79 – -3.08	<0.001
Register [Instructional]	-0.87	-1.12 – -0.62	<0.001
Register [Personal]	-1.89	-2.42 – -1.35	<0.001
Level [B]	-0.72	-0.96 – -0.49	<0.001
Level [C]	-1.71	-1.94 – -1.49	<0.001
Level [D]	-2.37	-2.61 – -2.14	<0.001
Level [E]	-2.65	-2.93 – -2.37	<0.001
Register [Fiction] × Level [B]	-0.28	-0.72 – 0.17	0.223
Register [Informative] × Level [B]	0.13	-0.32 – 0.58	0.565
Register [Instructional] × Level [B]	0.44	0.11 – 0.77	0.009
Register [Personal] × Level [B]	0.53	-0.14 – 1.20	0.120
Register [Fiction] × Level [C]	0.08	-0.35 – 0.52	0.705
Register [Informative] × Level [C]	0.42	-0.01 – 0.84	0.053
Register [Instructional] × Level [C]	1.11	0.80 – 1.43	<0.001
Register [Personal] × Level [C]	0.59	-0.07 – 1.26	0.081
Register [Fiction] × Level [D]	-0.01	-0.44 – 0.42	0.972
Register [Informative] × Level [D]	0.49	0.07 – 0.91	0.024
Register [Instructional] × Level [D]	1.49	1.16 – 1.81	<0.001
Register [Personal] × Level [D]	0.59	-0.12 – 1.30	0.104
Register [Fiction] × Level [E]	0.42	-0.04 – 0.87	0.072
Register [Informative] × Level [E]	0.39	-0.07 – 0.86	0.099
Register [Instructional] × Level [E]	1.05	0.68 – 1.42	<0.001
Register [Personal] × Level [E]	1.01	0.27 – 1.76	0.008
Random Effects
σ²	0.80
τ₀₀_Series	0.09
ICC	0.10
N _Series	9
Observations	1961
Marginal R² / Conditional R²	0.723 / 0.752

Code

# tab_model(md2Register) # Marginal R2 = 0.558
# tab_model(md2Level) # Marginal R2 = 0.228

Estimated coefficients of fixed effects on Dim2 scores:

Code

plot_model(md2, 
           type = "est",
           show.intercept = TRUE,
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = palette[c(1:3,8,7)],
           group.terms = c(1:5,1,1,1,1,2:5,2:5,2:5,2:5), 
           title = "",
           wrap.labels = 40,
           axis.title = "PC2 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA2_lmer_fixedeffects.svg"), height = 8, width = 8)

Estimated coefficients of random effects on Dim2 scores:

Code

## Random intercepts
plot_model(md2, 
           type = "re", # Option to visualise random effects
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = "bw",
           wrap.labels = 40,
           axis.title = "PC2 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA2_lmer_randomeffects.svg"), height = 3, width = 8)

# Textbook Country as a random effect variable
md2country <- lmer(PC2 ~ Register*Level + (1|Country), data = res.ind, REML = FALSE)

plot_model(md2country, 
           type = "re", # Option to visualise random effects
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = "bw",
           wrap.labels = 40,
           axis.title = "PC2 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA2_lmer_randomeffects_country.svg"), height = 3, width = 8)

The visreg function is used to visualise the distributions of the modelled Dim2 scores.

Code

# svg(here("plots", "TxB_predicted_PC2_scores_interactions.svg"), height = 5, width = 8)
visreg(md2, xvar = "Level", by="Register", type = "conditional",
       line=list(col="darkred"), 
       xlab = "Textbook Level", ylab = "PC2"
       #,gg = TRUE
       ,layout=c(5,1)
)

Code

#dev.off()

We also examine potential interactions between textbook series, proficiency level, and Dim2 scores.

Code

visreg(md2, xvar = "Series", by="Level", type = "conditional", re.form=~(1|Series), 
       line=list(col="darkred"), xlab = "Textbook Series", ylab = "PC2",
       layout=c(1,5))

Finally, we examine potential interactions between country of use of the textbook, register, and Dim2 scores.

Code

# svg(here("plots", "TxB_PCA2_lmer_randomeffects_country_register.svg"), height = 5, width = 8)
visreg::visreg(md2country, "Country", by="Register", re.form=~(1|Country),
               ylab="PC2", line=list(col="darkred"))

Code

# dev.off()

F.10.3 Dimension 3: ‘Narrative vs. factual discourse’

Code

md3 <- lmer(PC3 ~ Register*Level + (1|Series), data = res.ind, REML = FALSE)
md3Register <- lmer(PC3 ~ Register + (1|Series), data = res.ind, REML = FALSE)
md3Level <- lmer(PC3 ~ Level + (1|Series), data = res.ind, REML = FALSE)

anova(md3, md3Register, md3Level)

Data: res.ind
Models:
md3Register: PC3 ~ Register + (1 | Series)
md3Level: PC3 ~ Level + (1 | Series)
md3: PC3 ~ Register * Level + (1 | Series)
            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
md3Register    7 5139.9 5179.0 -2563.0   5125.9                         
md3Level       7 5528.8 5567.9 -2757.4   5514.8   0.00  0               
md3           27 4582.6 4733.3 -2264.3   4528.6 986.21 20  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

tab_model(md3) # Marginal R2 = 0.436

	PC 3
Predictors	Estimates	CI	p
(Intercept)	-1.33	-1.62 – -1.04	<0.001
Register [Fiction]	0.79	0.50 – 1.08	<0.001
Register [Informative]	-0.79	-1.10 – -0.49	<0.001
Register [Instructional]	1.03	0.82 – 1.24	<0.001
Register [Personal]	0.19	-0.27 – 0.65	0.411
Level [B]	0.97	0.77 – 1.17	<0.001
Level [C]	1.39	1.20 – 1.58	<0.001
Level [D]	1.50	1.30 – 1.70	<0.001
Level [E]	1.60	1.36 – 1.83	<0.001
Register [Fiction] × Level [B]	0.55	0.17 – 0.93	0.004
Register [Informative] × Level [B]	-0.13	-0.52 – 0.25	0.504
Register [Instructional] × Level [B]	-0.67	-0.95 – -0.39	<0.001
Register [Personal] × Level [B]	-0.34	-0.91 – 0.23	0.242
Register [Fiction] × Level [C]	0.78	0.41 – 1.15	<0.001
Register [Informative] × Level [C]	-0.17	-0.53 – 0.19	0.362
Register [Instructional] × Level [C]	-0.94	-1.21 – -0.68	<0.001
Register [Personal] × Level [C]	0.17	-0.40 – 0.74	0.568
Register [Fiction] × Level [D]	0.34	-0.03 – 0.70	0.073
Register [Informative] × Level [D]	0.01	-0.35 – 0.37	0.953
Register [Instructional] × Level [D]	-1.01	-1.29 – -0.73	<0.001
Register [Personal] × Level [D]	0.02	-0.58 – 0.63	0.939
Register [Fiction] × Level [E]	0.23	-0.15 – 0.62	0.238
Register [Informative] × Level [E]	-0.18	-0.58 – 0.21	0.364
Register [Instructional] × Level [E]	-1.01	-1.32 – -0.70	<0.001
Register [Personal] × Level [E]	-0.21	-0.84 – 0.43	0.520
Random Effects
σ²	0.58
τ₀₀_Series	0.14
ICC	0.19
N _Series	9
Observations	1961
Marginal R² / Conditional R²	0.436 / 0.543

Code

# tab_model(md3Register) # Marginal R2 = 0.272
# tab_model(md3Level) # Marginal R2 = 0.119

# Plot of fixed effects:
plot_model(md3, 
           type = "est",
           show.intercept = TRUE,
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = palette[c(1:3,8,7)],
           group.terms = c(1:5,1,1,1,1,2:5,2:5,2:5,2:5), 
           title = "",
           wrap.labels = 40,
           axis.title = "PC3 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA3_lmer_fixedeffects.svg"), height = 8, width = 8)

# Plot of random effects:
plot_model(md3, 
           type = "re", # Option to visualise random effects
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           color = "bw",
           wrap.labels = 40,
           axis.title = "PC3 estimated coefficients") +
  theme_sjplot2()

#ggsave(here("plots", "TxB_PCA3_lmer_randomeffects.svg"), height = 3, width = 8)

Code

# svg(here("plots", "TxB_predicted_PC3_scores_interactions.svg"), height = 5, width = 8)
visreg(md3, xvar = "Level", by="Register", type = "conditional",
       line=list(col="darkred"), 
       xlab = "Textbook Level", ylab = "PC3"
       #,gg = TRUE
       ,layout=c(5,1)
)

Code

# dev.off()

# Textbook Series-Register interactions
visreg::visreg(md3, "Series", by="Register", re.form=~(1|Series),
               ylab="PC3", line=list(col="darkred"))

Code

visreg(md3, xvar = "Level", by="Series", type = "conditional", re.form=~(1|Series), 
       line=list(col="darkred"), xlab = "Textbook Series", ylab = "PC3")

F.10.4 Dimension 4: ‘Informational compression vs. elaboration’

Code

md4 <- lmer(PC4 ~ Register*Level + (1|Series), data = res.ind, REML = FALSE)
md4Register <- lmer(PC4 ~ Register + (1|Series), data = res.ind, REML = FALSE)
md4Level <- lmer(PC4 ~ Level + (1|Series), data = res.ind, REML = FALSE)

anova(md4, md4Register, md4Level)

Data: res.ind
Models:
md4Register: PC4 ~ Register + (1 | Series)
md4Level: PC4 ~ Level + (1 | Series)
md4: PC4 ~ Register * Level + (1 | Series)
            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
md4Register    7 5034.0 5073.0 -2510.0   5020.0                         
md4Level       7 5043.6 5082.7 -2514.8   5029.6   0.00  0               
md4           27 4372.1 4522.8 -2159.1   4318.1 711.52 20  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

tab_model(md4) # Marginal R2 = 0.426

	PC 4
Predictors	Estimates	CI	p
(Intercept)	0.76	0.56 – 0.96	<0.001
Register [Fiction]	0.72	0.44 – 1.00	<0.001
Register [Informative]	0.20	-0.09 – 0.49	0.176
Register [Instructional]	-0.23	-0.43 – -0.03	0.023
Register [Personal]	0.49	0.06 – 0.93	0.026
Level [B]	-0.67	-0.87 – -0.48	<0.001
Level [C]	-1.14	-1.32 – -0.96	<0.001
Level [D]	-1.42	-1.61 – -1.23	<0.001
Level [E]	-1.76	-1.99 – -1.54	<0.001
Register [Fiction] × Level [B]	0.97	0.61 – 1.33	<0.001
Register [Informative] × Level [B]	0.22	-0.14 – 0.59	0.230
Register [Instructional] × Level [B]	0.21	-0.06 – 0.47	0.130
Register [Personal] × Level [B]	-0.02	-0.57 – 0.52	0.930
Register [Fiction] × Level [C]	0.85	0.50 – 1.20	<0.001
Register [Informative] × Level [C]	0.25	-0.09 – 0.59	0.156
Register [Instructional] × Level [C]	0.37	0.11 – 0.62	0.005
Register [Personal] × Level [C]	-0.26	-0.80 – 0.28	0.343
Register [Fiction] × Level [D]	0.65	0.30 – 1.00	<0.001
Register [Informative] × Level [D]	0.34	-0.00 – 0.68	0.053
Register [Instructional] × Level [D]	0.47	0.20 – 0.73	0.001
Register [Personal] × Level [D]	-0.43	-1.01 – 0.14	0.140
Register [Fiction] × Level [E]	0.82	0.45 – 1.19	<0.001
Register [Informative] × Level [E]	0.42	0.05 – 0.80	0.027
Register [Instructional] × Level [E]	0.44	0.14 – 0.74	0.004
Register [Personal] × Level [E]	-0.55	-1.15 – 0.05	0.072
Random Effects
σ²	0.52
τ₀₀_Series	0.05
ICC	0.08
N _Series	9
Observations	1961
Marginal R² / Conditional R²	0.426 / 0.472

Code

# tab_model(md4Register) # Marginal R2 = 0.203
# tab_model(md4Level) # Marginal R2 = 0.187

# Plot of fixed effects:
plot_model(md4, 
           type = "est",
           show.intercept = TRUE,
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           colors = palette[c(1:3,8,7)],
           group.terms = c(1:5,1,1,1,1,2:5,2:5,2:5,2:5), 
           title = "",
           wrap.labels = 40,
           axis.title = "PC4 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA4_lmer_fixedeffects.svg"), height = 8, width = 8)

Code

# Plot of random effects:
plot_model(md4, 
           type = "re", # Option to visualise random effects
           show.values=TRUE, 
           show.p=TRUE,
           value.offset = .4,
           value.size = 3.5,
           color = "bw",
           wrap.labels = 40,
           axis.title = "PC4 estimated coefficients") +
  theme_sjplot2()

Code

#ggsave(here("plots", "TxB_PCA4_lmer_randomeffects.svg"), height = 3, width = 8)

Code

# svg(here("plots", "TxB_predicted_PC4_scores_interactions.svg"), height = 5, width = 8)
visreg(md4, xvar = "Level", by="Register", type = "conditional",
       line=list(col="darkred"), 
       xlab = "Textbook Level", ylab = "PC4"
       #,gg = TRUE
       ,layout=c(5,1)
)

Code

# dev.off()

F.10.5 Testing model assumptions

This chunk can be used to check the assumptions of all of the models computed above. In the following example, we examine the final model selected to predict Dim2 scores.

model2test <- md2

# check distribution of residuals
plot(model2test)

# scale-location plot
plot(model2test,
     sqrt(abs(resid(.)))~fitted(.),
     type=c("p","smooth"), col.line=1)

# Q-Q plot
lattice::qqmath(model2test)

F.11 Packages used in this script

F.11.1 Package names and versions

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] visreg_2.7.0      lubridate_1.9.3   stringr_1.5.1     dplyr_1.1.4      
 [5] purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1     
 [9] tidyverse_2.0.0   sjPlot_2.8.16     psych_2.4.6.26    PCAtools_2.14.0  
[13] ggrepel_0.9.5     patchwork_1.2.0   lme4_1.1-35.5     Matrix_1.7-0     
[17] knitr_1.48        here_1.0.1        ggthemes_5.1.0    forcats_1.0.0    
[21] factoextra_1.0.7  emmeans_1.10.3    DescTools_0.99.54 cowplot_1.1.3    
[25] caret_6.0-94      lattice_0.22-6    ggplot2_3.5.1    

loaded via a namespace (and not attached):
  [1] rstudioapi_0.16.0         jsonlite_1.8.8           
  [3] datawizard_0.12.1         magrittr_2.0.3           
  [5] estimability_1.5.1        nloptr_2.1.1             
  [7] rmarkdown_2.27            zlibbioc_1.50.0          
  [9] vctrs_0.6.5               minqa_1.2.7              
 [11] DelayedMatrixStats_1.26.0 htmltools_0.5.8.1        
 [13] S4Arrays_1.4.1            cellranger_1.1.0         
 [15] sjmisc_2.8.10             SparseArray_1.4.8        
 [17] pROC_1.18.5               parallelly_1.37.1        
 [19] htmlwidgets_1.6.4         plyr_1.8.9               
 [21] rootSolve_1.8.2.4         lifecycle_1.0.4          
 [23] iterators_1.0.14          pkgconfig_2.0.3          
 [25] sjlabelled_1.2.0          rsvd_1.0.5               
 [27] R6_2.5.1                  fastmap_1.2.0            
 [29] MatrixGenerics_1.16.0     future_1.33.2            
 [31] digest_0.6.36             Exact_3.2                
 [33] colorspace_2.1-0          S4Vectors_0.42.1         
 [35] rprojroot_2.0.4           dqrng_0.4.1              
 [37] irlba_2.3.5.1             beachmat_2.20.0          
 [39] fansi_1.0.6               timechange_0.3.0         
 [41] httr_1.4.7                abind_1.4-5              
 [43] compiler_4.4.1            proxy_0.4-27             
 [45] withr_3.0.0               BiocParallel_1.38.0      
 [47] performance_0.12.2        MASS_7.3-60.2            
 [49] lava_1.8.0                sjstats_0.19.0           
 [51] DelayedArray_0.30.1       gld_2.6.6                
 [53] ModelMetrics_1.2.2.2      tools_4.4.1              
 [55] future.apply_1.11.2       nnet_7.3-19              
 [57] glue_1.7.0                nlme_3.1-164             
 [59] grid_4.4.1                reshape2_1.4.4           
 [61] generics_0.1.3            recipes_1.1.0            
 [63] gtable_0.3.5              tzdb_0.4.0               
 [65] class_7.3-22              hms_1.1.3                
 [67] data.table_1.15.4         lmom_3.0                 
 [69] BiocSingular_1.20.0       ScaledMatrix_1.12.0      
 [71] utf8_1.2.4                XVector_0.44.0           
 [73] BiocGenerics_0.50.0       foreach_1.5.2            
 [75] pillar_1.9.0              splines_4.4.1            
 [77] renv_1.0.3                survival_3.6-4           
 [79] tidyselect_1.2.1          IRanges_2.38.1           
 [81] stats4_4.4.1              xfun_0.46                
 [83] expm_0.999-9              hardhat_1.4.0            
 [85] timeDate_4032.109         matrixStats_1.3.0        
 [87] stringi_1.8.4             yaml_2.3.9               
 [89] boot_1.3-30               evaluate_0.24.0          
 [91] codetools_0.2-20          BiocManager_1.30.23      
 [93] cli_3.6.3                 rpart_4.1.23             
 [95] xtable_1.8-4              munsell_0.5.1            
 [97] Rcpp_1.0.13               readxl_1.4.3             
 [99] globals_0.16.3            ggeffects_1.7.0          
[101] coda_0.19-4.1             parallel_4.4.1           
[103] gower_1.0.1               sparseMatrixStats_1.16.0 
[105] listenv_0.9.1             mvtnorm_1.2-5            
[107] ipred_0.9-15              scales_1.3.0             
[109] prodlim_2024.06.25        e1071_1.7-14             
[111] insight_0.20.2            crayon_1.5.3             
[113] rlang_1.1.4               mnormt_2.1.1

F.11.2 Package references

[1] J. B. Arnold. ggthemes: Extra Themes, Scales and Geoms for ggplot2. R package version 5.1.0. 2024. https://jrnold.github.io/ggthemes/.

[2] D. Bates, M. Mächler, B. Bolker, et al. “Fitting Linear Mixed-Effects Models Using lme4”. In: Journal of Statistical Software 67.1 (2015), pp. 1-48. DOI: 10.18637/jss.v067.i01.

[3] D. Bates, M. Maechler, B. Bolker, et al. lme4: Linear Mixed-Effects Models using Eigen and S4. R package version 1.1-35.5. 2024. https://github.com/lme4/lme4/.

[4] D. Bates, M. Maechler, and M. Jagan. Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.7-0. 2024. https://Matrix.R-forge.R-project.org.

[5] K. Blighe and A. Lun. PCAtools: Everything Principal Components Analysis. R package version 2.14.0. 2023. DOI: 10.18129/B9.bioc.PCAtools. https://bioconductor.org/packages/PCAtools.

[6] P. Breheny and W. Burchett. visreg: Visualization of Regression Models. R package version 2.7.0. 2020. http://pbreheny.github.io/visreg.

[7] P. Breheny and W. Burchett. “Visualization of Regression Models Using visreg”. In: The R Journal 9.2 (2017), pp. 56-71.

[8] G. Grolemund and H. Wickham. “Dates and Times Made Easy with lubridate”. In: Journal of Statistical Software 40.3 (2011), pp. 1-25. https://www.jstatsoft.org/v40/i03/.

[9] A. Kassambara and F. Mundt. factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.7. 2020. http://www.sthda.com/english/rpkgs/factoextra.

[10] M. Kuhn. caret: Classification and Regression Training. R package version 6.0-94. 2023. https://github.com/topepo/caret/.

[11] Kuhn and Max. “Building Predictive Models in R Using the caret Package”. In: Journal of Statistical Software 28.5 (2008), p. 1–26. DOI: 10.18637/jss.v028.i05. https://www.jstatsoft.org/index.php/jss/article/view/v028i05.

[12] R. V. Lenth. emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.10.3. 2024. https://rvlenth.github.io/emmeans/.

[13] D. Lüdecke. sjPlot: Data Visualization for Statistics in Social Science. R package version 2.8.16. 2024. https://strengejacke.github.io/sjPlot/.

[14] K. Müller. here: A Simpler Way to Find Your Files. R package version 1.0.1. 2020. https://here.r-lib.org/.

[15] K. Müller and H. Wickham. tibble: Simple Data Frames. R package version 3.2.1. 2023. https://tibble.tidyverse.org/.

[16] T. L. Pedersen. patchwork: The Composer of Plots. R package version 1.2.0. 2024. https://patchwork.data-imaginist.com.

[17] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. https://www.R-project.org/.

[18] W. Revelle. psych: Procedures for Psychological, Psychometric, and Personality Research. R package version 2.4.6.26. 2024. https://personality-project.org/r/psych/.

[19] D. Sarkar. Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. ISBN: 978-0-387-75968-5. http://lmdvr.r-forge.r-project.org.

[20] D. Sarkar. lattice: Trellis Graphics for R. R package version 0.22-6. 2024. https://lattice.r-forge.r-project.org/.

[21] A. Signorell. DescTools: Tools for Descriptive Statistics. R package version 0.99.54. 2024. https://andrisignorell.github.io/DescTools/.

[22] K. Slowikowski. ggrepel: Automatically Position Non-Overlapping Text Labels with ggplot2. R package version 0.9.5. 2024. https://ggrepel.slowkow.com/.

[23] V. Spinu, G. Grolemund, and H. Wickham. lubridate: Make Dealing with Dates a Little Easier. R package version 1.9.3. 2023. https://lubridate.tidyverse.org.

[24] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0. 2023. https://forcats.tidyverse.org/.

[25] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. https://ggplot2.tidyverse.org.

[26] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1. 2023. https://stringr.tidyverse.org.

[27] H. Wickham. tidyverse: Easily Install and Load the Tidyverse. R package version 2.0.0. 2023. https://tidyverse.tidyverse.org.

[28] H. Wickham, M. Averick, J. Bryan, et al. “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43 (2019), p. 1686. DOI: 10.21105/joss.01686.

[29] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.5.1. 2024. https://ggplot2.tidyverse.org.

[30] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2023. https://dplyr.tidyverse.org.

[31] H. Wickham and L. Henry. purrr: Functional Programming Tools. R package version 1.0.2. 2023. https://purrr.tidyverse.org/.

[32] H. Wickham, J. Hester, and J. Bryan. readr: Read Rectangular Text Data. R package version 2.1.5. 2024. https://readr.tidyverse.org.

[33] H. Wickham, D. Vaughan, and M. Girlich. tidyr: Tidy Messy Data. R package version 1.3.1. 2024. https://tidyr.tidyverse.org.

[34] C. O. Wilke. cowplot: Streamlined Plot Theme and Plot Annotations for ggplot2. R package version 1.1.3. 2024. https://wilkelab.org/cowplot/.

[35] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[36] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[37] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48. 2024. https://yihui.org/knitr/.