Appendix E — Data Preparation for the Model of Intra-Textbook Variation

Last modifications on this page

July 21, 2024

This script documents the steps taken to pre-process the Textbook English Corpus (TEC) data that were entered in the multi-dimensional model of intra-textbook linguistic variation (Chapter 6).

E.1 Packages required

The following packages must be installed and loaded to process the data.

#renv::restore() # Restore the project's dependencies from the lockfile to ensure that same package versions are used as in the original study

library(caret) # For its confusion matrix function
library(DT) # To display interactive HTML tables
library(here) # For dynamic file paths
library(knitr) # Loaded to display the tables using the kable() function
library(patchwork) # Needed to put together Fig. 1
library(PerformanceAnalytics) # For the correlation plot
library(psych) # For various useful, stats function
library(tidyverse) # For data wrangling

E.2 Data import from MFTE output

The raw data used in this script is a tab-separated file that corresponds to the tabular output of mixed normalised frequencies as generated by the MFTE Perl v. 3.1 (Le Foll 2021a).

Code

# Read in Textbook Corpus data
TxBcounts <- read.delim(here("data", "MFTE", "TxB900MDA_3.1_normed_complex_counts.tsv"), header = TRUE, stringsAsFactors = TRUE)

TxBcounts <- TxBcounts |> 
  filter(Filename!=".DS_Store") |>  
  droplevels()

#str(TxBcounts) # Check sanity of data
#nrow(TxBcounts) # Should be 2014 files

datatable(TxBcounts,
  filter = "top",
) |> 
  formatRound(3:ncol(TxBcounts), digits=2)

Metadata was added on the basis of the filenames.

# Adding a textbook proficiency level
TxBLevels <- read.delim(here("data", "metadata", "TxB900MDA_ProficiencyLevels.csv"), sep = ",")
TxBcounts <- full_join(TxBcounts, TxBLevels, by = "Filename") |>  
  mutate(Level = as.factor(Level)) |>  
  mutate(Filename = as.factor(Filename))

# Check distribution and that there are no NAs
summary(TxBcounts$Level) |> 
  kable(col.names = c("Textbook Level", "# of texts"))

Textbook Level	# of texts
A	292
B	407
C	506
D	478
E	331

# Check matching on random sample
# TxBcounts |>
#   select(Filename, Level) |>  
#   sample_n(20) 

# Adding a register variable from the file names
TxBcounts$Register <- as.factor(stringr::str_extract(TxBcounts$Filename, "Spoken|Narrative|Other|Personal|Informative|Instructional|Poetry")) # Add a variable for Textbook Register
summary(TxBcounts$Register) |> 
  kable(col.names = c("Textbook Register", "# of texts"))

Textbook Register	# of texts
Informative	364
Instructional	647
Narrative	285
Personal	88
Poetry	37
Spoken	593

TxBcounts$Register <- car::recode(TxBcounts$Register, "'Narrative' = 'Fiction'; 'Spoken' = 'Conversation'")
#colnames(TxBcounts) # Check all the variables make sense

# Adding a textbook series variable from the file names
TxBcounts$Filename <- stringr::str_replace(TxBcounts$Filename, "English_In_Mind|English_in_Mind", "EIM") 
TxBcounts$Filename <- stringr::str_replace(TxBcounts$Filename, "New_GreenLine", "NGL") # Otherwise the regex for GreenLine will override New_GreenLine
TxBcounts$Filename <- stringr::str_replace(TxBcounts$Filename, "Piece_of_cake", "POC") # Shorten label for ease of plotting
TxBcounts$Series <- as.factor(stringr::str_extract(TxBcounts$Filename, "Access|Achievers|EIM|GreenLine|HT|NB|NM|POC|JTT|NGL|Solutions")) # Extract textbook series from (ammended) filenames
summary(TxBcounts$Series)  |> 
  kable(col.names = c("Textbook Name", "# of texts"))

Textbook Name	# of texts
Access	315
Achievers	240
EIM	180
GreenLine	209
HT	115
JTT	129
NB	44
NGL	298
NM	59
POC	98
Solutions	327

# Including the French textbooks for the first year of Lycée to their corresponding publisher series from collège
TxBcounts$Series <-car::recode(TxBcounts$Series, "c('NB', 'JTT') = 'JTT'; c('NM', 'HT') = 'HT'") # Recode final volumes of French series (see Section 4.3.1.1 on textbook selection for details)
summary(TxBcounts$Series) |> 
  kable(col.names = c("Textbook Series", "# of texts"))

Textbook Series	# of texts
Access	315
Achievers	240
EIM	180
GreenLine	209
HT	174
JTT	173
NGL	298
POC	98
Solutions	327

# Adding a textbook country of use variable from the series variable
TxBcounts$Country <- TxBcounts$Series
TxBcounts$Country <- car::recode(TxBcounts$Series, "c('Access', 'GreenLine', 'NGL') = 'Germany'; c('Achievers', 'EIM', 'Solutions') = 'Spain'; c('HT', 'NB', 'NM', 'POC', 'JTT') = 'France'")
summary(TxBcounts$Country) |> 
  kable(col.names = c("Country of Use", "# of texts"))

Country of Use	# of texts
France	445
Germany	822
Spain	747

# Re-order variables
#colnames(TxBcounts)
TxBcounts <- select(TxBcounts, order(names(TxBcounts))) %>%
  select(Filename, Country, Series, Level, Register, Words, everything())
#colnames(TxBcounts)

E.2.1 Corpus size

This table provides some summary statistics about the number of words included in the TEC texts originally tagged for this study.

TxBcounts  |>  
  group_by(Register) |>  
  summarise(totaltexts = n(), totalwords = sum(Words), mean = as.integer(mean(Words)), sd = as.integer(sd(Words)), TTRmean = mean(TTR)) |>  
  kable(digits = 2, format.args = list(big.mark = ","))

Register	totaltexts	totalwords	mean	sd	TTRmean
Conversation	593	505,147	851	301	0.44
Fiction	285	241,512	847	208	0.47
Informative	364	304,695	837	177	0.51
Instructional	647	585,049	904	94	0.42
Personal	88	69,570	790	177	0.48
Poetry	37	26,445	714	192	0.44

#TxBcounts <- saveRDS(TxBcounts, here("data", "processed", "TxBcounts.rds"))

E.3 Data preparation for PCA

Poetry texts were removed for this analysis as there were too few compared to the other register categories.

summary(TxBcounts$Register) |>  
  kable(col.names = c("Register", "# texts"))

Register	# texts
Conversation	593
Fiction	285
Informative	364
Instructional	647
Personal	88
Poetry	37

This led to the following distribution of texts across the five textbook English registers examined in the model of intra-textbook linguistic variation:

TxBcounts <- TxBcounts |>  
  filter(Register!="Poetry") |>  
  droplevels()

summary(TxBcounts$Register) |>  
  kable(col.names = c("Register", "# texts"))

Register	# texts
Conversation	593
Fiction	285
Informative	364
Instructional	647
Personal	88

E.3.1 Feature distributions

The distributions of each linguistic features were examined by means of visualisation. As shown below, before transformation, many of the features displayed highly skewed distributions.

Code

TxBcounts |> 
  select(-Words) |>  
  keep(is.numeric) |>  
  tidyr::gather() |>  # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value)) +
    theme_bw() +
    facet_wrap(~ key, scales = "free", ncol = 4) +
    scale_x_continuous(expand=c(0,0)) +
    geom_histogram(bins = 30, colour= "darkred", fill = "darkred", alpha = 0.5)

Code

#ggsave(here("plots", "TEC-HistogramPlotsAllVariablesTEC-only.svg"), width = 20, height = 45)

E.3.2 Feature removal

A number of features were removed from the dataset as they are not linguistically interpretable. In the case of the TEC, this included the variable CD because numbers spelt out as digits were removed from the textbooks before these were tagged with the MFTE. In addition, the variables LIKE and SO because these are “bin” features included in the output of the MFTE to ensure that the counts for these polysemous words do not inflate other categories due to mistags (Le Foll 2021b).

Whenever linguistically meaningful, very low-frequency features were merged. Finally, features absent from more than third of texts were also excluded. For the analysis intra-textbook register variation, the following linguistic features were excluded from the analysis due to low dispersion:

# Removal of meaningless features:
TxBcounts <- TxBcounts |>  
  select(-c(CD, LIKE, SO))

# Function to compute percentage of texts with occurrences meeting a condition
compute_percentage <- function(data, condition, threshold) {
  numeric_data <- Filter(is.numeric, data)
  percentage <- round(colSums(condition[, sapply(numeric_data, is.numeric)])/nrow(data) * 100, 2)
  percentage <- as.data.frame(percentage)
  colnames(percentage) <- "Percentage"
  percentage <- percentage |>  
    filter(!is.na(Percentage)) |> 
    rownames_to_column() |> 
    arrange(Percentage)
  if (!missing(threshold)) {
    percentage <- percentage |>  
      filter(Percentage > threshold)
  }
  return(percentage)
}

# Calculate percentage of texts with 0 occurrences of each feature
zero_features <- compute_percentage(TxBcounts, TxBcounts == 0, 66.6)
# zero_features |> 
#   kable(col.names = c("Feature", "% texts with zero occurrences"))

# Combine low frequency features into meaningful groups whenever this makes linguistic sense
TxBcounts <- TxBcounts |>  
  mutate(JJPR = ABLE + JJPR, ABLE = NULL) |>  
  mutate(PASS = PGET + PASS, PGET = NULL)

# Re-calculate percentage of texts with 0 occurrences of each feature
zero_features2 <- compute_percentage(TxBcounts, TxBcounts == 0, 66.6)
zero_features2 |> 
  kable(col.names = c("Feature", "% texts with zero occurrences"))

Feature	% texts with zero occurrences
GTO	67.07
ELAB	69.30
MDMM	70.81
HGOT	73.75
CONC	80.48
DWNT	81.44
QUTAG	85.99
URL	96.51
EMO	97.82
PRP	98.33
HST	99.44

# Drop variables with low document frequency
TxBcounts <- select(TxBcounts, -one_of(zero_features2$rowname))
#ncol(TxBcounts)-8 # Number of linguistic features remaining

# List of features
#colnames(TxBcounts)

These feature removal operations resulted in a feature set of 64 linguistic variables.

E.3.3 Identifying potential outlier texts

All normalised frequencies were normalised to identify any potential outlier texts.

TxBzcounts <- TxBcounts |> 
  select(-Words) |>  
  keep(is.numeric) |>  
  scale()

boxplot(TxBzcounts, las = 3, main = "z-scores") # Slow to open!

# If necessary, remove any outliers at this stage.
TxBdata <- cbind(TxBcounts[,1:6], as.data.frame(TxBzcounts))

outliers <- TxBdata |>  
  select(-c(Words, LD, TTR)) |>  
  filter(if_any(where(is.numeric), ~ .x > 8)) |>  
  select(Filename)

The following outlier texts were identified and excluded in subsequent analyses.

Code

outliers

                                            Filename
1                             POC_4e_Spoken_0007.txt
2             Solutions_Elementary_Personal_0001.txt
3                       NGL_5_Instructional_0018.txt
4                           Access_1_Spoken_0011.txt
5                              EIM_1_Spoken_0012.txt
6                              NGL_4_Spoken_0011.txt
7      Solutions_Intermediate_Plus_Personal_0001.txt
8           Solutions_Elementary_ELF_Spoken_0021.txt
9                          NB_2_Informative_0009.txt
10       Solutions_Intermediate_Plus_Spoken_0022.txt
11     Solutions_Intermediate_Instructional_0025.txt
12 Solutions_Pre-Intermediate_Instructional_0024.txt
13                            POC_4e_Spoken_0010.txt
14            Solutions_Intermediate_Spoken_0019.txt
15                          Access_1_Spoken_0019.txt
16    Solutions_Pre-Intermediate_ELF_Spoken_0005.txt

Code

TxBcounts <- TxBcounts |>  
  filter(!Filename %in% outliers$Filename)

#saveRDS(TxBcounts, here("data", "processed", "TxBcounts3.rds")) # Last saved 6 March 2024

TxBzcounts <- TxBcounts |> 
  select(-Words) |>  
  keep(is.numeric) |>  
  scale()

This resulted in 1,961 TEC texts being included in the model of intra-textbook linguistic variation with the following standardised feature distributions.

Code

TxBzcounts |> 
  as.data.frame() |>  
  gather() |>  # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value)) +
    theme_bw() +
    facet_wrap(~ key, scales = "free", ncol = 4) +
    scale_x_continuous(expand=c(0,0)) +
    geom_histogram(bins = 30, colour= "darkred", fill = "darkred", alpha = 0.5)

Code

#ggsave(here("plots", "TEC-zscores-HistogramsAllVariablesTEC-only.svg"), width = 20, height = 45)

E.3.4 Signed log transformation

A signed logarithmic transformation was applied to (further) deskew the feature distributions (Diwersy, Evert & Neumann 2014; Neumann & Evert 2021).

The signed log transformation function was inspired by the SignedLog function proposed in https://cran.r-project.org/web/packages/DataVisualizations/DataVisualizations.pdf

# All features are signed log-transformed (note that this is also what Neumann & Evert 2021 propose)
signed.log <- function(x) {
  sign(x) * log(abs(x) + 1)
  }

TxBzlogcounts <- signed.log(TxBzcounts) # Standardise first, then signed log transform

#saveRDS(TxBzlogcounts, here("data", "processed", "TxBzlogcounts.rds")) # Last saved 6 March 2024

The new feature distributions are visualised below.

Code

TxBzlogcounts |> 
  as.data.frame() |>  
  gather() |>  # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value, after_stat(density))) +
  theme_bw() +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(limits = c(0,NA)) +
  geom_histogram(bins = 30, colour= "black", fill = "grey") +
  geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)

Code

#ggsave(here("plots", "DensityPlotsAllVariablesSignedLog-TEC-only.svg"), width = 15, height = 49)

The following correlation plots serve to illustrate the effect of the variable transformations performed in the above chunks.

Example feature distributions before transformations:

Code

# This is a slightly amended version of the PerformanceAnalytics::chart.Correlation() function. It simply removes the significance stars that are meaningless with this many data points (see commented out lines below)

chart.Correlation.nostars <- function (R, histogram = TRUE, method = c("pearson", "kendall", "spearman"), ...) {
  x = checkData(R, method = "matrix")
  if (missing(method)) 
    method = method[1]
  panel.cor <- function(x, y, digits = 2, prefix = "", use = "pairwise.complete.obs", method = "pearson", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- cor(x, y, use = use, method = method)
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste(prefix, txt, sep = "")
    if (missing(cex.cor)) 
      cex <- 0.8/strwidth(txt)
    test <- cor.test(as.numeric(x), as.numeric(y), method = method)
    # Signif <- symnum(test$p.value, corr = FALSE, na = FALSE, 
    #                  cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1), symbols = c("***", 
    #                                                                           "**", "*", ".", " "))
    text(0.5, 0.5, txt, cex = cex * (abs(r) + 0.3)/1.3)
    # text(0.8, 0.8, Signif, cex = cex, col = 2)
  }
  f <- function(t) {
    dnorm(t, mean = mean(x), sd = sd.xts(x))
  }
  dotargs <- list(...)
  dotargs$method <- NULL
  rm(method)
  hist.panel = function(x, ... = NULL) {
    par(new = TRUE)
    hist(x, col = "light gray", probability = TRUE, 
         axes = FALSE, main = "", breaks = "FD")
    lines(density(x, na.rm = TRUE), col = "red", lwd = 1)
    rug(x)
  }
  if (histogram) 
    pairs(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor, 
          diag.panel = hist.panel)
  else pairs(x, gap = 0, lower.panel = panel.smooth, upper.panel = panel.cor)
}

# Example plot without any variable transformation
example1 <- TxBcounts |> 
  select(NN,PROG,SPLIT,ACT,FPP1S)

#png(here("plots", "CorrChart-TEC-examples-normedcounts.png"), width = 20, height = 20, units = "cm", res = 300)
chart.Correlation.nostars(example1, histogram=TRUE, pch=19)

Code

#dev.off()

Example feature distributions after transformations:

Code

# Example plot with transformed variables
example2 <- TxBzlogcounts |> 
  as.data.frame() |>  
  select(NN,PROG,SPLIT,ACT,FPP1S)

#png(here("plots", "CorrChart-TEC-examples-zsignedlogcounts.png"), width = 20, height = 20, units = "cm", res = 300)
chart.Correlation.nostars(example2, histogram=TRUE, pch=19)

Code

#dev.off()

E.3.5 Feature correlations

The correlations of the transformed feature frequencies can be visualised in the form of a heatmap. Negative correlations are rendered in blue, whereas positive ones are in red.

Code

# Simple heatmap in base R (inspired by Stephanie Evert's SIGIL code)
cor.colours <- c(
  hsv(h=2/3, v=1, s=(10:1)/10), # blue = negative correlation 
  rgb(1,1,1), # white = no correlation 
  hsv(h=0, v=1, s=(1:10/10))) # red = positive correlation

#png(here("plots", "heatmapzlogcounts-TEC-only.png"), width = 30, height= 30, units = "cm", res = 300)
heatmap(cor(TxBzlogcounts), 
        symm=TRUE, 
        zlim=c(-1,1), 
        col=cor.colours, 
        margins=c(0,0))

Code

#dev.off()

# Calculate the sum of all the words in the tagged texts of the TEC
totalwords <- TxBcounts |>  
  select(Words) |> 
  sum() |> 
  format(big.mark=",")

E.4 Composition of TEC texts/files

These figures and tables provide summary statistics on the texts/files of the TEC that were entered in the multi-dimensional model of intra-textbook linguistic variation. In total, the TEC texts entered amounted to 1,693,650 words.

Code

metadata <- TxBcounts |>  
  select(Filename, Country, Series, Level, Register, Words) |>  
  mutate(Volume = paste(Series, Level)) |>  
  mutate(Volume = fct_rev(Volume)) |>  
  mutate(Volume = fct_reorder(Volume, as.numeric(Level))) |>  
  group_by(Volume) |>  
  mutate(wordcount = sum(Words)) |>  
  ungroup() |>  
  distinct(Volume, .keep_all = TRUE)

# Plot for book
metadata2 <- TxBcounts |>  
  select(Country, Series, Level, Register, Words) |>  
  mutate(Volume = paste(Series, Level)) |>  
  mutate(Volume = fct_rev(Volume)) |>  
  #mutate(Volume = fct_reorder(Volume, as.numeric(Level))) |>  
  group_by(Volume, Register) |>  
  mutate(wordcount = sum(Words)) |>  
  ungroup() |>  
  distinct(Volume, Register, .keep_all = TRUE)

# This is the palette created above on the basis of the suffrager pakcage (but without needed to install the package)
palette <- c("#BD241E", "#A18A33", "#15274D", "#D54E1E", "#EA7E1E", "#4C4C4C", "#722672", "#F9B921", "#267226")

PlotSp <- metadata2 |>  
  filter(Country=="Spain") |>  
  #arrange(Volume) |>  
  ggplot(aes(x = Volume, y = wordcount, fill = fct_rev(Register))) + 
    geom_bar(stat = "identity", position = "stack") +
    coord_flip(expand = FALSE) + # Removes those annoying ticks before each bar label
    theme_minimal() + theme(legend.position = "none") +
    labs(x = "Spain", y = "Cumulative word count") +
    scale_fill_manual(values = palette[c(5,4,3,2,1)], 
                      guide = guide_legend(reverse = TRUE))

PlotGer <- metadata2 |>  
  filter(Country=="Germany") |>  
  #arrange(Volume) |>  
  ggplot(aes(x = Volume, y = wordcount, fill = fct_rev(Register))) + 
    geom_bar(stat = "identity", position = "stack") +
    coord_flip(expand = FALSE) +
    labs(x = "Germany", y = "") +
    scale_fill_manual(values = palette[c(5,4,3,2,1)], guide = guide_legend(reverse = TRUE)) +
    theme_minimal() + theme(legend.position = "none")

PlotFr <- metadata2 |>  
  filter(Country=="France") |>  
  #arrange(Volume) |>  
  ggplot(aes(x = Volume, y = wordcount, fill = fct_rev(Register))) + 
    geom_bar(stat = "identity", position = "stack") +
    coord_flip(expand = FALSE) +
    labs(x = "France", y  = "", fill = "Register subcorpus") +
    scale_fill_manual(values = palette[c(5,4,3,2,1)], guide = guide_legend(reverse = TRUE, legend.hjust = 0)) +
    theme_minimal() + theme(legend.position = "top", legend.justification = "left")

library(patchwork)

PlotFr /
PlotGer /
PlotSp

Code

#ggsave(here("plots", "TEC-T_wordcounts_book.svg"), width = 8, height = 12)

The following table provides information about the proportion of instructional language featured in each textbook series.

Code

metadataInstr <- TxBcounts |>  
  select(Country, Series, Level, Register, Words) |>  
  filter(Register=="Instructional") |>  
  mutate(Volume = paste(Series, Register)) |>  
  mutate(Volume = fct_rev(Volume)) |>  
  mutate(Volume = fct_reorder(Volume, as.numeric(Level))) |>  
  group_by(Volume, Register) |>  
  mutate(InstrWordcount = sum(Words)) |>  
  ungroup() |>  
  distinct(Volume, .keep_all = TRUE) |>  
  select(Series, InstrWordcount)

metaWordcount <- TxBcounts |>  
  select(Country, Series, Level, Register, Words) |>  
  group_by(Series) |>  
  mutate(TECwordcount = sum(Words)) |>  
  ungroup() |>  
  distinct(Series, .keep_all = TRUE) |>  
  select(Series, TECwordcount)

wordcount <- merge(metaWordcount, metadataInstr, by = "Series")

wordcount |>  
  mutate(InstrucPercent = InstrWordcount/TECwordcount*100) |>  
  arrange(InstrucPercent) |>  
  mutate(InstrucPercent = round(InstrucPercent, 2)) |>  
  kable(col.names = c("Textbook Series", "Total words", "Instructional words", "% of textbook content"), 
        digits = 2, 
        format.args = list(big.mark = ","))

Textbook Series	Total words	Instructional words	% of textbook content
Access	259,679	60,938	23.47
NGL	278,316	79,312	28.50
GreenLine	172,267	54,263	31.50
Solutions	270,278	87,829	32.50
JTT	137,557	48,375	35.17
HT	142,676	51,550	36.13
POC	76,714	30,548	39.82
EIM	147,185	59,928	40.72
Achievers	208,978	109,886	52.58

E.5 Packages used in this script

E.5.1 Package names and versions

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.9.3            forcats_1.0.0             
 [3] stringr_1.5.1              dplyr_1.1.4               
 [5] purrr_1.0.2                readr_2.1.5               
 [7] tidyr_1.3.1                tibble_3.2.1              
 [9] tidyverse_2.0.0            psych_2.4.6.26            
[11] PerformanceAnalytics_2.0.4 xts_0.14.0                
[13] zoo_1.8-12                 patchwork_1.2.0           
[15] knitr_1.48                 here_1.0.1                
[17] DT_0.33                    caret_6.0-94              
[19] lattice_0.22-6             ggplot2_3.5.1             

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     timeDate_4032.109    fastmap_1.2.0       
 [4] pROC_1.18.5          digest_0.6.36        rpart_4.1.23        
 [7] timechange_0.3.0     lifecycle_1.0.4      survival_3.6-4      
[10] magrittr_2.0.3       compiler_4.4.1       rlang_1.1.4         
[13] tools_4.4.1          utf8_1.2.4           yaml_2.3.9          
[16] data.table_1.15.4    htmlwidgets_1.6.4    mnormt_2.1.1        
[19] plyr_1.8.9           withr_3.0.0          nnet_7.3-19         
[22] grid_4.4.1           stats4_4.4.1         fansi_1.0.6         
[25] colorspace_2.1-0     future_1.33.2        globals_0.16.3      
[28] scales_1.3.0         iterators_1.0.14     MASS_7.3-60.2       
[31] cli_3.6.3            rmarkdown_2.27       generics_0.1.3      
[34] rstudioapi_0.16.0    future.apply_1.11.2  tzdb_0.4.0          
[37] reshape2_1.4.4       splines_4.4.1        parallel_4.4.1      
[40] BiocManager_1.30.23  vctrs_0.6.5          hardhat_1.4.0       
[43] Matrix_1.7-0         jsonlite_1.8.8       hms_1.1.3           
[46] listenv_0.9.1        foreach_1.5.2        gower_1.0.1         
[49] recipes_1.1.0        glue_1.7.0           parallelly_1.37.1   
[52] codetools_0.2-20     stringi_1.8.4        gtable_0.3.5        
[55] quadprog_1.5-8       munsell_0.5.1        pillar_1.9.0        
[58] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.0          
[61] R6_2.5.1             rprojroot_2.0.4      evaluate_0.24.0     
[64] renv_1.0.3           class_7.3-22         Rcpp_1.0.13         
[67] nlme_3.1-164         prodlim_2024.06.25   xfun_0.46           
[70] pkgconfig_2.0.3      ModelMetrics_1.2.2.2

E.5.2 Package references

[1] G. Grolemund and H. Wickham. “Dates and Times Made Easy with lubridate”. In: Journal of Statistical Software 40.3 (2011), pp. 1-25. https://www.jstatsoft.org/v40/i03/.

[2] M. Kuhn. caret: Classification and Regression Training. R package version 6.0-94. 2023. https://github.com/topepo/caret/.

[3] Kuhn and Max. “Building Predictive Models in R Using the caret Package”. In: Journal of Statistical Software 28.5 (2008), p. 1–26. DOI: 10.18637/jss.v028.i05. https://www.jstatsoft.org/index.php/jss/article/view/v028i05.

[4] K. Müller. here: A Simpler Way to Find Your Files. R package version 1.0.1. 2020. https://here.r-lib.org/.

[5] K. Müller and H. Wickham. tibble: Simple Data Frames. R package version 3.2.1. 2023. https://tibble.tidyverse.org/.

[6] T. L. Pedersen. patchwork: The Composer of Plots. R package version 1.2.0. 2024. https://patchwork.data-imaginist.com.

[7] B. G. Peterson and P. Carl. PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4. 2020. https://github.com/braverock/PerformanceAnalytics.

[8] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. https://www.R-project.org/.

[9] W. Revelle. psych: Procedures for Psychological, Psychometric, and Personality Research. R package version 2.4.6.26. 2024. https://personality-project.org/r/psych/.

[10] J. A. Ryan and J. M. Ulrich. xts: eXtensible Time Series. R package version 0.14.0. 2024. https://joshuaulrich.github.io/xts/.

[11] D. Sarkar. Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. ISBN: 978-0-387-75968-5. http://lmdvr.r-forge.r-project.org.

[12] D. Sarkar. lattice: Trellis Graphics for R. R package version 0.22-6. 2024. https://lattice.r-forge.r-project.org/.

[13] V. Spinu, G. Grolemund, and H. Wickham. lubridate: Make Dealing with Dates a Little Easier. R package version 1.9.3. 2023. https://lubridate.tidyverse.org.

[14] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0. 2023. https://forcats.tidyverse.org/.

[15] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. https://ggplot2.tidyverse.org.

[16] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1. 2023. https://stringr.tidyverse.org.

[17] H. Wickham. tidyverse: Easily Install and Load the Tidyverse. R package version 2.0.0. 2023. https://tidyverse.tidyverse.org.

[18] H. Wickham, M. Averick, J. Bryan, et al. “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43 (2019), p. 1686. DOI: 10.21105/joss.01686.

[19] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.5.1. 2024. https://ggplot2.tidyverse.org.

[20] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2023. https://dplyr.tidyverse.org.

[21] H. Wickham and L. Henry. purrr: Functional Programming Tools. R package version 1.0.2. 2023. https://purrr.tidyverse.org/.

[22] H. Wickham, J. Hester, and J. Bryan. readr: Read Rectangular Text Data. R package version 2.1.5. 2024. https://readr.tidyverse.org.

[23] H. Wickham, D. Vaughan, and M. Girlich. tidyr: Tidy Messy Data. R package version 1.3.1. 2024. https://tidyr.tidyverse.org.

[24] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[25] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[26] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48. 2024. https://yihui.org/knitr/.

[27] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library DataTables. R package version 0.33. 2024. https://github.com/rstudio/DT.

[28] A. Zeileis and G. Grothendieck. “zoo: S3 Infrastructure for Regular and Irregular Time Series”. In: Journal of Statistical Software 14.6 (2005), pp. 1-27. DOI: 10.18637/jss.v014.i06.

[29] A. Zeileis, G. Grothendieck, and J. A. Ryan. zoo: S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations). R package version 1.8-12. 2023. https://zoo.R-Forge.R-project.org/.