Appendix G — Data Preparation for the Model of Textbook English vs. ‘real-world’ English

Last modifications on this page

July 21, 2024

This script documents the steps taken to pre-process the data extracted from the Textbook English Corpus (TEC) and the three reference corpora that were ultimately entered in the comparative multi-dimensional model of Textbook English as compared to English outside the EFL classroom (Chapter 7).

G.1 Packages required

The following packages must be installed and loaded to process the data.

#renv::restore() # Restore the project's dependencies from the lockfile to ensure that same package versions are used as in the original thesis.

library(broom.mixed) # For checking singularity issues 
library(car) # For recoding data
library(corrplot) # For the feature correlation matrix
library(cowplot) # For nice plots
library(DT) # To display interactive HTML tables
library(emmeans) # Comparing group means of predicted values
library(GGally) # For ggpairs
library(gridExtra) # For making large faceted plots
library(here) # For ease of sharing
library(knitr) # Loaded to display the tables using the kable() function
library(lme4) # For mixed effects modelling
library(psych) # For various useful stats function, including KMO()
library(scales) # For working with colours
library(sjPlot) # For nice tabular display of regression models
library(tidyverse) # For data wrangling and plotting
library(visreg) # For nice visualisations of model results
select <- dplyr::select
filter <- dplyr::filter

G.2 Data import from MFTE outputs

The raw data used in this script comes from the matrices of mixed normalised frequencies as output by the MFTE Perl v. 3.1 (Le Foll 2021a).

G.2.1 Spoken BNC2014

Code
SpokenBNC2014 <- read.delim(here("data", "MFTE", "SpokenBNC2014_3.1_normed_complex_counts.tsv"), header = TRUE, stringsAsFactors = TRUE)

SpokenBNC2014$Series <- "Spoken BNC2014"
SpokenBNC2014$Level <- "Ref."
SpokenBNC2014$Country <- "Spoken BNC2014"
SpokenBNC2014$Register <- "Spoken BNC2014"

These normalised frequencies were computed on the basis of my own “John and Jill in Ivybridge” version of the Spoken BNC2014 with added full stops at speaker turns (see Appendix B for details). This corpus comprises of 1,251 texts, all of which were used in the following analyses.

G.2.2 Youth Fiction corpus

Code
YouthFiction <- read.delim(here("data", "MFTE", "YF_sampled_500_3.1_normed_complex_counts.tsv"), header = TRUE, stringsAsFactors = TRUE)

YouthFiction$Series <- "Youth Fiction"
YouthFiction$Level <- "Ref."
YouthFiction$Country <- "Youth Fiction"
YouthFiction$Register <- "Youth Fiction"

These normalised frequencies were computed on the basis of the random samples of approximately 5,000 words of the books of the Youth Fiction corpus (for details of the works included in this corpus, see Appendix B). The sampling procedure is described in Section 4.3.2.4 of the book. This dataset consists of 1,191 files.

G.2.3 Informative Texts for Teens (InfoTeens) corpus

Code
InfoTeen <- read.delim(here("data", "MFTE", "InfoTeen_3.1_normed_complex_counts.tsv"), header = TRUE, stringsAsFactors = TRUE)

# Removes three outlier files which should not have been included in the corpus as they contain exam papers only
InfoTeen <- InfoTeen |> 
  filter(Filename!=".DS_Store" & Filename!="Revision_World_GCSE_10529068_wjec-level-law-past-papers.txt" & Filename!="Revision_World_GCSE_10528474_wjec-level-history-past-papers.txt" & Filename!="Revision_World_GCSE_10528472_edexcel-level-history-past-papers.txt")

InfoTeen$Series <- "Info Teens"
InfoTeen$Level <- "Ref."
InfoTeen$Country <- "Info Teens"
InfoTeen$Register <- "Info Teens"

Details of the composition of the Info Teens corpus can be found in Section 4.3.2.5 of the book. The version used in the present study comprises 1,411 texts.

G.3 Merging TEC and reference corpora data

G.3.1 Corpus size

These tables provide some summary statistics about the texts/files whose normalised feature frequencies were entered in the model of Textbook English vs. real-world English described in Chapter 7.

Code
summary(ncounts$Subcorpus) |> 
  kable(col.names = c("(Sub)corpus", "# texts"),
        format.args = list(big.mark = ","))
(Sub)corpus # texts
Textbook Conversation 593
Textbook Fiction 285
Info Teens Ref. 1,411
Textbook Informative 364
Spoken BNC2014 Ref. 1,251
Youth Fiction Ref. 1,191
Code
ncounts  |>  
  group_by(Register) |>  
  summarise(totaltexts = n(), 
            totalwords = sum(Words), 
            mean = as.integer(mean(Words)), 
            sd = as.integer(sd(Words)), 
            TTRmean = mean(TTR)) |>  
  kable(digits = 2, 
        format.args = list(big.mark = ","),
        col.names = c("Register", "# texts/files", "# words", "mean # words per text", "SD", "mean TTR"))
Register # texts/files # words mean # words per text SD mean TTR
Conversation 1,844 13,804,196 7,486 8,690 0.40
Fiction 1,476 7,321,747 4,960 2,022 0.49
Informative 1,775 1,436,732 809 188 0.51

G.4 Data preparation for PCA

G.4.1 Feature distributions

The distributions of each linguistic features were examined by means of visualisation. As shown below, before transformation, many of the features displayed highly skewed distributions.

Code
#ncounts <- readRDS(here("data", "processed", "counts3Reg.rds"))

ncounts |>
  select(-Words) |> 
  keep(is.numeric) |> 
  gather() |> # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value, after_stat(density))) +
    theme_bw() +
    facet_wrap(~ key, scales = "free", ncol = 4) +
    scale_x_continuous(expand=c(0,0)) +
    scale_y_continuous(limits = c(0,NA)) +
    geom_histogram(bins = 30, colour= "black", fill = "grey") +
    geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)

Code
#ggsave(here("plots", "DensityPlotsAllVariables.svg"), width = 15, height = 49)

G.4.2 Feature removal

A number of features were removed from the dataset as they are not linguistically interpretable. In the case of the TEC, this included the variable CD because numbers spelt out as digits were removed from the textbooks before these were tagged with the MFTE. In addition, the variables LIKE and SO because these are “bin” features included in the output of the MFTE to ensure that the counts for these polysemous words do not inflate other categories due to mistags (Le Foll 2021b).

Whenever linguistically meaningful, very low-frequency features, features with low MSA or communalities (see chunks below) were merged. Finally, features absent from more than third of texts were also excluded. For the comparative analysis of TEC and the reference corpora, the following linguistic features were excluded from the analysis due to low dispersion:

Code
# Removal of meaningless feature: CD because numbers as digits were mostly removed from the textbooks, LIKE and SO because they are dustbin categories
ncounts <- ncounts |> 
  select(-c(CD, LIKE, SO))

# Combine problematic features into meaningful groups whenever this makes linguistic sense
ncounts <- ncounts |> 
  mutate(JJPR = JJPR + ABLE, ABLE = NULL) |> 
  mutate(PASS = PGET + PASS, PGET = NULL) |> 
  mutate(TPP3 = TPP3S + TPP3P, TPP3P = NULL, TPP3S = NULL) |> # Merged due to TTP3P having an individual MSA < 0.5
  mutate(FQTI = FREQ + TIME, FREQ = NULL, TIME = NULL) # Merged due to TIME communality < 0.2 (see below)

# Function to compute percentage of texts with occurrences meeting a condition
compute_percentage <- function(data, condition, threshold) {
  numeric_data <- Filter(is.numeric, data)
  percentage <- round(colSums(condition[, sapply(numeric_data, is.numeric)])/nrow(data) * 100, 2)
  percentage <- as.data.frame(percentage)
  colnames(percentage) <- "Percentage"
  percentage <- percentage |> 
    filter(!is.na(Percentage)) |>
    rownames_to_column() |>
    arrange(Percentage)
  if (!missing(threshold)) {
    percentage <- percentage |> 
      filter(Percentage > threshold)
  }
  return(percentage)
}

# Calculate percentage of texts with 0 occurrences of each feature
zero_features <- compute_percentage(ncounts, ncounts == 0, 66.6)
zero_features |> 
  kable(col.names = c("Feature", "% texts with zero occurrences"))
Feature % texts with zero occurrences
PRP 85.34
URL 93.03
EMO 98.98
HST 99.55
Code
# Drop variables with low document frequency
ncounts2 <- select(ncounts, -one_of(zero_features$rowname))

These feature removal operations resulted in a feature set of 71 linguistic variables.

G.4.3 Identifying outlier texts

All normalised frequencies were normalised to identify any potential outlier texts.

# First scale the normalised counts (z-standardisation) to be able to compare the various features
zcounts <- ncounts2 |>
  select(-Words) |> 
  keep(is.numeric) |> 
  scale()

# If necessary, remove any outliers at this stage.
data <- cbind(ncounts2[,1:8], as.data.frame(zcounts))
outliers <- data |> 
 filter(if_any(where(is.numeric) & !Words,  .fns = function(x){x > 8}))  |>
  select(Filename, Corpus, Register, Words) 

The following outlier texts were identified according to the above conditions and excluded in subsequent analyses.

Code
# These are potential outlier texts :
outliers |> 
  kable(col.names = c("Filename", "Corpus", "Register", "# words"))
Filename Corpus Register # words
POC_4e_Spoken_0007.txt Textbook.English Conversation 750
Solutions_Elementary_ELF_Spoken_0013.txt Textbook.English Conversation 931
EIM_Starter_Informative_0004.txt Textbook.English Informative 534
GreenLine_1_Spoken_0003.txt Textbook.English Conversation 970
Access_1_Spoken_0011.txt Textbook.English Conversation 784
Achievers_B1_Informative_0003.txt Textbook.English Informative 926
EIM_Starter_Spoken_0002.txt Textbook.English Conversation 824
GreenLine_1_Spoken_0008.txt Textbook.English Conversation 876
JTT_3_Informative_0003.txt Textbook.English Informative 699
GreenLine_1_Spoken_0010.txt Textbook.English Conversation 701
EIM_1_Spoken_0012.txt Textbook.English Conversation 640
NGL_1_Spoken_0013.txt Textbook.English Conversation 940
NGL_3_Spoken_0018.txt Textbook.English Conversation 751
Solutions_Intermediate_Spoken_0029.txt Textbook.English Conversation 672
NGL_1_Spoken_0012.txt Textbook.English Conversation 910
GreenLine_1_Spoken_0006.txt Textbook.English Conversation 622
GreenLine_2_Spoken_0004.txt Textbook.English Conversation 1102
Access_2_Spoken_0023.txt Textbook.English Conversation 875
HT_4_Informative_0006.txt Textbook.English Informative 513
Solutions_Intermediate_Informative_0017.txt Textbook.English Informative 816
EIM_1_Spoken_0013.txt Textbook.English Conversation 967
Solutions_Elementary_ELF_Spoken_0021.txt Textbook.English Conversation 846
Solutions_Intermediate_Plus_Spoken_0022.txt Textbook.English Conversation 596
Access_2_Spoken_0028.txt Textbook.English Conversation 813
NGL_1_Spoken_0005.txt Textbook.English Conversation 1020
Solutions_Elementary_ELF_Spoken_0016.txt Textbook.English Conversation 871
Solutions_Pre-Intermediate_ELF_Spoken_0007.txt Textbook.English Conversation 630
Solutions_Intermediate_Informative_0013.txt Textbook.English Informative 770
GreenLine_2_Spoken_0003.txt Textbook.English Conversation 850
HT_4_Spoken_0010.txt Textbook.English Conversation 727
Solutions_Elementary_Informative_0003.txt Textbook.English Informative 1051
Access_2_Informative_0001.txt Textbook.English Informative 655
Solutions_Elementary_Informative_0010.txt Textbook.English Informative 708
GreenLine_1_Informative_0001.txt Textbook.English Informative 731
Access_2_Spoken_0002.txt Textbook.English Conversation 572
Solutions_Intermediate_Spoken_0019.txt Textbook.English Conversation 1024
Access_3_Informative_0003.txt Textbook.English Informative 1000
Access_1_Spoken_0019.txt Textbook.English Conversation 701
Access_2_Spoken_0013.txt Textbook.English Conversation 981
Solutions_Intermediate_Plus_Informative_0014.txt Textbook.English Informative 537
Revision_World_GCSE_10525362_literary-terms.txt Informative.Teens Informative 790
Revision_World_GCSE_10528697_p6-physics-radioactive-materials.txt Informative.Teens Informative 1015
Science_Tech_Kinds_NZ_10382383_math.txt Informative.Teens Informative 522
Science_for_students_10064820_scientists-say-metabolism.txt Informative.Teens Informative 895
Science_Tech_Kinds_NZ_10382388_recycling.txt Informative.Teens Informative 666
History_Kids_BBC_10404337_go_furthers.txt Informative.Teens Informative 620
Science_Tech_Kinds_NZ_10382391_sports.txt Informative.Teens Informative 657
Teen_Kids_News_10402607_so-you-want-to-be-an-archivist.txt Informative.Teens Informative 763
Science_Tech_Kinds_NZ_10382234_biology.txt Informative.Teens Informative 843
Science_Tech_Kinds_NZ_10382372_astronomy.txt Informative.Teens Informative 900
Dogo_News_file10060404_banana-plant-extract-may-be-the-key-to-slower-melting-ice-cream.txt Informative.Teens Informative 611
Science_Tech_Kinds_NZ_10382667_countries.txt Informative.Teens Informative 717
Quatr_us_file10390777_quick-summary-geological-erashtm.txt Informative.Teens Informative 643
Science_Tech_Kinds_NZ_10382873_physics.txt Informative.Teens Informative 722
Science_Tech_Kinds_NZ_10382382_light.txt Informative.Teens Informative 639
Factmonster_10053687_august-13.txt Informative.Teens Informative 523
Revision_World_GCSE_10526703_limited-companies.txt Informative.Teens Informative 714
Revision_World_GCSE_10529637_transition-metals.txt Informative.Teens Informative 787
Quatr_us_10390856_early-african-historyhtm.txt Informative.Teens Informative 1136
History_Kids_BBC_10401873_ff6_sicilylandingss.txt Informative.Teens Informative 813
Quatr_us_10394250_harappan.txt Informative.Teens Informative 651
Ducksters_10398301_iraqphp.txt Informative.Teens Informative 657
History_Kids_BBC_10403171_death_sakkara_gallery_04s.txt Informative.Teens Informative 844
Revision_World_GCSE_10528246_agricultural-change.txt Informative.Teens Informative 789
Revision_World_GCSE_10528086_uk-government-judiciary.txt Informative.Teens Informative 1019
Revision_World_GCSE_10529794_definitions.txt Informative.Teens Informative 904
Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt Informative.Teens Informative 598
Science_for_students_10064875_questions-big-melt-earths-ice-sheets-are-under-attack.txt Informative.Teens Informative 685
Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list.txt Informative.Teens Informative 800
Science_Tech_Kinds_NZ_10382201_projects.txt Informative.Teens Informative 947
Revision_World_GCSE_10529753_probability.txt Informative.Teens Informative 816
Encyclopedia_Kinds_au_10085531_Complex_analysis.txt Informative.Teens Informative 735
History_Kids_BBC_10401890_ff7_ddays.txt Informative.Teens Informative 759
History_Kids_BBC_10403434s.txt Informative.Teens Informative 732
History_Kids_BBC_10401872_ff6_italys.txt Informative.Teens Informative 786
Science_Tech_Kinds_NZ_10382371_amazing.txt Informative.Teens Informative 629
Quatr_us_10391129_athabascan.txt Informative.Teens Informative 637
Encyclopedia_Kinds_au_10085355_20th_century.txt Informative.Teens Informative 864
Dogo_News_10060755_luxury-space-hotel-promises-guests-a-truly-out-of-this-world-vacation.txt Informative.Teens Informative 722
Revision_World_GCSE_10528072_nationalism-practice.txt Informative.Teens Informative 776
Quatr_us_10390861_quatr-us-privacy-policyhtm.txt Informative.Teens Informative 960
History_Kids_BBC_10401909_ff7_bulges.txt Informative.Teens Informative 732
History_kids_10381259_timeline-of-mesopotamia.txt Informative.Teens Informative 768
Revision_World_GCSE_10528123_gender-written-textual-analysis-framework.txt Informative.Teens Informative 905
Science_Tech_Kinds_NZ_10386406_floods.txt Informative.Teens Informative 580
Revision_World_GCSE_10529693_advantages.txt Informative.Teens Informative 782
Science_Tech_Kinds_NZ_10382378_geography.txt Informative.Teens Informative 761
Science_Tech_Kinds_NZ_10382374_earth.txt Informative.Teens Informative 726
Science_for_students_10066286_watering-plants-wastewater-can-spread-germs.txt Informative.Teens Informative 836
Science_Tech_Kinds_NZ_10382393_water.txt Informative.Teens Informative 856
World_Dteen_10406069_website_policies.txt Informative.Teens Informative 995
Science_Tech_Kinds_NZ_10382384_metals.txt Informative.Teens Informative 669
Dogo_News_10062028_puppy-bowl-14-promises-viewers-a-paw-some-time-on-super-bowl-sunday.txt Informative.Teens Informative 581
History_Kids_BBC_10404730_go_furthers.txt Informative.Teens Informative 611
Science_Tech_Kinds_NZ_10382385_nature.txt Informative.Teens Informative 722
Science_for_students_10065015_scientists-say-dna-sequencing.txt Informative.Teens Informative 953
Quatr_us_file10390817_conifers-pine-trees-gymnospermshtm.txt Informative.Teens Informative 533
TweenTribute_10051509_it-true-elephants-cant-jump.txt Informative.Teens Informative 790
Revision_World_GCSE_10528494_application-software.txt Informative.Teens Informative 855
Revision_World_GCSE_10529581_different-types-questions-examinations.txt Informative.Teens Informative 742
Dogo_News_10061669_the-chinese-city-of-chengdu-may-soon-be-home-to-multiple-moons.txt Informative.Teens Informative 614
Ducksters_10398306_geography_of_ancient_chinaphp.txt Informative.Teens Informative 638
Science_for_students_10065144_scientists-say-multiverse.txt Informative.Teens Informative 712
Science_Tech_Kinds_NZ_10382211_images.txt Informative.Teens Informative 793
Factmonster_10053754_may-18.txt Informative.Teens Informative 497
World_Dteen_10406047_AboutWORLDteen.txt Informative.Teens Informative 1053
Ducksters_10398078_first_new_dealphp.txt Informative.Teens Informative 649
Revision_World_GCSE_10526926_economies-scale.txt Informative.Teens Informative 621
Factmonster_10053201_september-03.txt Informative.Teens Informative 445
Science_Tech_Kinds_NZ_10387183_calciumcarbonates.txt Informative.Teens Informative 804
Science_Tech_Kinds_NZ_10382380_health.txt Informative.Teens Informative 694
Revision_World_GCSE_10529587_sources-finance.txt Informative.Teens Informative 665
Quatr_us_10393444_fishing.txt Informative.Teens Informative 656
Ducksters_10398315_glossary_and_termsphp.txt Informative.Teens Informative 684
S5AA.txt Spoken.BNC2014 Conversation 1869

We check that that outlier texts are not particularly long or short texts by looking at the distribution of text/file length of the outliers.

Code
summary(outliers$Words)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  445.0   655.5   751.0   773.6   860.0  1869.0 
Code
hist(outliers$Words, breaks = 30)

We also check the distribution of outlier texts across the four corpora. The majority come from the Info Teens corpus, though quite a few are also from the TEC.

Code
summary(outliers$Corpus) |> 
  kable(col.names = c("(Sub)corpus", "# outlier texts"))
(Sub)corpus # outlier texts
Textbook.English 40
Informative.Teens 74
Spoken.BNC2014 1
Youth.Fiction 0
Code
# Report on the manual check of a sample of these outliers:

# Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt is essentially a list of Nobel prize winners but with some additional information. In other words, not a bad representative of the type of texts of the Info Teen corpus.
# Solutions_Elementary_ELF_Spoken_0013 --> Has a lot of "going to" constructions because they are learnt in this chapter but is otherwise a well-formed text.
# Teen_Kids_News_10403972_a-brief-history-of-white-house-weddings --> No issues
# Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list --> Similar to the Nobel prize laureates text.
# Revision_World_GCSE_10528123_gender-written-textual-analysis-framework --> Text includes bullet points tokenised as the letter "o" but otherwise a fairly typical informative text.

# Removing the outliers at the request of the reviewers (but comparisons of models including the outliers showed that the results are very similar):
ncounts3 <- ncounts2 |> 
  filter(!Filename %in% outliers$Filename)

#saveRDS(ncounts3, here("data", "processed", "ncounts3_3Reg.rds")) # Last saved 6 March 2024

This resulted in 4,980 texts/files being included in the comparative model of Textbook English vs. ‘real-world’ English. These standardised feature frequencies were distributed as follows:

Code
zcounts3 <- ncounts3 |>
  select(-Words) |> 
  keep(is.numeric) |> 
  scale()

boxplot(zcounts3, las = 3, main = "z-scores") # Slow

G.4.4 Signed log transformation

A signed logarithmic transformation was applied to (further) deskew the feature distributions (see Diwersy, Evert & Neumann 2014; Neumann & Evert 2021).

The signed log transformation function was inspired by the SignedLog function proposed in https://cran.r-project.org/web/packages/DataVisualizations/DataVisualizations.pdf.

signed.log <- function(x) {sign(x)*log(abs(x)+1)}

# Standardise first, then sign log transform
zlogcounts <- signed.log(zcounts3) 

The new feature distributions are visualised below.

Code
zlogcounts |>
  as.data.frame() |> 
  gather() |> # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value, after_stat(density))) +
  theme_bw() +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(limits = c(0,NA)) +
  geom_histogram(bins = 30, colour= "black", fill = "grey") +
  geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)

Code
#ggsave(here("plots", "DensityPlotsAllVariablesSignedLog.svg"), width = 15, height = 49)

G.4.5 Merging of data for MDA

Code
zlogcounts <- readRDS(here("data", "processed", "zlogcounts_3Reg.rds")) 
#nrow(zlogcounts)
#colnames(zlogcounts)

ncounts3 <- readRDS(here("data", "processed", "ncounts3_3Reg.rds"))
#nrow(ncounts3)
#colnames(ncounts3)

data <- cbind(ncounts3[,1:8], as.data.frame(zlogcounts))
#saveRDS(data, here("data", "processed", "datazlogcounts_3Reg.rds")) # Last saved 16 March 2024

The final dataset comprises of 4,980 texts/files, divided as follows:

(Sub)corpus # texts/files
Textbook Conversation 565
Textbook Fiction 285
Info Teens Ref. 1337
Textbook Informative 352
Spoken BNC2014 Ref. 1250
Youth Fiction Ref. 1191

G.5 Testing factorability of data

G.5.1 Visualisation of feature correlations

We begin by visualising the correlations of the transformed feature frequencies using the heatmap function of the stats library. Negative correlations are rendered in blue; positive ones are in red.

Code
# Simple heatmap in base R (inspired by Stephanie Evert's SIGIL code)
cor.colours <- c(
  hsv(h=2/3, v=1, s=(10:1)/10), # blue = negative correlation 
  rgb(1,1,1), # white = no correlation 
  hsv(h=0, v=1, s=(1:10/10))) # red = positive correlation

#png(here("plots", "heatmapzlogcounts.png"), width = 30, height= 30, units = "cm", res = 300)
heatmap(cor(zlogcounts), 
        symm=TRUE, 
        zlim=c(-1,1), 
        col=cor.colours, 
        margins=c(7,7))

Code
#dev.off()

G.5.2 Collinearity

As a result of the normalisation unit of finite verb phrases for verb-based features, the present tense (VPRT) and past tense (VBD) variables are correlated to a very high degree:

cor(data$VPRT, data$VBD) |> round(2)
[1] -0.97

We therefore remove the least marked of the pair of collinear variables: VPRT.

data <- data |> 
  select(-c(VPRT))

G.5.3 MSA

kmo <- KMO(data[,9:ncol(data)]) # The first eight columns contain metadata.

The overall MSA value of the dataset is 0.95. The features have the following individual MSA values (ordered from lowest to largest):

Code
kmo$MSAi[order(kmo$MSAi)] |>  round(2)
   AMP   COMM    POS   TPP3   JJPR  PLACE  SPLIT     DT   JJAT   VIMP   MDCO 
  0.67   0.69   0.70   0.74   0.76   0.82   0.83   0.83   0.84   0.84   0.85 
    RP     EX   THSC     LD  NCOMP   BEMA   MDWS   FQTI  FPP1P   MDCA    ACT 
  0.85   0.85   0.86   0.87   0.88   0.88   0.89   0.89   0.89   0.89   0.89 
MENTAL    VBD  FPP1S   MDMM   PEAS   CONC   MDWO   THRC     NN   COND   PROG 
  0.91   0.91   0.91   0.91   0.91   0.93   0.93   0.94   0.94   0.95   0.95 
    CC   SPP2     RB   DWNT   MDNE   WHSC   CONT   QUPR    XX0  CAUSE   WHQU 
  0.95   0.95   0.95   0.95   0.95   0.96   0.96   0.96   0.96   0.96   0.96 
   VBG    AWL POLITE   PASS    PIT  DOAUX   ELAB ASPECT    DMA   DEMO    HDG 
  0.96   0.96   0.96   0.96   0.97   0.97   0.97   0.97   0.97   0.97   0.97 
    IN   FPUH  OCCUR    CUZ   EMPH   YNQU   QUAN    TTR  QUTAG  THATD    VBN 
  0.97   0.97   0.97   0.97   0.98   0.98   0.98   0.98   0.98   0.98   0.98 
 EXIST   STPR    GTO   HGOT 
  0.98   0.99   0.99   0.99 

We aim to remove features with an individual MSA < 0.5. All features have individual MSAs of > 0.5 (but only because TPP3P was merged into a broader category in an earlier chunk).

G.5.4 Scree plot

Six components were originally retained on the basis of the following screeplot, though only the first four were found to be interpretable and were therefore included in the model.

Code
# png(here("plots", "screeplot-TEC-Ref_3Reg.png"), width = 20, height= 12, units = "cm", res = 300)
scree(data[,9:ncol(data)], factors = FALSE, pc = TRUE) # 

Code
# dev.off()

# Perform PCA
pca1 <- psych::principal(data[9:ncol(data)], 
                         nfactors = 6)

G.5.5 Communalities

If features with final communalities of < 0.2 are removed, TIME would have to be removed. TIME was therefore merged with FREQ in an earlier chunk so that now all features have final communalities of > 0.2 (note that this is a very generous threshold!).

Code
pca1$communality |> sort() |> round(2)
  DWNT   STPR   CONC   FQTI    POS ASPECT   MDNE  FPP1P   PROG   MDCO   MDMM 
  0.22   0.23   0.23   0.23   0.24   0.25   0.27   0.28   0.29   0.32   0.32 
  MDWO  SPLIT   MDWS   PEAS   QUPR    AMP  PLACE    HDG   COMM  CAUSE     EX 
  0.32   0.33   0.34   0.35   0.35   0.35   0.37   0.38   0.38   0.38   0.38 
  THSC  OCCUR   WHSC   THRC   JJAT   COND MENTAL    ACT   VIMP   ELAB  EXIST 
  0.40   0.40   0.42   0.43   0.44   0.44   0.45   0.45   0.46   0.46   0.46 
  JJPR  NCOMP     RP    GTO   DEMO   MDCA POLITE    CUZ     CC   WHQU   TPP3 
  0.46   0.48   0.49   0.50   0.50   0.52   0.52   0.53   0.57   0.58   0.58 
   VBG  THATD    PIT   BEMA  FPP1S     DT   HGOT     RB    VBN  QUTAG   EMPH 
  0.60   0.60   0.61   0.61   0.61   0.61   0.62   0.62   0.64   0.64   0.64 
  PASS    XX0   QUAN   SPP2  DOAUX    TTR   YNQU    VBD     LD   FPUH     IN 
  0.65   0.65   0.67   0.68   0.69   0.71   0.74   0.78   0.81   0.83   0.86 
  CONT    DMA    AWL     NN 
  0.89   0.89   0.91   0.93 
Code
#saveRDS(data, here("data", "processed", "dataforPCA.rds")) # Last saved on 6 March 2024

The final dataset entered in the analysis described in Chapter 7 therefore comprises 4,980 texts/files, each with logged standardised normalised frequencies for 70 linguistic features.

G.6 Packages used in this script

G.6.1 Package names and versions

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] DT_0.33             visreg_2.7.0        lubridate_1.9.3    
 [4] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
 [7] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
[10] tibble_3.2.1        tidyverse_2.0.0     sjPlot_2.8.16      
[13] scales_1.3.0        psych_2.4.6.26      lme4_1.1-35.5      
[16] Matrix_1.7-0        knitr_1.48          here_1.0.1         
[19] gridExtra_2.3       GGally_2.2.1        ggplot2_3.5.1      
[22] emmeans_1.10.3      cowplot_1.1.3       corrplot_0.92      
[25] car_3.1-2           carData_3.0-5       broom.mixed_0.2.9.5

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    sjlabelled_1.2.0    fastmap_1.2.0      
 [4] sjstats_0.19.0      digest_0.6.36       timechange_0.3.0   
 [7] estimability_1.5.1  lifecycle_1.0.4     magrittr_2.0.3     
[10] compiler_4.4.1      rlang_1.1.4         tools_4.4.1        
[13] utf8_1.2.4          yaml_2.3.9          htmlwidgets_1.6.4  
[16] mnormt_2.1.1        plyr_1.8.9          RColorBrewer_1.1-3 
[19] abind_1.4-5         withr_3.0.0         grid_4.4.1         
[22] datawizard_0.12.1   fansi_1.0.6         xtable_1.8-4       
[25] colorspace_2.1-0    future_1.33.2       globals_0.16.3     
[28] MASS_7.3-60.2       insight_0.20.2      cli_3.6.3          
[31] mvtnorm_1.2-5       rmarkdown_2.27      generics_0.1.3     
[34] rstudioapi_0.16.0   performance_0.12.2  tzdb_0.4.0         
[37] minqa_1.2.7         splines_4.4.1       parallel_4.4.1     
[40] BiocManager_1.30.23 vctrs_0.6.5         boot_1.3-30        
[43] jsonlite_1.8.8      hms_1.1.3           listenv_0.9.1      
[46] glue_1.7.0          parallelly_1.37.1   nloptr_2.1.1       
[49] ggstats_0.6.0       codetools_0.2-20    stringi_1.8.4      
[52] gtable_0.3.5        ggeffects_1.7.0     munsell_0.5.1      
[55] furrr_0.3.1         pillar_1.9.0        htmltools_0.5.8.1  
[58] R6_2.5.1            rprojroot_2.0.4     evaluate_0.24.0    
[61] lattice_0.22-6      backports_1.5.0     broom_1.0.6        
[64] renv_1.0.3          Rcpp_1.0.13         coda_0.19-4.1      
[67] nlme_3.1-164        xfun_0.46           sjmisc_2.8.10      
[70] pkgconfig_2.0.3    

G.6.2 Package references

[1] B. Auguie. gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. 2017.

[2] D. Bates, M. Mächler, B. Bolker, et al. “Fitting Linear Mixed-Effects Models Using lme4”. In: Journal of Statistical Software 67.1 (2015), pp. 1-48. DOI: 10.18637/jss.v067.i01.

[3] D. Bates, M. Maechler, B. Bolker, et al. lme4: Linear Mixed-Effects Models using Eigen and S4. R package version 1.1-35.5. 2024. https://github.com/lme4/lme4/.

[4] D. Bates, M. Maechler, and M. Jagan. Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.7-0. 2024. https://Matrix.R-forge.R-project.org.

[5] B. Bolker and D. Robinson. broom.mixed: Tidying Methods for Mixed Models. R package version 0.2.9.5. 2024. https://github.com/bbolker/broom.mixed.

[6] P. Breheny and W. Burchett. visreg: Visualization of Regression Models. R package version 2.7.0. 2020. http://pbreheny.github.io/visreg.

[7] P. Breheny and W. Burchett. “Visualization of Regression Models Using visreg”. In: The R Journal 9.2 (2017), pp. 56-71.

[8] J. Fox and S. Weisberg. An R Companion to Applied Regression. Third. Thousand Oaks CA: Sage, 2019. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.

[9] J. Fox, S. Weisberg, and B. Price. car: Companion to Applied Regression. R package version 3.1-2. 2023. https://r-forge.r-project.org/projects/car/.

[10] J. Fox, S. Weisberg, and B. Price. carData: Companion to Applied Regression Data Sets. R package version 3.0-5. 2022. https://r-forge.r-project.org/projects/car/.

[11] G. Grolemund and H. Wickham. “Dates and Times Made Easy with lubridate”. In: Journal of Statistical Software 40.3 (2011), pp. 1-25. https://www.jstatsoft.org/v40/i03/.

[12] R. V. Lenth. emmeans: Estimated Marginal Means, aka Least-Squares Means. R package version 1.10.3. 2024. https://rvlenth.github.io/emmeans/.

[13] D. Lüdecke. sjPlot: Data Visualization for Statistics in Social Science. R package version 2.8.16. 2024. https://strengejacke.github.io/sjPlot/.

[14] K. Müller. here: A Simpler Way to Find Your Files. R package version 1.0.1. 2020. https://here.r-lib.org/.

[15] K. Müller and H. Wickham. tibble: Simple Data Frames. R package version 3.2.1. 2023. https://tibble.tidyverse.org/.

[16] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. https://www.R-project.org/.

[17] W. Revelle. psych: Procedures for Psychological, Psychometric, and Personality Research. R package version 2.4.6.26. 2024. https://personality-project.org/r/psych/.

[18] B. Schloerke, D. Cook, J. Larmarange, et al. GGally: Extension to ggplot2. R package version 2.2.1. 2024. https://ggobi.github.io/ggally/.

[19] V. Spinu, G. Grolemund, and H. Wickham. lubridate: Make Dealing with Dates a Little Easier. R package version 1.9.3. 2023. https://lubridate.tidyverse.org.

[20] T. Wei and V. Simko. corrplot: Visualization of a Correlation Matrix. R package version 0.92. 2021. https://github.com/taiyun/corrplot.

[21] T. Wei and V. Simko. R package ‘corrplot’: Visualization of a Correlation Matrix. (Version 0.92). 2021. https://github.com/taiyun/corrplot.

[22] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0. 2023. https://forcats.tidyverse.org/.

[23] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. https://ggplot2.tidyverse.org.

[24] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1. 2023. https://stringr.tidyverse.org.

[25] H. Wickham. tidyverse: Easily Install and Load the Tidyverse. R package version 2.0.0. 2023. https://tidyverse.tidyverse.org.

[26] H. Wickham, M. Averick, J. Bryan, et al. “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43 (2019), p. 1686. DOI: 10.21105/joss.01686.

[27] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.5.1. 2024. https://ggplot2.tidyverse.org.

[28] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2023. https://dplyr.tidyverse.org.

[29] H. Wickham and L. Henry. purrr: Functional Programming Tools. R package version 1.0.2. 2023. https://purrr.tidyverse.org/.

[30] H. Wickham, J. Hester, and J. Bryan. readr: Read Rectangular Text Data. R package version 2.1.5. 2024. https://readr.tidyverse.org.

[31] H. Wickham, T. L. Pedersen, and D. Seidel. scales: Scale Functions for Visualization. R package version 1.3.0. 2023. https://scales.r-lib.org.

[32] H. Wickham, D. Vaughan, and M. Girlich. tidyr: Tidy Messy Data. R package version 1.3.1. 2024. https://tidyr.tidyverse.org.

[33] C. O. Wilke. cowplot: Streamlined Plot Theme and Plot Annotations for ggplot2. R package version 1.1.3. 2024. https://wilkelab.org/cowplot/.

[34] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[35] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[36] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48. 2024. https://yihui.org/knitr/.

[37] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library DataTables. R package version 0.33. 2024. https://github.com/rstudio/DT.