This script is part of the Online Appendix to my PhD thesis.

Please cite as: Le Foll, Elen. 2022. Textbook English: A Corpus-Based Analysis of the Language of EFL textbooks used in Secondary Schools in France, Germany and Spain. PhD thesis. Osnabrück University.

For more information, see: https://elenlefoll.github.io/TextbookEnglish/

Please note that the plot dimensions in this notebook have been optimised for the print version of the thesis.

Set-up

Built with R 4.0.3

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)

knitr::opts_chunk$set(echo = TRUE, tidy = TRUE, message=FALSE, paged.print=TRUE, fig.width = 10, warning=FALSE)

library(caret) # For its confusion matrix function
library(DescTools) # For common stats functions
library(dplyr)
library(forcats)
library(here) # For dynamic file paths
library(ggplot2) 
library(PerformanceAnalytics)
library(purrr) # For data wrangling
library(psych) # For various useful stats function
library(suffrager) # For pretty feminist colour palettes :)
library(tidyr)
library(tibble)

MDA of Textbook English vs. Reference corpora

Importing reference corpora data

Importing Spoken BNC2014 counts

These counts were computed on the basis of the “John and Jill in Ivybridge” version of the Spoken BNC2014 with added full stops at speaker turns.

SpokenBNC2014 <- read.delim(here("MFTE", "Outputs", "SpokenBNC2014_3.1_normed_complex_counts.tsv"), 
    header = TRUE, stringsAsFactors = TRUE)
str(SpokenBNC2014)  # Check sanity of data

## 'data.frame':    1251 obs. of  82 variables:
##  $ Filename: Factor w/ 1251 levels "S23A.txt","S24A.txt",..: 586 1079 223 624 394 874 137 1144 1091 419 ...
##  $ Words   : int  9192 18158 6900 5944 5866 44400 2514 3238 8614 21724 ...
##  $ AWL     : num  3.92 3.92 3.85 3.81 3.87 ...
##  $ TTR     : num  0.407 0.41 0.35 0.385 0.37 ...
##  $ LD      : num  0.462 0.507 0.489 0.493 0.467 ...
##  $ DT      : num  46.4 39.5 35.9 40.9 35.7 ...
##  $ JJAT    : num  17.2 17.3 13.3 18 12.6 ...
##  $ POS     : num  0.758 2.367 2.575 3.03 2.119 ...
##  $ NCOMP   : num  8.33 7.01 8.4 9.55 9.18 ...
##  $ QUAN    : num  12.31 10.27 9.62 9.24 11.16 ...
##  $ ACT     : num  22.2 17.7 19.9 11.5 25.7 ...
##  $ ASPECT  : num  2.4 0.87 1.11 1.02 1.98 ...
##  $ CAUSE   : num  0.712 1.099 0.443 0.511 1.273 ...
##  $ COMM    : num  8.9 7.1 8.08 7.42 7.5 ...
##  $ CUZ     : num  4.9 4.31 4.09 2.43 5.8 ...
##  $ CC      : num  42.8 21.3 26.5 14.3 37.9 ...
##  $ CONC    : num  0.356 1.054 0.996 1.023 0.99 ...
##  $ COND    : num  1.69 3.02 2.54 2.17 5.8 ...
##  $ EX      : num  2.14 4.21 3.1 2.43 3.54 ...
##  $ EXIST   : num  2.85 1.65 1.33 1.53 2.26 ...
##  $ ELAB    : num  0.445 0.321 0 0 0.141 ...
##  $ FREQ    : num  2.404 2.382 2.655 0.767 2.546 ...
##  $ JJPR    : num  14.78 18.64 11.62 16.5 9.05 ...
##  $ MENTAL  : num  19.8 20.3 18.8 21.5 19.8 ...
##  $ OCCUR   : num  1.692 0.87 0.664 0.511 0.283 ...
##  $ DOAUX   : num  6.14 6.69 9.07 9.59 7.36 ...
##  $ QUTAG   : num  1.692 0.596 3.54 2.43 0.141 ...
##  $ QUPR    : num  3.29 2.57 6.19 2.3 5.52 ...
##  $ SPLIT   : num  2.58 2.98 3.1 2.43 5.23 ...
##  $ STPR    : num  0.267 1.374 0.885 0.384 0.707 ...
##  $ WHQU    : num  1.16 3.71 5.31 4.48 2.69 ...
##  $ THSC    : num  4.9 4.4 2.77 3.84 3.68 ...
##  $ WHSC    : num  9.44 8.84 5.97 5.88 13.72 ...
##  $ CONT    : num  28 42.4 42.1 37.6 34.8 ...
##  $ VBD     : num  43.19 16.77 21.24 32.99 9.34 ...
##  $ VPRT    : num  47.5 67.5 64.4 53.2 62.7 ...
##  $ PLACE   : num  3.21 3.44 4.42 3.71 2.69 ...
##  $ PROG    : num  3.74 4.99 6.97 5.12 3.82 ...
##  $ HGOT    : num  1.16 1.28 2.88 1.41 3.25 ...
##  $ BEMA    : num  21.7 27.4 21.2 27.5 16.7 ...
##  $ MDCA    : num  1.96 2.84 4.54 3.71 7.36 ...
##  $ MDCO    : num  1.51 1.19 1.44 1.53 6.08 ...
##  $ TIME    : num  2.85 2.24 4.2 2.56 2.12 ...
##  $ THATD   : num  4.72 5.27 5.2 5.63 5.52 ...
##  $ THRC    : num  0.712 2.748 1.77 2.43 2.122 ...
##  $ VIMP    : num  0.178 1.878 1.327 1.407 1.414 ...
##  $ MDMM    : num  0.356 0.641 0.664 0.895 0.99 ...
##  $ ABLE    : num  0 0.183 0.111 0.128 0.141 ...
##  $ MDNE    : num  1.34 2.15 2.32 2.43 4.67 ...
##  $ MDWS    : num  0.712 2.657 2.655 1.151 3.678 ...
##  $ MDWO    : num  3.29 4.35 1.44 2.69 3.82 ...
##  $ XX0     : num  11.2 16.9 17.3 22.9 15.1 ...
##  $ PASS    : num  3.38 2.57 1.66 1.79 1.27 ...
##  $ PGET    : num  0.98 0.183 0.332 0.256 0.566 ...
##  $ VBG     : num  7.3 4.54 4.2 3.45 5.09 ...
##  $ VBN     : num  1.96 2.15 1.33 1.28 0.99 ...
##  $ PEAS    : num  5.79 3.99 3.98 3.45 1.84 ...
##  $ GTO     : num  0.267 2.061 0.774 0.767 1.556 ...
##  $ FPP1S   : num  40.5 24.9 26 30.9 18.2 ...
##  $ FPP1P   : num  9.53 8.57 3.76 7.03 13.72 ...
##  $ TPP3S   : num  9.884 7.879 11.504 5.499 0.424 ...
##  $ TPP3P   : num  9.62 8.25 10.73 4.86 21.22 ...
##  $ SPP2    : num  13.6 14.8 19.4 18.3 15.1 ...
##  $ PIT     : num  18.5 23.1 28.3 29.2 18.2 ...
##  $ PRP     : num  0.089 0.0458 0 0.1279 0 ...
##  $ RP      : num  3.21 2.75 3.76 3.32 6.08 ...
##  $ AMP     : num  0.653 0.314 0.145 0.337 0.239 ...
##  $ CD      : num  0.783 0.881 1 1.043 1.705 ...
##  $ DEMO    : num  0.957 1.74 1.029 1.632 1.279 ...
##  $ DMA     : num  3.62 4.37 3.41 4.46 1.89 ...
##  $ DWNT    : num  0.0326 0.0771 0.029 0.0168 0 0.0338 0.0398 0.0926 0.0232 0.0368 ...
##  $ EMPH    : num  1.49 1.31 1.19 1.46 1.01 ...
##  $ FPUH    : num  3.79 3.08 2.52 3.01 2.78 ...
##  $ HDG     : num  0.174 0.688 0.159 0.303 0.699 ...
##  $ IN      : num  7.72 7.01 5.48 5.08 6.75 ...
##  $ LIKE    : num  0.152 0.6 0.623 0.69 0.494 ...
##  $ NN      : num  11.5 11.6 10.7 11.1 12.1 ...
##  $ POLITE  : num  0 0.2313 0.0435 0.0505 0.0341 ...
##  $ RB      : num  3.1 3.42 3.14 3.26 3.8 ...
##  $ SO      : num  0.664 0.832 0.71 0.202 0.733 ...
##  $ URL     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ YNQU    : num  0.283 0.363 1.044 0.74 0.511 ...

nrow(SpokenBNC2014)  # Should be 1251 files

## [1] 1251

SpokenBNC2014$Series <- "Spoken BNC2014"
SpokenBNC2014$Level <- "Ref."
SpokenBNC2014$Country <- "Spoken BNC2014"
SpokenBNC2014$Register <- "Spoken BNC2014"

Importing Youth Fiction counts

These counts were computed on the basis of the random samples of approximately 5,000 words of the books of the Youth Fiction corpus.

YouthFiction <- read.delim(here("MFTE", "Outputs", "YF_sampled_500_3.1_normed_complex_counts.tsv"), 
    header = TRUE, stringsAsFactors = TRUE)
str(YouthFiction)  # Check sanity of data

## 'data.frame':    1191 obs. of  83 variables:
##  $ Filename: Factor w/ 1191 levels "1_BaumWizardOz_1.txt",..: 946 264 698 551 543 856 976 927 542 1150 ...
##  $ Words   : int  5955 5877 5974 5795 6185 5924 6097 5886 6110 5801 ...
##  $ AWL     : num  4.13 4.15 4.07 4.26 4.14 ...
##  $ TTR     : num  0.47 0.477 0.502 0.547 0.527 ...
##  $ LD      : num  0.505 0.485 0.496 0.496 0.499 ...
##  $ DT      : num  33.5 38.7 48.2 45.9 35.6 ...
##  $ JJAT    : num  12.6 14.2 23 20.8 19.1 ...
##  $ POS     : num  2.615 2.233 0.862 1.134 2.072 ...
##  $ NCOMP   : num  1.48 2.91 4.56 6.08 7.74 ...
##  $ QUAN    : num  6.19 6.41 9.36 6.6 8.72 ...
##  $ ACT     : num  19.3 20.9 24.8 25.2 23.8 ...
##  $ ASPECT  : num  2.11 2.1 5.68 2.47 1.52 ...
##  $ CAUSE   : num  1.057 0.84 2.436 1.389 0.967 ...
##  $ COMM    : num  22.46 16.11 13.53 9.88 16.71 ...
##  $ CUZ     : num  0.793 1.961 1.488 0.926 1.934 ...
##  $ CC      : num  28 35.7 26.8 30.1 32.5 ...
##  $ CONC    : num  0.793 1.12 1.083 1.08 2.486 ...
##  $ COND    : num  4.1 2.38 2.84 3.09 4.14 ...
##  $ EX      : num  2.64 2.8 2.57 2.93 1.93 ...
##  $ EXIST   : num  1.32 1.96 3.65 3.7 2.9 ...
##  $ ELAB    : num  0.132 0 0.135 0 0.276 ...
##  $ FREQ    : num  2.77 2.94 2.17 4.94 2.49 ...
##  $ JJPR    : num  12.4 15.4 16 17.6 12.2 ...
##  $ MENTAL  : num  15.7 18.5 17.9 15.6 21.5 ...
##  $ OCCUR   : num  3.04 2.38 2.03 5.25 1.52 ...
##  $ DOAUX   : num  3.57 4.62 2.57 3.55 5.25 ...
##  $ QUTAG   : num  0.132 0 0 0 1.934 ...
##  $ QUPR    : num  3.57 2.1 2.71 3.55 4.56 ...
##  $ SPLIT   : num  3.04 2.38 2.98 4.32 4.01 ...
##  $ STPR    : num  0.925 0.42 0.947 0.154 0.552 ...
##  $ WHQU    : num  3.3 3.64 1.89 2.01 1.1 ...
##  $ THSC    : num  2.38 3.08 4.74 4.94 4.7 ...
##  $ WHSC    : num  7.79 11.06 7.04 4.94 8.98 ...
##  $ CONT    : num  12.8 15.4 14.9 10.8 21.7 ...
##  $ VBD     : num  49.3 51.8 68.1 76.5 53.6 ...
##  $ VPRT    : num  34.1 30.7 16.2 11.9 30.7 ...
##  $ PLACE   : num  1.98 4.76 4.74 3.09 2.35 ...
##  $ PROG    : num  3.96 3.36 6.5 7.1 4.83 ...
##  $ HGOT    : num  0.528 0.14 0.947 0 0.691 ...
##  $ BEMA    : num  15.3 16.5 16.2 15.4 14.6 ...
##  $ MDCA    : num  1.717 2.521 0.947 1.08 3.315 ...
##  $ MDCO    : num  1.59 2.38 3.92 2.93 2.35 ...
##  $ TIME    : num  5.15 4.76 5.41 3.7 5.94 ...
##  $ THATD   : num  1.98 1.26 3.52 2.16 3.59 ...
##  $ THRC    : num  0.528 1.681 1.488 1.235 0.414 ...
##  $ VIMP    : num  3.567 2.241 3.383 1.698 0.829 ...
##  $ MDMM    : num  1.453 1.401 0.812 1.543 0.552 ...
##  $ ABLE    : num  0.396 0.28 0.271 0.154 0.138 ...
##  $ MDNE    : num  1.98 2.24 2.84 1.54 2.07 ...
##  $ MDWS    : num  3.699 3.642 1.624 0.309 2.072 ...
##  $ MDWO    : num  2.64 3.08 2.17 2.47 4.56 ...
##  $ XX0     : num  10.3 8.68 8.39 11.11 12.29 ...
##  $ PASS    : num  4.89 3.36 4.19 4.01 2.21 ...
##  $ PGET    : num  0.132 0.14 0 0 0 ...
##  $ VBG     : num  7.53 6.02 11.37 14.66 10.22 ...
##  $ VBN     : num  2.25 4.76 2.84 5.86 3.31 ...
##  $ PEAS    : num  6.87 8.68 6.9 8.95 8.29 ...
##  $ GTO     : num  0 0.7 0.541 0.154 1.243 ...
##  $ FPP1S   : num  17.31 19.33 46.41 8.02 34.94 ...
##  $ FPP1P   : num  7.53 5.74 9.88 1.39 4.42 ...
##  $ TPP3S   : num  23.78 41.74 9.88 64.81 31.77 ...
##  $ TPP3P   : num  7.66 10.08 6.5 11.57 3.59 ...
##  $ SPP2    : num  15.32 9.1 7.31 5.4 12.15 ...
##  $ PIT     : num  11.9 9.1 11.9 18.2 14.9 ...
##  $ PRP     : num  0.132 0 0 0 0 ...
##  $ RP      : num  3.83 4.9 6.09 5.56 7.87 ...
##  $ AMP     : num  0.487 0.0851 0.2176 0.1208 0.2749 ...
##  $ CD      : num  0.722 0.391 0.904 0.328 0.582 ...
##  $ DEMO    : num  0.823 1.038 0.753 0.725 0.582 ...
##  $ DMA     : num  0.369 0.391 0.285 0.259 0.598 ...
##  $ DWNT    : num  0.0672 0.0681 0.1339 0.1726 0.097 ...
##  $ EMO     : num  0 0.017 0 0 0 0 0.0164 0 0 0 ...
##  $ EMPH    : num  0.571 0.204 0.619 0.311 0.776 ...
##  $ FPUH    : num  0.2519 0.034 0.1339 0.1553 0.0647 ...
##  $ HDG     : num  0.1511 0.0681 0.1172 0.2071 0.3072 ...
##  $ IN      : num  9.5 10.06 9.78 11.32 9.12 ...
##  $ LIKE    : num  0.2015 0.0851 0.3348 0.3279 0.2264 ...
##  $ NN      : num  19.3 17.5 13.6 16.7 14.8 ...
##  $ POLITE  : num  0.117 0.102 0.067 0.069 0.097 ...
##  $ RB      : num  2.3 1.91 3.16 2.26 3.46 ...
##  $ SO      : num  0.218 0.17 0.117 0.19 0.388 ...
##  $ URL     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ YNQU    : num  0.403 0.204 0.167 0.069 0.372 ...

nrow(YouthFiction)  # Should be 1191 files

## [1] 1191

YouthFiction$Series <- "Youth Fiction"
YouthFiction$Level <- "Ref."
YouthFiction$Country <- "Youth Fiction"
YouthFiction$Register <- "Youth Fiction"

Importing ITTC counts

InfoTeen <- read.delim(here("MFTE", "Outputs", "InfoTeen_3.1_normed_complex_counts.tsv"), 
    header = TRUE, stringsAsFactors = TRUE)
str(InfoTeen)  # Check sanity of data

## 'data.frame':    1414 obs. of  84 variables:
##  $ Filename: Factor w/ 1414 levels "Dogo_News_10059528_science.txt",..: 657 831 825 1285 1086 1240 85 567 644 1147 ...
##  $ Words   : int  656 904 833 971 711 1104 697 1098 762 705 ...
##  $ AWL     : num  4.77 4.73 5.08 4.99 4.92 ...
##  $ TTR     : num  0.482 0.485 0.502 0.547 0.55 ...
##  $ LD      : num  0.59 0.559 0.593 0.623 0.578 ...
##  $ DT      : num  21.1 33.7 31.8 26.5 32.2 ...
##  $ JJAT    : num  23.2 22.4 20.4 24.9 17.8 ...
##  $ POS     : num  2.062 2.845 0.408 1.246 1.485 ...
##  $ NCOMP   : num  5.15 4.88 6.53 11.21 8.42 ...
##  $ QUAN    : num  3.61 5.69 1.63 4.36 3.47 ...
##  $ ACT     : num  21.9 24.7 28.4 29.5 34 ...
##  $ ASPECT  : num  6.25 5.19 0 1.64 1.89 ...
##  $ CAUSE   : num  1.56 3.9 1.49 1.64 5.66 ...
##  $ COMM    : num  7.81 5.19 14.93 3.28 3.77 ...
##  $ CUZ     : num  0 0 1.49 8.2 0 ...
##  $ CC      : num  23.4 36.4 37.3 44.3 30.2 ...
##  $ CONC    : num  0 1.3 1.49 3.28 5.66 ...
##  $ COND    : num  0 1.3 1.49 3.28 0 ...
##  $ EX      : num  1.56 0 0 0 1.89 ...
##  $ EXIST   : num  4.69 14.29 16.42 4.92 5.66 ...
##  $ ELAB    : num  0 0 4.48 1.64 0 ...
##  $ FREQ    : num  1.56 1.3 2.99 3.28 0 ...
##  $ JJPR    : num  12.5 19.5 23.9 24.6 13.2 ...
##  $ MENTAL  : num  3.12 9.09 11.94 9.84 15.09 ...
##  $ OCCUR   : num  4.69 2.6 0 4.92 1.89 ...
##  $ DOAUX   : num  7.81 1.3 0 0 1.89 ...
##  $ QUTAG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUPR    : num  0 0 0 0 3.77 ...
##  $ SPLIT   : num  4.69 1.3 2.99 8.2 3.77 ...
##  $ STPR    : num  1.56 1.3 0 0 0 ...
##  $ WHQU    : num  7.81 0 0 0 0 ...
##  $ THSC    : num  6.25 10.39 7.46 1.64 7.55 ...
##  $ WHSC    : num  12.5 19.48 8.96 4.92 18.87 ...
##  $ CONT    : num  6.25 6.49 2.99 0 3.77 ...
##  $ VBD     : num  51.56 6.49 29.85 8.2 26.42 ...
##  $ VPRT    : num  42.2 58.4 59.7 78.7 56.6 ...
##  $ PLACE   : num  3.12 2.6 13.43 3.28 3.77 ...
##  $ PROG    : num  4.69 1.3 1.49 3.28 0 ...
##  $ HGOT    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BEMA    : num  23.4 15.6 16.4 26.2 17 ...
##  $ MDCA    : num  3.12 0 1.49 4.92 13.21 ...
##  $ MDCO    : num  0 7.79 0 1.64 0 ...
##  $ TIME    : num  7.81 2.6 16.42 4.92 7.55 ...
##  $ THATD   : num  1.56 1.3 0 1.64 3.77 ...
##  $ THRC    : num  6.25 14.29 5.97 6.56 3.77 ...
##  $ VIMP    : num  0 9.09 4.48 3.28 0 ...
##  $ MDMM    : num  3.12 6.49 2.99 0 0 ...
##  $ ABLE    : num  0 1.3 1.49 0 0 ...
##  $ MDNE    : num  0 2.6 0 0 0 ...
##  $ MDWS    : num  0 1.3 0 1.64 1.89 ...
##  $ MDWO    : num  0 7.79 1.49 1.64 1.89 ...
##  $ XX0     : num  7.81 5.19 0 3.28 3.77 ...
##  $ PASS    : num  10.94 5.19 13.43 4.92 7.55 ...
##  $ PGET    : num  0 0 0 0 0 ...
##  $ VBG     : num  10.9 19.5 14.9 18 35.8 ...
##  $ VBN     : num  1.56 10.39 28.36 13.11 13.21 ...
##  $ PEAS    : num  6.25 6.49 4.48 13.11 15.09 ...
##  $ GTO     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FPP1S   : num  0 0 0 0 0 ...
##  $ FPP1P   : num  0 5.19 0 0 7.55 ...
##  $ TPP3S   : num  7.81 0 4.48 0 0 ...
##  $ TPP3P   : num  4.69 2.6 8.96 4.92 35.85 ...
##  $ SPP2    : num  17.2 0 0 0 0 ...
##  $ PIT     : num  6.25 12.99 20.9 8.2 0 ...
##  $ PRP     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ RP      : num  3.12 2.6 5.97 1.64 9.43 ...
##  $ AMP     : num  0.152 0 0 0.103 0.422 ...
##  $ CD      : num  2.29 2.77 2.4 3.6 2.25 ...
##  $ DEMO    : num  0.457 0.996 0.48 0.927 0.281 ...
##  $ DMA     : num  0 0.111 0 0 0 ...
##  $ DWNT    : num  0 0 0 0.103 0 ...
##  $ EMO     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EMPH    : num  0.61 0.332 0.24 0 0.563 ...
##  $ FPUH    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HDG     : num  0.457 0.111 0.84 0.412 0 ...
##  $ HST     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IN      : num  11.4 13.3 11.8 12.9 14.6 ...
##  $ LIKE    : num  0.152 0.111 0 0.206 0 ...
##  $ NN      : num  29.6 27.2 29.4 33.1 28.4 ...
##  $ POLITE  : num  0 0 0 0 0 ...
##  $ RB      : num  0.915 1.327 1.321 1.133 1.969 ...
##  $ SO      : num  0.152 0 0 0 0 ...
##  $ URL     : num  0 0 0 0.309 0 ...
##  $ YNQU    : num  0.152 0 0 0 0 ...

nrow(InfoTeen)  # Should be 1414 files

## [1] 1414

InfoTeen <- InfoTeen %>% filter(Filename != ".DS_Store" & Filename != "Revision_World_GCSE_10529068_wjec-level-law-past-papers.txt" & 
    Filename != "Revision_World_GCSE_10528474_wjec-level-history-past-papers.txt" & 
    Filename != "Revision_World_GCSE_10528472_edexcel-level-history-past-papers.txt")
# Removes three outlier files which should not have been included in the corpus

InfoTeen$Series <- "Info Teens"
InfoTeen$Level <- "Ref."
InfoTeen$Country <- "Info Teens"
InfoTeen$Register <- "Info Teens"

Merging TEC and reference corpora counts

Due to reasons of space, the results of the five-register dataset were not included in the thesis.

TxBcounts <- readRDS(here("FullMDA", "TxBcounts.rds"))

TxBcounts %>% 
  filter(Series=="NGL") %>% 
  group_by(Series, Level) %>% 
  summarise(wordcount = sum(Words))

counts <- bind_rows(TxBcounts, InfoTeen, SpokenBNC2014, YouthFiction, .id = "Corpus") %>% 
  filter(Register != "Poetry")
head(counts); tail(counts)
nrow(counts)

# Convert all character vectors to factors
counts[sapply(counts, is.character)] <- lapply(counts[sapply(counts, is.character)], as.factor)

# Change all NAs to 0
counts[is.na(counts)] <- 0

levels(counts$Corpus)
levels(counts$Corpus) <- list(Textbook.English="1", Informative.Teens="2", Spoken.BNC2014="3", Youth.Fiction="4")
summary(counts$Corpus)
summary(counts$Series)

# Re-order registers
levels(counts$Register)
counts$Register <- factor(counts$Register, levels = c("Conversation", "Fiction", "Informative", "Instructional", "Personal", "Info Teens", "Spoken BNC2014", "Youth Fiction"))

# Wrangle metadata variables
counts$Subcorpus <- counts$Register
levels(counts$Subcorpus) <- c("Textbook Conversation", "Textbook Fiction", "Textbook Informative", "Textbook Instructional", "Textbook Personal", "Info Teens Ref.", "Spoken BNC2014 Ref.", "Youth Fiction Ref.")
summary(counts$Subcorpus)

levels(counts$Register) <- c("Conversation", "Fiction", "Informative", "Instructional", "Personal", "Poetry", "Informative", "Conversation", "Fiction")
summary(counts$Register)

# Re-order variables
colnames(counts)
counts <- counts %>% 
  select(order(names(.))) %>% # Order alphabetically first
  select(Filename, Register, Level, Series, Country, Corpus, Subcorpus, Words, everything()) # Then place the metadata variable at the front of the table

#saveRDS(counts, here("FullMDA", "counts.rds")) # Last saved 9 Feb 2022

This is the dataset that is presented in the second half of Chapter 7.

TxBcounts <- readRDS(here("FullMDA", "TxBcounts.rds"))

All3Reg <- c("Conversation", "Fiction", "Informative")

TxBcounts3Reg <- TxBcounts %>% 
  filter(Register %in% All3Reg) %>% 
  droplevels(.)

counts <- bind_rows(TxBcounts3Reg, InfoTeen, SpokenBNC2014, YouthFiction, .id = "Corpus")
head(counts); tail(counts)

##   Corpus                               Filename Country    Series Level
## 1      1                 POC_4e_Spoken_0007.txt  France       POC     C
## 2      1 Achievers_B1_plus_Informative_0007.txt   Spain Achievers     D
## 3      1                 POC_5e_Spoken_0003.txt  France       POC     B
## 4      1            Access_4_Narrative_0013.txt Germany    Access     D
## 5      1                  NGL_1_Spoken_0002.txt Germany       NGL     A
## 6      1            Access_1_Narrative_0005.txt Germany    Access     A
##       Register Words ABLE     ACT    AMP ASPECT    AWL    BEMA  CAUSE      CC
## 1 Conversation   750    0 23.9437 0.2667 2.8169 3.8987 19.7183 1.4085 45.0704
## 2  Informative   690    0 43.7500 0.1449 6.2500 4.6986 14.5833 2.0833 68.7500
## 3 Conversation   694    0 14.1176 0.2882 1.1765 3.8098 31.7647 1.1765 20.0000
## 4      Fiction   547    0 18.9189 0.9141 6.7568 3.9506 20.2703 0.0000 28.3784
## 5 Conversation   927    0 10.4348 0.1079 0.0000 3.8188 46.9565 3.4783 20.8696
## 6      Fiction   840    0 23.5772 0.1190 0.8130 3.9393 15.4472 2.4390 26.8293
##       CD    COMM   CONC   COND    CONT    CUZ   DEMO    DMA   DOAUX      DT
## 1 2.4000  9.8592 0.0000 1.4085 19.7183 1.4085 0.2667 2.0000  1.4085 39.3701
## 2 1.3043 16.6667 4.1667 0.0000  4.1667 0.0000 0.1449 0.0000  0.0000 28.2178
## 3 1.0086  3.5294 0.0000 4.7059 29.4118 0.0000 1.1527 2.1614  9.4118 40.3846
## 4 0.1828  2.7027 0.0000 0.0000  8.1081 1.3514 0.0000 0.0000  4.0541 54.8387
## 5 0.2157  1.7391 0.0000 0.0000 47.8261 0.0000 1.1866 2.3732  4.3478 18.8406
## 6 0.8333 22.7642 0.0000 0.0000 12.1951 0.8130 0.7143 1.4286 10.5691 27.0833
##     DWNT   ELAB EMO   EMPH     EX   EXIST   FPP1P   FPP1S   FPUH   FREQ    GTO
## 1 0.1333 0.0000   0 0.6667 0.0000  8.4507 18.3099 60.5634 0.8000 2.8169 2.8169
## 2 0.0000 0.0000   0 0.2899 0.0000 12.5000 20.8333  0.0000 0.0000 2.0833 0.0000
## 3 0.0000 2.3529   0 0.4323 3.5294  3.5294 17.6471 15.2941 1.0086 1.1765 4.7059
## 4 0.0000 0.0000   0 0.0000 8.1081  4.0541 20.2703  9.4595 0.0000 1.3514 0.0000
## 5 0.0000 0.0000   0 0.3236 2.6087  0.8696 12.1739 31.3043 3.2362 4.3478 0.0000
## 6 0.0000 0.0000   0 0.3571 4.0650  1.6260 11.3821 23.5772 0.9524 4.0650 0.0000
##      HDG   HGOT HST      IN    JJAT    JJPR       LD   LIKE   MDCA   MDCO MDMM
## 1 0.6667 0.0000   0 10.1333 18.8976 12.6761 0.516000 0.4000 1.4085 1.4085    0
## 2 0.1449 0.0000   0 10.8696 25.7426 12.5000 0.589855 0.0000 4.1667 0.0000    0
## 3 0.1441 0.0000   0  7.7810 13.4615 17.6471 0.504323 0.1441 2.3529 0.0000    0
## 4 0.0000 0.0000   0 11.5174 10.7527 22.9730 0.466179 0.0000 4.0541 1.3514    0
## 5 0.3236 7.8261   0  5.1780  7.2464 27.8261 0.535059 0.1079 6.9565 0.0000    0
## 6 0.0000 0.0000   0  7.6190  7.2917 12.1951 0.557143 0.0000 7.3171 0.8130    0
##     MDNE    MDWO    MDWS  MENTAL   NCOMP      NN  OCCUR   PASS   PEAS PGET
## 1 4.2254 22.5352  2.8169 36.6197  4.7244 16.9333 7.0423 2.8169 0.0000    0
## 2 6.2500  0.0000 10.4167 20.8333 12.8713 29.2754 0.0000 4.1667 4.1667    0
## 3 7.0588  0.0000  1.1765 20.0000  4.8077 14.9856 0.0000 1.1765 0.0000    0
## 4 0.0000  0.0000  1.3514 20.2703  1.0753 17.0018 5.4054 2.7027 2.7027    0
## 5 0.0000  0.0000  0.0000 11.3043 10.6280 22.3301 0.0000 0.0000 0.0000    0
## 6 0.0000  0.0000  0.0000 10.5691  8.3333 22.8571 0.0000 0.8130 0.0000    0
##       PIT  PLACE POLITE    POS    PROG PRP    QUAN   QUPR QUTAG     RB     RP
## 1  7.0423 0.0000 0.4000 0.0000  2.8169   0 11.0236 5.6338     0 1.4667 4.2254
## 2 12.5000 6.2500 0.0000 0.9901  6.2500   0  2.9703 4.1667     0 0.7246 6.2500
## 3 23.5294 5.8824 0.1441 1.9231 12.9412   0  4.8077 2.3529     0 3.7464 0.0000
## 4  8.1081 6.7568 0.1828 2.1505  4.0541   0  7.5269 1.3514     0 3.1079 5.4054
## 5 15.6522 6.9565 0.9709 3.3816  0.0000   0  0.9662 2.6087     0 1.2945 0.0000
## 6  6.5041 2.4390 0.2381 2.0833  0.8130   0  3.1250 0.0000     0 1.6667 0.8130
##       SO  SPLIT    SPP2   STPR  THATD THRC   THSC   TIME   TPP3P   TPP3S    TTR
## 1 0.2667 1.4085 23.9437 1.4085 0.0000    0 0.0000 5.6338  0.0000  1.4085 0.5050
## 2 0.2899 4.1667 33.3333 0.0000 0.0000    0 2.0833 2.0833 12.5000  0.0000 0.5900
## 3 0.0000 2.3529 24.7059 0.0000 0.0000    0 1.1765 4.7059  2.3529  5.8824 0.4275
## 4 0.9141 0.0000  0.0000 0.0000 1.3514    0 0.0000 4.0541  8.1081 28.3784 0.4525
## 5 0.0000 1.7391 25.2174 0.0000 0.0000    0 0.8696 5.2174  1.7391  7.8261 0.3550
## 6 0.1190 0.8130  9.7561 0.0000 0.0000    0 0.0000 3.2520  8.1301  8.1301 0.5100
##      URL     VBD     VBG    VBN    VIMP    VPRT   WHQU    WHSC     XX0   YNQU
## 1 0.0000 38.0282  7.0423 0.0000  2.8169 26.7606 8.4507  5.6338  7.0423 0.6667
## 2 0.0000 18.7500 14.5833 4.1667  6.2500 54.1667 0.0000 16.6667  4.1667 0.0000
## 3 0.0000  4.7059  0.0000 0.0000  7.0588 77.6471 8.2353  5.8824 12.9412 2.1614
## 4 0.0000 64.8649  8.1081 0.0000  0.0000 28.3784 0.0000  9.4595  5.4054 0.0000
## 5 0.1079  3.4783  1.7391 0.0000 10.4348 79.1304 7.8261  2.6087 17.3913 0.8630
## 6 0.0000 51.2195  0.0000 0.8130  3.2520 37.3984 4.8780  5.6911  8.1301 0.3571

##      Corpus                              Filename       Country        Series
## 5090      4       130_PRATCHETT1989DW07MIDS_3.txt Youth Fiction Youth Fiction
## 5091      4       163_PRATCHETT1998DW23ULUM_4.txt Youth Fiction Youth Fiction
## 5092      4         106_GOLDING1980RITESAGE_2.txt Youth Fiction Youth Fiction
## 5093      4            68_A-Wrinkle-In-Time_1.txt Youth Fiction Youth Fiction
## 5094      4          81_thetrumpetoftheswan_4.txt Youth Fiction Youth Fiction
## 5095      4 207_DiaryOfAWimpyKid1JeffKinney_1.txt Youth Fiction Youth Fiction
##      Level      Register Words   ABLE     ACT    AMP ASPECT    AWL    BEMA
## 5090  Ref. Youth Fiction  5840 0.2642 18.6262 0.2740 1.7173 4.2649 15.3236
## 5091  Ref. Youth Fiction  6141 0.0000 16.3855 0.1628 2.2892 4.0474 12.4096
## 5092  Ref. Youth Fiction  5686 0.1681 20.3361 0.1231 1.1765 4.2921 17.4790
## 5093  Ref. Youth Fiction  5980 1.0485 21.1009 0.3846 1.4417 4.1137 18.3486
## 5094  Ref. Youth Fiction  5772 0.5789 23.2996 0.1213 1.4472 4.0665 13.4588
## 5095  Ref. Youth Fiction  6024 0.4688 37.5000 0.0996 4.3750 4.0144 13.4375
##       CAUSE      CC     CD    COMM   CONC   COND    CONT    CUZ   DEMO    DMA
## 5090 0.9247 26.4201 0.5308 15.3236 0.9247 1.0568 15.1915 1.1889 0.9075 0.5993
## 5091 1.2048 17.3494 0.5862 22.0482 0.6024 2.7711 28.7952 0.9639 0.9770 1.0096
## 5092 2.0168 30.9244 0.5276 14.2857 2.0168 2.1849  0.8403 0.8403 0.6683 0.4749
## 5093 2.3591 25.4260 0.7692 14.5478 1.5727 1.7038 10.6160 1.0485 0.6355 0.6355
## 5094 0.8683 29.8119 0.5891 10.1302 0.2894 1.1577  8.8278 0.8683 0.7796 0.1213
## 5095 1.0938 25.3125 1.5438 11.5625 0.1562 1.4062 13.7500 2.6562 0.6474 0.3154
##       DOAUX      DT   DWNT   ELAB EMO   EMPH     EX  EXIST   FPP1P   FPP1S
## 5090 3.1704 44.0196 0.0856 0.3963   0 0.5137 3.1704 3.1704  4.6235  7.2655
## 5091 5.0602 31.7360 0.0977 0.0000   0 0.6188 3.7349 1.8072  6.6265 15.4217
## 5092 3.0252 41.1150 0.0703 0.3361   0 0.3693 3.0252 3.1933  8.4034 37.8151
## 5093 5.6356 27.6768 0.1003 0.0000   0 0.7358 1.9659 2.4902 14.2857 11.4024
## 5094 4.1968 44.0287 0.0693 0.1447   0 0.3119 0.8683 2.0260  1.0130 11.5774
## 5095 3.7500 41.1438 0.0332 0.1562   0 0.7968 1.4062 0.7812  6.4062 45.9375
##        FPUH   FREQ    GTO    HDG   HGOT HST      IN    JJAT    JJPR       LD
## 5090 0.2397 4.0951 0.6605 0.1370 0.3963  NA  9.5205 20.0980 10.5680 0.508904
## 5091 0.4560 2.6506 0.7229 0.1628 1.8072  NA  7.7349 13.8336  8.7952 0.544862
## 5092 0.0528 2.3529 0.0000 0.1231 0.1681  NA 12.4868 22.2997 17.3109 0.465881
## 5093 0.2007 3.6697 0.5242 0.1171 0.2621  NA  8.5953 16.0606 15.2031 0.503010
## 5094 0.1559 3.0391 1.0130 0.0866 0.2894  NA 10.4816 13.9331 12.0116 0.511088
## 5095 0.1162 2.5000 1.4062 0.2158 0.4688  NA  7.8187 11.6759 10.3125 0.541169
##        LIKE   MDCA   MDCO   MDMM   MDNE   MDWO   MDWS  MENTAL  NCOMP      NN
## 5090 0.3253 0.9247 2.7741 0.7926 1.7173 3.3025 1.5852 13.7384 2.4510 17.4658
## 5091 0.3582 5.6627 0.8434 0.1205 1.2048 3.0120 3.9759 14.8193 3.8879 18.0101
## 5092 0.1055 2.6891 1.0084 2.1849 4.2017 4.0336 2.6891 15.7983 3.3101 20.1899
## 5093 0.1505 2.3591 2.7523 0.3932 4.0629 3.4076 2.6212 18.4797 5.8586 16.5552
## 5094 0.1213 1.8813 2.0260 0.8683 1.1577 2.6049 3.0391 16.3531 5.0955 21.7602
## 5095 0.2822 1.2500 1.8750 0.1562 3.9062 2.3438 1.2500 24.5312 5.7983 20.8997
##       OCCUR   PASS    PEAS   PGET     PIT  PLACE POLITE    POS   PROG    PRP
## 5090 0.9247 3.4346  6.2087 0.3963 12.4174 4.3593 0.0856 1.8627 5.5482 0.0000
## 5091 1.2048 2.0482  2.7711 0.2410  9.0361 3.4940 0.0977 3.4358 3.6145 0.0000
## 5092 1.8487 6.3866 10.2521 0.0000  9.7479 2.8571 0.1231 1.5679 2.3529 0.1681
## 5093 1.4417 5.6356  6.2910 0.1311  9.5675 3.8008 0.1171 2.1212 3.4076 0.2621
## 5094 2.1708 3.6179  5.6440 0.2894 11.1433 5.0651 0.0347 1.3535 2.3155 0.0000
## 5095 1.0938 2.0312  2.0312 0.9375 11.0938 3.2812 0.1494 1.9063 5.6250 0.0000
##        QUAN   QUPR  QUTAG     RB      RP     SO  SPLIT    SPP2   STPR   THATD
## 5090 6.9608 3.8309 1.1889 2.5171  4.8877 0.2568 2.7741  9.3791 0.7926  1.8494
## 5091 6.6908 3.7349 1.2048 2.4426  6.6265 0.2931 2.6506 16.8675 0.9639  3.3735
## 5092 5.6620 1.8487 0.0000 2.0929  2.6891 0.2638 3.0252  9.0756 0.8403  2.8571
## 5093 4.9495 4.0629 0.3932 2.8595  2.4902 0.1171 3.5387 17.5623 0.3932  1.5727
## 5094 3.9809 3.3285 0.0000 1.7845  5.4993 0.1386 1.7366  8.3936 1.0130  1.5919
## 5095 4.2097 4.5312 0.0000 2.0750 11.2500 0.4648 3.5937  5.6250 0.9375 11.5625
##        THRC   THSC   TIME   TPP3P   TPP3S    TTR URL     VBD     VBG    VBN
## 5090 0.7926 4.3593 3.6988 11.0964 32.4967 0.5275   0 59.4452  6.7371 3.9630
## 5091 1.3253 2.2892 1.8072 10.6024 22.2892 0.4125   0 44.0964  6.5060 1.3253
## 5092 1.3445 7.3950 4.8739  5.5462 30.0840 0.5300   0 49.0756  6.5546 5.5462
## 5093 2.0970 7.6016 4.8493  7.7326 31.9790 0.4875   0 50.3277  6.4220 2.4902
## 5094 1.1577 2.1708 3.0391  6.0781 39.6527 0.5125   0 63.5311  9.6961 2.4602
## 5095 1.8750 5.6250 4.8438  4.6875 20.0000 0.4800   0 57.3438 22.6562 1.4062
##        VIMP    VPRT   WHQU    WHSC     XX0   YNQU
## 5090 2.1136 27.3448 3.0383  7.0013 10.7001 0.3253
## 5091 3.8554 37.2289 3.3735  4.3373 11.3253 0.3745
## 5092 2.5210 31.5966 0.6723 10.4202 10.2521 0.1759
## 5093 2.0970 31.9790 4.7182  8.9122 14.6789 0.4515
## 5094 2.7496 22.1418 1.8813  8.6831  8.2489 0.0520
## 5095 1.4062 30.4688 0.7812  9.2188  8.5938 0.0830

str(counts)

## 'data.frame':    5095 obs. of  89 variables:
##  $ Corpus  : chr  "1" "1" "1" "1" ...
##  $ Filename: chr  "POC_4e_Spoken_0007.txt" "Achievers_B1_plus_Informative_0007.txt" "POC_5e_Spoken_0003.txt" "Access_4_Narrative_0013.txt" ...
##  $ Country : chr  "France" "Spain" "France" "Germany" ...
##  $ Series  : chr  "POC" "Achievers" "POC" "Access" ...
##  $ Level   : chr  "C" "D" "B" "D" ...
##  $ Register: chr  "Conversation" "Informative" "Conversation" "Fiction" ...
##  $ Words   : int  750 690 694 547 927 840 1127 1090 635 976 ...
##  $ ABLE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ACT     : num  23.9 43.8 14.1 18.9 10.4 ...
##  $ AMP     : num  0.267 0.145 0.288 0.914 0.108 ...
##  $ ASPECT  : num  2.82 6.25 1.18 6.76 0 ...
##  $ AWL     : num  3.9 4.7 3.81 3.95 3.82 ...
##  $ BEMA    : num  19.7 14.6 31.8 20.3 47 ...
##  $ CAUSE   : num  1.41 2.08 1.18 0 3.48 ...
##  $ CC      : num  45.1 68.8 20 28.4 20.9 ...
##  $ CD      : num  2.4 1.304 1.009 0.183 0.216 ...
##  $ COMM    : num  9.86 16.67 3.53 2.7 1.74 ...
##  $ CONC    : num  0 4.17 0 0 0 ...
##  $ COND    : num  1.41 0 4.71 0 0 ...
##  $ CONT    : num  19.72 4.17 29.41 8.11 47.83 ...
##  $ CUZ     : num  1.41 0 0 1.35 0 ...
##  $ DEMO    : num  0.267 0.145 1.153 0 1.187 ...
##  $ DMA     : num  2 0 2.16 0 2.37 ...
##  $ DOAUX   : num  1.41 0 9.41 4.05 4.35 ...
##  $ DT      : num  39.4 28.2 40.4 54.8 18.8 ...
##  $ DWNT    : num  0.133 0 0 0 0 ...
##  $ ELAB    : num  0 0 2.35 0 0 ...
##  $ EMO     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EMPH    : num  0.667 0.29 0.432 0 0.324 ...
##  $ EX      : num  0 0 3.53 8.11 2.61 ...
##  $ EXIST   : num  8.45 12.5 3.53 4.05 0.87 ...
##  $ FPP1P   : num  18.3 20.8 17.6 20.3 12.2 ...
##  $ FPP1S   : num  60.56 0 15.29 9.46 31.3 ...
##  $ FPUH    : num  0.8 0 1.01 0 3.24 ...
##  $ FREQ    : num  2.82 2.08 1.18 1.35 4.35 ...
##  $ GTO     : num  2.82 0 4.71 0 0 ...
##  $ HDG     : num  0.667 0.145 0.144 0 0.324 ...
##  $ HGOT    : num  0 0 0 0 7.83 ...
##  $ HST     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ IN      : num  10.13 10.87 7.78 11.52 5.18 ...
##  $ JJAT    : num  18.9 25.74 13.46 10.75 7.25 ...
##  $ JJPR    : num  12.7 12.5 17.6 23 27.8 ...
##  $ LD      : num  0.516 0.59 0.504 0.466 0.535 ...
##  $ LIKE    : num  0.4 0 0.144 0 0.108 ...
##  $ MDCA    : num  1.41 4.17 2.35 4.05 6.96 ...
##  $ MDCO    : num  1.41 0 0 1.35 0 ...
##  $ MDMM    : num  0 0 0 0 0 ...
##  $ MDNE    : num  4.23 6.25 7.06 0 0 ...
##  $ MDWO    : num  22.5 0 0 0 0 ...
##  $ MDWS    : num  2.82 10.42 1.18 1.35 0 ...
##  $ MENTAL  : num  36.6 20.8 20 20.3 11.3 ...
##  $ NCOMP   : num  4.72 12.87 4.81 1.08 10.63 ...
##  $ NN      : num  16.9 29.3 15 17 22.3 ...
##  $ OCCUR   : num  7.04 0 0 5.41 0 ...
##  $ PASS    : num  2.82 4.17 1.18 2.7 0 ...
##  $ PEAS    : num  0 4.17 0 2.7 0 ...
##  $ PGET    : num  0 0 0 0 0 ...
##  $ PIT     : num  7.04 12.5 23.53 8.11 15.65 ...
##  $ PLACE   : num  0 6.25 5.88 6.76 6.96 ...
##  $ POLITE  : num  0.4 0 0.144 0.183 0.971 ...
##  $ POS     : num  0 0.99 1.92 2.15 3.38 ...
##  $ PROG    : num  2.82 6.25 12.94 4.05 0 ...
##  $ PRP     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUAN    : num  11.024 2.97 4.808 7.527 0.966 ...
##  $ QUPR    : num  5.63 4.17 2.35 1.35 2.61 ...
##  $ QUTAG   : num  0 0 0 0 0 ...
##  $ RB      : num  1.467 0.725 3.746 3.108 1.294 ...
##  $ RP      : num  4.23 6.25 0 5.41 0 ...
##  $ SO      : num  0.267 0.29 0 0.914 0 ...
##  $ SPLIT   : num  1.41 4.17 2.35 0 1.74 ...
##  $ SPP2    : num  23.9 33.3 24.7 0 25.2 ...
##  $ STPR    : num  1.41 0 0 0 0 ...
##  $ THATD   : num  0 0 0 1.35 0 ...
##  $ THRC    : num  0 0 0 0 0 ...
##  $ THSC    : num  0 2.08 1.18 0 0.87 ...
##  $ TIME    : num  5.63 2.08 4.71 4.05 5.22 ...
##  $ TPP3P   : num  0 12.5 2.35 8.11 1.74 ...
##  $ TPP3S   : num  1.41 0 5.88 28.38 7.83 ...
##  $ TTR     : num  0.505 0.59 0.427 0.453 0.355 ...
##  $ URL     : num  0 0 0 0 0.108 ...
##  $ VBD     : num  38.03 18.75 4.71 64.86 3.48 ...
##  $ VBG     : num  7.04 14.58 0 8.11 1.74 ...
##  $ VBN     : num  0 4.17 0 0 0 ...
##  $ VIMP    : num  2.82 6.25 7.06 0 10.43 ...
##  $ VPRT    : num  26.8 54.2 77.6 28.4 79.1 ...
##  $ WHQU    : num  8.45 0 8.24 0 7.83 ...
##  $ WHSC    : num  5.63 16.67 5.88 9.46 2.61 ...
##  $ XX0     : num  7.04 4.17 12.94 5.41 17.39 ...
##  $ YNQU    : num  0.667 0 2.161 0 0.863 ...

# Convert all character vectors to factors
counts[sapply(counts, is.character)] <- lapply(counts[sapply(counts, is.character)], as.factor)

# Change all NAs to 0
counts[is.na(counts)] <- 0

levels(counts$Corpus)

## [1] "1" "2" "3" "4"

levels(counts$Corpus) <- list(Textbook.English="1", Informative.Teens="2", Spoken.BNC2014="3", Youth.Fiction="4")
summary(counts$Corpus)

##  Textbook.English Informative.Teens    Spoken.BNC2014     Youth.Fiction 
##              1242              1411              1251              1191

summary(counts$Series)

##         Access      Achievers            EIM      GreenLine             HT 
##            227            108             99            134            106 
##     Info Teens            JTT            NGL            POC      Solutions 
##           1411             97            192             60            219 
## Spoken BNC2014  Youth Fiction 
##           1251           1191

# Wrangle metadata variables
counts$Subcorpus <- counts$Register
levels(counts$Subcorpus)

## [1] "Conversation"   "Fiction"        "Info Teens"     "Informative"   
## [5] "Spoken BNC2014" "Youth Fiction"

levels(counts$Subcorpus) <- c("Textbook Conversation", "Textbook Fiction", "Info Teens Ref.", "Textbook Informative", "Spoken BNC2014 Ref.", "Youth Fiction Ref.")
summary(counts$Subcorpus)

## Textbook Conversation      Textbook Fiction       Info Teens Ref. 
##                   593                   285                  1411 
##  Textbook Informative   Spoken BNC2014 Ref.    Youth Fiction Ref. 
##                   364                  1251                  1191

# Re-order registers
levels(counts$Register)

## [1] "Conversation"   "Fiction"        "Info Teens"     "Informative"   
## [5] "Spoken BNC2014" "Youth Fiction"

levels(counts$Register) <- c("Conversation", "Fiction", "Informative", "Informative", "Conversation", "Fiction")
summary(counts$Register)

## Conversation      Fiction  Informative 
##         1844         1476         1775

# Re-order variables
colnames(counts)

##  [1] "Corpus"    "Filename"  "Country"   "Series"    "Level"     "Register" 
##  [7] "Words"     "ABLE"      "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "CD"        "COMM"      "CONC"     
## [19] "COND"      "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"    
## [25] "DT"        "DWNT"      "ELAB"      "EMO"       "EMPH"      "EX"       
## [31] "EXIST"     "FPP1P"     "FPP1S"     "FPUH"      "FREQ"      "GTO"      
## [37] "HDG"       "HGOT"      "HST"       "IN"        "JJAT"      "JJPR"     
## [43] "LD"        "LIKE"      "MDCA"      "MDCO"      "MDMM"      "MDNE"     
## [49] "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"        "OCCUR"    
## [55] "PASS"      "PEAS"      "PGET"      "PIT"       "PLACE"     "POLITE"   
## [61] "POS"       "PROG"      "PRP"       "QUAN"      "QUPR"      "QUTAG"    
## [67] "RB"        "RP"        "SO"        "SPLIT"     "SPP2"      "STPR"     
## [73] "THATD"     "THRC"      "THSC"      "TIME"      "TPP3P"     "TPP3S"    
## [79] "TTR"       "URL"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [85] "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"      "Subcorpus"

counts <- counts %>% 
  select(order(names(.))) %>% # Order alphabetically first
  select(Filename, Register, Level, Series, Country, Corpus, Subcorpus, Words, everything()) # Then place the metadata variable at the front of the table

#saveRDS(counts, here("FullMDA", "counts3Reg.rds")) # Last saved 9 Feb 2022

Data preparation

Plotting the distributions of all the features

ncounts <- readRDS(here("FullMDA", "counts3Reg.rds"))
colnames(ncounts)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ABLE"      "ACT"       "AMP"       "ASPECT"   
## [13] "AWL"       "BEMA"      "CAUSE"     "CC"        "CD"        "COMM"     
## [19] "CONC"      "COND"      "CONT"      "CUZ"       "DEMO"      "DMA"      
## [25] "DOAUX"     "DT"        "DWNT"      "ELAB"      "EMO"       "EMPH"     
## [31] "EX"        "EXIST"     "FPP1P"     "FPP1S"     "FPUH"      "FREQ"     
## [37] "GTO"       "HDG"       "HGOT"      "HST"       "IN"        "JJAT"     
## [43] "JJPR"      "LD"        "LIKE"      "MDCA"      "MDCO"      "MDMM"     
## [49] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [55] "OCCUR"     "PASS"      "PEAS"      "PGET"      "PIT"       "PLACE"    
## [61] "POLITE"    "POS"       "PROG"      "PRP"       "QUAN"      "QUPR"     
## [67] "QUTAG"     "RB"        "RP"        "SO"        "SPLIT"     "SPP2"     
## [73] "STPR"      "THATD"     "THRC"      "THSC"      "TIME"      "TPP3P"    
## [79] "TPP3S"     "TTR"       "URL"       "VBD"       "VBG"       "VBN"      
## [85] "VIMP"      "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"

# Compare relative frequencies of individual features, e.g., BE as a main verb per FVP (finite verb phrase)
ncounts %>% 
  group_by(Register, Corpus) %>% 
  summarise(median(BEMA), MAD(BEMA))

## # A tibble: 6 × 4
## # Groups:   Register [3]
##   Register     Corpus            `median(BEMA)` `MAD(BEMA)`
##   <fct>        <fct>                      <dbl>       <dbl>
## 1 Conversation Textbook.English            23.9        6.23
## 2 Conversation Spoken.BNC2014              20.6        2.90
## 3 Fiction      Textbook.English            15.8        5.13
## 4 Fiction      Youth.Fiction               14.2        2.53
## 5 Informative  Textbook.English            18.8        6.65
## 6 Informative  Informative.Teens           16.9        7.07

# Inspired by: https://drsimonj.svbtle.com/quick-plot-of-all-variables

ncounts %>%
  select(-Words) %>% 
  keep(is.numeric) %>% 
  gather() %>% # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value)) +
    theme_bw() +
    facet_wrap(~ key, scales = "free", ncol = 4) +
    scale_x_continuous(expand=c(0,0)) +
    scale_y_continuous(limits = c(0,NA)) +
    geom_histogram(aes(y = ..density..), bins = 30, colour= "black", fill = "grey") +
    geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)

#ggsave(here("Plots", "DensityPlotsAllVariables.svg"), width = 15, height = 49)

ncounts %>%
  select(-Words) %>% 
  keep(is.numeric) %>% 
  gather() %>% # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value)) +
    theme_bw() +
    facet_wrap(~ key, scales = "free", ncol = 4) +
    scale_x_continuous(expand=c(0,0)) +
    geom_histogram(bins = 30, colour= "darkred", fill = "darkred", alpha = 0.5)

#ggsave(here("Plots", "HistogramPlotsAllVariables.svg"), width = 20, height = 45)

Feature removal due to low text frequency

# For MDA with five TEC registers ncounts <- readRDS(here('FullMDA',
# 'counts.rds'))

# For MDA with three TEC registers
ncounts <- readRDS(here("FullMDA", "counts3Reg.rds"))

colnames(ncounts)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ABLE"      "ACT"       "AMP"       "ASPECT"   
## [13] "AWL"       "BEMA"      "CAUSE"     "CC"        "CD"        "COMM"     
## [19] "CONC"      "COND"      "CONT"      "CUZ"       "DEMO"      "DMA"      
## [25] "DOAUX"     "DT"        "DWNT"      "ELAB"      "EMO"       "EMPH"     
## [31] "EX"        "EXIST"     "FPP1P"     "FPP1S"     "FPUH"      "FREQ"     
## [37] "GTO"       "HDG"       "HGOT"      "HST"       "IN"        "JJAT"     
## [43] "JJPR"      "LD"        "LIKE"      "MDCA"      "MDCO"      "MDMM"     
## [49] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [55] "OCCUR"     "PASS"      "PEAS"      "PGET"      "PIT"       "PLACE"    
## [61] "POLITE"    "POS"       "PROG"      "PRP"       "QUAN"      "QUPR"     
## [67] "QUTAG"     "RB"        "RP"        "SO"        "SPLIT"     "SPP2"     
## [73] "STPR"      "THATD"     "THRC"      "THSC"      "TIME"      "TPP3P"    
## [79] "TPP3S"     "TTR"       "URL"       "VBD"       "VBG"       "VBN"      
## [85] "VIMP"      "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"

# Removal of meaningless feature: CD because numbers as digits were mostly
# removed from the textbooks, LIKE and SO because they are dustbin categories
ncounts <- ncounts %>% select(-c(CD, LIKE, SO))

# Combine low frequency features into meaningful groups whenever this makes
# linguistic sense
ncounts <- ncounts %>% mutate(JJPR = JJPR + ABLE, ABLE = NULL) %>% mutate(PASS = PGET + 
    PASS, PGET = NULL) %>% mutate(TPP3 = TPP3S + TPP3P, TPP3P = NULL, TPP3S = NULL) %>% 
    mutate(FQTI = FREQ + TIME, FREQ = NULL, TIME = NULL)

zero_features <- as.data.frame(round(colSums(ncounts == 0)/nrow(ncounts) * 100, 2))  # Percentage of texts with 0 occurrences of each feature
colnames(zero_features) <- "Percentage_with_zero"
zero_features %>% filter(!is.na(zero_features)) %>% rownames_to_column() %>% arrange(Percentage_with_zero) %>% 
    filter(Percentage_with_zero > 66.6)

##   rowname Percentage_with_zero
## 1     PRP                85.34
## 2     URL                93.03
## 3     EMO                98.98
## 4     HST                99.55

zero_features <- as.data.frame(round(colSums(ncounts > 0)/nrow(ncounts) * 100, 2))  # Percentage of texts >0 occurrences of each feature
colnames(zero_features) <- "Percentage_above_zero"
zero_features %>% rownames_to_column() %>% filter(!is.na(zero_features)) %>% arrange(desc(Percentage_above_zero))

##    rowname Percentage_above_zero
## 1    Words                100.00
## 2      AWL                100.00
## 3       CC                100.00
## 4       DT                100.00
## 5       IN                100.00
## 6     JJAT                100.00
## 7       LD                100.00
## 8       NN                100.00
## 9      TTR                100.00
## 10     ACT                 99.98
## 11      RB                 99.98
## 12    BEMA                 99.94
## 13    JJPR                 99.92
## 14   NCOMP                 99.82
## 15  MENTAL                 99.69
## 16    QUAN                 99.69
## 17    VPRT                 99.57
## 18    TPP3                 99.29
## 19    WHSC                 99.02
## 20    FQTI                 98.96
## 21    COMM                 98.82
## 22    DEMO                 98.57
## 23     PIT                 98.41
## 24     VBD                 98.10
## 25     VBG                 96.80
## 26     XX0                 96.68
## 27    EMPH                 96.15
## 28    PASS                 95.31
## 29   EXIST                 94.27
## 30     POS                 93.58
## 31     VBN                 93.15
## 32   SPLIT                 92.37
## 33      RP                 92.15
## 34    THSC                 91.13
## 35   PLACE                 90.42
## 36   OCCUR                 90.19
## 37    PEAS                 89.52
## 38     AMP                 89.34
## 39   DOAUX                 88.89
## 40    CONT                 88.81
## 41  ASPECT                 88.22
## 42   CAUSE                 87.54
## 43    SPP2                 87.05
## 44    PROG                 86.73
## 45    MDCA                 86.30
## 46    VIMP                 85.12
## 47   FPP1P                 84.38
## 48      EX                 83.57
## 49    QUPR                 83.42
## 50    MDNE                 81.06
## 51    THRC                 80.51
## 52   THATD                 80.00
## 53    COND                 78.68
## 54     CUZ                 78.31
## 55   FPP1S                 78.17
## 56     DMA                 78.14
## 57    WHQU                 77.96
## 58    MDWS                 77.08
## 59    MDWO                 75.82
## 60     HDG                 75.43
## 61    MDCO                 73.70
## 62    YNQU                 72.03
## 63    FPUH                 68.85
## 64    CONC                 68.44
## 65    MDMM                 66.01
## 66    STPR                 64.83
## 67    DWNT                 60.37
## 68  POLITE                 59.84
## 69     GTO                 56.09
## 70    HGOT                 51.60
## 71    ELAB                 49.50
## 72   QUTAG                 45.71
## 73     PRP                 14.66
## 74     URL                  6.97
## 75     EMO                  1.02
## 76     HST                  0.45

docfreq.too.low <- zero_features %>% filter(!is.na(zero_features)) %>% subset(Percentage_above_zero < 
    33.3) %>% rownames_to_column() %>% select(rowname)  # Select all variables with a document frequency of at least 40%.
docfreq.too.low

##   rowname
## 1     EMO
## 2     HST
## 3     PRP
## 4     URL

ncounts <- select(ncounts, -one_of(docfreq.too.low$rowname))  # Drop these variables
colnames(ncounts)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"      "COND"     
## [19] "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"     "DT"       
## [25] "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"     "FPP1P"    
## [31] "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"      "IN"       
## [37] "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"      "MDMM"     
## [43] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [49] "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"     "POLITE"   
## [55] "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"     "RB"       
## [61] "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"     "THRC"     
## [67] "THSC"      "TTR"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [73] "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"     
## [79] "FQTI"

ncol(ncounts) - 8  # Number of linguistic features remaining

## [1] 71

# With five TEC registers saveRDS(ncounts, here('FullMDA', 'ncounts2.rds')) #
# Last saved 18 November 2021

# With three TEC registers saveRDS(ncounts, here('FullMDA', 'ncounts2_3Reg.rds'))
# # Last saved 9 Feb 2022

Standardising normalised counts and identifying potential outliers

“As an alternative to removing very sparse feature, we apply a signed logarithmic transformation to deskew the feature distributions.” (Neumann & Evert)

# First scale the normalised counts (z-standardisation) to be able to compare the
# various features
zcounts <- ncounts %>% select(-Words) %>% keep(is.numeric) %>% scale()

boxplot(zcounts, las = 3, main = "z-scores")  # Slow

# If necessary, remove any outliers at this stage.
colnames(ncounts)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"      "COND"     
## [19] "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"     "DT"       
## [25] "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"     "FPP1P"    
## [31] "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"      "IN"       
## [37] "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"      "MDMM"     
## [43] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [49] "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"     "POLITE"   
## [55] "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"     "RB"       
## [61] "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"     "THRC"     
## [67] "THSC"      "TTR"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [73] "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"     
## [79] "FQTI"

data <- cbind(ncounts[, 1:8], as.data.frame(zcounts))
str(data)

## 'data.frame':    5095 obs. of  79 variables:
##  $ Filename : Factor w/ 5095 levels "1_BaumWizardOz_1.txt",..: 2789 1473 2801 1365 2533 1197 2556 4205 2590 2502 ...
##  $ Register : Factor w/ 3 levels "Conversation",..: 1 3 1 2 1 2 2 1 3 2 ...
##  $ Level    : Factor w/ 6 levels "A","B","C","D",..: 3 4 2 4 1 1 2 3 3 5 ...
##  $ Series   : Factor w/ 12 levels "Access","Achievers",..: 9 2 9 1 8 1 8 10 8 7 ...
##  $ Country  : Factor w/ 6 levels "France","Germany",..: 1 4 1 2 2 2 2 4 2 1 ...
##  $ Corpus   : Factor w/ 4 levels "Textbook.English",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Subcorpus: Factor w/ 6 levels "Textbook Conversation",..: 1 4 1 2 1 2 2 1 4 2 ...
##  $ Words    : int  750 690 694 547 927 840 1127 1090 635 976 ...
##  $ ACT      : num  0.0641 2.6962 -1.2416 -0.6036 -1.7311 ...
##  $ AMP      : num  0.0391 -0.5615 0.1452 3.2319 -0.744 ...
##  $ ASPECT   : num  0.182 1.743 -0.564 1.974 -1.099 ...
##  $ AWL      : num  -0.863 1.038 -1.074 -0.739 -1.053 ...
##  $ BEMA     : num  0.19 -0.606 2.059 0.276 4.415 ...
##  $ CAUSE    : num  -0.2546 0.0289 -0.352 -0.8463 0.615 ...
##  $ CC       : num  0.785 2.317 -0.838 -0.296 -0.782 ...
##  $ COMM     : num  -0.119 1.101 -1.254 -1.402 -1.575 ...
##  $ CONC     : num  -0.777 2.831 -0.777 -0.777 -0.777 ...
##  $ COND     : num  -0.412 -1.166 1.355 -1.166 -1.166 ...
##  $ CONT     : num  0.059 -0.999 0.719 -0.731 1.972 ...
##  $ CUZ      : num  -0.316 -1.034 -1.034 -0.345 -1.034 ...
##  $ DEMO     : num  -1.412 -1.668 0.451 -1.973 0.522 ...
##  $ DMA      : num  0.558 -0.797 0.667 -0.797 0.811 ...
##  $ DOAUX    : num  -1.023 -1.413 1.188 -0.293 -0.211 ...
##  $ DT       : num  0.755 -0.625 0.881 2.67 -1.786 ...
##  $ DWNT     : num  1.001 -0.767 -0.767 -0.767 -0.767 ...
##  $ ELAB     : num  -0.415 -0.415 0.896 -0.415 -0.415 ...
##  $ EMPH     : num  -0.145 -0.859 -0.589 -1.408 -0.795 ...
##  $ EX       : num  -1.222 -1.222 0.647 3.071 0.159 ...
##  $ EXIST    : num  1.4708 2.6525 0.0346 0.1877 -0.7416 ...
##  $ FPP1P    : num  2.24 2.69 2.13 2.59 1.16 ...
##  $ FPP1S    : num  2.612 -1.115 -0.174 -0.533 0.812 ...
##  $ FPUH     : num  -0.00108 -0.66909 0.1731 -0.66909 2.03315 ...
##  $ GTO      : num  1.895 -0.672 3.617 -0.672 -0.672 ...
##  $ HDG      : num  2.168 -0.233 -0.237 -0.9 0.589 ...
##  $ HGOT     : num  -0.66 -0.66 -0.66 -0.66 6.05 ...
##  $ IN       : num  0.262 0.533 -0.602 0.771 -1.559 ...
##  $ JJAT     : num  0.349 1.726 -0.744 -1.289 -1.994 ...
##  $ JJPR     : num  -0.478 -0.507 0.341 1.218 2.017 ...
##  $ LD       : num  -0.3301 1.2881 -0.586 -1.4218 0.0875 ...
##  $ MDCA     : num  -0.572 0.335 -0.262 0.298 1.252 ...
##  $ MDCO     : num  -0.0713 -0.9951 -0.9951 -0.1087 -0.9951 ...
##  $ MDMM     : num  -0.633 -0.633 -0.633 -0.633 -0.633 ...
##  $ MDNE     : num  1.12 2.2 2.63 -1.14 -1.14 ...
##  $ MDWO     : num  9.53 -1.07 -1.07 -1.07 -1.07 ...
##  $ MDWS     : num  0.0968 2.8613 -0.4999 -0.4362 -0.9278 ...
##  $ MENTAL   : num  3.152 0.613 0.479 0.523 -0.919 ...
##  $ NCOMP    : num  -0.601 2.051 -0.574 -1.788 1.321 ...
##  $ NN       : num  -0.377 1.251 -0.634 -0.368 0.334 ...
##  $ OCCUR    : num  1.344 -0.923 -0.923 0.817 -0.923 ...
##  $ PASS     : num  -0.499 -0.298 -0.744 -0.516 -0.919 ...
##  $ PEAS     : num  -1.321 -0.131 -1.321 -0.549 -1.321 ...
##  $ PIT      : num  -0.8628 -0.0712 1.5284 -0.7082 0.386 ...
##  $ PLACE    : num  -1.275 0.86 0.734 1.033 1.101 ...
##  $ POLITE   : num  1.356 -0.508 0.164 0.344 4.017 ...
##  $ POS      : num  -1.5355 -0.8173 -0.1405 0.0244 0.9175 ...
##  $ PROG     : num  -0.3931 0.8798 3.3607 0.0656 -1.4375 ...
##  $ QUAN     : num  1.024 -1.086 -0.605 0.108 -1.611 ...
##  $ QUPR     : num  1.318 0.607 -0.271 -0.756 -0.147 ...
##  $ QUTAG    : num  -0.554 -0.554 -0.554 -0.554 -0.554 ...
##  $ RB       : num  -0.924 -1.798 1.761 1.009 -1.127 ...
##  $ RP       : num  -0.0544 0.6924 -1.6129 0.3809 -1.6129 ...
##  $ SPLIT    : num  -0.806 0.45 -0.376 -1.447 -0.655 ...
##  $ SPP2     : num  1.06 1.95 1.14 -1.21 1.18 ...
##  $ STPR     : num  0.743 -0.87 -0.87 -0.87 -0.87 ...
##  $ THATD    : num  -1.247 -1.247 -1.247 -0.612 -1.247 ...
##  $ THRC     : num  -0.829 -0.829 -0.829 -0.829 -0.829 ...
##  $ THSC     : num  -1.371 -0.674 -0.977 -1.371 -1.08 ...
##  $ TTR      : num  0.564 1.78 -0.545 -0.188 -1.583 ...
##  $ VBD      : num  -0.00474 -0.76963 -1.32684 1.06003 -1.37555 ...
##  $ VBG      : num  -0.32 0.724 -1.295 -0.172 -1.055 ...
##  $ VBN      : num  -0.8756 -0.0798 -0.8756 -0.8756 -0.8756 ...
##  $ VIMP     : num  -0.039 0.798 0.996 -0.726 1.819 ...
##  $ VPRT     : num  -0.875 0.347 1.394 -0.803 1.46 ...
##  $ WHQU     : num  1.52 -1.02 1.46 -1.02 1.34 ...
##  $ WHSC     : num  -0.671 1.874 -0.614 0.212 -1.369 ...
##  $ XX0      : num  -0.613 -1.175 0.54 -0.933 1.409 ...
##  $ YNQU     : num  0.878 -0.897 4.856 -0.897 1.4 ...
##  $ TPP3     : num  -1.377 -0.791 -1.016 0.477 -0.946 ...
##  $ FQTI     : num  0.284 -0.873 -0.409 -0.538 0.586 ...

nrow(data)

## [1] 5095

outliers <- data %>% filter(if_any(where(is.numeric) & !Words, .fns = function(x) {
    x > 8
})) %>% select(Filename, Corpus, Series, Register, Level, Words)
outliers

##                                                                                         Filename
## 1                                                                         POC_4e_Spoken_0007.txt
## 2                                                       Solutions_Elementary_ELF_Spoken_0013.txt
## 3                                                               EIM_Starter_Informative_0004.txt
## 4                                                                    GreenLine_1_Spoken_0003.txt
## 5                                                                       Access_1_Spoken_0011.txt
## 6                                                              Achievers_B1_Informative_0003.txt
## 7                                                                    EIM_Starter_Spoken_0002.txt
## 8                                                                    GreenLine_1_Spoken_0008.txt
## 9                                                                     JTT_3_Informative_0003.txt
## 10                                                                   GreenLine_1_Spoken_0010.txt
## 11                                                                         EIM_1_Spoken_0012.txt
## 12                                                                         NGL_1_Spoken_0013.txt
## 13                                                                         NGL_3_Spoken_0018.txt
## 14                                                        Solutions_Intermediate_Spoken_0029.txt
## 15                                                                         NGL_1_Spoken_0012.txt
## 16                                                                   GreenLine_1_Spoken_0006.txt
## 17                                                                   GreenLine_2_Spoken_0004.txt
## 18                                                                      Access_2_Spoken_0023.txt
## 19                                                                     HT_4_Informative_0006.txt
## 20                                                   Solutions_Intermediate_Informative_0017.txt
## 21                                                                         EIM_1_Spoken_0013.txt
## 22                                                      Solutions_Elementary_ELF_Spoken_0021.txt
## 23                                                   Solutions_Intermediate_Plus_Spoken_0022.txt
## 24                                                                      Access_2_Spoken_0028.txt
## 25                                                                         NGL_1_Spoken_0005.txt
## 26                                                      Solutions_Elementary_ELF_Spoken_0016.txt
## 27                                                Solutions_Pre-Intermediate_ELF_Spoken_0007.txt
## 28                                                   Solutions_Intermediate_Informative_0013.txt
## 29                                                                   GreenLine_2_Spoken_0003.txt
## 30                                                                          HT_4_Spoken_0010.txt
## 31                                                     Solutions_Elementary_Informative_0003.txt
## 32                                                                 Access_2_Informative_0001.txt
## 33                                                     Solutions_Elementary_Informative_0010.txt
## 34                                                              GreenLine_1_Informative_0001.txt
## 35                                                                      Access_2_Spoken_0002.txt
## 36                                                        Solutions_Intermediate_Spoken_0019.txt
## 37                                                                 Access_3_Informative_0003.txt
## 38                                                                      Access_1_Spoken_0019.txt
## 39                                                                      Access_2_Spoken_0013.txt
## 40                                              Solutions_Intermediate_Plus_Informative_0014.txt
## 41                                               Revision_World_GCSE_10525362_literary-terms.txt
## 42                             Revision_World_GCSE_10528697_p6-physics-radioactive-materials.txt
## 43                                                       Science_Tech_Kinds_NZ_10382383_math.txt
## 44                                   Science_for_students_10064820_scientists-say-metabolism.txt
## 45                                                  Science_Tech_Kinds_NZ_10382388_recycling.txt
## 46                                                     History_Kids_BBC_10404337_go_furthers.txt
## 47                                                     Science_Tech_Kinds_NZ_10382391_sports.txt
## 48                                    Teen_Kids_News_10402607_so-you-want-to-be-an-archivist.txt
## 49                                                    Science_Tech_Kinds_NZ_10382234_biology.txt
## 50                                                  Science_Tech_Kinds_NZ_10382372_astronomy.txt
## 51    Dogo_News_file10060404_banana-plant-extract-may-be-the-key-to-slower-melting-ice-cream.txt
## 52                                                  Science_Tech_Kinds_NZ_10382667_countries.txt
## 53                                    Quatr_us_file10390777_quick-summary-geological-erashtm.txt
## 54                                                    Science_Tech_Kinds_NZ_10382873_physics.txt
## 55                                                      Science_Tech_Kinds_NZ_10382382_light.txt
## 56                                                            Factmonster_10053687_august-13.txt
## 57                                            Revision_World_GCSE_10526703_limited-companies.txt
## 58                                            Revision_World_GCSE_10529637_transition-metals.txt
## 59                                                Quatr_us_10390856_early-african-historyhtm.txt
## 60                                             History_Kids_BBC_10401873_ff6_sicilylandingss.txt
## 61                                                                Quatr_us_10394250_harappan.txt
## 62                                                                Ducksters_10398301_iraqphp.txt
## 63                                       History_Kids_BBC_10403171_death_sakkara_gallery_04s.txt
## 64                                          Revision_World_GCSE_10528246_agricultural-change.txt
## 65                                      Revision_World_GCSE_10528086_uk-government-judiciary.txt
## 66                                                  Revision_World_GCSE_10529794_definitions.txt
## 67                                   Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt
## 68       Science_for_students_10064875_questions-big-melt-earths-ice-sheets-are-under-attack.txt
## 69                       Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list.txt
## 70                                                   Science_Tech_Kinds_NZ_10382201_projects.txt
## 71                                                  Revision_World_GCSE_10529753_probability.txt
## 72                                           Encyclopedia_Kinds_au_10085531_Complex_analysis.txt
## 73                                                       History_Kids_BBC_10401890_ff7_ddays.txt
## 74                                                                History_Kids_BBC_10403434s.txt
## 75                                                      History_Kids_BBC_10401872_ff6_italys.txt
## 76                                                    Science_Tech_Kinds_NZ_10382371_amazing.txt
## 77                                                              Quatr_us_10391129_athabascan.txt
## 78                                               Encyclopedia_Kinds_au_10085355_20th_century.txt
## 79  Dogo_News_10060755_luxury-space-hotel-promises-guests-a-truly-out-of-this-world-vacation.txt
## 80                                         Revision_World_GCSE_10528072_nationalism-practice.txt
## 81                                              Quatr_us_10390861_quatr-us-privacy-policyhtm.txt
## 82                                                      History_Kids_BBC_10401909_ff7_bulges.txt
## 83                                             History_kids_10381259_timeline-of-mesopotamia.txt
## 84                    Revision_World_GCSE_10528123_gender-written-textual-analysis-framework.txt
## 85                                                     Science_Tech_Kinds_NZ_10386406_floods.txt
## 86                                                   Revision_World_GCSE_10529693_advantages.txt
## 87                                                  Science_Tech_Kinds_NZ_10382378_geography.txt
## 88                                                      Science_Tech_Kinds_NZ_10382374_earth.txt
## 89                 Science_for_students_10066286_watering-plants-wastewater-can-spread-germs.txt
## 90                                                      Science_Tech_Kinds_NZ_10382393_water.txt
## 91                                                     World_Dteen_10406069_website_policies.txt
## 92                                                     Science_Tech_Kinds_NZ_10382384_metals.txt
## 93    Dogo_News_10062028_puppy-bowl-14-promises-viewers-a-paw-some-time-on-super-bowl-sunday.txt
## 94                                                     History_Kids_BBC_10404730_go_furthers.txt
## 95                                                     Science_Tech_Kinds_NZ_10382385_nature.txt
## 96                               Science_for_students_10065015_scientists-say-dna-sequencing.txt
## 97                                  Quatr_us_file10390817_conifers-pine-trees-gymnospermshtm.txt
## 98                                         TweenTribute_10051509_it-true-elephants-cant-jump.txt
## 99                                         Revision_World_GCSE_10528494_application-software.txt
## 100                      Revision_World_GCSE_10529581_different-types-questions-examinations.txt
## 101        Dogo_News_10061669_the-chinese-city-of-chengdu-may-soon-be-home-to-multiple-moons.txt
## 102                                         Ducksters_10398306_geography_of_ancient_chinaphp.txt
## 103                                  Science_for_students_10065144_scientists-say-multiverse.txt
## 104                                                    Science_Tech_Kinds_NZ_10382211_images.txt
## 105                                                              Factmonster_10053754_may-18.txt
## 106                                                      World_Dteen_10406047_AboutWORLDteen.txt
## 107                                                     Ducksters_10398078_first_new_dealphp.txt
## 108                                             Revision_World_GCSE_10526926_economies-scale.txt
## 109                                                        Factmonster_10053201_september-03.txt
## 110                                         Science_Tech_Kinds_NZ_10387183_calciumcarbonates.txt
## 111                                                    Science_Tech_Kinds_NZ_10382380_health.txt
## 112                                             Revision_World_GCSE_10529587_sources-finance.txt
## 113                                                                Quatr_us_10393444_fishing.txt
## 114                                                 Ducksters_10398315_glossary_and_termsphp.txt
## 115                                                                                     S5AA.txt
##                Corpus         Series     Register Level Words
## 1    Textbook.English            POC Conversation     C   750
## 2    Textbook.English      Solutions Conversation     A   931
## 3    Textbook.English            EIM  Informative     A   534
## 4    Textbook.English      GreenLine Conversation     A   970
## 5    Textbook.English         Access Conversation     A   784
## 6    Textbook.English      Achievers  Informative     C   926
## 7    Textbook.English            EIM Conversation     A   824
## 8    Textbook.English      GreenLine Conversation     A   876
## 9    Textbook.English            JTT  Informative     D   699
## 10   Textbook.English      GreenLine Conversation     A   701
## 11   Textbook.English            EIM Conversation     B   640
## 12   Textbook.English            NGL Conversation     A   940
## 13   Textbook.English            NGL Conversation     C   751
## 14   Textbook.English      Solutions Conversation     C   672
## 15   Textbook.English            NGL Conversation     A   910
## 16   Textbook.English      GreenLine Conversation     A   622
## 17   Textbook.English      GreenLine Conversation     B  1102
## 18   Textbook.English         Access Conversation     B   875
## 19   Textbook.English             HT  Informative     C   513
## 20   Textbook.English      Solutions  Informative     C   816
## 21   Textbook.English            EIM Conversation     B   967
## 22   Textbook.English      Solutions Conversation     A   846
## 23   Textbook.English      Solutions Conversation     D   596
## 24   Textbook.English         Access Conversation     B   813
## 25   Textbook.English            NGL Conversation     A  1020
## 26   Textbook.English      Solutions Conversation     A   871
## 27   Textbook.English      Solutions Conversation     B   630
## 28   Textbook.English      Solutions  Informative     C   770
## 29   Textbook.English      GreenLine Conversation     B   850
## 30   Textbook.English             HT Conversation     C   727
## 31   Textbook.English      Solutions  Informative     A  1051
## 32   Textbook.English         Access  Informative     B   655
## 33   Textbook.English      Solutions  Informative     A   708
## 34   Textbook.English      GreenLine  Informative     A   731
## 35   Textbook.English         Access Conversation     B   572
## 36   Textbook.English      Solutions Conversation     C  1024
## 37   Textbook.English         Access  Informative     C  1000
## 38   Textbook.English         Access Conversation     A   701
## 39   Textbook.English         Access Conversation     B   981
## 40   Textbook.English      Solutions  Informative     D   537
## 41  Informative.Teens     Info Teens  Informative  Ref.   790
## 42  Informative.Teens     Info Teens  Informative  Ref.  1015
## 43  Informative.Teens     Info Teens  Informative  Ref.   522
## 44  Informative.Teens     Info Teens  Informative  Ref.   895
## 45  Informative.Teens     Info Teens  Informative  Ref.   666
## 46  Informative.Teens     Info Teens  Informative  Ref.   620
## 47  Informative.Teens     Info Teens  Informative  Ref.   657
## 48  Informative.Teens     Info Teens  Informative  Ref.   763
## 49  Informative.Teens     Info Teens  Informative  Ref.   843
## 50  Informative.Teens     Info Teens  Informative  Ref.   900
## 51  Informative.Teens     Info Teens  Informative  Ref.   611
## 52  Informative.Teens     Info Teens  Informative  Ref.   717
## 53  Informative.Teens     Info Teens  Informative  Ref.   643
## 54  Informative.Teens     Info Teens  Informative  Ref.   722
## 55  Informative.Teens     Info Teens  Informative  Ref.   639
## 56  Informative.Teens     Info Teens  Informative  Ref.   523
## 57  Informative.Teens     Info Teens  Informative  Ref.   714
## 58  Informative.Teens     Info Teens  Informative  Ref.   787
## 59  Informative.Teens     Info Teens  Informative  Ref.  1136
## 60  Informative.Teens     Info Teens  Informative  Ref.   813
## 61  Informative.Teens     Info Teens  Informative  Ref.   651
## 62  Informative.Teens     Info Teens  Informative  Ref.   657
## 63  Informative.Teens     Info Teens  Informative  Ref.   844
## 64  Informative.Teens     Info Teens  Informative  Ref.   789
## 65  Informative.Teens     Info Teens  Informative  Ref.  1019
## 66  Informative.Teens     Info Teens  Informative  Ref.   904
## 67  Informative.Teens     Info Teens  Informative  Ref.   598
## 68  Informative.Teens     Info Teens  Informative  Ref.   685
## 69  Informative.Teens     Info Teens  Informative  Ref.   800
## 70  Informative.Teens     Info Teens  Informative  Ref.   947
## 71  Informative.Teens     Info Teens  Informative  Ref.   816
## 72  Informative.Teens     Info Teens  Informative  Ref.   735
## 73  Informative.Teens     Info Teens  Informative  Ref.   759
## 74  Informative.Teens     Info Teens  Informative  Ref.   732
## 75  Informative.Teens     Info Teens  Informative  Ref.   786
## 76  Informative.Teens     Info Teens  Informative  Ref.   629
## 77  Informative.Teens     Info Teens  Informative  Ref.   637
## 78  Informative.Teens     Info Teens  Informative  Ref.   864
## 79  Informative.Teens     Info Teens  Informative  Ref.   722
## 80  Informative.Teens     Info Teens  Informative  Ref.   776
## 81  Informative.Teens     Info Teens  Informative  Ref.   960
## 82  Informative.Teens     Info Teens  Informative  Ref.   732
## 83  Informative.Teens     Info Teens  Informative  Ref.   768
## 84  Informative.Teens     Info Teens  Informative  Ref.   905
## 85  Informative.Teens     Info Teens  Informative  Ref.   580
## 86  Informative.Teens     Info Teens  Informative  Ref.   782
## 87  Informative.Teens     Info Teens  Informative  Ref.   761
## 88  Informative.Teens     Info Teens  Informative  Ref.   726
## 89  Informative.Teens     Info Teens  Informative  Ref.   836
## 90  Informative.Teens     Info Teens  Informative  Ref.   856
## 91  Informative.Teens     Info Teens  Informative  Ref.   995
## 92  Informative.Teens     Info Teens  Informative  Ref.   669
## 93  Informative.Teens     Info Teens  Informative  Ref.   581
## 94  Informative.Teens     Info Teens  Informative  Ref.   611
## 95  Informative.Teens     Info Teens  Informative  Ref.   722
## 96  Informative.Teens     Info Teens  Informative  Ref.   953
## 97  Informative.Teens     Info Teens  Informative  Ref.   533
## 98  Informative.Teens     Info Teens  Informative  Ref.   790
## 99  Informative.Teens     Info Teens  Informative  Ref.   855
## 100 Informative.Teens     Info Teens  Informative  Ref.   742
## 101 Informative.Teens     Info Teens  Informative  Ref.   614
## 102 Informative.Teens     Info Teens  Informative  Ref.   638
## 103 Informative.Teens     Info Teens  Informative  Ref.   712
## 104 Informative.Teens     Info Teens  Informative  Ref.   793
## 105 Informative.Teens     Info Teens  Informative  Ref.   497
## 106 Informative.Teens     Info Teens  Informative  Ref.  1053
## 107 Informative.Teens     Info Teens  Informative  Ref.   649
## 108 Informative.Teens     Info Teens  Informative  Ref.   621
## 109 Informative.Teens     Info Teens  Informative  Ref.   445
## 110 Informative.Teens     Info Teens  Informative  Ref.   804
## 111 Informative.Teens     Info Teens  Informative  Ref.   694
## 112 Informative.Teens     Info Teens  Informative  Ref.   665
## 113 Informative.Teens     Info Teens  Informative  Ref.   656
## 114 Informative.Teens     Info Teens  Informative  Ref.   684
## 115    Spoken.BNC2014 Spoken BNC2014 Conversation  Ref.  1869

outliers %>% select(Filename)

##                                                                                         Filename
## 1                                                                         POC_4e_Spoken_0007.txt
## 2                                                       Solutions_Elementary_ELF_Spoken_0013.txt
## 3                                                               EIM_Starter_Informative_0004.txt
## 4                                                                    GreenLine_1_Spoken_0003.txt
## 5                                                                       Access_1_Spoken_0011.txt
## 6                                                              Achievers_B1_Informative_0003.txt
## 7                                                                    EIM_Starter_Spoken_0002.txt
## 8                                                                    GreenLine_1_Spoken_0008.txt
## 9                                                                     JTT_3_Informative_0003.txt
## 10                                                                   GreenLine_1_Spoken_0010.txt
## 11                                                                         EIM_1_Spoken_0012.txt
## 12                                                                         NGL_1_Spoken_0013.txt
## 13                                                                         NGL_3_Spoken_0018.txt
## 14                                                        Solutions_Intermediate_Spoken_0029.txt
## 15                                                                         NGL_1_Spoken_0012.txt
## 16                                                                   GreenLine_1_Spoken_0006.txt
## 17                                                                   GreenLine_2_Spoken_0004.txt
## 18                                                                      Access_2_Spoken_0023.txt
## 19                                                                     HT_4_Informative_0006.txt
## 20                                                   Solutions_Intermediate_Informative_0017.txt
## 21                                                                         EIM_1_Spoken_0013.txt
## 22                                                      Solutions_Elementary_ELF_Spoken_0021.txt
## 23                                                   Solutions_Intermediate_Plus_Spoken_0022.txt
## 24                                                                      Access_2_Spoken_0028.txt
## 25                                                                         NGL_1_Spoken_0005.txt
## 26                                                      Solutions_Elementary_ELF_Spoken_0016.txt
## 27                                                Solutions_Pre-Intermediate_ELF_Spoken_0007.txt
## 28                                                   Solutions_Intermediate_Informative_0013.txt
## 29                                                                   GreenLine_2_Spoken_0003.txt
## 30                                                                          HT_4_Spoken_0010.txt
## 31                                                     Solutions_Elementary_Informative_0003.txt
## 32                                                                 Access_2_Informative_0001.txt
## 33                                                     Solutions_Elementary_Informative_0010.txt
## 34                                                              GreenLine_1_Informative_0001.txt
## 35                                                                      Access_2_Spoken_0002.txt
## 36                                                        Solutions_Intermediate_Spoken_0019.txt
## 37                                                                 Access_3_Informative_0003.txt
## 38                                                                      Access_1_Spoken_0019.txt
## 39                                                                      Access_2_Spoken_0013.txt
## 40                                              Solutions_Intermediate_Plus_Informative_0014.txt
## 41                                               Revision_World_GCSE_10525362_literary-terms.txt
## 42                             Revision_World_GCSE_10528697_p6-physics-radioactive-materials.txt
## 43                                                       Science_Tech_Kinds_NZ_10382383_math.txt
## 44                                   Science_for_students_10064820_scientists-say-metabolism.txt
## 45                                                  Science_Tech_Kinds_NZ_10382388_recycling.txt
## 46                                                     History_Kids_BBC_10404337_go_furthers.txt
## 47                                                     Science_Tech_Kinds_NZ_10382391_sports.txt
## 48                                    Teen_Kids_News_10402607_so-you-want-to-be-an-archivist.txt
## 49                                                    Science_Tech_Kinds_NZ_10382234_biology.txt
## 50                                                  Science_Tech_Kinds_NZ_10382372_astronomy.txt
## 51    Dogo_News_file10060404_banana-plant-extract-may-be-the-key-to-slower-melting-ice-cream.txt
## 52                                                  Science_Tech_Kinds_NZ_10382667_countries.txt
## 53                                    Quatr_us_file10390777_quick-summary-geological-erashtm.txt
## 54                                                    Science_Tech_Kinds_NZ_10382873_physics.txt
## 55                                                      Science_Tech_Kinds_NZ_10382382_light.txt
## 56                                                            Factmonster_10053687_august-13.txt
## 57                                            Revision_World_GCSE_10526703_limited-companies.txt
## 58                                            Revision_World_GCSE_10529637_transition-metals.txt
## 59                                                Quatr_us_10390856_early-african-historyhtm.txt
## 60                                             History_Kids_BBC_10401873_ff6_sicilylandingss.txt
## 61                                                                Quatr_us_10394250_harappan.txt
## 62                                                                Ducksters_10398301_iraqphp.txt
## 63                                       History_Kids_BBC_10403171_death_sakkara_gallery_04s.txt
## 64                                          Revision_World_GCSE_10528246_agricultural-change.txt
## 65                                      Revision_World_GCSE_10528086_uk-government-judiciary.txt
## 66                                                  Revision_World_GCSE_10529794_definitions.txt
## 67                                   Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt
## 68       Science_for_students_10064875_questions-big-melt-earths-ice-sheets-are-under-attack.txt
## 69                       Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list.txt
## 70                                                   Science_Tech_Kinds_NZ_10382201_projects.txt
## 71                                                  Revision_World_GCSE_10529753_probability.txt
## 72                                           Encyclopedia_Kinds_au_10085531_Complex_analysis.txt
## 73                                                       History_Kids_BBC_10401890_ff7_ddays.txt
## 74                                                                History_Kids_BBC_10403434s.txt
## 75                                                      History_Kids_BBC_10401872_ff6_italys.txt
## 76                                                    Science_Tech_Kinds_NZ_10382371_amazing.txt
## 77                                                              Quatr_us_10391129_athabascan.txt
## 78                                               Encyclopedia_Kinds_au_10085355_20th_century.txt
## 79  Dogo_News_10060755_luxury-space-hotel-promises-guests-a-truly-out-of-this-world-vacation.txt
## 80                                         Revision_World_GCSE_10528072_nationalism-practice.txt
## 81                                              Quatr_us_10390861_quatr-us-privacy-policyhtm.txt
## 82                                                      History_Kids_BBC_10401909_ff7_bulges.txt
## 83                                             History_kids_10381259_timeline-of-mesopotamia.txt
## 84                    Revision_World_GCSE_10528123_gender-written-textual-analysis-framework.txt
## 85                                                     Science_Tech_Kinds_NZ_10386406_floods.txt
## 86                                                   Revision_World_GCSE_10529693_advantages.txt
## 87                                                  Science_Tech_Kinds_NZ_10382378_geography.txt
## 88                                                      Science_Tech_Kinds_NZ_10382374_earth.txt
## 89                 Science_for_students_10066286_watering-plants-wastewater-can-spread-germs.txt
## 90                                                      Science_Tech_Kinds_NZ_10382393_water.txt
## 91                                                     World_Dteen_10406069_website_policies.txt
## 92                                                     Science_Tech_Kinds_NZ_10382384_metals.txt
## 93    Dogo_News_10062028_puppy-bowl-14-promises-viewers-a-paw-some-time-on-super-bowl-sunday.txt
## 94                                                     History_Kids_BBC_10404730_go_furthers.txt
## 95                                                     Science_Tech_Kinds_NZ_10382385_nature.txt
## 96                               Science_for_students_10065015_scientists-say-dna-sequencing.txt
## 97                                  Quatr_us_file10390817_conifers-pine-trees-gymnospermshtm.txt
## 98                                         TweenTribute_10051509_it-true-elephants-cant-jump.txt
## 99                                         Revision_World_GCSE_10528494_application-software.txt
## 100                      Revision_World_GCSE_10529581_different-types-questions-examinations.txt
## 101        Dogo_News_10061669_the-chinese-city-of-chengdu-may-soon-be-home-to-multiple-moons.txt
## 102                                         Ducksters_10398306_geography_of_ancient_chinaphp.txt
## 103                                  Science_for_students_10065144_scientists-say-multiverse.txt
## 104                                                    Science_Tech_Kinds_NZ_10382211_images.txt
## 105                                                              Factmonster_10053754_may-18.txt
## 106                                                      World_Dteen_10406047_AboutWORLDteen.txt
## 107                                                     Ducksters_10398078_first_new_dealphp.txt
## 108                                             Revision_World_GCSE_10526926_economies-scale.txt
## 109                                                        Factmonster_10053201_september-03.txt
## 110                                         Science_Tech_Kinds_NZ_10387183_calciumcarbonates.txt
## 111                                                    Science_Tech_Kinds_NZ_10382380_health.txt
## 112                                             Revision_World_GCSE_10529587_sources-finance.txt
## 113                                                                Quatr_us_10393444_fishing.txt
## 114                                                 Ducksters_10398315_glossary_and_termsphp.txt
## 115                                                                                     S5AA.txt

# Checking that outlier texts are not particularly long or short texts
summary(outliers$Words)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   445.0   655.5   751.0   773.6   860.0  1869.0

histogram(outliers$Words, breaks = 30)

summary(data$Words)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     427     788    1179    4428    5999  148358

# Distribution of outlier texts
summary(outliers$Corpus)

##  Textbook.English Informative.Teens    Spoken.BNC2014     Youth.Fiction 
##                40                74                 1                 0

# Manually checking a sample of these outliers:

# Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt is essentially a
# list of Nobel prize winners but with some additional information. Hence a good
# representative of the type of texts of the ITTC.
# Solutions_Elementary_ELF_Spoken_0013 --> Has a lot of 'going to' constructions
# because they are learnt in this chapter but is otherwise a well-formed text.
# Teen_Kids_News_10403972_a-brief-history-of-white-house-weddings --> No issues
# Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list --> Similar
# to the Nobel prize laureates text.
# Revision_World_GCSE_10528123_gender-written-textual-analysis-framework --> Text
# includes bullet points tokenised as the letter 'o' but otherwise a fairly
# typical informative text.

# Removing the outliers
ncounts <- ncounts %>% filter(!Filename %in% outliers$Filename)

nrow(ncounts)

## [1] 4980

# saveRDS(ncounts, here('FullMDA', 'ncounts3_3Reg.rds')) # Last saved 9 Feb 2022

zcounts <- ncounts %>% select(-Words) %>% keep(is.numeric) %>% scale()

nrow(zcounts)

## [1] 4980

boxplot(zcounts, las = 3, main = "z-scores")  # Slow to open!

Transforming the features to (partially) deskew these distributions

signed.log <- function(x) {sign(x)*log(abs(x)+1)}

zlogcounts <- signed.log(zcounts) # Standardise first, then signed log transform

boxplot(zlogcounts, las=3, main="log-transformed z-scores")

# With three TEC registers
#saveRDS(zlogcounts, here("FullMDA", "zlogcounts_3Reg.rds")) # Last saved 9 Feb 2022

# With five TEC registers
#saveRDS(zlogcounts, here("FullMDA", "zlogcounts.rds")) # Last saved 18 November

zlogcounts %>%
  as.data.frame() %>% 
  gather() %>% # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value)) +
  theme_bw() +
  facet_wrap(~ key, scales = "free", ncol = 4) +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(limits = c(0,NA)) +
  geom_histogram(aes(y = ..density..), bins = 30, colour= "black", fill = "grey") +
  geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)

#ggsave(here("Plots", "DensityPlotsAllVariablesSignedLog.svg"), width = 15, height = 49)

Merging of data for MDA

# With five TEC registers zlogcounts <- readRDS(here('FullMDA',
# 'zlogcounts.rds')) nrow(zlogcounts) colnames(zlogcounts) ncounts <-
# readRDS(here('FullMDA', 'ncounts2.rds')) nrow(ncounts) colnames(ncounts) data
# <- cbind(ncounts[,1:7], as.data.frame(zlogcounts)) str(data) saveRDS(data,
# here('FullMDA', 'datazlogcounts.rds')) # Last saved 18 November

# With three TEC registers
zlogcounts <- readRDS(here("FullMDA", "zlogcounts_3Reg.rds"))
nrow(zlogcounts)

## [1] 4980

colnames(zlogcounts)

##  [1] "ACT"    "AMP"    "ASPECT" "AWL"    "BEMA"   "CAUSE"  "CC"     "COMM"  
##  [9] "CONC"   "COND"   "CONT"   "CUZ"    "DEMO"   "DMA"    "DOAUX"  "DT"    
## [17] "DWNT"   "ELAB"   "EMPH"   "EX"     "EXIST"  "FPP1P"  "FPP1S"  "FPUH"  
## [25] "GTO"    "HDG"    "HGOT"   "IN"     "JJAT"   "JJPR"   "LD"     "MDCA"  
## [33] "MDCO"   "MDMM"   "MDNE"   "MDWO"   "MDWS"   "MENTAL" "NCOMP"  "NN"    
## [41] "OCCUR"  "PASS"   "PEAS"   "PIT"    "PLACE"  "POLITE" "POS"    "PROG"  
## [49] "QUAN"   "QUPR"   "QUTAG"  "RB"     "RP"     "SPLIT"  "SPP2"   "STPR"  
## [57] "THATD"  "THRC"   "THSC"   "TTR"    "VBD"    "VBG"    "VBN"    "VIMP"  
## [65] "VPRT"   "WHQU"   "WHSC"   "XX0"    "YNQU"   "TPP3"   "FQTI"

ncounts <- readRDS(here("FullMDA", "ncounts3_3Reg.rds"))
nrow(ncounts)

## [1] 4980

colnames(ncounts)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"      "COND"     
## [19] "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"     "DT"       
## [25] "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"     "FPP1P"    
## [31] "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"      "IN"       
## [37] "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"      "MDMM"     
## [43] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [49] "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"     "POLITE"   
## [55] "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"     "RB"       
## [61] "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"     "THRC"     
## [67] "THSC"      "TTR"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [73] "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"     
## [79] "FQTI"

data <- cbind(ncounts[, 1:8], as.data.frame(zlogcounts))
colnames(data)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"      "COND"     
## [19] "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"     "DT"       
## [25] "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"     "FPP1P"    
## [31] "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"      "IN"       
## [37] "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"      "MDMM"     
## [43] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [49] "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"     "POLITE"   
## [55] "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"     "RB"       
## [61] "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"     "THRC"     
## [67] "THSC"      "TTR"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [73] "VPRT"      "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"     
## [79] "FQTI"

# saveRDS(data, here('FullMDA', 'datazlogcounts_3Reg.rds')) # Last saved 9 Feb
# 2022

Performing the PCA of Textbook English vs. Reference corpora

Quick import

# With five TEC registers data <- readRDS(here('FullMDA', 'datazlogcounts.rds'))

# With three TEC registers
data <- readRDS(here("FullMDA", "datazlogcounts_3Reg.rds"))

summary(data$Corpus)

##  Textbook.English Informative.Teens    Spoken.BNC2014     Youth.Fiction 
##              1202              1337              1250              1191

summary(data$Subcorpus)

## Textbook Conversation      Textbook Fiction       Info Teens Ref. 
##                   565                   285                  1337 
##  Textbook Informative   Spoken BNC2014 Ref.    Youth Fiction Ref. 
##                   352                  1250                  1191

# This rearranges the levels in the desired order for the plot legends:
data <- data %>% mutate(Subcorpus = fct_relevel(Subcorpus, "Info Teens Ref.", after = 9))

Testing factorability of data

Correlation matrix: only for data exploration, not used to exclude variables

# From:
# https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57
# Function adapated to my needs ##

colnames(data)

corr <- cor(data[9:ncol(data)])
# prepare to drop duplicates and correlations of 1
corr[lower.tri(corr, diag = TRUE)] <- NA
# drop perfect correlations
corr[corr == 1] <- NA
# turn into a 3-column table
corr <- as.data.frame(as.table(corr))
# remove the NA values from above
corr <- na.omit(corr)

# Uninteresting variable correlations?
lowcor <- subset(corr, abs(Freq) < 0.3)
# lowcor %>% filter(Var2=='CC'|Var1=='CC') %>% round(Freq, 2) select significant
# correlations
corr <- subset(corr, abs(Freq) > 0.3)
# sort by highest correlation
corr <- corr[order(-abs(corr$Freq)), ]
# see which variables might be eliminated: the ones with correlation > 0.3
eliminate <- as.data.frame((summary(corr$Var1) + summary(corr$Var2)))
(LowcCommunality <- eliminate %>% filter(`(summary(corr$Var1) + summary(corr$Var2))` == 
    0))

# Potentially problematic collinear variables that may need to be removed:
highcor <- subset(corr, abs(Freq) > 0.95)
highcor

# variables which are retained
corr$Var1 <- droplevels(corr$Var1)
corr$Var2 <- droplevels(corr$Var2)
features <- unique(c(levels(corr$Var1), levels(corr$Var2)))
features  # 68 variables
# turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1 ~ Var2, value.var = "Freq")

# plot correlations in a manageable way
library(corrplot)
plot.margin = unit(c(0, 0, 0, 0), "mm")
corrplot(mtx_corr, is.corr = FALSE, tl.col = "black", na.label = " ", tl.cex = 0.5)
# save as SVG with Rstudio e.g. 1000 x 1000

Visualisation of feature correlations

# Simple heatmap in base R (inspired by Stephanie Evert's SIGIL code)
cor.colours <- c(
  hsv(h=2/3, v=1, s=(10:1)/10), # blue = negative correlation 
  rgb(1,1,1), # white = no correlation 
  hsv(h=0, v=1, s=(1:10/10))) # red = positive correlation

#png(here("Plots", "heatmapzlogcounts.png"), width = 30, height= 30, units = "cm", res = 300)
heatmap(cor(zlogcounts), 
        symm=TRUE, 
        zlim=c(-1,1), 
        col=cor.colours, 
        margins=c(7,7))

dev.off()

## null device 
##           1

MSA, communalities and scree plot

# Eliminate highly collinear variable
cor(data$VPRT, data$VBD)

## [1] -0.9731048

data <- data %>% select(-c(VPRT))

colnames(data)

##  [1] "Filename"  "Register"  "Level"     "Series"    "Country"   "Corpus"   
##  [7] "Subcorpus" "Words"     "ACT"       "AMP"       "ASPECT"    "AWL"      
## [13] "BEMA"      "CAUSE"     "CC"        "COMM"      "CONC"      "COND"     
## [19] "CONT"      "CUZ"       "DEMO"      "DMA"       "DOAUX"     "DT"       
## [25] "DWNT"      "ELAB"      "EMPH"      "EX"        "EXIST"     "FPP1P"    
## [31] "FPP1S"     "FPUH"      "GTO"       "HDG"       "HGOT"      "IN"       
## [37] "JJAT"      "JJPR"      "LD"        "MDCA"      "MDCO"      "MDMM"     
## [43] "MDNE"      "MDWO"      "MDWS"      "MENTAL"    "NCOMP"     "NN"       
## [49] "OCCUR"     "PASS"      "PEAS"      "PIT"       "PLACE"     "POLITE"   
## [55] "POS"       "PROG"      "QUAN"      "QUPR"      "QUTAG"     "RB"       
## [61] "RP"        "SPLIT"     "SPP2"      "STPR"      "THATD"     "THRC"     
## [67] "THSC"      "TTR"       "VBD"       "VBG"       "VBN"       "VIMP"     
## [73] "WHQU"      "WHSC"      "XX0"       "YNQU"      "TPP3"      "FQTI"

kmo <- KMO(data[, 9:ncol(data)])
kmo  # # Overall MSA = 0.95

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data[, 9:ncol(data)])
## Overall MSA =  0.95
## MSA for each item = 
##    ACT    AMP ASPECT    AWL   BEMA  CAUSE     CC   COMM   CONC   COND   CONT 
##   0.89   0.67   0.97   0.96   0.88   0.96   0.95   0.69   0.93   0.95   0.96 
##    CUZ   DEMO    DMA  DOAUX     DT   DWNT   ELAB   EMPH     EX  EXIST  FPP1P 
##   0.97   0.97   0.97   0.97   0.83   0.95   0.97   0.98   0.85   0.98   0.89 
##  FPP1S   FPUH    GTO    HDG   HGOT     IN   JJAT   JJPR     LD   MDCA   MDCO 
##   0.91   0.97   0.99   0.97   0.99   0.97   0.84   0.76   0.87   0.89   0.85 
##   MDMM   MDNE   MDWO   MDWS MENTAL  NCOMP     NN  OCCUR   PASS   PEAS    PIT 
##   0.91   0.95   0.93   0.89   0.91   0.88   0.94   0.97   0.96   0.91   0.97 
##  PLACE POLITE    POS   PROG   QUAN   QUPR  QUTAG     RB     RP  SPLIT   SPP2 
##   0.82   0.96   0.70   0.95   0.98   0.96   0.98   0.95   0.85   0.83   0.95 
##   STPR  THATD   THRC   THSC    TTR    VBD    VBG    VBN   VIMP   WHQU   WHSC 
##   0.99   0.98   0.94   0.86   0.98   0.91   0.96   0.98   0.84   0.96   0.96 
##    XX0   YNQU   TPP3   FQTI 
##   0.96   0.98   0.74   0.89

kmo$MSAi[order(kmo$MSAi)]  # All features have individual MSAs of > 0.5 (but only because TPP3P was merged with TPP3S earlier on)

##       AMP      COMM       POS      TPP3      JJPR     PLACE     SPLIT        DT 
## 0.6659581 0.6860298 0.6995184 0.7373646 0.7594598 0.8209795 0.8305808 0.8325009 
##      JJAT      VIMP      MDCO        RP        EX      THSC        LD     NCOMP 
## 0.8384328 0.8435546 0.8511922 0.8513463 0.8547890 0.8590161 0.8692458 0.8757097 
##      BEMA      MDWS      FQTI     FPP1P      MDCA       ACT    MENTAL       VBD 
## 0.8761697 0.8865788 0.8888590 0.8928853 0.8930041 0.8932178 0.9067961 0.9103405 
##     FPP1S      MDMM      PEAS      CONC      MDWO      THRC        NN      COND 
## 0.9105424 0.9108706 0.9125553 0.9301851 0.9348437 0.9350359 0.9413801 0.9458270 
##      PROG        CC      SPP2        RB      DWNT      MDNE      WHSC      CONT 
## 0.9461472 0.9476549 0.9483572 0.9495729 0.9496995 0.9514806 0.9565117 0.9569848 
##      QUPR       XX0     CAUSE      WHQU       VBG       AWL    POLITE      PASS 
## 0.9576645 0.9578687 0.9579412 0.9599786 0.9617553 0.9623653 0.9643209 0.9645284 
##       PIT     DOAUX      ELAB    ASPECT       DMA      DEMO       HDG        IN 
## 0.9657592 0.9661947 0.9669056 0.9679603 0.9691391 0.9695140 0.9704817 0.9710499 
##      FPUH     OCCUR       CUZ      EMPH      YNQU      QUAN       TTR     QUTAG 
## 0.9712987 0.9716773 0.9728492 0.9758064 0.9758425 0.9765995 0.9776647 0.9782315 
##     THATD       VBN     EXIST      STPR       GTO      HGOT 
## 0.9790366 0.9801000 0.9804490 0.9854688 0.9881734 0.9889865

# png(here('Plots', 'screeplot-TEC-Ref_3Reg.png'), width = 20, height= 12, units
# = 'cm', res = 300)
scree(data[, 9:ncol(data)], factors = FALSE, pc = TRUE)  # 6 components

dev.off()

## null device 
##           1

# Perform PCA
pca1 <- psych::principal(data[9:ncol(data)], nfactors = 6)
pca1$loadings

## 
## Loadings:
##        RC1    RC3    RC2    RC4    RC5    RC6   
## ACT    -0.502  0.141         0.196 -0.309  0.215
## AMP                                 0.574  0.124
## ASPECT -0.395         0.211        -0.203       
## AWL    -0.767  0.501 -0.201 -0.160              
## BEMA    0.356        -0.469         0.502       
## CAUSE  -0.455  0.154 -0.301  0.236              
## CC     -0.488  0.461 -0.182 -0.248         0.149
## COMM   -0.242 -0.157  0.384  0.158        -0.349
## CONC           0.439        -0.100        -0.129
## COND    0.363  0.195  0.156  0.494              
## CONT    0.864 -0.276 -0.120  0.221              
## CUZ     0.620  0.341 -0.149                     
## DEMO    0.670        -0.170  0.128              
## DMA     0.916 -0.113 -0.182                     
## DOAUX   0.736 -0.294         0.187        -0.148
## DT      0.399         0.518                0.423
## DWNT   -0.182  0.129  0.369         0.141  0.105
## ELAB   -0.338  0.385 -0.320         0.264  0.148
## EMPH    0.782                0.117              
## EX      0.246                       0.217  0.513
## EXIST  -0.530  0.377 -0.142 -0.112              
## FPP1P   0.223 -0.284         0.354  0.108       
## FPP1S   0.627 -0.363         0.221        -0.153
## FPUH    0.877        -0.208                     
## GTO     0.641                0.218 -0.145       
## HDG     0.575  0.161                       0.127
## HGOT    0.743 -0.151 -0.126        -0.116       
## IN     -0.809  0.405        -0.175              
## JJAT           0.553                0.280  0.212
## JJPR   -0.141  0.198 -0.205  0.178  0.571       
## LD     -0.718  0.207 -0.452 -0.130 -0.110 -0.149
## MDCA    0.114        -0.495  0.462  0.154  0.124
## MDCO                  0.534  0.111              
## MDMM           0.432         0.341              
## MDNE    0.238                0.449              
## MDWO    0.369  0.116  0.393  0.122              
## MDWS    0.161                0.545 -0.145       
## MENTAL  0.449                0.261  0.247 -0.329
## NCOMP          0.347 -0.509  0.220 -0.185  0.102
## NN     -0.851  0.225 -0.298 -0.250              
## OCCUR  -0.512  0.298        -0.224              
## PASS   -0.522  0.521 -0.209 -0.238              
## PEAS   -0.215  0.288  0.448                     
## PIT     0.726  0.114                       0.223
## PLACE         -0.351  0.120  0.103         0.467
## POLITE  0.265 -0.557 -0.179  0.241  0.172 -0.136
## POS           -0.110  0.103        -0.235 -0.389
## PROG    0.331 -0.116  0.205  0.327 -0.117       
## QUAN    0.772         0.134  0.153  0.112  0.151
## QUPR    0.307         0.382  0.278  0.167       
## QUTAG   0.776                      -0.168       
## RB      0.609         0.454  0.189              
## RP     -0.101         0.478  0.247 -0.339  0.272
## SPLIT          0.541         0.128  0.114       
## SPP2    0.599 -0.334 -0.182  0.412              
## STPR    0.425 -0.162                       0.107
## THATD   0.733                0.159        -0.159
## THRC           0.554 -0.262        -0.110  0.177
## THSC           0.577  0.124         0.147 -0.119
## TTR    -0.793  0.204  0.183                     
## VBD    -0.358         0.639 -0.440 -0.196       
## VBG    -0.589  0.465         0.108 -0.136       
## VBN    -0.565  0.489 -0.188 -0.184 -0.113       
## VIMP   -0.165 -0.394 -0.337  0.364  0.128  0.101
## WHQU    0.454 -0.518 -0.201  0.206        -0.116
## WHSC   -0.273  0.575                            
## XX0     0.742 -0.135         0.240        -0.118
## YNQU    0.692 -0.434 -0.201  0.161              
## TPP3   -0.211 -0.109  0.581 -0.341 -0.184 -0.193
## FQTI   -0.354         0.124         0.298       
## 
##                   RC1   RC3   RC2   RC4   RC5   RC6
## SS loadings    17.839 6.068 4.731 3.283 2.099 1.788
## Proportion Var  0.255 0.087 0.068 0.047 0.030 0.026
## Cumulative Var  0.255 0.342 0.409 0.456 0.486 0.512

pca1$communality %>% sort(.)  # If features with communalities of < 0.2 are removed, we remove TIME (therefore merged TIME and FREQ further up the line)

##      DWNT      STPR      CONC      FQTI       POS    ASPECT      MDNE     FPP1P 
## 0.2167287 0.2259971 0.2284371 0.2299426 0.2373224 0.2464027 0.2706613 0.2805719 
##      PROG      MDCO      MDMM      MDWO     SPLIT      MDWS      PEAS      QUPR 
## 0.2893579 0.3154268 0.3170923 0.3233782 0.3278744 0.3446263 0.3481427 0.3489995 
##       AMP     PLACE       HDG      COMM     CAUSE        EX      THSC     OCCUR 
## 0.3544622 0.3710837 0.3774159 0.3793614 0.3800517 0.3805229 0.4009387 0.4017528 
##      WHSC      THRC      JJAT      COND    MENTAL       ACT      VIMP      ELAB 
## 0.4169886 0.4284237 0.4365607 0.4388680 0.4451165 0.4513069 0.4555282 0.4588368 
##     EXIST      JJPR     NCOMP        RP       GTO      DEMO      MDCA    POLITE 
## 0.4607173 0.4640015 0.4764765 0.4938285 0.4952373 0.5048824 0.5166493 0.5182614 
##       CUZ        CC      WHQU      TPP3       VBG     THATD       PIT      BEMA 
## 0.5295899 0.5704833 0.5799242 0.5812901 0.5989025 0.6044126 0.6059166 0.6078998 
##     FPP1S        DT      HGOT        RB       VBN     QUTAG      EMPH      PASS 
## 0.6080200 0.6130366 0.6153162 0.6237505 0.6403441 0.6404917 0.6434773 0.6450342 
##       XX0      QUAN      SPP2     DOAUX       TTR      YNQU       VBD        LD 
## 0.6504597 0.6732223 0.6819055 0.6940095 0.7076993 0.7413764 0.7799251 0.8145733 
##      FPUH        IN      CONT       DMA       AWL        NN 
## 0.8255802 0.8567971 0.8869047 0.8909298 0.9087159 0.9292510

# Final number of features
ncol(data) - 6

## [1] 72

# Final number of texts
nrow(data)

## [1] 4980

# saveRDS(data, here('FullMDA', 'dataforPCA.rds')) # Last saved on 9 Feb 2022

Package used in this script

# packages.bib <- sapply(1:length(loadedNamespaces()), function(i)
# toBibtex(citation(loadedNamespaces()[i])))

knitr::write_bib(c(.packages(), "knitr"), "packages.bib")

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tibble_3.1.6               tidyr_1.1.4               
##  [3] suffrager_0.1.0            psych_2.0.12              
##  [5] purrr_0.3.4                PerformanceAnalytics_2.0.4
##  [7] xts_0.12.1                 zoo_1.8-9                 
##  [9] here_1.0.1                 forcats_0.5.1             
## [11] dplyr_1.0.7                DescTools_0.99.40         
## [13] caret_6.0-86               ggplot2_3.3.5             
## [15] lattice_0.20-41           
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-152         lubridate_1.7.10     rprojroot_2.0.2     
##  [4] tools_4.0.3          bslib_0.3.1          utf8_1.2.2          
##  [7] R6_2.5.1             rpart_4.1-15         DBI_1.1.1           
## [10] colorspace_2.0-2     nnet_7.3-15          withr_2.4.3         
## [13] tidyselect_1.1.1     Exact_2.1            mnormt_2.0.2        
## [16] compiler_4.0.3       cli_3.1.0            formatR_1.8         
## [19] expm_0.999-6         labeling_0.4.2       sass_0.4.0          
## [22] scales_1.1.1         mvtnorm_1.1-1        quadprog_1.5-8      
## [25] stringr_1.4.0        digest_0.6.29        rmarkdown_2.11      
## [28] pkgconfig_2.0.3      htmltools_0.5.2      fastmap_1.1.0       
## [31] highr_0.9            rlang_0.4.12         rstudioapi_0.13     
## [34] jquerylib_0.1.4      generics_0.1.1       farver_2.1.0        
## [37] jsonlite_1.7.2       ModelMetrics_1.2.2.2 magrittr_2.0.1      
## [40] Matrix_1.3-2         Rcpp_1.0.7           munsell_0.5.0       
## [43] fansi_0.5.0          lifecycle_1.0.1      stringi_1.7.6       
## [46] pROC_1.17.0.1        yaml_2.2.1           MASS_7.3-53.1       
## [49] rootSolve_1.8.2.1    plyr_1.8.6           recipes_0.1.15      
## [52] grid_4.0.3           parallel_4.0.3       crayon_1.4.2        
## [55] lmom_2.8             splines_4.0.3        tmvnsim_1.0-2       
## [58] knitr_1.37           pillar_1.6.4         boot_1.3-27         
## [61] gld_2.6.2            reshape2_1.4.4       codetools_0.2-18    
## [64] stats4_4.0.3         glue_1.6.0           evaluate_0.14       
## [67] data.table_1.14.2    vctrs_0.3.8          foreach_1.5.1       
## [70] gtable_0.3.0         assertthat_0.2.1     xfun_0.29           
## [73] gower_0.2.2          prodlim_2019.11.13   e1071_1.7-4         
## [76] class_7.3-18         survival_3.2-7       timeDate_3043.102   
## [79] iterators_1.0.13     lava_1.6.9           ellipsis_0.3.2      
## [82] ipred_0.9-11

Alburez-Gutierrez, Diego. 2020. Suffrager: A Feminist Colour Palette for r.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Kuhn, Max. 2020. Caret: Classification and Regression Training. https://github.com/topepo/caret/.

Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.

Peterson, Brian G., and Peter Carl. 2020. PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. https://github.com/braverock/PerformanceAnalytics.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Revelle, William. 2020. Psych: Procedures for Psychological, Psychometric, and Personality Research. https://personality-project.org/r/psych/ https://personality-project.org/r/psych-manual.pdf.

Ryan, Jeffrey A., and Joshua M. Ulrich. 2020. Xts: eXtensible Time Series. https://github.com/joshuaulrich/xts.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.

———. 2020. Lattice: Trellis Graphics for r. http://lattice.r-forge.r-project.org/.

Signorell, Andri. 2021. DescTools: Tools for Descriptive Statistics. https://CRAN.R-project.org/package=DescTools.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2021a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

———. 2021b. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Zeileis, Achim, and Gabor Grothendieck. 2005. “Zoo: S3 Infrastructure for Regular and Irregular Time Series.” Journal of Statistical Software 14 (6): 1–27. https://doi.org/10.18637/jss.v014.i06.

Zeileis, Achim, Gabor Grothendieck, and Jeffrey A. Ryan. 2021. Zoo: S3 Infrastructure for Regular and Irregular Time Series (z’s Ordered Observations). https://zoo.R-Forge.R-project.org/.

Reference corpus data preparation for full MDA

Elen Le Foll

09/11/2021