This script is part of the Online Appendix to my PhD thesis.
Please cite as: Le Foll, Elen. 2022. Textbook English: A Corpus-Based Analysis of the Language of EFL textbooks used in Secondary Schools in France, Germany and Spain. PhD thesis. Osnabrück University.
For more information, see: https://elenlefoll.github.io/TextbookEnglish/
Please note that the plot dimensions in this notebook have been optimised for the print version of the thesis.
Built with R 4.0.3
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(echo = TRUE, tidy = TRUE, message=FALSE, paged.print=TRUE, fig.width = 10, warning=FALSE)
library(caret) # For its confusion matrix function
library(DescTools) # For common stats functions
library(dplyr)
library(forcats)
library(here) # For dynamic file paths
library(ggplot2)
library(PerformanceAnalytics)
library(purrr) # For data wrangling
library(psych) # For various useful stats function
library(suffrager) # For pretty feminist colour palettes :)
library(tidyr)
library(tibble)
These counts were computed on the basis of the “John and Jill in Ivybridge” version of the Spoken BNC2014 with added full stops at speaker turns.
SpokenBNC2014 <- read.delim(here("MFTE", "Outputs", "SpokenBNC2014_3.1_normed_complex_counts.tsv"),
header = TRUE, stringsAsFactors = TRUE)
str(SpokenBNC2014) # Check sanity of data
## 'data.frame': 1251 obs. of 82 variables:
## $ Filename: Factor w/ 1251 levels "S23A.txt","S24A.txt",..: 586 1079 223 624 394 874 137 1144 1091 419 ...
## $ Words : int 9192 18158 6900 5944 5866 44400 2514 3238 8614 21724 ...
## $ AWL : num 3.92 3.92 3.85 3.81 3.87 ...
## $ TTR : num 0.407 0.41 0.35 0.385 0.37 ...
## $ LD : num 0.462 0.507 0.489 0.493 0.467 ...
## $ DT : num 46.4 39.5 35.9 40.9 35.7 ...
## $ JJAT : num 17.2 17.3 13.3 18 12.6 ...
## $ POS : num 0.758 2.367 2.575 3.03 2.119 ...
## $ NCOMP : num 8.33 7.01 8.4 9.55 9.18 ...
## $ QUAN : num 12.31 10.27 9.62 9.24 11.16 ...
## $ ACT : num 22.2 17.7 19.9 11.5 25.7 ...
## $ ASPECT : num 2.4 0.87 1.11 1.02 1.98 ...
## $ CAUSE : num 0.712 1.099 0.443 0.511 1.273 ...
## $ COMM : num 8.9 7.1 8.08 7.42 7.5 ...
## $ CUZ : num 4.9 4.31 4.09 2.43 5.8 ...
## $ CC : num 42.8 21.3 26.5 14.3 37.9 ...
## $ CONC : num 0.356 1.054 0.996 1.023 0.99 ...
## $ COND : num 1.69 3.02 2.54 2.17 5.8 ...
## $ EX : num 2.14 4.21 3.1 2.43 3.54 ...
## $ EXIST : num 2.85 1.65 1.33 1.53 2.26 ...
## $ ELAB : num 0.445 0.321 0 0 0.141 ...
## $ FREQ : num 2.404 2.382 2.655 0.767 2.546 ...
## $ JJPR : num 14.78 18.64 11.62 16.5 9.05 ...
## $ MENTAL : num 19.8 20.3 18.8 21.5 19.8 ...
## $ OCCUR : num 1.692 0.87 0.664 0.511 0.283 ...
## $ DOAUX : num 6.14 6.69 9.07 9.59 7.36 ...
## $ QUTAG : num 1.692 0.596 3.54 2.43 0.141 ...
## $ QUPR : num 3.29 2.57 6.19 2.3 5.52 ...
## $ SPLIT : num 2.58 2.98 3.1 2.43 5.23 ...
## $ STPR : num 0.267 1.374 0.885 0.384 0.707 ...
## $ WHQU : num 1.16 3.71 5.31 4.48 2.69 ...
## $ THSC : num 4.9 4.4 2.77 3.84 3.68 ...
## $ WHSC : num 9.44 8.84 5.97 5.88 13.72 ...
## $ CONT : num 28 42.4 42.1 37.6 34.8 ...
## $ VBD : num 43.19 16.77 21.24 32.99 9.34 ...
## $ VPRT : num 47.5 67.5 64.4 53.2 62.7 ...
## $ PLACE : num 3.21 3.44 4.42 3.71 2.69 ...
## $ PROG : num 3.74 4.99 6.97 5.12 3.82 ...
## $ HGOT : num 1.16 1.28 2.88 1.41 3.25 ...
## $ BEMA : num 21.7 27.4 21.2 27.5 16.7 ...
## $ MDCA : num 1.96 2.84 4.54 3.71 7.36 ...
## $ MDCO : num 1.51 1.19 1.44 1.53 6.08 ...
## $ TIME : num 2.85 2.24 4.2 2.56 2.12 ...
## $ THATD : num 4.72 5.27 5.2 5.63 5.52 ...
## $ THRC : num 0.712 2.748 1.77 2.43 2.122 ...
## $ VIMP : num 0.178 1.878 1.327 1.407 1.414 ...
## $ MDMM : num 0.356 0.641 0.664 0.895 0.99 ...
## $ ABLE : num 0 0.183 0.111 0.128 0.141 ...
## $ MDNE : num 1.34 2.15 2.32 2.43 4.67 ...
## $ MDWS : num 0.712 2.657 2.655 1.151 3.678 ...
## $ MDWO : num 3.29 4.35 1.44 2.69 3.82 ...
## $ XX0 : num 11.2 16.9 17.3 22.9 15.1 ...
## $ PASS : num 3.38 2.57 1.66 1.79 1.27 ...
## $ PGET : num 0.98 0.183 0.332 0.256 0.566 ...
## $ VBG : num 7.3 4.54 4.2 3.45 5.09 ...
## $ VBN : num 1.96 2.15 1.33 1.28 0.99 ...
## $ PEAS : num 5.79 3.99 3.98 3.45 1.84 ...
## $ GTO : num 0.267 2.061 0.774 0.767 1.556 ...
## $ FPP1S : num 40.5 24.9 26 30.9 18.2 ...
## $ FPP1P : num 9.53 8.57 3.76 7.03 13.72 ...
## $ TPP3S : num 9.884 7.879 11.504 5.499 0.424 ...
## $ TPP3P : num 9.62 8.25 10.73 4.86 21.22 ...
## $ SPP2 : num 13.6 14.8 19.4 18.3 15.1 ...
## $ PIT : num 18.5 23.1 28.3 29.2 18.2 ...
## $ PRP : num 0.089 0.0458 0 0.1279 0 ...
## $ RP : num 3.21 2.75 3.76 3.32 6.08 ...
## $ AMP : num 0.653 0.314 0.145 0.337 0.239 ...
## $ CD : num 0.783 0.881 1 1.043 1.705 ...
## $ DEMO : num 0.957 1.74 1.029 1.632 1.279 ...
## $ DMA : num 3.62 4.37 3.41 4.46 1.89 ...
## $ DWNT : num 0.0326 0.0771 0.029 0.0168 0 0.0338 0.0398 0.0926 0.0232 0.0368 ...
## $ EMPH : num 1.49 1.31 1.19 1.46 1.01 ...
## $ FPUH : num 3.79 3.08 2.52 3.01 2.78 ...
## $ HDG : num 0.174 0.688 0.159 0.303 0.699 ...
## $ IN : num 7.72 7.01 5.48 5.08 6.75 ...
## $ LIKE : num 0.152 0.6 0.623 0.69 0.494 ...
## $ NN : num 11.5 11.6 10.7 11.1 12.1 ...
## $ POLITE : num 0 0.2313 0.0435 0.0505 0.0341 ...
## $ RB : num 3.1 3.42 3.14 3.26 3.8 ...
## $ SO : num 0.664 0.832 0.71 0.202 0.733 ...
## $ URL : num 0 0 0 0 0 0 0 0 0 0 ...
## $ YNQU : num 0.283 0.363 1.044 0.74 0.511 ...
nrow(SpokenBNC2014) # Should be 1251 files
## [1] 1251
SpokenBNC2014$Series <- "Spoken BNC2014"
SpokenBNC2014$Level <- "Ref."
SpokenBNC2014$Country <- "Spoken BNC2014"
SpokenBNC2014$Register <- "Spoken BNC2014"
These counts were computed on the basis of the random samples of approximately 5,000 words of the books of the Youth Fiction corpus.
YouthFiction <- read.delim(here("MFTE", "Outputs", "YF_sampled_500_3.1_normed_complex_counts.tsv"),
header = TRUE, stringsAsFactors = TRUE)
str(YouthFiction) # Check sanity of data
## 'data.frame': 1191 obs. of 83 variables:
## $ Filename: Factor w/ 1191 levels "1_BaumWizardOz_1.txt",..: 946 264 698 551 543 856 976 927 542 1150 ...
## $ Words : int 5955 5877 5974 5795 6185 5924 6097 5886 6110 5801 ...
## $ AWL : num 4.13 4.15 4.07 4.26 4.14 ...
## $ TTR : num 0.47 0.477 0.502 0.547 0.527 ...
## $ LD : num 0.505 0.485 0.496 0.496 0.499 ...
## $ DT : num 33.5 38.7 48.2 45.9 35.6 ...
## $ JJAT : num 12.6 14.2 23 20.8 19.1 ...
## $ POS : num 2.615 2.233 0.862 1.134 2.072 ...
## $ NCOMP : num 1.48 2.91 4.56 6.08 7.74 ...
## $ QUAN : num 6.19 6.41 9.36 6.6 8.72 ...
## $ ACT : num 19.3 20.9 24.8 25.2 23.8 ...
## $ ASPECT : num 2.11 2.1 5.68 2.47 1.52 ...
## $ CAUSE : num 1.057 0.84 2.436 1.389 0.967 ...
## $ COMM : num 22.46 16.11 13.53 9.88 16.71 ...
## $ CUZ : num 0.793 1.961 1.488 0.926 1.934 ...
## $ CC : num 28 35.7 26.8 30.1 32.5 ...
## $ CONC : num 0.793 1.12 1.083 1.08 2.486 ...
## $ COND : num 4.1 2.38 2.84 3.09 4.14 ...
## $ EX : num 2.64 2.8 2.57 2.93 1.93 ...
## $ EXIST : num 1.32 1.96 3.65 3.7 2.9 ...
## $ ELAB : num 0.132 0 0.135 0 0.276 ...
## $ FREQ : num 2.77 2.94 2.17 4.94 2.49 ...
## $ JJPR : num 12.4 15.4 16 17.6 12.2 ...
## $ MENTAL : num 15.7 18.5 17.9 15.6 21.5 ...
## $ OCCUR : num 3.04 2.38 2.03 5.25 1.52 ...
## $ DOAUX : num 3.57 4.62 2.57 3.55 5.25 ...
## $ QUTAG : num 0.132 0 0 0 1.934 ...
## $ QUPR : num 3.57 2.1 2.71 3.55 4.56 ...
## $ SPLIT : num 3.04 2.38 2.98 4.32 4.01 ...
## $ STPR : num 0.925 0.42 0.947 0.154 0.552 ...
## $ WHQU : num 3.3 3.64 1.89 2.01 1.1 ...
## $ THSC : num 2.38 3.08 4.74 4.94 4.7 ...
## $ WHSC : num 7.79 11.06 7.04 4.94 8.98 ...
## $ CONT : num 12.8 15.4 14.9 10.8 21.7 ...
## $ VBD : num 49.3 51.8 68.1 76.5 53.6 ...
## $ VPRT : num 34.1 30.7 16.2 11.9 30.7 ...
## $ PLACE : num 1.98 4.76 4.74 3.09 2.35 ...
## $ PROG : num 3.96 3.36 6.5 7.1 4.83 ...
## $ HGOT : num 0.528 0.14 0.947 0 0.691 ...
## $ BEMA : num 15.3 16.5 16.2 15.4 14.6 ...
## $ MDCA : num 1.717 2.521 0.947 1.08 3.315 ...
## $ MDCO : num 1.59 2.38 3.92 2.93 2.35 ...
## $ TIME : num 5.15 4.76 5.41 3.7 5.94 ...
## $ THATD : num 1.98 1.26 3.52 2.16 3.59 ...
## $ THRC : num 0.528 1.681 1.488 1.235 0.414 ...
## $ VIMP : num 3.567 2.241 3.383 1.698 0.829 ...
## $ MDMM : num 1.453 1.401 0.812 1.543 0.552 ...
## $ ABLE : num 0.396 0.28 0.271 0.154 0.138 ...
## $ MDNE : num 1.98 2.24 2.84 1.54 2.07 ...
## $ MDWS : num 3.699 3.642 1.624 0.309 2.072 ...
## $ MDWO : num 2.64 3.08 2.17 2.47 4.56 ...
## $ XX0 : num 10.3 8.68 8.39 11.11 12.29 ...
## $ PASS : num 4.89 3.36 4.19 4.01 2.21 ...
## $ PGET : num 0.132 0.14 0 0 0 ...
## $ VBG : num 7.53 6.02 11.37 14.66 10.22 ...
## $ VBN : num 2.25 4.76 2.84 5.86 3.31 ...
## $ PEAS : num 6.87 8.68 6.9 8.95 8.29 ...
## $ GTO : num 0 0.7 0.541 0.154 1.243 ...
## $ FPP1S : num 17.31 19.33 46.41 8.02 34.94 ...
## $ FPP1P : num 7.53 5.74 9.88 1.39 4.42 ...
## $ TPP3S : num 23.78 41.74 9.88 64.81 31.77 ...
## $ TPP3P : num 7.66 10.08 6.5 11.57 3.59 ...
## $ SPP2 : num 15.32 9.1 7.31 5.4 12.15 ...
## $ PIT : num 11.9 9.1 11.9 18.2 14.9 ...
## $ PRP : num 0.132 0 0 0 0 ...
## $ RP : num 3.83 4.9 6.09 5.56 7.87 ...
## $ AMP : num 0.487 0.0851 0.2176 0.1208 0.2749 ...
## $ CD : num 0.722 0.391 0.904 0.328 0.582 ...
## $ DEMO : num 0.823 1.038 0.753 0.725 0.582 ...
## $ DMA : num 0.369 0.391 0.285 0.259 0.598 ...
## $ DWNT : num 0.0672 0.0681 0.1339 0.1726 0.097 ...
## $ EMO : num 0 0.017 0 0 0 0 0.0164 0 0 0 ...
## $ EMPH : num 0.571 0.204 0.619 0.311 0.776 ...
## $ FPUH : num 0.2519 0.034 0.1339 0.1553 0.0647 ...
## $ HDG : num 0.1511 0.0681 0.1172 0.2071 0.3072 ...
## $ IN : num 9.5 10.06 9.78 11.32 9.12 ...
## $ LIKE : num 0.2015 0.0851 0.3348 0.3279 0.2264 ...
## $ NN : num 19.3 17.5 13.6 16.7 14.8 ...
## $ POLITE : num 0.117 0.102 0.067 0.069 0.097 ...
## $ RB : num 2.3 1.91 3.16 2.26 3.46 ...
## $ SO : num 0.218 0.17 0.117 0.19 0.388 ...
## $ URL : num 0 0 0 0 0 0 0 0 0 0 ...
## $ YNQU : num 0.403 0.204 0.167 0.069 0.372 ...
nrow(YouthFiction) # Should be 1191 files
## [1] 1191
YouthFiction$Series <- "Youth Fiction"
YouthFiction$Level <- "Ref."
YouthFiction$Country <- "Youth Fiction"
YouthFiction$Register <- "Youth Fiction"
InfoTeen <- read.delim(here("MFTE", "Outputs", "InfoTeen_3.1_normed_complex_counts.tsv"),
header = TRUE, stringsAsFactors = TRUE)
str(InfoTeen) # Check sanity of data
## 'data.frame': 1414 obs. of 84 variables:
## $ Filename: Factor w/ 1414 levels "Dogo_News_10059528_science.txt",..: 657 831 825 1285 1086 1240 85 567 644 1147 ...
## $ Words : int 656 904 833 971 711 1104 697 1098 762 705 ...
## $ AWL : num 4.77 4.73 5.08 4.99 4.92 ...
## $ TTR : num 0.482 0.485 0.502 0.547 0.55 ...
## $ LD : num 0.59 0.559 0.593 0.623 0.578 ...
## $ DT : num 21.1 33.7 31.8 26.5 32.2 ...
## $ JJAT : num 23.2 22.4 20.4 24.9 17.8 ...
## $ POS : num 2.062 2.845 0.408 1.246 1.485 ...
## $ NCOMP : num 5.15 4.88 6.53 11.21 8.42 ...
## $ QUAN : num 3.61 5.69 1.63 4.36 3.47 ...
## $ ACT : num 21.9 24.7 28.4 29.5 34 ...
## $ ASPECT : num 6.25 5.19 0 1.64 1.89 ...
## $ CAUSE : num 1.56 3.9 1.49 1.64 5.66 ...
## $ COMM : num 7.81 5.19 14.93 3.28 3.77 ...
## $ CUZ : num 0 0 1.49 8.2 0 ...
## $ CC : num 23.4 36.4 37.3 44.3 30.2 ...
## $ CONC : num 0 1.3 1.49 3.28 5.66 ...
## $ COND : num 0 1.3 1.49 3.28 0 ...
## $ EX : num 1.56 0 0 0 1.89 ...
## $ EXIST : num 4.69 14.29 16.42 4.92 5.66 ...
## $ ELAB : num 0 0 4.48 1.64 0 ...
## $ FREQ : num 1.56 1.3 2.99 3.28 0 ...
## $ JJPR : num 12.5 19.5 23.9 24.6 13.2 ...
## $ MENTAL : num 3.12 9.09 11.94 9.84 15.09 ...
## $ OCCUR : num 4.69 2.6 0 4.92 1.89 ...
## $ DOAUX : num 7.81 1.3 0 0 1.89 ...
## $ QUTAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ QUPR : num 0 0 0 0 3.77 ...
## $ SPLIT : num 4.69 1.3 2.99 8.2 3.77 ...
## $ STPR : num 1.56 1.3 0 0 0 ...
## $ WHQU : num 7.81 0 0 0 0 ...
## $ THSC : num 6.25 10.39 7.46 1.64 7.55 ...
## $ WHSC : num 12.5 19.48 8.96 4.92 18.87 ...
## $ CONT : num 6.25 6.49 2.99 0 3.77 ...
## $ VBD : num 51.56 6.49 29.85 8.2 26.42 ...
## $ VPRT : num 42.2 58.4 59.7 78.7 56.6 ...
## $ PLACE : num 3.12 2.6 13.43 3.28 3.77 ...
## $ PROG : num 4.69 1.3 1.49 3.28 0 ...
## $ HGOT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BEMA : num 23.4 15.6 16.4 26.2 17 ...
## $ MDCA : num 3.12 0 1.49 4.92 13.21 ...
## $ MDCO : num 0 7.79 0 1.64 0 ...
## $ TIME : num 7.81 2.6 16.42 4.92 7.55 ...
## $ THATD : num 1.56 1.3 0 1.64 3.77 ...
## $ THRC : num 6.25 14.29 5.97 6.56 3.77 ...
## $ VIMP : num 0 9.09 4.48 3.28 0 ...
## $ MDMM : num 3.12 6.49 2.99 0 0 ...
## $ ABLE : num 0 1.3 1.49 0 0 ...
## $ MDNE : num 0 2.6 0 0 0 ...
## $ MDWS : num 0 1.3 0 1.64 1.89 ...
## $ MDWO : num 0 7.79 1.49 1.64 1.89 ...
## $ XX0 : num 7.81 5.19 0 3.28 3.77 ...
## $ PASS : num 10.94 5.19 13.43 4.92 7.55 ...
## $ PGET : num 0 0 0 0 0 ...
## $ VBG : num 10.9 19.5 14.9 18 35.8 ...
## $ VBN : num 1.56 10.39 28.36 13.11 13.21 ...
## $ PEAS : num 6.25 6.49 4.48 13.11 15.09 ...
## $ GTO : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FPP1S : num 0 0 0 0 0 ...
## $ FPP1P : num 0 5.19 0 0 7.55 ...
## $ TPP3S : num 7.81 0 4.48 0 0 ...
## $ TPP3P : num 4.69 2.6 8.96 4.92 35.85 ...
## $ SPP2 : num 17.2 0 0 0 0 ...
## $ PIT : num 6.25 12.99 20.9 8.2 0 ...
## $ PRP : num 0 0 0 0 0 0 0 0 0 0 ...
## $ RP : num 3.12 2.6 5.97 1.64 9.43 ...
## $ AMP : num 0.152 0 0 0.103 0.422 ...
## $ CD : num 2.29 2.77 2.4 3.6 2.25 ...
## $ DEMO : num 0.457 0.996 0.48 0.927 0.281 ...
## $ DMA : num 0 0.111 0 0 0 ...
## $ DWNT : num 0 0 0 0.103 0 ...
## $ EMO : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EMPH : num 0.61 0.332 0.24 0 0.563 ...
## $ FPUH : num 0 0 0 0 0 0 0 0 0 0 ...
## $ HDG : num 0.457 0.111 0.84 0.412 0 ...
## $ HST : num 0 0 0 0 0 0 0 0 0 0 ...
## $ IN : num 11.4 13.3 11.8 12.9 14.6 ...
## $ LIKE : num 0.152 0.111 0 0.206 0 ...
## $ NN : num 29.6 27.2 29.4 33.1 28.4 ...
## $ POLITE : num 0 0 0 0 0 ...
## $ RB : num 0.915 1.327 1.321 1.133 1.969 ...
## $ SO : num 0.152 0 0 0 0 ...
## $ URL : num 0 0 0 0.309 0 ...
## $ YNQU : num 0.152 0 0 0 0 ...
nrow(InfoTeen) # Should be 1414 files
## [1] 1414
InfoTeen <- InfoTeen %>% filter(Filename != ".DS_Store" & Filename != "Revision_World_GCSE_10529068_wjec-level-law-past-papers.txt" &
Filename != "Revision_World_GCSE_10528474_wjec-level-history-past-papers.txt" &
Filename != "Revision_World_GCSE_10528472_edexcel-level-history-past-papers.txt")
# Removes three outlier files which should not have been included in the corpus
InfoTeen$Series <- "Info Teens"
InfoTeen$Level <- "Ref."
InfoTeen$Country <- "Info Teens"
InfoTeen$Register <- "Info Teens"
Due to reasons of space, the results of the five-register dataset were not included in the thesis.
TxBcounts <- readRDS(here("FullMDA", "TxBcounts.rds"))
TxBcounts %>%
filter(Series=="NGL") %>%
group_by(Series, Level) %>%
summarise(wordcount = sum(Words))
counts <- bind_rows(TxBcounts, InfoTeen, SpokenBNC2014, YouthFiction, .id = "Corpus") %>%
filter(Register != "Poetry")
head(counts); tail(counts)
nrow(counts)
# Convert all character vectors to factors
counts[sapply(counts, is.character)] <- lapply(counts[sapply(counts, is.character)], as.factor)
# Change all NAs to 0
counts[is.na(counts)] <- 0
levels(counts$Corpus)
levels(counts$Corpus) <- list(Textbook.English="1", Informative.Teens="2", Spoken.BNC2014="3", Youth.Fiction="4")
summary(counts$Corpus)
summary(counts$Series)
# Re-order registers
levels(counts$Register)
counts$Register <- factor(counts$Register, levels = c("Conversation", "Fiction", "Informative", "Instructional", "Personal", "Info Teens", "Spoken BNC2014", "Youth Fiction"))
# Wrangle metadata variables
counts$Subcorpus <- counts$Register
levels(counts$Subcorpus) <- c("Textbook Conversation", "Textbook Fiction", "Textbook Informative", "Textbook Instructional", "Textbook Personal", "Info Teens Ref.", "Spoken BNC2014 Ref.", "Youth Fiction Ref.")
summary(counts$Subcorpus)
levels(counts$Register) <- c("Conversation", "Fiction", "Informative", "Instructional", "Personal", "Poetry", "Informative", "Conversation", "Fiction")
summary(counts$Register)
# Re-order variables
colnames(counts)
counts <- counts %>%
select(order(names(.))) %>% # Order alphabetically first
select(Filename, Register, Level, Series, Country, Corpus, Subcorpus, Words, everything()) # Then place the metadata variable at the front of the table
#saveRDS(counts, here("FullMDA", "counts.rds")) # Last saved 9 Feb 2022
This is the dataset that is presented in the second half of Chapter 7.
TxBcounts <- readRDS(here("FullMDA", "TxBcounts.rds"))
All3Reg <- c("Conversation", "Fiction", "Informative")
TxBcounts3Reg <- TxBcounts %>%
filter(Register %in% All3Reg) %>%
droplevels(.)
counts <- bind_rows(TxBcounts3Reg, InfoTeen, SpokenBNC2014, YouthFiction, .id = "Corpus")
head(counts); tail(counts)
## Corpus Filename Country Series Level
## 1 1 POC_4e_Spoken_0007.txt France POC C
## 2 1 Achievers_B1_plus_Informative_0007.txt Spain Achievers D
## 3 1 POC_5e_Spoken_0003.txt France POC B
## 4 1 Access_4_Narrative_0013.txt Germany Access D
## 5 1 NGL_1_Spoken_0002.txt Germany NGL A
## 6 1 Access_1_Narrative_0005.txt Germany Access A
## Register Words ABLE ACT AMP ASPECT AWL BEMA CAUSE CC
## 1 Conversation 750 0 23.9437 0.2667 2.8169 3.8987 19.7183 1.4085 45.0704
## 2 Informative 690 0 43.7500 0.1449 6.2500 4.6986 14.5833 2.0833 68.7500
## 3 Conversation 694 0 14.1176 0.2882 1.1765 3.8098 31.7647 1.1765 20.0000
## 4 Fiction 547 0 18.9189 0.9141 6.7568 3.9506 20.2703 0.0000 28.3784
## 5 Conversation 927 0 10.4348 0.1079 0.0000 3.8188 46.9565 3.4783 20.8696
## 6 Fiction 840 0 23.5772 0.1190 0.8130 3.9393 15.4472 2.4390 26.8293
## CD COMM CONC COND CONT CUZ DEMO DMA DOAUX DT
## 1 2.4000 9.8592 0.0000 1.4085 19.7183 1.4085 0.2667 2.0000 1.4085 39.3701
## 2 1.3043 16.6667 4.1667 0.0000 4.1667 0.0000 0.1449 0.0000 0.0000 28.2178
## 3 1.0086 3.5294 0.0000 4.7059 29.4118 0.0000 1.1527 2.1614 9.4118 40.3846
## 4 0.1828 2.7027 0.0000 0.0000 8.1081 1.3514 0.0000 0.0000 4.0541 54.8387
## 5 0.2157 1.7391 0.0000 0.0000 47.8261 0.0000 1.1866 2.3732 4.3478 18.8406
## 6 0.8333 22.7642 0.0000 0.0000 12.1951 0.8130 0.7143 1.4286 10.5691 27.0833
## DWNT ELAB EMO EMPH EX EXIST FPP1P FPP1S FPUH FREQ GTO
## 1 0.1333 0.0000 0 0.6667 0.0000 8.4507 18.3099 60.5634 0.8000 2.8169 2.8169
## 2 0.0000 0.0000 0 0.2899 0.0000 12.5000 20.8333 0.0000 0.0000 2.0833 0.0000
## 3 0.0000 2.3529 0 0.4323 3.5294 3.5294 17.6471 15.2941 1.0086 1.1765 4.7059
## 4 0.0000 0.0000 0 0.0000 8.1081 4.0541 20.2703 9.4595 0.0000 1.3514 0.0000
## 5 0.0000 0.0000 0 0.3236 2.6087 0.8696 12.1739 31.3043 3.2362 4.3478 0.0000
## 6 0.0000 0.0000 0 0.3571 4.0650 1.6260 11.3821 23.5772 0.9524 4.0650 0.0000
## HDG HGOT HST IN JJAT JJPR LD LIKE MDCA MDCO MDMM
## 1 0.6667 0.0000 0 10.1333 18.8976 12.6761 0.516000 0.4000 1.4085 1.4085 0
## 2 0.1449 0.0000 0 10.8696 25.7426 12.5000 0.589855 0.0000 4.1667 0.0000 0
## 3 0.1441 0.0000 0 7.7810 13.4615 17.6471 0.504323 0.1441 2.3529 0.0000 0
## 4 0.0000 0.0000 0 11.5174 10.7527 22.9730 0.466179 0.0000 4.0541 1.3514 0
## 5 0.3236 7.8261 0 5.1780 7.2464 27.8261 0.535059 0.1079 6.9565 0.0000 0
## 6 0.0000 0.0000 0 7.6190 7.2917 12.1951 0.557143 0.0000 7.3171 0.8130 0
## MDNE MDWO MDWS MENTAL NCOMP NN OCCUR PASS PEAS PGET
## 1 4.2254 22.5352 2.8169 36.6197 4.7244 16.9333 7.0423 2.8169 0.0000 0
## 2 6.2500 0.0000 10.4167 20.8333 12.8713 29.2754 0.0000 4.1667 4.1667 0
## 3 7.0588 0.0000 1.1765 20.0000 4.8077 14.9856 0.0000 1.1765 0.0000 0
## 4 0.0000 0.0000 1.3514 20.2703 1.0753 17.0018 5.4054 2.7027 2.7027 0
## 5 0.0000 0.0000 0.0000 11.3043 10.6280 22.3301 0.0000 0.0000 0.0000 0
## 6 0.0000 0.0000 0.0000 10.5691 8.3333 22.8571 0.0000 0.8130 0.0000 0
## PIT PLACE POLITE POS PROG PRP QUAN QUPR QUTAG RB RP
## 1 7.0423 0.0000 0.4000 0.0000 2.8169 0 11.0236 5.6338 0 1.4667 4.2254
## 2 12.5000 6.2500 0.0000 0.9901 6.2500 0 2.9703 4.1667 0 0.7246 6.2500
## 3 23.5294 5.8824 0.1441 1.9231 12.9412 0 4.8077 2.3529 0 3.7464 0.0000
## 4 8.1081 6.7568 0.1828 2.1505 4.0541 0 7.5269 1.3514 0 3.1079 5.4054
## 5 15.6522 6.9565 0.9709 3.3816 0.0000 0 0.9662 2.6087 0 1.2945 0.0000
## 6 6.5041 2.4390 0.2381 2.0833 0.8130 0 3.1250 0.0000 0 1.6667 0.8130
## SO SPLIT SPP2 STPR THATD THRC THSC TIME TPP3P TPP3S TTR
## 1 0.2667 1.4085 23.9437 1.4085 0.0000 0 0.0000 5.6338 0.0000 1.4085 0.5050
## 2 0.2899 4.1667 33.3333 0.0000 0.0000 0 2.0833 2.0833 12.5000 0.0000 0.5900
## 3 0.0000 2.3529 24.7059 0.0000 0.0000 0 1.1765 4.7059 2.3529 5.8824 0.4275
## 4 0.9141 0.0000 0.0000 0.0000 1.3514 0 0.0000 4.0541 8.1081 28.3784 0.4525
## 5 0.0000 1.7391 25.2174 0.0000 0.0000 0 0.8696 5.2174 1.7391 7.8261 0.3550
## 6 0.1190 0.8130 9.7561 0.0000 0.0000 0 0.0000 3.2520 8.1301 8.1301 0.5100
## URL VBD VBG VBN VIMP VPRT WHQU WHSC XX0 YNQU
## 1 0.0000 38.0282 7.0423 0.0000 2.8169 26.7606 8.4507 5.6338 7.0423 0.6667
## 2 0.0000 18.7500 14.5833 4.1667 6.2500 54.1667 0.0000 16.6667 4.1667 0.0000
## 3 0.0000 4.7059 0.0000 0.0000 7.0588 77.6471 8.2353 5.8824 12.9412 2.1614
## 4 0.0000 64.8649 8.1081 0.0000 0.0000 28.3784 0.0000 9.4595 5.4054 0.0000
## 5 0.1079 3.4783 1.7391 0.0000 10.4348 79.1304 7.8261 2.6087 17.3913 0.8630
## 6 0.0000 51.2195 0.0000 0.8130 3.2520 37.3984 4.8780 5.6911 8.1301 0.3571
## Corpus Filename Country Series
## 5090 4 130_PRATCHETT1989DW07MIDS_3.txt Youth Fiction Youth Fiction
## 5091 4 163_PRATCHETT1998DW23ULUM_4.txt Youth Fiction Youth Fiction
## 5092 4 106_GOLDING1980RITESAGE_2.txt Youth Fiction Youth Fiction
## 5093 4 68_A-Wrinkle-In-Time_1.txt Youth Fiction Youth Fiction
## 5094 4 81_thetrumpetoftheswan_4.txt Youth Fiction Youth Fiction
## 5095 4 207_DiaryOfAWimpyKid1JeffKinney_1.txt Youth Fiction Youth Fiction
## Level Register Words ABLE ACT AMP ASPECT AWL BEMA
## 5090 Ref. Youth Fiction 5840 0.2642 18.6262 0.2740 1.7173 4.2649 15.3236
## 5091 Ref. Youth Fiction 6141 0.0000 16.3855 0.1628 2.2892 4.0474 12.4096
## 5092 Ref. Youth Fiction 5686 0.1681 20.3361 0.1231 1.1765 4.2921 17.4790
## 5093 Ref. Youth Fiction 5980 1.0485 21.1009 0.3846 1.4417 4.1137 18.3486
## 5094 Ref. Youth Fiction 5772 0.5789 23.2996 0.1213 1.4472 4.0665 13.4588
## 5095 Ref. Youth Fiction 6024 0.4688 37.5000 0.0996 4.3750 4.0144 13.4375
## CAUSE CC CD COMM CONC COND CONT CUZ DEMO DMA
## 5090 0.9247 26.4201 0.5308 15.3236 0.9247 1.0568 15.1915 1.1889 0.9075 0.5993
## 5091 1.2048 17.3494 0.5862 22.0482 0.6024 2.7711 28.7952 0.9639 0.9770 1.0096
## 5092 2.0168 30.9244 0.5276 14.2857 2.0168 2.1849 0.8403 0.8403 0.6683 0.4749
## 5093 2.3591 25.4260 0.7692 14.5478 1.5727 1.7038 10.6160 1.0485 0.6355 0.6355
## 5094 0.8683 29.8119 0.5891 10.1302 0.2894 1.1577 8.8278 0.8683 0.7796 0.1213
## 5095 1.0938 25.3125 1.5438 11.5625 0.1562 1.4062 13.7500 2.6562 0.6474 0.3154
## DOAUX DT DWNT ELAB EMO EMPH EX EXIST FPP1P FPP1S
## 5090 3.1704 44.0196 0.0856 0.3963 0 0.5137 3.1704 3.1704 4.6235 7.2655
## 5091 5.0602 31.7360 0.0977 0.0000 0 0.6188 3.7349 1.8072 6.6265 15.4217
## 5092 3.0252 41.1150 0.0703 0.3361 0 0.3693 3.0252 3.1933 8.4034 37.8151
## 5093 5.6356 27.6768 0.1003 0.0000 0 0.7358 1.9659 2.4902 14.2857 11.4024
## 5094 4.1968 44.0287 0.0693 0.1447 0 0.3119 0.8683 2.0260 1.0130 11.5774
## 5095 3.7500 41.1438 0.0332 0.1562 0 0.7968 1.4062 0.7812 6.4062 45.9375
## FPUH FREQ GTO HDG HGOT HST IN JJAT JJPR LD
## 5090 0.2397 4.0951 0.6605 0.1370 0.3963 NA 9.5205 20.0980 10.5680 0.508904
## 5091 0.4560 2.6506 0.7229 0.1628 1.8072 NA 7.7349 13.8336 8.7952 0.544862
## 5092 0.0528 2.3529 0.0000 0.1231 0.1681 NA 12.4868 22.2997 17.3109 0.465881
## 5093 0.2007 3.6697 0.5242 0.1171 0.2621 NA 8.5953 16.0606 15.2031 0.503010
## 5094 0.1559 3.0391 1.0130 0.0866 0.2894 NA 10.4816 13.9331 12.0116 0.511088
## 5095 0.1162 2.5000 1.4062 0.2158 0.4688 NA 7.8187 11.6759 10.3125 0.541169
## LIKE MDCA MDCO MDMM MDNE MDWO MDWS MENTAL NCOMP NN
## 5090 0.3253 0.9247 2.7741 0.7926 1.7173 3.3025 1.5852 13.7384 2.4510 17.4658
## 5091 0.3582 5.6627 0.8434 0.1205 1.2048 3.0120 3.9759 14.8193 3.8879 18.0101
## 5092 0.1055 2.6891 1.0084 2.1849 4.2017 4.0336 2.6891 15.7983 3.3101 20.1899
## 5093 0.1505 2.3591 2.7523 0.3932 4.0629 3.4076 2.6212 18.4797 5.8586 16.5552
## 5094 0.1213 1.8813 2.0260 0.8683 1.1577 2.6049 3.0391 16.3531 5.0955 21.7602
## 5095 0.2822 1.2500 1.8750 0.1562 3.9062 2.3438 1.2500 24.5312 5.7983 20.8997
## OCCUR PASS PEAS PGET PIT PLACE POLITE POS PROG PRP
## 5090 0.9247 3.4346 6.2087 0.3963 12.4174 4.3593 0.0856 1.8627 5.5482 0.0000
## 5091 1.2048 2.0482 2.7711 0.2410 9.0361 3.4940 0.0977 3.4358 3.6145 0.0000
## 5092 1.8487 6.3866 10.2521 0.0000 9.7479 2.8571 0.1231 1.5679 2.3529 0.1681
## 5093 1.4417 5.6356 6.2910 0.1311 9.5675 3.8008 0.1171 2.1212 3.4076 0.2621
## 5094 2.1708 3.6179 5.6440 0.2894 11.1433 5.0651 0.0347 1.3535 2.3155 0.0000
## 5095 1.0938 2.0312 2.0312 0.9375 11.0938 3.2812 0.1494 1.9063 5.6250 0.0000
## QUAN QUPR QUTAG RB RP SO SPLIT SPP2 STPR THATD
## 5090 6.9608 3.8309 1.1889 2.5171 4.8877 0.2568 2.7741 9.3791 0.7926 1.8494
## 5091 6.6908 3.7349 1.2048 2.4426 6.6265 0.2931 2.6506 16.8675 0.9639 3.3735
## 5092 5.6620 1.8487 0.0000 2.0929 2.6891 0.2638 3.0252 9.0756 0.8403 2.8571
## 5093 4.9495 4.0629 0.3932 2.8595 2.4902 0.1171 3.5387 17.5623 0.3932 1.5727
## 5094 3.9809 3.3285 0.0000 1.7845 5.4993 0.1386 1.7366 8.3936 1.0130 1.5919
## 5095 4.2097 4.5312 0.0000 2.0750 11.2500 0.4648 3.5937 5.6250 0.9375 11.5625
## THRC THSC TIME TPP3P TPP3S TTR URL VBD VBG VBN
## 5090 0.7926 4.3593 3.6988 11.0964 32.4967 0.5275 0 59.4452 6.7371 3.9630
## 5091 1.3253 2.2892 1.8072 10.6024 22.2892 0.4125 0 44.0964 6.5060 1.3253
## 5092 1.3445 7.3950 4.8739 5.5462 30.0840 0.5300 0 49.0756 6.5546 5.5462
## 5093 2.0970 7.6016 4.8493 7.7326 31.9790 0.4875 0 50.3277 6.4220 2.4902
## 5094 1.1577 2.1708 3.0391 6.0781 39.6527 0.5125 0 63.5311 9.6961 2.4602
## 5095 1.8750 5.6250 4.8438 4.6875 20.0000 0.4800 0 57.3438 22.6562 1.4062
## VIMP VPRT WHQU WHSC XX0 YNQU
## 5090 2.1136 27.3448 3.0383 7.0013 10.7001 0.3253
## 5091 3.8554 37.2289 3.3735 4.3373 11.3253 0.3745
## 5092 2.5210 31.5966 0.6723 10.4202 10.2521 0.1759
## 5093 2.0970 31.9790 4.7182 8.9122 14.6789 0.4515
## 5094 2.7496 22.1418 1.8813 8.6831 8.2489 0.0520
## 5095 1.4062 30.4688 0.7812 9.2188 8.5938 0.0830
str(counts)
## 'data.frame': 5095 obs. of 89 variables:
## $ Corpus : chr "1" "1" "1" "1" ...
## $ Filename: chr "POC_4e_Spoken_0007.txt" "Achievers_B1_plus_Informative_0007.txt" "POC_5e_Spoken_0003.txt" "Access_4_Narrative_0013.txt" ...
## $ Country : chr "France" "Spain" "France" "Germany" ...
## $ Series : chr "POC" "Achievers" "POC" "Access" ...
## $ Level : chr "C" "D" "B" "D" ...
## $ Register: chr "Conversation" "Informative" "Conversation" "Fiction" ...
## $ Words : int 750 690 694 547 927 840 1127 1090 635 976 ...
## $ ABLE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ACT : num 23.9 43.8 14.1 18.9 10.4 ...
## $ AMP : num 0.267 0.145 0.288 0.914 0.108 ...
## $ ASPECT : num 2.82 6.25 1.18 6.76 0 ...
## $ AWL : num 3.9 4.7 3.81 3.95 3.82 ...
## $ BEMA : num 19.7 14.6 31.8 20.3 47 ...
## $ CAUSE : num 1.41 2.08 1.18 0 3.48 ...
## $ CC : num 45.1 68.8 20 28.4 20.9 ...
## $ CD : num 2.4 1.304 1.009 0.183 0.216 ...
## $ COMM : num 9.86 16.67 3.53 2.7 1.74 ...
## $ CONC : num 0 4.17 0 0 0 ...
## $ COND : num 1.41 0 4.71 0 0 ...
## $ CONT : num 19.72 4.17 29.41 8.11 47.83 ...
## $ CUZ : num 1.41 0 0 1.35 0 ...
## $ DEMO : num 0.267 0.145 1.153 0 1.187 ...
## $ DMA : num 2 0 2.16 0 2.37 ...
## $ DOAUX : num 1.41 0 9.41 4.05 4.35 ...
## $ DT : num 39.4 28.2 40.4 54.8 18.8 ...
## $ DWNT : num 0.133 0 0 0 0 ...
## $ ELAB : num 0 0 2.35 0 0 ...
## $ EMO : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EMPH : num 0.667 0.29 0.432 0 0.324 ...
## $ EX : num 0 0 3.53 8.11 2.61 ...
## $ EXIST : num 8.45 12.5 3.53 4.05 0.87 ...
## $ FPP1P : num 18.3 20.8 17.6 20.3 12.2 ...
## $ FPP1S : num 60.56 0 15.29 9.46 31.3 ...
## $ FPUH : num 0.8 0 1.01 0 3.24 ...
## $ FREQ : num 2.82 2.08 1.18 1.35 4.35 ...
## $ GTO : num 2.82 0 4.71 0 0 ...
## $ HDG : num 0.667 0.145 0.144 0 0.324 ...
## $ HGOT : num 0 0 0 0 7.83 ...
## $ HST : num 0 0 0 0 0 0 0 0 0 0 ...
## $ IN : num 10.13 10.87 7.78 11.52 5.18 ...
## $ JJAT : num 18.9 25.74 13.46 10.75 7.25 ...
## $ JJPR : num 12.7 12.5 17.6 23 27.8 ...
## $ LD : num 0.516 0.59 0.504 0.466 0.535 ...
## $ LIKE : num 0.4 0 0.144 0 0.108 ...
## $ MDCA : num 1.41 4.17 2.35 4.05 6.96 ...
## $ MDCO : num 1.41 0 0 1.35 0 ...
## $ MDMM : num 0 0 0 0 0 ...
## $ MDNE : num 4.23 6.25 7.06 0 0 ...
## $ MDWO : num 22.5 0 0 0 0 ...
## $ MDWS : num 2.82 10.42 1.18 1.35 0 ...
## $ MENTAL : num 36.6 20.8 20 20.3 11.3 ...
## $ NCOMP : num 4.72 12.87 4.81 1.08 10.63 ...
## $ NN : num 16.9 29.3 15 17 22.3 ...
## $ OCCUR : num 7.04 0 0 5.41 0 ...
## $ PASS : num 2.82 4.17 1.18 2.7 0 ...
## $ PEAS : num 0 4.17 0 2.7 0 ...
## $ PGET : num 0 0 0 0 0 ...
## $ PIT : num 7.04 12.5 23.53 8.11 15.65 ...
## $ PLACE : num 0 6.25 5.88 6.76 6.96 ...
## $ POLITE : num 0.4 0 0.144 0.183 0.971 ...
## $ POS : num 0 0.99 1.92 2.15 3.38 ...
## $ PROG : num 2.82 6.25 12.94 4.05 0 ...
## $ PRP : num 0 0 0 0 0 0 0 0 0 0 ...
## $ QUAN : num 11.024 2.97 4.808 7.527 0.966 ...
## $ QUPR : num 5.63 4.17 2.35 1.35 2.61 ...
## $ QUTAG : num 0 0 0 0 0 ...
## $ RB : num 1.467 0.725 3.746 3.108 1.294 ...
## $ RP : num 4.23 6.25 0 5.41 0 ...
## $ SO : num 0.267 0.29 0 0.914 0 ...
## $ SPLIT : num 1.41 4.17 2.35 0 1.74 ...
## $ SPP2 : num 23.9 33.3 24.7 0 25.2 ...
## $ STPR : num 1.41 0 0 0 0 ...
## $ THATD : num 0 0 0 1.35 0 ...
## $ THRC : num 0 0 0 0 0 ...
## $ THSC : num 0 2.08 1.18 0 0.87 ...
## $ TIME : num 5.63 2.08 4.71 4.05 5.22 ...
## $ TPP3P : num 0 12.5 2.35 8.11 1.74 ...
## $ TPP3S : num 1.41 0 5.88 28.38 7.83 ...
## $ TTR : num 0.505 0.59 0.427 0.453 0.355 ...
## $ URL : num 0 0 0 0 0.108 ...
## $ VBD : num 38.03 18.75 4.71 64.86 3.48 ...
## $ VBG : num 7.04 14.58 0 8.11 1.74 ...
## $ VBN : num 0 4.17 0 0 0 ...
## $ VIMP : num 2.82 6.25 7.06 0 10.43 ...
## $ VPRT : num 26.8 54.2 77.6 28.4 79.1 ...
## $ WHQU : num 8.45 0 8.24 0 7.83 ...
## $ WHSC : num 5.63 16.67 5.88 9.46 2.61 ...
## $ XX0 : num 7.04 4.17 12.94 5.41 17.39 ...
## $ YNQU : num 0.667 0 2.161 0 0.863 ...
# Convert all character vectors to factors
counts[sapply(counts, is.character)] <- lapply(counts[sapply(counts, is.character)], as.factor)
# Change all NAs to 0
counts[is.na(counts)] <- 0
levels(counts$Corpus)
## [1] "1" "2" "3" "4"
levels(counts$Corpus) <- list(Textbook.English="1", Informative.Teens="2", Spoken.BNC2014="3", Youth.Fiction="4")
summary(counts$Corpus)
## Textbook.English Informative.Teens Spoken.BNC2014 Youth.Fiction
## 1242 1411 1251 1191
summary(counts$Series)
## Access Achievers EIM GreenLine HT
## 227 108 99 134 106
## Info Teens JTT NGL POC Solutions
## 1411 97 192 60 219
## Spoken BNC2014 Youth Fiction
## 1251 1191
# Wrangle metadata variables
counts$Subcorpus <- counts$Register
levels(counts$Subcorpus)
## [1] "Conversation" "Fiction" "Info Teens" "Informative"
## [5] "Spoken BNC2014" "Youth Fiction"
levels(counts$Subcorpus) <- c("Textbook Conversation", "Textbook Fiction", "Info Teens Ref.", "Textbook Informative", "Spoken BNC2014 Ref.", "Youth Fiction Ref.")
summary(counts$Subcorpus)
## Textbook Conversation Textbook Fiction Info Teens Ref.
## 593 285 1411
## Textbook Informative Spoken BNC2014 Ref. Youth Fiction Ref.
## 364 1251 1191
# Re-order registers
levels(counts$Register)
## [1] "Conversation" "Fiction" "Info Teens" "Informative"
## [5] "Spoken BNC2014" "Youth Fiction"
levels(counts$Register) <- c("Conversation", "Fiction", "Informative", "Informative", "Conversation", "Fiction")
summary(counts$Register)
## Conversation Fiction Informative
## 1844 1476 1775
# Re-order variables
colnames(counts)
## [1] "Corpus" "Filename" "Country" "Series" "Level" "Register"
## [7] "Words" "ABLE" "ACT" "AMP" "ASPECT" "AWL"
## [13] "BEMA" "CAUSE" "CC" "CD" "COMM" "CONC"
## [19] "COND" "CONT" "CUZ" "DEMO" "DMA" "DOAUX"
## [25] "DT" "DWNT" "ELAB" "EMO" "EMPH" "EX"
## [31] "EXIST" "FPP1P" "FPP1S" "FPUH" "FREQ" "GTO"
## [37] "HDG" "HGOT" "HST" "IN" "JJAT" "JJPR"
## [43] "LD" "LIKE" "MDCA" "MDCO" "MDMM" "MDNE"
## [49] "MDWO" "MDWS" "MENTAL" "NCOMP" "NN" "OCCUR"
## [55] "PASS" "PEAS" "PGET" "PIT" "PLACE" "POLITE"
## [61] "POS" "PROG" "PRP" "QUAN" "QUPR" "QUTAG"
## [67] "RB" "RP" "SO" "SPLIT" "SPP2" "STPR"
## [73] "THATD" "THRC" "THSC" "TIME" "TPP3P" "TPP3S"
## [79] "TTR" "URL" "VBD" "VBG" "VBN" "VIMP"
## [85] "VPRT" "WHQU" "WHSC" "XX0" "YNQU" "Subcorpus"
counts <- counts %>%
select(order(names(.))) %>% # Order alphabetically first
select(Filename, Register, Level, Series, Country, Corpus, Subcorpus, Words, everything()) # Then place the metadata variable at the front of the table
#saveRDS(counts, here("FullMDA", "counts3Reg.rds")) # Last saved 9 Feb 2022
ncounts <- readRDS(here("FullMDA", "counts3Reg.rds"))
colnames(ncounts)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ABLE" "ACT" "AMP" "ASPECT"
## [13] "AWL" "BEMA" "CAUSE" "CC" "CD" "COMM"
## [19] "CONC" "COND" "CONT" "CUZ" "DEMO" "DMA"
## [25] "DOAUX" "DT" "DWNT" "ELAB" "EMO" "EMPH"
## [31] "EX" "EXIST" "FPP1P" "FPP1S" "FPUH" "FREQ"
## [37] "GTO" "HDG" "HGOT" "HST" "IN" "JJAT"
## [43] "JJPR" "LD" "LIKE" "MDCA" "MDCO" "MDMM"
## [49] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [55] "OCCUR" "PASS" "PEAS" "PGET" "PIT" "PLACE"
## [61] "POLITE" "POS" "PROG" "PRP" "QUAN" "QUPR"
## [67] "QUTAG" "RB" "RP" "SO" "SPLIT" "SPP2"
## [73] "STPR" "THATD" "THRC" "THSC" "TIME" "TPP3P"
## [79] "TPP3S" "TTR" "URL" "VBD" "VBG" "VBN"
## [85] "VIMP" "VPRT" "WHQU" "WHSC" "XX0" "YNQU"
# Compare relative frequencies of individual features, e.g., BE as a main verb per FVP (finite verb phrase)
ncounts %>%
group_by(Register, Corpus) %>%
summarise(median(BEMA), MAD(BEMA))
## # A tibble: 6 × 4
## # Groups: Register [3]
## Register Corpus `median(BEMA)` `MAD(BEMA)`
## <fct> <fct> <dbl> <dbl>
## 1 Conversation Textbook.English 23.9 6.23
## 2 Conversation Spoken.BNC2014 20.6 2.90
## 3 Fiction Textbook.English 15.8 5.13
## 4 Fiction Youth.Fiction 14.2 2.53
## 5 Informative Textbook.English 18.8 6.65
## 6 Informative Informative.Teens 16.9 7.07
# Inspired by: https://drsimonj.svbtle.com/quick-plot-of-all-variables
ncounts %>%
select(-Words) %>%
keep(is.numeric) %>%
gather() %>% # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
ggplot(aes(value)) +
theme_bw() +
facet_wrap(~ key, scales = "free", ncol = 4) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(limits = c(0,NA)) +
geom_histogram(aes(y = ..density..), bins = 30, colour= "black", fill = "grey") +
geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)
#ggsave(here("Plots", "DensityPlotsAllVariables.svg"), width = 15, height = 49)
ncounts %>%
select(-Words) %>%
keep(is.numeric) %>%
gather() %>% # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
ggplot(aes(value)) +
theme_bw() +
facet_wrap(~ key, scales = "free", ncol = 4) +
scale_x_continuous(expand=c(0,0)) +
geom_histogram(bins = 30, colour= "darkred", fill = "darkred", alpha = 0.5)
#ggsave(here("Plots", "HistogramPlotsAllVariables.svg"), width = 20, height = 45)
# For MDA with five TEC registers ncounts <- readRDS(here('FullMDA',
# 'counts.rds'))
# For MDA with three TEC registers
ncounts <- readRDS(here("FullMDA", "counts3Reg.rds"))
colnames(ncounts)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ABLE" "ACT" "AMP" "ASPECT"
## [13] "AWL" "BEMA" "CAUSE" "CC" "CD" "COMM"
## [19] "CONC" "COND" "CONT" "CUZ" "DEMO" "DMA"
## [25] "DOAUX" "DT" "DWNT" "ELAB" "EMO" "EMPH"
## [31] "EX" "EXIST" "FPP1P" "FPP1S" "FPUH" "FREQ"
## [37] "GTO" "HDG" "HGOT" "HST" "IN" "JJAT"
## [43] "JJPR" "LD" "LIKE" "MDCA" "MDCO" "MDMM"
## [49] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [55] "OCCUR" "PASS" "PEAS" "PGET" "PIT" "PLACE"
## [61] "POLITE" "POS" "PROG" "PRP" "QUAN" "QUPR"
## [67] "QUTAG" "RB" "RP" "SO" "SPLIT" "SPP2"
## [73] "STPR" "THATD" "THRC" "THSC" "TIME" "TPP3P"
## [79] "TPP3S" "TTR" "URL" "VBD" "VBG" "VBN"
## [85] "VIMP" "VPRT" "WHQU" "WHSC" "XX0" "YNQU"
# Removal of meaningless feature: CD because numbers as digits were mostly
# removed from the textbooks, LIKE and SO because they are dustbin categories
ncounts <- ncounts %>% select(-c(CD, LIKE, SO))
# Combine low frequency features into meaningful groups whenever this makes
# linguistic sense
ncounts <- ncounts %>% mutate(JJPR = JJPR + ABLE, ABLE = NULL) %>% mutate(PASS = PGET +
PASS, PGET = NULL) %>% mutate(TPP3 = TPP3S + TPP3P, TPP3P = NULL, TPP3S = NULL) %>%
mutate(FQTI = FREQ + TIME, FREQ = NULL, TIME = NULL)
zero_features <- as.data.frame(round(colSums(ncounts == 0)/nrow(ncounts) * 100, 2)) # Percentage of texts with 0 occurrences of each feature
colnames(zero_features) <- "Percentage_with_zero"
zero_features %>% filter(!is.na(zero_features)) %>% rownames_to_column() %>% arrange(Percentage_with_zero) %>%
filter(Percentage_with_zero > 66.6)
## rowname Percentage_with_zero
## 1 PRP 85.34
## 2 URL 93.03
## 3 EMO 98.98
## 4 HST 99.55
zero_features <- as.data.frame(round(colSums(ncounts > 0)/nrow(ncounts) * 100, 2)) # Percentage of texts >0 occurrences of each feature
colnames(zero_features) <- "Percentage_above_zero"
zero_features %>% rownames_to_column() %>% filter(!is.na(zero_features)) %>% arrange(desc(Percentage_above_zero))
## rowname Percentage_above_zero
## 1 Words 100.00
## 2 AWL 100.00
## 3 CC 100.00
## 4 DT 100.00
## 5 IN 100.00
## 6 JJAT 100.00
## 7 LD 100.00
## 8 NN 100.00
## 9 TTR 100.00
## 10 ACT 99.98
## 11 RB 99.98
## 12 BEMA 99.94
## 13 JJPR 99.92
## 14 NCOMP 99.82
## 15 MENTAL 99.69
## 16 QUAN 99.69
## 17 VPRT 99.57
## 18 TPP3 99.29
## 19 WHSC 99.02
## 20 FQTI 98.96
## 21 COMM 98.82
## 22 DEMO 98.57
## 23 PIT 98.41
## 24 VBD 98.10
## 25 VBG 96.80
## 26 XX0 96.68
## 27 EMPH 96.15
## 28 PASS 95.31
## 29 EXIST 94.27
## 30 POS 93.58
## 31 VBN 93.15
## 32 SPLIT 92.37
## 33 RP 92.15
## 34 THSC 91.13
## 35 PLACE 90.42
## 36 OCCUR 90.19
## 37 PEAS 89.52
## 38 AMP 89.34
## 39 DOAUX 88.89
## 40 CONT 88.81
## 41 ASPECT 88.22
## 42 CAUSE 87.54
## 43 SPP2 87.05
## 44 PROG 86.73
## 45 MDCA 86.30
## 46 VIMP 85.12
## 47 FPP1P 84.38
## 48 EX 83.57
## 49 QUPR 83.42
## 50 MDNE 81.06
## 51 THRC 80.51
## 52 THATD 80.00
## 53 COND 78.68
## 54 CUZ 78.31
## 55 FPP1S 78.17
## 56 DMA 78.14
## 57 WHQU 77.96
## 58 MDWS 77.08
## 59 MDWO 75.82
## 60 HDG 75.43
## 61 MDCO 73.70
## 62 YNQU 72.03
## 63 FPUH 68.85
## 64 CONC 68.44
## 65 MDMM 66.01
## 66 STPR 64.83
## 67 DWNT 60.37
## 68 POLITE 59.84
## 69 GTO 56.09
## 70 HGOT 51.60
## 71 ELAB 49.50
## 72 QUTAG 45.71
## 73 PRP 14.66
## 74 URL 6.97
## 75 EMO 1.02
## 76 HST 0.45
docfreq.too.low <- zero_features %>% filter(!is.na(zero_features)) %>% subset(Percentage_above_zero <
33.3) %>% rownames_to_column() %>% select(rowname) # Select all variables with a document frequency of at least 40%.
docfreq.too.low
## rowname
## 1 EMO
## 2 HST
## 3 PRP
## 4 URL
ncounts <- select(ncounts, -one_of(docfreq.too.low$rowname)) # Drop these variables
colnames(ncounts)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ACT" "AMP" "ASPECT" "AWL"
## [13] "BEMA" "CAUSE" "CC" "COMM" "CONC" "COND"
## [19] "CONT" "CUZ" "DEMO" "DMA" "DOAUX" "DT"
## [25] "DWNT" "ELAB" "EMPH" "EX" "EXIST" "FPP1P"
## [31] "FPP1S" "FPUH" "GTO" "HDG" "HGOT" "IN"
## [37] "JJAT" "JJPR" "LD" "MDCA" "MDCO" "MDMM"
## [43] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [49] "OCCUR" "PASS" "PEAS" "PIT" "PLACE" "POLITE"
## [55] "POS" "PROG" "QUAN" "QUPR" "QUTAG" "RB"
## [61] "RP" "SPLIT" "SPP2" "STPR" "THATD" "THRC"
## [67] "THSC" "TTR" "VBD" "VBG" "VBN" "VIMP"
## [73] "VPRT" "WHQU" "WHSC" "XX0" "YNQU" "TPP3"
## [79] "FQTI"
ncol(ncounts) - 8 # Number of linguistic features remaining
## [1] 71
# With five TEC registers saveRDS(ncounts, here('FullMDA', 'ncounts2.rds')) #
# Last saved 18 November 2021
# With three TEC registers saveRDS(ncounts, here('FullMDA', 'ncounts2_3Reg.rds'))
# # Last saved 9 Feb 2022
“As an alternative to removing very sparse feature, we apply a signed logarithmic transformation to deskew the feature distributions.” (Neumann & Evert)
# First scale the normalised counts (z-standardisation) to be able to compare the
# various features
zcounts <- ncounts %>% select(-Words) %>% keep(is.numeric) %>% scale()
boxplot(zcounts, las = 3, main = "z-scores") # Slow
# If necessary, remove any outliers at this stage.
colnames(ncounts)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ACT" "AMP" "ASPECT" "AWL"
## [13] "BEMA" "CAUSE" "CC" "COMM" "CONC" "COND"
## [19] "CONT" "CUZ" "DEMO" "DMA" "DOAUX" "DT"
## [25] "DWNT" "ELAB" "EMPH" "EX" "EXIST" "FPP1P"
## [31] "FPP1S" "FPUH" "GTO" "HDG" "HGOT" "IN"
## [37] "JJAT" "JJPR" "LD" "MDCA" "MDCO" "MDMM"
## [43] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [49] "OCCUR" "PASS" "PEAS" "PIT" "PLACE" "POLITE"
## [55] "POS" "PROG" "QUAN" "QUPR" "QUTAG" "RB"
## [61] "RP" "SPLIT" "SPP2" "STPR" "THATD" "THRC"
## [67] "THSC" "TTR" "VBD" "VBG" "VBN" "VIMP"
## [73] "VPRT" "WHQU" "WHSC" "XX0" "YNQU" "TPP3"
## [79] "FQTI"
data <- cbind(ncounts[, 1:8], as.data.frame(zcounts))
str(data)
## 'data.frame': 5095 obs. of 79 variables:
## $ Filename : Factor w/ 5095 levels "1_BaumWizardOz_1.txt",..: 2789 1473 2801 1365 2533 1197 2556 4205 2590 2502 ...
## $ Register : Factor w/ 3 levels "Conversation",..: 1 3 1 2 1 2 2 1 3 2 ...
## $ Level : Factor w/ 6 levels "A","B","C","D",..: 3 4 2 4 1 1 2 3 3 5 ...
## $ Series : Factor w/ 12 levels "Access","Achievers",..: 9 2 9 1 8 1 8 10 8 7 ...
## $ Country : Factor w/ 6 levels "France","Germany",..: 1 4 1 2 2 2 2 4 2 1 ...
## $ Corpus : Factor w/ 4 levels "Textbook.English",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Subcorpus: Factor w/ 6 levels "Textbook Conversation",..: 1 4 1 2 1 2 2 1 4 2 ...
## $ Words : int 750 690 694 547 927 840 1127 1090 635 976 ...
## $ ACT : num 0.0641 2.6962 -1.2416 -0.6036 -1.7311 ...
## $ AMP : num 0.0391 -0.5615 0.1452 3.2319 -0.744 ...
## $ ASPECT : num 0.182 1.743 -0.564 1.974 -1.099 ...
## $ AWL : num -0.863 1.038 -1.074 -0.739 -1.053 ...
## $ BEMA : num 0.19 -0.606 2.059 0.276 4.415 ...
## $ CAUSE : num -0.2546 0.0289 -0.352 -0.8463 0.615 ...
## $ CC : num 0.785 2.317 -0.838 -0.296 -0.782 ...
## $ COMM : num -0.119 1.101 -1.254 -1.402 -1.575 ...
## $ CONC : num -0.777 2.831 -0.777 -0.777 -0.777 ...
## $ COND : num -0.412 -1.166 1.355 -1.166 -1.166 ...
## $ CONT : num 0.059 -0.999 0.719 -0.731 1.972 ...
## $ CUZ : num -0.316 -1.034 -1.034 -0.345 -1.034 ...
## $ DEMO : num -1.412 -1.668 0.451 -1.973 0.522 ...
## $ DMA : num 0.558 -0.797 0.667 -0.797 0.811 ...
## $ DOAUX : num -1.023 -1.413 1.188 -0.293 -0.211 ...
## $ DT : num 0.755 -0.625 0.881 2.67 -1.786 ...
## $ DWNT : num 1.001 -0.767 -0.767 -0.767 -0.767 ...
## $ ELAB : num -0.415 -0.415 0.896 -0.415 -0.415 ...
## $ EMPH : num -0.145 -0.859 -0.589 -1.408 -0.795 ...
## $ EX : num -1.222 -1.222 0.647 3.071 0.159 ...
## $ EXIST : num 1.4708 2.6525 0.0346 0.1877 -0.7416 ...
## $ FPP1P : num 2.24 2.69 2.13 2.59 1.16 ...
## $ FPP1S : num 2.612 -1.115 -0.174 -0.533 0.812 ...
## $ FPUH : num -0.00108 -0.66909 0.1731 -0.66909 2.03315 ...
## $ GTO : num 1.895 -0.672 3.617 -0.672 -0.672 ...
## $ HDG : num 2.168 -0.233 -0.237 -0.9 0.589 ...
## $ HGOT : num -0.66 -0.66 -0.66 -0.66 6.05 ...
## $ IN : num 0.262 0.533 -0.602 0.771 -1.559 ...
## $ JJAT : num 0.349 1.726 -0.744 -1.289 -1.994 ...
## $ JJPR : num -0.478 -0.507 0.341 1.218 2.017 ...
## $ LD : num -0.3301 1.2881 -0.586 -1.4218 0.0875 ...
## $ MDCA : num -0.572 0.335 -0.262 0.298 1.252 ...
## $ MDCO : num -0.0713 -0.9951 -0.9951 -0.1087 -0.9951 ...
## $ MDMM : num -0.633 -0.633 -0.633 -0.633 -0.633 ...
## $ MDNE : num 1.12 2.2 2.63 -1.14 -1.14 ...
## $ MDWO : num 9.53 -1.07 -1.07 -1.07 -1.07 ...
## $ MDWS : num 0.0968 2.8613 -0.4999 -0.4362 -0.9278 ...
## $ MENTAL : num 3.152 0.613 0.479 0.523 -0.919 ...
## $ NCOMP : num -0.601 2.051 -0.574 -1.788 1.321 ...
## $ NN : num -0.377 1.251 -0.634 -0.368 0.334 ...
## $ OCCUR : num 1.344 -0.923 -0.923 0.817 -0.923 ...
## $ PASS : num -0.499 -0.298 -0.744 -0.516 -0.919 ...
## $ PEAS : num -1.321 -0.131 -1.321 -0.549 -1.321 ...
## $ PIT : num -0.8628 -0.0712 1.5284 -0.7082 0.386 ...
## $ PLACE : num -1.275 0.86 0.734 1.033 1.101 ...
## $ POLITE : num 1.356 -0.508 0.164 0.344 4.017 ...
## $ POS : num -1.5355 -0.8173 -0.1405 0.0244 0.9175 ...
## $ PROG : num -0.3931 0.8798 3.3607 0.0656 -1.4375 ...
## $ QUAN : num 1.024 -1.086 -0.605 0.108 -1.611 ...
## $ QUPR : num 1.318 0.607 -0.271 -0.756 -0.147 ...
## $ QUTAG : num -0.554 -0.554 -0.554 -0.554 -0.554 ...
## $ RB : num -0.924 -1.798 1.761 1.009 -1.127 ...
## $ RP : num -0.0544 0.6924 -1.6129 0.3809 -1.6129 ...
## $ SPLIT : num -0.806 0.45 -0.376 -1.447 -0.655 ...
## $ SPP2 : num 1.06 1.95 1.14 -1.21 1.18 ...
## $ STPR : num 0.743 -0.87 -0.87 -0.87 -0.87 ...
## $ THATD : num -1.247 -1.247 -1.247 -0.612 -1.247 ...
## $ THRC : num -0.829 -0.829 -0.829 -0.829 -0.829 ...
## $ THSC : num -1.371 -0.674 -0.977 -1.371 -1.08 ...
## $ TTR : num 0.564 1.78 -0.545 -0.188 -1.583 ...
## $ VBD : num -0.00474 -0.76963 -1.32684 1.06003 -1.37555 ...
## $ VBG : num -0.32 0.724 -1.295 -0.172 -1.055 ...
## $ VBN : num -0.8756 -0.0798 -0.8756 -0.8756 -0.8756 ...
## $ VIMP : num -0.039 0.798 0.996 -0.726 1.819 ...
## $ VPRT : num -0.875 0.347 1.394 -0.803 1.46 ...
## $ WHQU : num 1.52 -1.02 1.46 -1.02 1.34 ...
## $ WHSC : num -0.671 1.874 -0.614 0.212 -1.369 ...
## $ XX0 : num -0.613 -1.175 0.54 -0.933 1.409 ...
## $ YNQU : num 0.878 -0.897 4.856 -0.897 1.4 ...
## $ TPP3 : num -1.377 -0.791 -1.016 0.477 -0.946 ...
## $ FQTI : num 0.284 -0.873 -0.409 -0.538 0.586 ...
nrow(data)
## [1] 5095
outliers <- data %>% filter(if_any(where(is.numeric) & !Words, .fns = function(x) {
x > 8
})) %>% select(Filename, Corpus, Series, Register, Level, Words)
outliers
## Filename
## 1 POC_4e_Spoken_0007.txt
## 2 Solutions_Elementary_ELF_Spoken_0013.txt
## 3 EIM_Starter_Informative_0004.txt
## 4 GreenLine_1_Spoken_0003.txt
## 5 Access_1_Spoken_0011.txt
## 6 Achievers_B1_Informative_0003.txt
## 7 EIM_Starter_Spoken_0002.txt
## 8 GreenLine_1_Spoken_0008.txt
## 9 JTT_3_Informative_0003.txt
## 10 GreenLine_1_Spoken_0010.txt
## 11 EIM_1_Spoken_0012.txt
## 12 NGL_1_Spoken_0013.txt
## 13 NGL_3_Spoken_0018.txt
## 14 Solutions_Intermediate_Spoken_0029.txt
## 15 NGL_1_Spoken_0012.txt
## 16 GreenLine_1_Spoken_0006.txt
## 17 GreenLine_2_Spoken_0004.txt
## 18 Access_2_Spoken_0023.txt
## 19 HT_4_Informative_0006.txt
## 20 Solutions_Intermediate_Informative_0017.txt
## 21 EIM_1_Spoken_0013.txt
## 22 Solutions_Elementary_ELF_Spoken_0021.txt
## 23 Solutions_Intermediate_Plus_Spoken_0022.txt
## 24 Access_2_Spoken_0028.txt
## 25 NGL_1_Spoken_0005.txt
## 26 Solutions_Elementary_ELF_Spoken_0016.txt
## 27 Solutions_Pre-Intermediate_ELF_Spoken_0007.txt
## 28 Solutions_Intermediate_Informative_0013.txt
## 29 GreenLine_2_Spoken_0003.txt
## 30 HT_4_Spoken_0010.txt
## 31 Solutions_Elementary_Informative_0003.txt
## 32 Access_2_Informative_0001.txt
## 33 Solutions_Elementary_Informative_0010.txt
## 34 GreenLine_1_Informative_0001.txt
## 35 Access_2_Spoken_0002.txt
## 36 Solutions_Intermediate_Spoken_0019.txt
## 37 Access_3_Informative_0003.txt
## 38 Access_1_Spoken_0019.txt
## 39 Access_2_Spoken_0013.txt
## 40 Solutions_Intermediate_Plus_Informative_0014.txt
## 41 Revision_World_GCSE_10525362_literary-terms.txt
## 42 Revision_World_GCSE_10528697_p6-physics-radioactive-materials.txt
## 43 Science_Tech_Kinds_NZ_10382383_math.txt
## 44 Science_for_students_10064820_scientists-say-metabolism.txt
## 45 Science_Tech_Kinds_NZ_10382388_recycling.txt
## 46 History_Kids_BBC_10404337_go_furthers.txt
## 47 Science_Tech_Kinds_NZ_10382391_sports.txt
## 48 Teen_Kids_News_10402607_so-you-want-to-be-an-archivist.txt
## 49 Science_Tech_Kinds_NZ_10382234_biology.txt
## 50 Science_Tech_Kinds_NZ_10382372_astronomy.txt
## 51 Dogo_News_file10060404_banana-plant-extract-may-be-the-key-to-slower-melting-ice-cream.txt
## 52 Science_Tech_Kinds_NZ_10382667_countries.txt
## 53 Quatr_us_file10390777_quick-summary-geological-erashtm.txt
## 54 Science_Tech_Kinds_NZ_10382873_physics.txt
## 55 Science_Tech_Kinds_NZ_10382382_light.txt
## 56 Factmonster_10053687_august-13.txt
## 57 Revision_World_GCSE_10526703_limited-companies.txt
## 58 Revision_World_GCSE_10529637_transition-metals.txt
## 59 Quatr_us_10390856_early-african-historyhtm.txt
## 60 History_Kids_BBC_10401873_ff6_sicilylandingss.txt
## 61 Quatr_us_10394250_harappan.txt
## 62 Ducksters_10398301_iraqphp.txt
## 63 History_Kids_BBC_10403171_death_sakkara_gallery_04s.txt
## 64 Revision_World_GCSE_10528246_agricultural-change.txt
## 65 Revision_World_GCSE_10528086_uk-government-judiciary.txt
## 66 Revision_World_GCSE_10529794_definitions.txt
## 67 Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt
## 68 Science_for_students_10064875_questions-big-melt-earths-ice-sheets-are-under-attack.txt
## 69 Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list.txt
## 70 Science_Tech_Kinds_NZ_10382201_projects.txt
## 71 Revision_World_GCSE_10529753_probability.txt
## 72 Encyclopedia_Kinds_au_10085531_Complex_analysis.txt
## 73 History_Kids_BBC_10401890_ff7_ddays.txt
## 74 History_Kids_BBC_10403434s.txt
## 75 History_Kids_BBC_10401872_ff6_italys.txt
## 76 Science_Tech_Kinds_NZ_10382371_amazing.txt
## 77 Quatr_us_10391129_athabascan.txt
## 78 Encyclopedia_Kinds_au_10085355_20th_century.txt
## 79 Dogo_News_10060755_luxury-space-hotel-promises-guests-a-truly-out-of-this-world-vacation.txt
## 80 Revision_World_GCSE_10528072_nationalism-practice.txt
## 81 Quatr_us_10390861_quatr-us-privacy-policyhtm.txt
## 82 History_Kids_BBC_10401909_ff7_bulges.txt
## 83 History_kids_10381259_timeline-of-mesopotamia.txt
## 84 Revision_World_GCSE_10528123_gender-written-textual-analysis-framework.txt
## 85 Science_Tech_Kinds_NZ_10386406_floods.txt
## 86 Revision_World_GCSE_10529693_advantages.txt
## 87 Science_Tech_Kinds_NZ_10382378_geography.txt
## 88 Science_Tech_Kinds_NZ_10382374_earth.txt
## 89 Science_for_students_10066286_watering-plants-wastewater-can-spread-germs.txt
## 90 Science_Tech_Kinds_NZ_10382393_water.txt
## 91 World_Dteen_10406069_website_policies.txt
## 92 Science_Tech_Kinds_NZ_10382384_metals.txt
## 93 Dogo_News_10062028_puppy-bowl-14-promises-viewers-a-paw-some-time-on-super-bowl-sunday.txt
## 94 History_Kids_BBC_10404730_go_furthers.txt
## 95 Science_Tech_Kinds_NZ_10382385_nature.txt
## 96 Science_for_students_10065015_scientists-say-dna-sequencing.txt
## 97 Quatr_us_file10390817_conifers-pine-trees-gymnospermshtm.txt
## 98 TweenTribute_10051509_it-true-elephants-cant-jump.txt
## 99 Revision_World_GCSE_10528494_application-software.txt
## 100 Revision_World_GCSE_10529581_different-types-questions-examinations.txt
## 101 Dogo_News_10061669_the-chinese-city-of-chengdu-may-soon-be-home-to-multiple-moons.txt
## 102 Ducksters_10398306_geography_of_ancient_chinaphp.txt
## 103 Science_for_students_10065144_scientists-say-multiverse.txt
## 104 Science_Tech_Kinds_NZ_10382211_images.txt
## 105 Factmonster_10053754_may-18.txt
## 106 World_Dteen_10406047_AboutWORLDteen.txt
## 107 Ducksters_10398078_first_new_dealphp.txt
## 108 Revision_World_GCSE_10526926_economies-scale.txt
## 109 Factmonster_10053201_september-03.txt
## 110 Science_Tech_Kinds_NZ_10387183_calciumcarbonates.txt
## 111 Science_Tech_Kinds_NZ_10382380_health.txt
## 112 Revision_World_GCSE_10529587_sources-finance.txt
## 113 Quatr_us_10393444_fishing.txt
## 114 Ducksters_10398315_glossary_and_termsphp.txt
## 115 S5AA.txt
## Corpus Series Register Level Words
## 1 Textbook.English POC Conversation C 750
## 2 Textbook.English Solutions Conversation A 931
## 3 Textbook.English EIM Informative A 534
## 4 Textbook.English GreenLine Conversation A 970
## 5 Textbook.English Access Conversation A 784
## 6 Textbook.English Achievers Informative C 926
## 7 Textbook.English EIM Conversation A 824
## 8 Textbook.English GreenLine Conversation A 876
## 9 Textbook.English JTT Informative D 699
## 10 Textbook.English GreenLine Conversation A 701
## 11 Textbook.English EIM Conversation B 640
## 12 Textbook.English NGL Conversation A 940
## 13 Textbook.English NGL Conversation C 751
## 14 Textbook.English Solutions Conversation C 672
## 15 Textbook.English NGL Conversation A 910
## 16 Textbook.English GreenLine Conversation A 622
## 17 Textbook.English GreenLine Conversation B 1102
## 18 Textbook.English Access Conversation B 875
## 19 Textbook.English HT Informative C 513
## 20 Textbook.English Solutions Informative C 816
## 21 Textbook.English EIM Conversation B 967
## 22 Textbook.English Solutions Conversation A 846
## 23 Textbook.English Solutions Conversation D 596
## 24 Textbook.English Access Conversation B 813
## 25 Textbook.English NGL Conversation A 1020
## 26 Textbook.English Solutions Conversation A 871
## 27 Textbook.English Solutions Conversation B 630
## 28 Textbook.English Solutions Informative C 770
## 29 Textbook.English GreenLine Conversation B 850
## 30 Textbook.English HT Conversation C 727
## 31 Textbook.English Solutions Informative A 1051
## 32 Textbook.English Access Informative B 655
## 33 Textbook.English Solutions Informative A 708
## 34 Textbook.English GreenLine Informative A 731
## 35 Textbook.English Access Conversation B 572
## 36 Textbook.English Solutions Conversation C 1024
## 37 Textbook.English Access Informative C 1000
## 38 Textbook.English Access Conversation A 701
## 39 Textbook.English Access Conversation B 981
## 40 Textbook.English Solutions Informative D 537
## 41 Informative.Teens Info Teens Informative Ref. 790
## 42 Informative.Teens Info Teens Informative Ref. 1015
## 43 Informative.Teens Info Teens Informative Ref. 522
## 44 Informative.Teens Info Teens Informative Ref. 895
## 45 Informative.Teens Info Teens Informative Ref. 666
## 46 Informative.Teens Info Teens Informative Ref. 620
## 47 Informative.Teens Info Teens Informative Ref. 657
## 48 Informative.Teens Info Teens Informative Ref. 763
## 49 Informative.Teens Info Teens Informative Ref. 843
## 50 Informative.Teens Info Teens Informative Ref. 900
## 51 Informative.Teens Info Teens Informative Ref. 611
## 52 Informative.Teens Info Teens Informative Ref. 717
## 53 Informative.Teens Info Teens Informative Ref. 643
## 54 Informative.Teens Info Teens Informative Ref. 722
## 55 Informative.Teens Info Teens Informative Ref. 639
## 56 Informative.Teens Info Teens Informative Ref. 523
## 57 Informative.Teens Info Teens Informative Ref. 714
## 58 Informative.Teens Info Teens Informative Ref. 787
## 59 Informative.Teens Info Teens Informative Ref. 1136
## 60 Informative.Teens Info Teens Informative Ref. 813
## 61 Informative.Teens Info Teens Informative Ref. 651
## 62 Informative.Teens Info Teens Informative Ref. 657
## 63 Informative.Teens Info Teens Informative Ref. 844
## 64 Informative.Teens Info Teens Informative Ref. 789
## 65 Informative.Teens Info Teens Informative Ref. 1019
## 66 Informative.Teens Info Teens Informative Ref. 904
## 67 Informative.Teens Info Teens Informative Ref. 598
## 68 Informative.Teens Info Teens Informative Ref. 685
## 69 Informative.Teens Info Teens Informative Ref. 800
## 70 Informative.Teens Info Teens Informative Ref. 947
## 71 Informative.Teens Info Teens Informative Ref. 816
## 72 Informative.Teens Info Teens Informative Ref. 735
## 73 Informative.Teens Info Teens Informative Ref. 759
## 74 Informative.Teens Info Teens Informative Ref. 732
## 75 Informative.Teens Info Teens Informative Ref. 786
## 76 Informative.Teens Info Teens Informative Ref. 629
## 77 Informative.Teens Info Teens Informative Ref. 637
## 78 Informative.Teens Info Teens Informative Ref. 864
## 79 Informative.Teens Info Teens Informative Ref. 722
## 80 Informative.Teens Info Teens Informative Ref. 776
## 81 Informative.Teens Info Teens Informative Ref. 960
## 82 Informative.Teens Info Teens Informative Ref. 732
## 83 Informative.Teens Info Teens Informative Ref. 768
## 84 Informative.Teens Info Teens Informative Ref. 905
## 85 Informative.Teens Info Teens Informative Ref. 580
## 86 Informative.Teens Info Teens Informative Ref. 782
## 87 Informative.Teens Info Teens Informative Ref. 761
## 88 Informative.Teens Info Teens Informative Ref. 726
## 89 Informative.Teens Info Teens Informative Ref. 836
## 90 Informative.Teens Info Teens Informative Ref. 856
## 91 Informative.Teens Info Teens Informative Ref. 995
## 92 Informative.Teens Info Teens Informative Ref. 669
## 93 Informative.Teens Info Teens Informative Ref. 581
## 94 Informative.Teens Info Teens Informative Ref. 611
## 95 Informative.Teens Info Teens Informative Ref. 722
## 96 Informative.Teens Info Teens Informative Ref. 953
## 97 Informative.Teens Info Teens Informative Ref. 533
## 98 Informative.Teens Info Teens Informative Ref. 790
## 99 Informative.Teens Info Teens Informative Ref. 855
## 100 Informative.Teens Info Teens Informative Ref. 742
## 101 Informative.Teens Info Teens Informative Ref. 614
## 102 Informative.Teens Info Teens Informative Ref. 638
## 103 Informative.Teens Info Teens Informative Ref. 712
## 104 Informative.Teens Info Teens Informative Ref. 793
## 105 Informative.Teens Info Teens Informative Ref. 497
## 106 Informative.Teens Info Teens Informative Ref. 1053
## 107 Informative.Teens Info Teens Informative Ref. 649
## 108 Informative.Teens Info Teens Informative Ref. 621
## 109 Informative.Teens Info Teens Informative Ref. 445
## 110 Informative.Teens Info Teens Informative Ref. 804
## 111 Informative.Teens Info Teens Informative Ref. 694
## 112 Informative.Teens Info Teens Informative Ref. 665
## 113 Informative.Teens Info Teens Informative Ref. 656
## 114 Informative.Teens Info Teens Informative Ref. 684
## 115 Spoken.BNC2014 Spoken BNC2014 Conversation Ref. 1869
outliers %>% select(Filename)
## Filename
## 1 POC_4e_Spoken_0007.txt
## 2 Solutions_Elementary_ELF_Spoken_0013.txt
## 3 EIM_Starter_Informative_0004.txt
## 4 GreenLine_1_Spoken_0003.txt
## 5 Access_1_Spoken_0011.txt
## 6 Achievers_B1_Informative_0003.txt
## 7 EIM_Starter_Spoken_0002.txt
## 8 GreenLine_1_Spoken_0008.txt
## 9 JTT_3_Informative_0003.txt
## 10 GreenLine_1_Spoken_0010.txt
## 11 EIM_1_Spoken_0012.txt
## 12 NGL_1_Spoken_0013.txt
## 13 NGL_3_Spoken_0018.txt
## 14 Solutions_Intermediate_Spoken_0029.txt
## 15 NGL_1_Spoken_0012.txt
## 16 GreenLine_1_Spoken_0006.txt
## 17 GreenLine_2_Spoken_0004.txt
## 18 Access_2_Spoken_0023.txt
## 19 HT_4_Informative_0006.txt
## 20 Solutions_Intermediate_Informative_0017.txt
## 21 EIM_1_Spoken_0013.txt
## 22 Solutions_Elementary_ELF_Spoken_0021.txt
## 23 Solutions_Intermediate_Plus_Spoken_0022.txt
## 24 Access_2_Spoken_0028.txt
## 25 NGL_1_Spoken_0005.txt
## 26 Solutions_Elementary_ELF_Spoken_0016.txt
## 27 Solutions_Pre-Intermediate_ELF_Spoken_0007.txt
## 28 Solutions_Intermediate_Informative_0013.txt
## 29 GreenLine_2_Spoken_0003.txt
## 30 HT_4_Spoken_0010.txt
## 31 Solutions_Elementary_Informative_0003.txt
## 32 Access_2_Informative_0001.txt
## 33 Solutions_Elementary_Informative_0010.txt
## 34 GreenLine_1_Informative_0001.txt
## 35 Access_2_Spoken_0002.txt
## 36 Solutions_Intermediate_Spoken_0019.txt
## 37 Access_3_Informative_0003.txt
## 38 Access_1_Spoken_0019.txt
## 39 Access_2_Spoken_0013.txt
## 40 Solutions_Intermediate_Plus_Informative_0014.txt
## 41 Revision_World_GCSE_10525362_literary-terms.txt
## 42 Revision_World_GCSE_10528697_p6-physics-radioactive-materials.txt
## 43 Science_Tech_Kinds_NZ_10382383_math.txt
## 44 Science_for_students_10064820_scientists-say-metabolism.txt
## 45 Science_Tech_Kinds_NZ_10382388_recycling.txt
## 46 History_Kids_BBC_10404337_go_furthers.txt
## 47 Science_Tech_Kinds_NZ_10382391_sports.txt
## 48 Teen_Kids_News_10402607_so-you-want-to-be-an-archivist.txt
## 49 Science_Tech_Kinds_NZ_10382234_biology.txt
## 50 Science_Tech_Kinds_NZ_10382372_astronomy.txt
## 51 Dogo_News_file10060404_banana-plant-extract-may-be-the-key-to-slower-melting-ice-cream.txt
## 52 Science_Tech_Kinds_NZ_10382667_countries.txt
## 53 Quatr_us_file10390777_quick-summary-geological-erashtm.txt
## 54 Science_Tech_Kinds_NZ_10382873_physics.txt
## 55 Science_Tech_Kinds_NZ_10382382_light.txt
## 56 Factmonster_10053687_august-13.txt
## 57 Revision_World_GCSE_10526703_limited-companies.txt
## 58 Revision_World_GCSE_10529637_transition-metals.txt
## 59 Quatr_us_10390856_early-african-historyhtm.txt
## 60 History_Kids_BBC_10401873_ff6_sicilylandingss.txt
## 61 Quatr_us_10394250_harappan.txt
## 62 Ducksters_10398301_iraqphp.txt
## 63 History_Kids_BBC_10403171_death_sakkara_gallery_04s.txt
## 64 Revision_World_GCSE_10528246_agricultural-change.txt
## 65 Revision_World_GCSE_10528086_uk-government-judiciary.txt
## 66 Revision_World_GCSE_10529794_definitions.txt
## 67 Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt
## 68 Science_for_students_10064875_questions-big-melt-earths-ice-sheets-are-under-attack.txt
## 69 Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list.txt
## 70 Science_Tech_Kinds_NZ_10382201_projects.txt
## 71 Revision_World_GCSE_10529753_probability.txt
## 72 Encyclopedia_Kinds_au_10085531_Complex_analysis.txt
## 73 History_Kids_BBC_10401890_ff7_ddays.txt
## 74 History_Kids_BBC_10403434s.txt
## 75 History_Kids_BBC_10401872_ff6_italys.txt
## 76 Science_Tech_Kinds_NZ_10382371_amazing.txt
## 77 Quatr_us_10391129_athabascan.txt
## 78 Encyclopedia_Kinds_au_10085355_20th_century.txt
## 79 Dogo_News_10060755_luxury-space-hotel-promises-guests-a-truly-out-of-this-world-vacation.txt
## 80 Revision_World_GCSE_10528072_nationalism-practice.txt
## 81 Quatr_us_10390861_quatr-us-privacy-policyhtm.txt
## 82 History_Kids_BBC_10401909_ff7_bulges.txt
## 83 History_kids_10381259_timeline-of-mesopotamia.txt
## 84 Revision_World_GCSE_10528123_gender-written-textual-analysis-framework.txt
## 85 Science_Tech_Kinds_NZ_10386406_floods.txt
## 86 Revision_World_GCSE_10529693_advantages.txt
## 87 Science_Tech_Kinds_NZ_10382378_geography.txt
## 88 Science_Tech_Kinds_NZ_10382374_earth.txt
## 89 Science_for_students_10066286_watering-plants-wastewater-can-spread-germs.txt
## 90 Science_Tech_Kinds_NZ_10382393_water.txt
## 91 World_Dteen_10406069_website_policies.txt
## 92 Science_Tech_Kinds_NZ_10382384_metals.txt
## 93 Dogo_News_10062028_puppy-bowl-14-promises-viewers-a-paw-some-time-on-super-bowl-sunday.txt
## 94 History_Kids_BBC_10404730_go_furthers.txt
## 95 Science_Tech_Kinds_NZ_10382385_nature.txt
## 96 Science_for_students_10065015_scientists-say-dna-sequencing.txt
## 97 Quatr_us_file10390817_conifers-pine-trees-gymnospermshtm.txt
## 98 TweenTribute_10051509_it-true-elephants-cant-jump.txt
## 99 Revision_World_GCSE_10528494_application-software.txt
## 100 Revision_World_GCSE_10529581_different-types-questions-examinations.txt
## 101 Dogo_News_10061669_the-chinese-city-of-chengdu-may-soon-be-home-to-multiple-moons.txt
## 102 Ducksters_10398306_geography_of_ancient_chinaphp.txt
## 103 Science_for_students_10065144_scientists-say-multiverse.txt
## 104 Science_Tech_Kinds_NZ_10382211_images.txt
## 105 Factmonster_10053754_may-18.txt
## 106 World_Dteen_10406047_AboutWORLDteen.txt
## 107 Ducksters_10398078_first_new_dealphp.txt
## 108 Revision_World_GCSE_10526926_economies-scale.txt
## 109 Factmonster_10053201_september-03.txt
## 110 Science_Tech_Kinds_NZ_10387183_calciumcarbonates.txt
## 111 Science_Tech_Kinds_NZ_10382380_health.txt
## 112 Revision_World_GCSE_10529587_sources-finance.txt
## 113 Quatr_us_10393444_fishing.txt
## 114 Ducksters_10398315_glossary_and_termsphp.txt
## 115 S5AA.txt
# Checking that outlier texts are not particularly long or short texts
summary(outliers$Words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 445.0 655.5 751.0 773.6 860.0 1869.0
histogram(outliers$Words, breaks = 30)
summary(data$Words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 427 788 1179 4428 5999 148358
# Distribution of outlier texts
summary(outliers$Corpus)
## Textbook.English Informative.Teens Spoken.BNC2014 Youth.Fiction
## 40 74 1 0
# Manually checking a sample of these outliers:
# Encyclopedia_Kinds_au_10085347_Nobel_Prize_in_Chemistry.txt is essentially a
# list of Nobel prize winners but with some additional information. Hence a good
# representative of the type of texts of the ITTC.
# Solutions_Elementary_ELF_Spoken_0013 --> Has a lot of 'going to' constructions
# because they are learnt in this chapter but is otherwise a well-formed text.
# Teen_Kids_News_10403972_a-brief-history-of-white-house-weddings --> No issues
# Teen_Kids_News_10403301_golden-globe-winners-2019-the-complete-list --> Similar
# to the Nobel prize laureates text.
# Revision_World_GCSE_10528123_gender-written-textual-analysis-framework --> Text
# includes bullet points tokenised as the letter 'o' but otherwise a fairly
# typical informative text.
# Removing the outliers
ncounts <- ncounts %>% filter(!Filename %in% outliers$Filename)
nrow(ncounts)
## [1] 4980
# saveRDS(ncounts, here('FullMDA', 'ncounts3_3Reg.rds')) # Last saved 9 Feb 2022
zcounts <- ncounts %>% select(-Words) %>% keep(is.numeric) %>% scale()
nrow(zcounts)
## [1] 4980
boxplot(zcounts, las = 3, main = "z-scores") # Slow to open!
signed.log <- function(x) {sign(x)*log(abs(x)+1)}
zlogcounts <- signed.log(zcounts) # Standardise first, then signed log transform
boxplot(zlogcounts, las=3, main="log-transformed z-scores")
# With three TEC registers
#saveRDS(zlogcounts, here("FullMDA", "zlogcounts_3Reg.rds")) # Last saved 9 Feb 2022
# With five TEC registers
#saveRDS(zlogcounts, here("FullMDA", "zlogcounts.rds")) # Last saved 18 November
zlogcounts %>%
as.data.frame() %>%
gather() %>% # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
ggplot(aes(value)) +
theme_bw() +
facet_wrap(~ key, scales = "free", ncol = 4) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(limits = c(0,NA)) +
geom_histogram(aes(y = ..density..), bins = 30, colour= "black", fill = "grey") +
geom_density(colour = "darkred", weight = 2, fill="darkred", alpha = .4)
#ggsave(here("Plots", "DensityPlotsAllVariablesSignedLog.svg"), width = 15, height = 49)
# With five TEC registers zlogcounts <- readRDS(here('FullMDA',
# 'zlogcounts.rds')) nrow(zlogcounts) colnames(zlogcounts) ncounts <-
# readRDS(here('FullMDA', 'ncounts2.rds')) nrow(ncounts) colnames(ncounts) data
# <- cbind(ncounts[,1:7], as.data.frame(zlogcounts)) str(data) saveRDS(data,
# here('FullMDA', 'datazlogcounts.rds')) # Last saved 18 November
# With three TEC registers
zlogcounts <- readRDS(here("FullMDA", "zlogcounts_3Reg.rds"))
nrow(zlogcounts)
## [1] 4980
colnames(zlogcounts)
## [1] "ACT" "AMP" "ASPECT" "AWL" "BEMA" "CAUSE" "CC" "COMM"
## [9] "CONC" "COND" "CONT" "CUZ" "DEMO" "DMA" "DOAUX" "DT"
## [17] "DWNT" "ELAB" "EMPH" "EX" "EXIST" "FPP1P" "FPP1S" "FPUH"
## [25] "GTO" "HDG" "HGOT" "IN" "JJAT" "JJPR" "LD" "MDCA"
## [33] "MDCO" "MDMM" "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [41] "OCCUR" "PASS" "PEAS" "PIT" "PLACE" "POLITE" "POS" "PROG"
## [49] "QUAN" "QUPR" "QUTAG" "RB" "RP" "SPLIT" "SPP2" "STPR"
## [57] "THATD" "THRC" "THSC" "TTR" "VBD" "VBG" "VBN" "VIMP"
## [65] "VPRT" "WHQU" "WHSC" "XX0" "YNQU" "TPP3" "FQTI"
ncounts <- readRDS(here("FullMDA", "ncounts3_3Reg.rds"))
nrow(ncounts)
## [1] 4980
colnames(ncounts)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ACT" "AMP" "ASPECT" "AWL"
## [13] "BEMA" "CAUSE" "CC" "COMM" "CONC" "COND"
## [19] "CONT" "CUZ" "DEMO" "DMA" "DOAUX" "DT"
## [25] "DWNT" "ELAB" "EMPH" "EX" "EXIST" "FPP1P"
## [31] "FPP1S" "FPUH" "GTO" "HDG" "HGOT" "IN"
## [37] "JJAT" "JJPR" "LD" "MDCA" "MDCO" "MDMM"
## [43] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [49] "OCCUR" "PASS" "PEAS" "PIT" "PLACE" "POLITE"
## [55] "POS" "PROG" "QUAN" "QUPR" "QUTAG" "RB"
## [61] "RP" "SPLIT" "SPP2" "STPR" "THATD" "THRC"
## [67] "THSC" "TTR" "VBD" "VBG" "VBN" "VIMP"
## [73] "VPRT" "WHQU" "WHSC" "XX0" "YNQU" "TPP3"
## [79] "FQTI"
data <- cbind(ncounts[, 1:8], as.data.frame(zlogcounts))
colnames(data)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ACT" "AMP" "ASPECT" "AWL"
## [13] "BEMA" "CAUSE" "CC" "COMM" "CONC" "COND"
## [19] "CONT" "CUZ" "DEMO" "DMA" "DOAUX" "DT"
## [25] "DWNT" "ELAB" "EMPH" "EX" "EXIST" "FPP1P"
## [31] "FPP1S" "FPUH" "GTO" "HDG" "HGOT" "IN"
## [37] "JJAT" "JJPR" "LD" "MDCA" "MDCO" "MDMM"
## [43] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [49] "OCCUR" "PASS" "PEAS" "PIT" "PLACE" "POLITE"
## [55] "POS" "PROG" "QUAN" "QUPR" "QUTAG" "RB"
## [61] "RP" "SPLIT" "SPP2" "STPR" "THATD" "THRC"
## [67] "THSC" "TTR" "VBD" "VBG" "VBN" "VIMP"
## [73] "VPRT" "WHQU" "WHSC" "XX0" "YNQU" "TPP3"
## [79] "FQTI"
# saveRDS(data, here('FullMDA', 'datazlogcounts_3Reg.rds')) # Last saved 9 Feb
# 2022
# With five TEC registers data <- readRDS(here('FullMDA', 'datazlogcounts.rds'))
# With three TEC registers
data <- readRDS(here("FullMDA", "datazlogcounts_3Reg.rds"))
summary(data$Corpus)
## Textbook.English Informative.Teens Spoken.BNC2014 Youth.Fiction
## 1202 1337 1250 1191
summary(data$Subcorpus)
## Textbook Conversation Textbook Fiction Info Teens Ref.
## 565 285 1337
## Textbook Informative Spoken BNC2014 Ref. Youth Fiction Ref.
## 352 1250 1191
# This rearranges the levels in the desired order for the plot legends:
data <- data %>% mutate(Subcorpus = fct_relevel(Subcorpus, "Info Teens Ref.", after = 9))
# From:
# https://towardsdatascience.com/how-to-create-a-correlation-matrix-with-too-many-variables-309cc0c0a57
# Function adapated to my needs ##
colnames(data)
corr <- cor(data[9:ncol(data)])
# prepare to drop duplicates and correlations of 1
corr[lower.tri(corr, diag = TRUE)] <- NA
# drop perfect correlations
corr[corr == 1] <- NA
# turn into a 3-column table
corr <- as.data.frame(as.table(corr))
# remove the NA values from above
corr <- na.omit(corr)
# Uninteresting variable correlations?
lowcor <- subset(corr, abs(Freq) < 0.3)
# lowcor %>% filter(Var2=='CC'|Var1=='CC') %>% round(Freq, 2) select significant
# correlations
corr <- subset(corr, abs(Freq) > 0.3)
# sort by highest correlation
corr <- corr[order(-abs(corr$Freq)), ]
# see which variables might be eliminated: the ones with correlation > 0.3
eliminate <- as.data.frame((summary(corr$Var1) + summary(corr$Var2)))
(LowcCommunality <- eliminate %>% filter(`(summary(corr$Var1) + summary(corr$Var2))` ==
0))
# Potentially problematic collinear variables that may need to be removed:
highcor <- subset(corr, abs(Freq) > 0.95)
highcor
# variables which are retained
corr$Var1 <- droplevels(corr$Var1)
corr$Var2 <- droplevels(corr$Var2)
features <- unique(c(levels(corr$Var1), levels(corr$Var2)))
features # 68 variables
# turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1 ~ Var2, value.var = "Freq")
# plot correlations in a manageable way
library(corrplot)
plot.margin = unit(c(0, 0, 0, 0), "mm")
corrplot(mtx_corr, is.corr = FALSE, tl.col = "black", na.label = " ", tl.cex = 0.5)
# save as SVG with Rstudio e.g. 1000 x 1000
# Simple heatmap in base R (inspired by Stephanie Evert's SIGIL code)
cor.colours <- c(
hsv(h=2/3, v=1, s=(10:1)/10), # blue = negative correlation
rgb(1,1,1), # white = no correlation
hsv(h=0, v=1, s=(1:10/10))) # red = positive correlation
#png(here("Plots", "heatmapzlogcounts.png"), width = 30, height= 30, units = "cm", res = 300)
heatmap(cor(zlogcounts),
symm=TRUE,
zlim=c(-1,1),
col=cor.colours,
margins=c(7,7))
dev.off()
## null device
## 1
# Eliminate highly collinear variable
cor(data$VPRT, data$VBD)
## [1] -0.9731048
data <- data %>% select(-c(VPRT))
colnames(data)
## [1] "Filename" "Register" "Level" "Series" "Country" "Corpus"
## [7] "Subcorpus" "Words" "ACT" "AMP" "ASPECT" "AWL"
## [13] "BEMA" "CAUSE" "CC" "COMM" "CONC" "COND"
## [19] "CONT" "CUZ" "DEMO" "DMA" "DOAUX" "DT"
## [25] "DWNT" "ELAB" "EMPH" "EX" "EXIST" "FPP1P"
## [31] "FPP1S" "FPUH" "GTO" "HDG" "HGOT" "IN"
## [37] "JJAT" "JJPR" "LD" "MDCA" "MDCO" "MDMM"
## [43] "MDNE" "MDWO" "MDWS" "MENTAL" "NCOMP" "NN"
## [49] "OCCUR" "PASS" "PEAS" "PIT" "PLACE" "POLITE"
## [55] "POS" "PROG" "QUAN" "QUPR" "QUTAG" "RB"
## [61] "RP" "SPLIT" "SPP2" "STPR" "THATD" "THRC"
## [67] "THSC" "TTR" "VBD" "VBG" "VBN" "VIMP"
## [73] "WHQU" "WHSC" "XX0" "YNQU" "TPP3" "FQTI"
kmo <- KMO(data[, 9:ncol(data)])
kmo # # Overall MSA = 0.95
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data[, 9:ncol(data)])
## Overall MSA = 0.95
## MSA for each item =
## ACT AMP ASPECT AWL BEMA CAUSE CC COMM CONC COND CONT
## 0.89 0.67 0.97 0.96 0.88 0.96 0.95 0.69 0.93 0.95 0.96
## CUZ DEMO DMA DOAUX DT DWNT ELAB EMPH EX EXIST FPP1P
## 0.97 0.97 0.97 0.97 0.83 0.95 0.97 0.98 0.85 0.98 0.89
## FPP1S FPUH GTO HDG HGOT IN JJAT JJPR LD MDCA MDCO
## 0.91 0.97 0.99 0.97 0.99 0.97 0.84 0.76 0.87 0.89 0.85
## MDMM MDNE MDWO MDWS MENTAL NCOMP NN OCCUR PASS PEAS PIT
## 0.91 0.95 0.93 0.89 0.91 0.88 0.94 0.97 0.96 0.91 0.97
## PLACE POLITE POS PROG QUAN QUPR QUTAG RB RP SPLIT SPP2
## 0.82 0.96 0.70 0.95 0.98 0.96 0.98 0.95 0.85 0.83 0.95
## STPR THATD THRC THSC TTR VBD VBG VBN VIMP WHQU WHSC
## 0.99 0.98 0.94 0.86 0.98 0.91 0.96 0.98 0.84 0.96 0.96
## XX0 YNQU TPP3 FQTI
## 0.96 0.98 0.74 0.89
kmo$MSAi[order(kmo$MSAi)] # All features have individual MSAs of > 0.5 (but only because TPP3P was merged with TPP3S earlier on)
## AMP COMM POS TPP3 JJPR PLACE SPLIT DT
## 0.6659581 0.6860298 0.6995184 0.7373646 0.7594598 0.8209795 0.8305808 0.8325009
## JJAT VIMP MDCO RP EX THSC LD NCOMP
## 0.8384328 0.8435546 0.8511922 0.8513463 0.8547890 0.8590161 0.8692458 0.8757097
## BEMA MDWS FQTI FPP1P MDCA ACT MENTAL VBD
## 0.8761697 0.8865788 0.8888590 0.8928853 0.8930041 0.8932178 0.9067961 0.9103405
## FPP1S MDMM PEAS CONC MDWO THRC NN COND
## 0.9105424 0.9108706 0.9125553 0.9301851 0.9348437 0.9350359 0.9413801 0.9458270
## PROG CC SPP2 RB DWNT MDNE WHSC CONT
## 0.9461472 0.9476549 0.9483572 0.9495729 0.9496995 0.9514806 0.9565117 0.9569848
## QUPR XX0 CAUSE WHQU VBG AWL POLITE PASS
## 0.9576645 0.9578687 0.9579412 0.9599786 0.9617553 0.9623653 0.9643209 0.9645284
## PIT DOAUX ELAB ASPECT DMA DEMO HDG IN
## 0.9657592 0.9661947 0.9669056 0.9679603 0.9691391 0.9695140 0.9704817 0.9710499
## FPUH OCCUR CUZ EMPH YNQU QUAN TTR QUTAG
## 0.9712987 0.9716773 0.9728492 0.9758064 0.9758425 0.9765995 0.9776647 0.9782315
## THATD VBN EXIST STPR GTO HGOT
## 0.9790366 0.9801000 0.9804490 0.9854688 0.9881734 0.9889865
# png(here('Plots', 'screeplot-TEC-Ref_3Reg.png'), width = 20, height= 12, units
# = 'cm', res = 300)
scree(data[, 9:ncol(data)], factors = FALSE, pc = TRUE) # 6 components
dev.off()
## null device
## 1
# Perform PCA
pca1 <- psych::principal(data[9:ncol(data)], nfactors = 6)
pca1$loadings
##
## Loadings:
## RC1 RC3 RC2 RC4 RC5 RC6
## ACT -0.502 0.141 0.196 -0.309 0.215
## AMP 0.574 0.124
## ASPECT -0.395 0.211 -0.203
## AWL -0.767 0.501 -0.201 -0.160
## BEMA 0.356 -0.469 0.502
## CAUSE -0.455 0.154 -0.301 0.236
## CC -0.488 0.461 -0.182 -0.248 0.149
## COMM -0.242 -0.157 0.384 0.158 -0.349
## CONC 0.439 -0.100 -0.129
## COND 0.363 0.195 0.156 0.494
## CONT 0.864 -0.276 -0.120 0.221
## CUZ 0.620 0.341 -0.149
## DEMO 0.670 -0.170 0.128
## DMA 0.916 -0.113 -0.182
## DOAUX 0.736 -0.294 0.187 -0.148
## DT 0.399 0.518 0.423
## DWNT -0.182 0.129 0.369 0.141 0.105
## ELAB -0.338 0.385 -0.320 0.264 0.148
## EMPH 0.782 0.117
## EX 0.246 0.217 0.513
## EXIST -0.530 0.377 -0.142 -0.112
## FPP1P 0.223 -0.284 0.354 0.108
## FPP1S 0.627 -0.363 0.221 -0.153
## FPUH 0.877 -0.208
## GTO 0.641 0.218 -0.145
## HDG 0.575 0.161 0.127
## HGOT 0.743 -0.151 -0.126 -0.116
## IN -0.809 0.405 -0.175
## JJAT 0.553 0.280 0.212
## JJPR -0.141 0.198 -0.205 0.178 0.571
## LD -0.718 0.207 -0.452 -0.130 -0.110 -0.149
## MDCA 0.114 -0.495 0.462 0.154 0.124
## MDCO 0.534 0.111
## MDMM 0.432 0.341
## MDNE 0.238 0.449
## MDWO 0.369 0.116 0.393 0.122
## MDWS 0.161 0.545 -0.145
## MENTAL 0.449 0.261 0.247 -0.329
## NCOMP 0.347 -0.509 0.220 -0.185 0.102
## NN -0.851 0.225 -0.298 -0.250
## OCCUR -0.512 0.298 -0.224
## PASS -0.522 0.521 -0.209 -0.238
## PEAS -0.215 0.288 0.448
## PIT 0.726 0.114 0.223
## PLACE -0.351 0.120 0.103 0.467
## POLITE 0.265 -0.557 -0.179 0.241 0.172 -0.136
## POS -0.110 0.103 -0.235 -0.389
## PROG 0.331 -0.116 0.205 0.327 -0.117
## QUAN 0.772 0.134 0.153 0.112 0.151
## QUPR 0.307 0.382 0.278 0.167
## QUTAG 0.776 -0.168
## RB 0.609 0.454 0.189
## RP -0.101 0.478 0.247 -0.339 0.272
## SPLIT 0.541 0.128 0.114
## SPP2 0.599 -0.334 -0.182 0.412
## STPR 0.425 -0.162 0.107
## THATD 0.733 0.159 -0.159
## THRC 0.554 -0.262 -0.110 0.177
## THSC 0.577 0.124 0.147 -0.119
## TTR -0.793 0.204 0.183
## VBD -0.358 0.639 -0.440 -0.196
## VBG -0.589 0.465 0.108 -0.136
## VBN -0.565 0.489 -0.188 -0.184 -0.113
## VIMP -0.165 -0.394 -0.337 0.364 0.128 0.101
## WHQU 0.454 -0.518 -0.201 0.206 -0.116
## WHSC -0.273 0.575
## XX0 0.742 -0.135 0.240 -0.118
## YNQU 0.692 -0.434 -0.201 0.161
## TPP3 -0.211 -0.109 0.581 -0.341 -0.184 -0.193
## FQTI -0.354 0.124 0.298
##
## RC1 RC3 RC2 RC4 RC5 RC6
## SS loadings 17.839 6.068 4.731 3.283 2.099 1.788
## Proportion Var 0.255 0.087 0.068 0.047 0.030 0.026
## Cumulative Var 0.255 0.342 0.409 0.456 0.486 0.512
pca1$communality %>% sort(.) # If features with communalities of < 0.2 are removed, we remove TIME (therefore merged TIME and FREQ further up the line)
## DWNT STPR CONC FQTI POS ASPECT MDNE FPP1P
## 0.2167287 0.2259971 0.2284371 0.2299426 0.2373224 0.2464027 0.2706613 0.2805719
## PROG MDCO MDMM MDWO SPLIT MDWS PEAS QUPR
## 0.2893579 0.3154268 0.3170923 0.3233782 0.3278744 0.3446263 0.3481427 0.3489995
## AMP PLACE HDG COMM CAUSE EX THSC OCCUR
## 0.3544622 0.3710837 0.3774159 0.3793614 0.3800517 0.3805229 0.4009387 0.4017528
## WHSC THRC JJAT COND MENTAL ACT VIMP ELAB
## 0.4169886 0.4284237 0.4365607 0.4388680 0.4451165 0.4513069 0.4555282 0.4588368
## EXIST JJPR NCOMP RP GTO DEMO MDCA POLITE
## 0.4607173 0.4640015 0.4764765 0.4938285 0.4952373 0.5048824 0.5166493 0.5182614
## CUZ CC WHQU TPP3 VBG THATD PIT BEMA
## 0.5295899 0.5704833 0.5799242 0.5812901 0.5989025 0.6044126 0.6059166 0.6078998
## FPP1S DT HGOT RB VBN QUTAG EMPH PASS
## 0.6080200 0.6130366 0.6153162 0.6237505 0.6403441 0.6404917 0.6434773 0.6450342
## XX0 QUAN SPP2 DOAUX TTR YNQU VBD LD
## 0.6504597 0.6732223 0.6819055 0.6940095 0.7076993 0.7413764 0.7799251 0.8145733
## FPUH IN CONT DMA AWL NN
## 0.8255802 0.8567971 0.8869047 0.8909298 0.9087159 0.9292510
# Final number of features
ncol(data) - 6
## [1] 72
# Final number of texts
nrow(data)
## [1] 4980
# saveRDS(data, here('FullMDA', 'dataforPCA.rds')) # Last saved on 9 Feb 2022
# packages.bib <- sapply(1:length(loadedNamespaces()), function(i)
# toBibtex(citation(loadedNamespaces()[i])))
knitr::write_bib(c(.packages(), "knitr"), "packages.bib")
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tibble_3.1.6 tidyr_1.1.4
## [3] suffrager_0.1.0 psych_2.0.12
## [5] purrr_0.3.4 PerformanceAnalytics_2.0.4
## [7] xts_0.12.1 zoo_1.8-9
## [9] here_1.0.1 forcats_0.5.1
## [11] dplyr_1.0.7 DescTools_0.99.40
## [13] caret_6.0-86 ggplot2_3.3.5
## [15] lattice_0.20-41
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-152 lubridate_1.7.10 rprojroot_2.0.2
## [4] tools_4.0.3 bslib_0.3.1 utf8_1.2.2
## [7] R6_2.5.1 rpart_4.1-15 DBI_1.1.1
## [10] colorspace_2.0-2 nnet_7.3-15 withr_2.4.3
## [13] tidyselect_1.1.1 Exact_2.1 mnormt_2.0.2
## [16] compiler_4.0.3 cli_3.1.0 formatR_1.8
## [19] expm_0.999-6 labeling_0.4.2 sass_0.4.0
## [22] scales_1.1.1 mvtnorm_1.1-1 quadprog_1.5-8
## [25] stringr_1.4.0 digest_0.6.29 rmarkdown_2.11
## [28] pkgconfig_2.0.3 htmltools_0.5.2 fastmap_1.1.0
## [31] highr_0.9 rlang_0.4.12 rstudioapi_0.13
## [34] jquerylib_0.1.4 generics_0.1.1 farver_2.1.0
## [37] jsonlite_1.7.2 ModelMetrics_1.2.2.2 magrittr_2.0.1
## [40] Matrix_1.3-2 Rcpp_1.0.7 munsell_0.5.0
## [43] fansi_0.5.0 lifecycle_1.0.1 stringi_1.7.6
## [46] pROC_1.17.0.1 yaml_2.2.1 MASS_7.3-53.1
## [49] rootSolve_1.8.2.1 plyr_1.8.6 recipes_0.1.15
## [52] grid_4.0.3 parallel_4.0.3 crayon_1.4.2
## [55] lmom_2.8 splines_4.0.3 tmvnsim_1.0-2
## [58] knitr_1.37 pillar_1.6.4 boot_1.3-27
## [61] gld_2.6.2 reshape2_1.4.4 codetools_0.2-18
## [64] stats4_4.0.3 glue_1.6.0 evaluate_0.14
## [67] data.table_1.14.2 vctrs_0.3.8 foreach_1.5.1
## [70] gtable_0.3.0 assertthat_0.2.1 xfun_0.29
## [73] gower_0.2.2 prodlim_2019.11.13 e1071_1.7-4
## [76] class_7.3-18 survival_3.2-7 timeDate_3043.102
## [79] iterators_1.0.13 lava_1.6.9 ellipsis_0.3.2
## [82] ipred_0.9-11