This script is part of the Online Appendix to my PhD thesis.

Please cite as: Le Foll, Elen. 2022. Textbook English: A Corpus-Based Analysis of the Language of EFL textbooks used in Secondary Schools in France, Germany and Spain. PhD thesis. Osnabrück University.

For more information, see: https://elenlefoll.github.io/TextbookEnglish/

Please note that the plot dimensions in this notebook have been optimised for the print version of the thesis.

Set-up

Built with R 4.0.3

knitr::opts_chunk$set(echo = TRUE, tidy = TRUE, message=FALSE, paged.print=TRUE, fig.width = 10, warning=FALSE)

library(here)
library(ggsignif)
library(ggrepel)
library(paletteer)
require(gtools)
library(RColorBrewer)
library(tidyverse)
library(wordcloud)

Frequency of MAKE

MAKE in the TEC

##                     Subcorpus raw_freq_MAKE wpm_freq_MAKE total_tokens
## 2             TxB_informative           713       2353.79       302916
## 3           TxB_instructional          2310       3901.63       592061
## 4               TxB_narrative           395       1555.80       253888
## 5                   TxB_other            29       2014.73        14394
## 6                TxB_personal           115       1713.91        67098
## 7                  TxB_poetry            82       3130.49        26194
## 8                  TxB_spoken           699       1368.03       510954
## 9 TxB_words_phrases_sentences          1776       1942.93       914083
##   raw_freq_not_MAKE raw_freq_verbs       xlabels
## 2            302203          44160   Informative
## 3            589751          89342 Instructional
## 4            253493          42302     Narrative
## 5             14365           2068         Other
## 6             66983          11175      Personal
## 7             26112           4512        Poetry
## 8            510255          80106  Conversation
## 9            912307         138695     Exercises

Across the entire Textbook English Corpus (TEC), make is the 13th most frequent verb lemma after MAKE DO, HAVE, CAN GO, SAY, LOOK, USE, LOOK, THINK, WRITE and READ.

Moreover, it is the 47th most frequent lemma across all parts-of-speech. Its relative frequency of occurrence across all textbook registers is 2247.66 pmw. That said, its distribution across the different textbook registers is strikingly uneven (see Fig. @ref(fig:registers)).

Frequency of MAKE across all Textbook registers

## null device 
##           1

The following sections will home in on the two textbook registers with the lowest relative frequencies of MAKE, namely the narrative texts and spoken language within the textbooks. The representations of MAKE in these two textbook registers are compared to the Youth Fiction Corpus and the Spoken BNC 2014 (LoveSpokenBNC20142017?) respectively.

Fig. @ref(fig:relfreqspokennar) reveals that the verb MAKE is more frequent fiction than in conversation. However, in both textbook registers, the relative frequencies of MAKE is lower than in the corresponding reference corpora. X2 tests (with Yates’ continuity correction) applied on the observed and expected raw frequencies show that the high-frequency verb is featured significantly less frequently than in the reference corpora (Conversation: X2 = 18.774, df = 1, p-value = <0.001; Fiction: X2 = 25.18, df = 1, p-value = <0.001).

MAKE in the TEC vs. Ref. corpora

Relative Frequencies of the verb lemma MAKE in Textbook and Reference subcorpora

## null device 
##           1

##                               Register      Data raw_Freq pmw_Freq total_tokens
## Ref_Youth_Fiction_sampled      Fiction Reference    20657  2007.87     10288031
## TxB_narrative                  Fiction  Textbook      395  1555.80       253888
## Ref_BNC_Spoken2014        Conversation Reference    18253  1617.24     11286509
## TxB_spoken                Conversation  Textbook      699  1368.03       510954
##                           raw_freq_not_MAKE total_verbs total_verbs_not_MAKE
## Ref_Youth_Fiction_sampled          10267374     1792111              1771454
## TxB_narrative                        253493       42302                41907
## Ref_BNC_Spoken2014                 11268256     2487515              2469262
## TxB_spoken                           510255       80106                79407
##                           total_BE total_verbs_not_BE total_verbs_not_BE_MAKE
## Ref_Youth_Fiction_sampled   361207            1430904                 1410247
## TxB_narrative                10278              32024                   31629
## Ref_BNC_Spoken2014          748800            1738715                 1720462
## TxB_spoken                   25706              54400                   53701
##                           MAKE_10000verbs MAKE_10000verbs_BE
## Ref_Youth_Fiction_sampled       115.26630           144.3633
## TxB_narrative                    93.37620           123.3450
## Ref_BNC_Spoken2014               73.37845           104.9798
## TxB_spoken                       87.25938           128.4926

##                           raw_Freq total_verbs_not_MAKE
## Ref_Youth_Fiction_sampled    20657              1771454
## TxB_narrative                  395                41907

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  freq_nar
## X-squared = 17.263, df = 1, p-value = 3.254e-05

##                     X^2 df   P(> X^2)
## Likelihood Ratio 18.612  1 1.6019e-05
## Pearson          17.456  1 2.9406e-05
## 
## Phi-Coefficient   : 0.003 
## Contingency Coeff.: 0.003 
## Cramer's V        : 0.003

##                    raw_Freq raw_freq_not_MAKE
## Ref_BNC_Spoken2014    18253          11268256
## TxB_spoken              699            510255

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  freq_spoken
## X-squared = 18.774, df = 1, p-value = 1.472e-05

##                     X^2 df   P(> X^2)
## Likelihood Ratio 19.894  1 8.1848e-06
## Pearson          18.929  1 1.3570e-05
## 
## Phi-Coefficient   : 0.001 
## Contingency Coeff.: 0.001 
## Cramer's V        : 0.001

## 
##  Fisher's Exact Test for Count Data
## 
## data:  freq_spoken
## p-value = 9.4e-06
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.096321 1.277161
## sample estimates:
## odds ratio 
##   1.182465

Semantics of MAKE

High-frequency verbs such as MAKE are not only highly polysemous, they also enter into many different lexico-grammatical constructions. This section explores the quantitative differences in the proportional use of these different meanings and constructions in two major Textbook English registers as compared to similar registers in naturally-occurring ENL speech and writing.

# Semantics of MAKE in Textbook English #

semantics <- read.csv(here("Make_semantics.csv"), sep = ",", header = TRUE, stringsAsFactors = TRUE)

semantics[9, 5] = 8  # Correction to dataset (one concordance line was annotated as PV but wasn't in fact MAKE as a noun and has been deleted from the PV dataset)

head(semantics)

##         Category RefYouthFiction_raw TextbookNarrative_raw SpokenBNC2014_raw
## 1    Ai: produce                  87                    94               181
## 2 Aii: delexical                 106                    81               145
## 3        B: earn                   5                    14                24
## 4     C: achieve                  13                    26                20
## 5   D: be/become                   4                     2                 6
## 6       E: cause                 118                   125               215
##   TextbookConversation_raw
## 1                      194
## 2                      168
## 3                       24
## 4                       23
## 5                        5
## 6                      172

# Chi-squared tests #

# The chisq.test() function wants just a matrix of counts.  So we need to select
# the columns that contain the relevant counts.
cols = c("SpokenBNC2014_raw", "TextbookConversation_raw")
semantics_test = chisq.test(semantics[, cols])
print(semantics_test)

## 
##  Pearson's Chi-squared test
## 
## data:  semantics[, cols]
## X-squared = 24.81, df = 8, p-value = 0.001674

# The test asks simply whether the distributions over categories are the same for
# the two corpora.  And a low enough p-value is reason to doubt this hypothesis
# (loosely speaking).

# To test just one of the categories with the Chi-squared test we need to
# collapse the others.  For example if we want to test the delexical category we
# need a category 'other' for all the rest.  We can achieve this by binding the
# relevant row to the column sums of the other rows.  Then we can test this 2x2
# matrix of counts.  Since we have a 2x2 matrix we can use Fisher's exact test
# instead.  We can do this for every category in a loop, and store the resulting
# p-values from the tests, along with a star symbol for the low p-values.

results = data.frame(Category = semantics$Category, p = 1)
for (i in 1:nrow(results)) {
    category = semantics$Category[i]
    row = semantics[semantics$Category == category, cols]
    other_rows = semantics[semantics$Category != category, cols]
    collapsed = rbind(row, colSums(other_rows))
    category_test = fisher.test(collapsed)
    results$p[i] = category_test$p.value
    results$OddsRatio[i] = category_test$estimate
    results$conf.int.lower[i] = category_test$conf.int[1]
    results$conf.int.upper[i] = category_test$conf.int[2]

}

results

##           Category           p OddsRatio conf.int.lower conf.int.upper
## 1      Ai: produce 0.429831680 0.9065618      0.7089985      1.1588489
## 2   Aii: delexical 0.138265294 0.8240593      0.6340915      1.0700880
## 3          B: earn 1.000000000 0.9984627      0.5368422      1.8569966
## 4       C: achieve 0.646116946 0.8643411      0.4455196      1.6637706
## 5     D: be/become 1.000000000 1.1998363      0.3034438      4.9952869
## 6         E: cause 0.011400491 1.3640483      1.0688700      1.7424388
## 7        F: ensure 0.898208363 1.0661577      0.6246510      1.8237921
## 8     I: idiomatic 0.003604951 0.4810266      0.2781997      0.8120519
## 9 P: phrasal verbs 0.002600575 3.3327110      1.4518204      8.5812583

# The 'BH' or 'BY' methods are probably the most appropriate here if p-value
# correction is desired (controversial topic).
results$p = p.adjust(results$p, method = "none")
results$Star <- stars.pval(results$p)

# Check the outcome.
results

##           Category           p OddsRatio conf.int.lower conf.int.upper Star
## 1      Ai: produce 0.429831680 0.9065618      0.7089985      1.1588489     
## 2   Aii: delexical 0.138265294 0.8240593      0.6340915      1.0700880     
## 3          B: earn 1.000000000 0.9984627      0.5368422      1.8569966     
## 4       C: achieve 0.646116946 0.8643411      0.4455196      1.6637706     
## 5     D: be/become 1.000000000 1.1998363      0.3034438      4.9952869     
## 6         E: cause 0.011400491 1.3640483      1.0688700      1.7424388    *
## 7        F: ensure 0.898208363 1.0661577      0.6246510      1.8237921     
## 8     I: idiomatic 0.003604951 0.4810266      0.2781997      0.8120519   **
## 9 P: phrasal verbs 0.002600575 3.3327110      1.4518204      8.5812583   **

### Compute percentages for distributions of meanings ###
semantics = semantics %>% mutate(RefYouthFiction = (RefYouthFiction_raw/(sum(RefYouthFiction_raw))) * 
    100) %>% mutate(TextbookNarrative = (TextbookNarrative_raw/(sum(TextbookNarrative_raw))) * 
    100) %>% mutate(SpokenBNC2014 = (SpokenBNC2014_raw/(sum(SpokenBNC2014_raw))) * 
    100) %>% mutate(TextbookConversation = (TextbookConversation_raw/(sum(TextbookConversation_raw)) * 
    100))

Semantics of MAKE in Textbook Conversation

Figure @ref(fig:semantics-spoken) visualises the different proportions of meanings of MAKE in Textbook Conversation and in a sample of the Spoken BNC 2014. Statistical differences between the proportions of each category was tested using Fisher’s Exact Test for Count Data the resulting p=values are plotted on Figures @ref(fig:semantics-spoken) and @ref(fig:semantics-nar) using the following convention: 0 - 0.001 = ***, 0.001 - 0.01 = **, 0.01 - 0.05 = *, 0.05 - 0.1 = ˘.

## Start with TxB spoken vs. BNCspoken ###

spoken_sem = semantics %>% select(Category, SpokenBNC2014, TextbookConversation)

# Lollipop graph:
# https://www.r-graph-gallery.com/303-lollipop-plot-with-2-values/

beige <- col2rgb("#FC8D59", alpha = TRUE)
bordeau <- col2rgb("#990000", alpha = TRUE)

# svg(filename=here('MAKE_semantics_Spoken.svg', width=12, height=6,
# pointsize=12)

fig = ggplot(spoken_sem) + geom_segment(aes(x = fct_rev(Category), xend = Category, 
    y = SpokenBNC2014, yend = TextbookConversation), color = "black") + geom_point(aes(x = Category, 
    y = SpokenBNC2014), color = rgb(252, 141, 89, max = 255, alpha = 220), size = 4) + 
    geom_point(aes(x = Category, y = TextbookConversation), color = rgb(153, 0, 0, 
        max = 255, alpha = 180), size = 4) + coord_flip() + theme_light() + theme(panel.border = element_blank(), 
    axis.text = element_text(size = 15, colour = "black"), axis.title.x = element_text(size = 15, 
        face = "bold")) + xlab("") + ylab("% of occurrences of MAKE") + theme(plot.margin = margin(20, 
    10, 10, 10))

# + labs(title = 'Distribution of MAKE meanings in Textbook Conversation and the
# Spoken BNC 2014')

### Adding legend directly onto plot ###
### https://stackoverflow.com/questions/40011005/adding-a-legend-to-a-dumbbell-chart-in-r
### #

fig1 = fig + geom_text(data = data.frame(), aes(x = "Ai: produce", y = 29, label = "Textbook Conversation"), 
    color = rgb(153, 0, 0, max = 255, alpha = 255), hjust = 1, size = 5, fontface = "bold", 
    nudge_y = 11, nudge_x = 0.2) + geom_text(data = data.frame(), aes(x = 9, y = 26, 
    label = "Spoken BNC 2014 sample"), color = rgb(252, 141, 89, max = 255, alpha = 255), 
    hjust = 0, size = 5, fontface = "bold", nudge_y = -11, nudge_x = 0.2)

# With the addition of an = assignment the plot above is now stored and can be
# added to.  We have some relevant data in our 'results' dataframe.  The Star
# column says what to display.  And the Category column says for what category.
# But ggplot needs also to know how far along the y axis to put each star.  For
# this, we need a y column.  To place the stars half way along the segment, we
# need the mean of the two point values.  We can get this from the plot data
# frame created above by calculating the row means for the two relevant columns.
# This is slightly complicated by the fact that the column names no longer have
# '_raw' appended.
columns = c("SpokenBNC2014", "TextbookConversation")
results$y = rowMeans(spoken_sem[, columns])

# Now we can add a text annotation layer to the plot.  We map the Start column to
# the text itself, we map Category to the x axis as in the original plot, and we
# map the halfway points calculated above to the y axis.  We specify that the
# data for this layer are coming from another data frame.  And we adjust the size
# of the text.
fig2 = fig1 + geom_text(aes(x = Category, y = y, label = Star), data = results, size = 8)
print(fig2)

Distribution of MAKE meanings in Textbook Conversation and the Spoken BNC 2014

dev.off()

## null device 
##           1

# ggsave(here('semantics_spoken.png'), width = 24, height = 12, units = 'cm', dpi
# = 300)

Causative uses of MAKE represent the most frequent semantic category in the BNC. The proportion of causative makes in Textbook Conversation, however, is significantly lower. The other two most frequent categories, the produce sense and delexical uses, are similarly distributed across the two conversation corpora. The significantly higher proportion of idioms featuring MAKE in Textbook Conversation is primarily due to the very high frequency of the idiom to make friends in Textbook Conversation (see Table @ref(tab:idioms_spoken)).

Idioms

idioms_spoken <- read.csv(here("MAKE_idioms_freq_spoken.csv"), sep = ",", header = TRUE)

stargazer::stargazer(idioms_spoken, type = "html", title = "Idioms with MAKE in Textbook Conversation and a sample of the Spoken BNC 2014")

## 
## <table style="text-align:center"><caption><strong>Idioms with MAKE in Textbook Conversation and a sample of the Spoken BNC 2014</strong></caption>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Raw.Freq..in.Textbook.Conversation</td><td>6</td><td>7.333</td><td>8.687</td><td>3</td><td>3.2</td><td>4.8</td><td>25</td></tr>
## <tr><td style="text-align:left">Raw.Freq..in.Spoken.BNC2014.Sample</td><td>6</td><td>2.500</td><td>1.871</td><td>0</td><td>1.2</td><td>3.8</td><td>5</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>

Idiomatic phrase	Raw Freq. in Textbook Conversation	Raw Freq. in Spoken BNC2014 Sample
MAKE friends	25	5
MAKE the best/most of sth.	4	2
MAKE faces	4	4
MAKE fun of sth./sb.	3	1
Practice makes perfect	3	0
MAKE up .* mind	5	3

Both Textbook Conversation and Textbook Fiction feature significantly fewer phrasal verbs with MAKE than their corresponding ENL reference corpora (cf. Fig. @ref(fig:semantics-spoken) and @ref(fig:semantics-nar)). This pedagogically relevant finding will be explored further in Section *. Furthermore, Textbook Fiction also features considerably fewer delexical uses of MAKE. This finding will be elaborated on in Section *.

Semantics of MAKE in Textbook Fiction

#### Textbook Narrative vs. Youth Fiction ###

columns = c("RefYouthFiction_raw", "TextbookNarrative_raw")

semantics_nar_test = chisq.test(semantics[, columns])
print(semantics_nar_test)

## 
##  Pearson's Chi-squared test
## 
## data:  semantics[, columns]
## X-squared = 20.14, df = 8, p-value = 0.009818

results = data.frame(Category = semantics$Category, p = 1)
for (i in 1:nrow(results)) {
    category = semantics$Category[i]
    row = semantics[semantics$Category == category, columns]
    other_rows = semantics[semantics$Category != category, columns]
    collapsed = rbind(row, colSums(other_rows))
    category_test = fisher.test(collapsed)
    results$p[i] = category_test$p.value
    results$OddsRatio[i] = category_test$estimate
    results$conf.int.lower[i] = category_test$conf.int[1]
    results$conf.int.upper[i] = category_test$conf.int[2]

}

results

##           Category          p OddsRatio conf.int.lower conf.int.upper
## 1      Ai: produce 0.61116111 0.9044069     0.63930842      1.2784425
## 2   Aii: delexical 0.04412899 1.4223754     1.00890050      2.0104157
## 3          B: earn 0.06040835 0.3492676     0.09748602      1.0389145
## 4       C: achieve 0.04742970 0.4832880     0.22436660      0.9926373
## 5     D: be/become 0.68629807 2.0085743     0.28598225     22.3251938
## 6         E: cause 0.64316182 0.9199817     0.67166232      1.2596311
## 7        F: ensure 0.44968236 0.6969169     0.29926374      1.5726868
## 8     I: idiomatic 0.86818253 0.8953140     0.43878014      1.8145862
## 9 P: phrasal verbs 0.01640735 2.3266949     1.15064154      4.9560309

results$p = p.adjust(results$p, method = "none")

# require(gtools)
results$Star <- stars.pval(results$p)
results = results %>% mutate(Star = replace(Star, Star == ".", "˘"))  # Change dot to a higher symbol or: º˘

# Check the outcome.
results

##           Category          p OddsRatio conf.int.lower conf.int.upper Star
## 1      Ai: produce 0.61116111 0.9044069     0.63930842      1.2784425     
## 2   Aii: delexical 0.04412899 1.4223754     1.00890050      2.0104157    *
## 3          B: earn 0.06040835 0.3492676     0.09748602      1.0389145    ˘
## 4       C: achieve 0.04742970 0.4832880     0.22436660      0.9926373    *
## 5     D: be/become 0.68629807 2.0085743     0.28598225     22.3251938     
## 6         E: cause 0.64316182 0.9199817     0.67166232      1.2596311     
## 7        F: ensure 0.44968236 0.6969169     0.29926374      1.5726868     
## 8     I: idiomatic 0.86818253 0.8953140     0.43878014      1.8145862     
## 9 P: phrasal verbs 0.01640735 2.3266949     1.15064154      4.9560309    *

## Textbook narrative vs. Youth Fiction ###
nar_sem = semantics %>% select(Category, RefYouthFiction, TextbookNarrative)

# svg(filename=here('MAKE_semantics_Fiction.svg'), width=12, height=6,
# pointsize=12)

fig = ggplot(nar_sem) + geom_segment(aes(x = fct_rev(Category), xend = Category, 
    y = RefYouthFiction, yend = TextbookNarrative), color = "darkgrey") + geom_point(aes(x = Category, 
    y = RefYouthFiction), color = rgb(252, 141, 89, max = 255, alpha = 220), size = 4) + 
    geom_point(aes(x = Category, y = TextbookNarrative), color = rgb(153, 0, 0, max = 255, 
        alpha = 180), size = 4) + coord_flip() + theme_light() + theme(panel.border = element_blank(), 
    axis.text = element_text(size = 14, colour = "black"), axis.title.x = element_text(size = 14, 
        face = "bold")) + xlab("") + ylab("% of occurrences of MAKE") + theme(plot.margin = margin(10, 
    10, 10, 10))
# + labs(title = 'Distribution of MAKE meanings in Textbook Fiction and Reference
# Youth Fiction')

fig1 = fig + geom_text(data = data.frame(), aes(x = "Ai: produce", y = 26, label = "Textbook Fiction"), 
    color = rgb(153, 0, 0, max = 255, alpha = 255), hjust = 1, size = 5, fontface = "bold", 
    nudge_y = 6, nudge_x = 0.2) + geom_text(data = data.frame(), aes(x = 9, y = 20, 
    label = "Youth Fiction Corpus sample"), color = rgb(252, 141, 89, max = 255, 
    alpha = 220), hjust = 0, size = 5, fontface = "bold", nudge_y = -10, nudge_x = 0.2)

cols = c("RefYouthFiction", "TextbookNarrative")
results$y = rowMeans(nar_sem[, cols])

fig2 = fig1 + geom_text(aes(x = Category, y = y, label = Star), data = results, size = 8)
print(fig2)

Distribution of MAKE meanings in Textbook Fiction and Reference Youth Fiction

dev.off()

## null device 
##           1

# ggsave(here('semantics_nar.png', width = 24, height = 12, units = 'cm', dpi =
# 300)

Quantitatively, the semantic category “C: achieve” is proportionally more represented in Textbook Fiction than in the Youth Fiction Corpus sample. This is primarily due to the fact that this category is largely dominated by the phrase to make it which is featured 17 times in Textbook Fiction, as opposed to three times in the Youth Fiction sample. By contrast, a qualitative analysis of the corresponding concordance lines reveals that, in the Youth Fiction sample, the category includes a much broader range of phrases involving movement and/or a sense of achievement. These typical, yet syntactically relatively complex, phrases of narrative writing, usually rendered with one or more prepositions (-), are conspicuously absent from the Textbook Fiction.

(###) Then he made his way down the stairs and into the locker room. <Youth Fiction, file 183: Pratchett, Night Watch, 2002>

(###) […] Jack came out of the barn and made straight for me. <Youth Fiction, file 191: Delaney, The Spook’s Apprentice, 2004>.

(###) She and the bird started to make off towards my ship. <Youth Fiction, file 140: Adams, Mostly Harmless, 1992>

MAKE in the sense (Ai) produce/create

MAKE in the sense (Ai) produce/create in Textbook Conversation

## 
##  Pearson's Chi-squared test
## 
## data:  spoken_sem[, cols]
## X-squared = 31.847, df = 10, p-value = 0.0004247

##                         Semantics            p OddsRatio conf.int.lower
## 1  Sports, Entertainment & Travel 0.0001106101 0.2895076     0.13684543
## 2                            Food 0.0392139156 1.5859533     1.01216142
## 3                           Drink 0.0118899021 3.6564387     1.24501982
## 4          Clothing & Accessories 0.0151945924 0.2966682     0.08373583
## 5                   Arts & Crafts 0.2216724760 1.7825954     0.73661879
## 6              World, Life & Body 0.3967783920 0.6736406     0.25017318
## 7           Industry & Technology 0.4591638554 1.5611497     0.52296327
## 8        General & abstract terms 0.6031684435 0.7056359     0.20223584
## 9           Materials & Equipment 0.7366388507 1.1443201     0.55721579
## 10       Language & Communication 1.0000000000 0.8907527     0.21104313
## 11              Buildings & House 1.0000000000 1.0749109     0.31465306
##    conf.int.upper
## 1       0.5765733
## 2       2.4938716
## 3      13.0493546
## 4       0.8615393
## 5       4.5261472
## 6       1.7217635
## 7       4.9478222
## 8       2.2731796
## 9       2.3573155
## 10      3.5727125
## 11      3.6722592

## null device 
##           1

#### Chatterplot: inspired by https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098

produce <- read.csv(file = here("Produce.csv"), sep = "\t", stringsAsFactors = TRUE)
summary(produce$USAS)

##                  Arts & Crafts              Buildings & House 
##                             36                             31 
##         Clothing & Accessories                          Drink 
##                             32                             36 
##                           Food       General & abstract terms 
##                            160                             29 
##          Industry & Technology       Language & Communication 
##                             20                             26 
##          Materials & Equipment Sports, Entertainment & Travel 
##                             64                             74 
##             World, Life & Body 
##                             44

str(produce)

## 'data.frame':    552 obs. of  10 variables:
##  $ Subcorpus      : Factor w/ 4 levels "BNCspoken","spoken",..: 1 4 1 1 1 1 1 1 1 2 ...
##  $ File_level     : Factor w/ 230 levels "A","B","C","D",..: 141 48 77 172 158 228 125 150 134 2 ...
##  $ File_series    : Factor w/ 10 levels "","Access","Achievers",..: 1 1 1 1 1 1 1 1 1 7 ...
##  $ VDE            : Factor w/ 1 level "produce": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Collocate      : Factor w/ 341 levels "[food]","album",..: 6 15 25 48 49 50 50 73 74 91 ...
##  $ Collocate_lemma: Factor w/ 314 levels "[food]","album",..: 6 14 24 45 46 46 46 70 71 87 ...
##  $ USAS           : Factor w/ 11 levels "Arts & Crafts",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Concordance1   : logi  NA NA NA NA NA NA ...
##  $ Concordance2   : Factor w/ 6 levels "made","Made",..: 6 3 1 1 1 6 1 3 6 3 ...
##  $ Concordance3   : logi  NA NA NA NA NA NA ...

spoken_produce <- produce %>% filter(Subcorpus=="BNCspoken" | Subcorpus =="spoken")
spoken_produce$Subcorpus <- droplevels(spoken_produce$Subcorpus)
table_spoken_produce <- table(spoken_produce$Collocate_lemma, spoken_produce$Subcorpus)
head(table_spoken_produce)

##           
##            BNCspoken spoken
##   [food]           6      0
##   album            1      0
##   alcohol          0      0
##   anything         1      1
##   army             1      0
##   art              1      0

comp <- round(prop.table(table(spoken_produce$Collocate_lemma, spoken_produce$Subcorpus), 2), 4)*100
comp <- as.data.frame(unclass(comp))
comp$Collocate_lemma <- row.names(comp)
head(comp)

##          BNCspoken spoken Collocate_lemma
## [food]        3.33   0.00          [food]
## album         0.56   0.00           album
## alcohol       0.00   0.00         alcohol
## anything      0.56   0.52        anything
## army          0.56   0.00            army
## art           0.56   0.00             art

comp <- inner_join(comp, spoken_produce[,6:7], by = "Collocate_lemma")
comp <- distinct(comp)
head(comp)

##   BNCspoken spoken Collocate_lemma                           USAS
## 1      3.33   0.00          [food]                           Food
## 2      0.56   0.00           album Sports, Entertainment & Travel
## 3      0.56   0.52        anything          Materials & Equipment
## 4      0.56   0.00            army          Industry & Technology
## 5      0.56   0.00             art                  Arts & Crafts
## 6      0.56   0.00             ash          Materials & Equipment

suffrager <- c(palettes_d$suffrager$classic[1], palettes_d$suffrager$CarolMan[3:4], palettes_d$suffrager$london, palettes_d$suffrager$oxon[c(1,5)])

levels(as.factor(comp$spoken))

##  [1] "0"     "0.52"  "1.04"  "1.55"  "2.07"  "2.59"  "4.15"  "4.66"  "5.18" 
## [10] "10.36"

## With minimal threshold of just min 2 occurrences
comp %>% filter(spoken > 0.57 | BNCspoken > 0.57) %>% 
  ggplot(aes(spoken, BNCspoken, label = Collocate_lemma)) +
  # ggrepel geom, make arrows transparent, color by rank, size by n
  geom_point(aes(colour=USAS)) +
  geom_text_repel(min.segment.length = 0.3, segment.alpha = 0.2, force = 0.4, 
                  aes(colour=USAS), show_guide = F) +
  scale_colour_manual(values = suffrager,
        name = "Semantic field") +
  
  # set color gradient & customize legend
  geom_abline(color = "gray40", lty = 2) +
  
  # set word size range & turn off legend
  labs(y = "% of MAKE 'produce' collocations in Spoken BNC2014 sample", x = "% of 'produce' MAKEs in Textbook Conversation") +
  scale_y_log10(breaks=c(1,2,3,4)) +
  scale_x_log10(breaks=c(1,2,4,6,8,10,12)) +
  
  # minimal theme & customizations
  theme_bw() +
  theme(legend.position=c(0.86,0.35),
        #legend.justification = c("right","top"),
        panel.grid.major = element_line(colour = "whitesmoke"),
        panel.grid.minor=element_blank(),
        legend.background = element_rect(colour = 'darkgrey', fill = 'white', linetype='solid'))

#ggsave(here("Produce_MAKEs_Spoken_chatterplot.svg"), dpi = 300, width =22, height = 15, units = "cm")

## With higher threshold of min. 3 occurrences
comp %>% filter(spoken > 1.04 | BNCspoken > 1.11) %>% 
  ggplot(aes(spoken, BNCspoken, label = Collocate_lemma)) +
  # ggrepel geom, make arrows transparent, color by rank, size by n
  geom_point(aes(colour=USAS)) +
  geom_text_repel(min.segment.length = 0.3, segment.alpha = 0.2, force = 0.4, 
                  aes(colour=USAS), show_guide = F) +
  scale_colour_manual(values = suffrager,
        name = "Semantic field") +
  # set color gradient & customize legend
  geom_abline(color = "gray40", lty = 2) +
  # set word size range & turn off legend
  labs(y = "% of MAKE 'produce' collocations in Spoken BNC2014 sample", x = "% of 'produce' MAKEs in Textbook Conversation") +
  scale_y_log10(breaks=c(1,2,3,4)) +
  scale_x_log10(breaks=c(1,2,4,6,8,10,12)) +
  
  # minimal theme & customizations
  theme_bw() +
  theme(legend.position=c(0.86,0.30),
        #legend.justification = c("right","top"),
        panel.grid.major = element_line(colour = "whitesmoke"),
        panel.grid.minor=element_blank(),
        legend.background = element_rect(colour = 'darkgrey', fill = 'white', linetype='solid'))

#ggsave(here("Produce_MAKEs_Spoken_chatterplot_min3.svg"), dpi = 300, width =23, height = 15, units = "cm")

### Wordclouds for collocates ###
produce <- read.csv(file = here("Produce.csv"), sep = "\t", stringsAsFactors = TRUE)
summary(produce$USAS)

##                  Arts & Crafts              Buildings & House 
##                             36                             31 
##         Clothing & Accessories                          Drink 
##                             32                             36 
##                           Food       General & abstract terms 
##                            160                             29 
##          Industry & Technology       Language & Communication 
##                             20                             26 
##          Materials & Equipment Sports, Entertainment & Travel 
##                             64                             74 
##             World, Life & Body 
##                             44

### Now I would like one colour for each USAS category in the df 'produce' ##
### Copying this
### https://stackoverflow.com/questions/18902485/colored-categories-in-r-wordclouds

# The problem is that the wordcloud() function first summarizes produce$colors,
# so that each word only appears once, and is matched to a frequency.  The result
# is that the colors vector from the original data set is too long.

# One way to solve this is to handle summarising the data frame first With dplyr
# we can use group_by() to group the data frame by combinations of factor
# variables.  Collocate_lemma and USAS are the relevant ones.  We can then use
# summarize() to count the frequencies ourselves.  The n() function counts
# frequencies (where n stands for 'number' of entries).

produce_freq <- produce %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())

# Now we can apply the colour scheme:
colours1 <- RColorBrewer::brewer.pal(n = 8, name = "Set1")
colours2 <- RColorBrewer::brewer.pal(n = 10, name = "RdGy")
colours <- c(colours1, colours2[c(1, 9, 10)])  # We need a total of 12 colours for 12 categories

### Now a wordcloud for each subcorpus ###

produceBNC <- produce[produce$Subcorpus == "BNCspoken", ]
produceTxBspoken <- produce[produce$Subcorpus == "spoken", ]

produce_freq_BNC <- produceBNC %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())

produce_freq_TxBspoken <- produceTxBspoken %>% group_by(Collocate_lemma, USAS) %>% 
    summarize(Frequency = n())

anchor = 10

# png(filename = here('produce_BNC_categories.png'), width = 1300, height = 1100,
# res = 300)

# svg(filename=here('produce_BNC_collocates.svg'), width=9, height=6,
# pointsize=11)
par(mar = rep(0, 4))
produce_freq_BNC$colors <- colours[match(produce_freq_BNC$USAS, levels(produce_freq_BNC$USAS))]
wordcloud(produce_freq_BNC$Collocate_lemma, produce_freq_BNC$Frequency, min.freq = 2, 
    rot.per = 0, colors = produce_freq_BNC$colors, ordered.colors = TRUE, random.order = FALSE, 
    scale = c(10 * max(table(produce_freq_BNC$Collocate_lemma))/anchor, 0.5))

Main collocates of MAKE in the produce/create sense (Ai) in the Spoken BNC2014 sample

dev.off()

## null device 
##           1

# png(filename = here('produce_TxBspoken_categories.png'), width = 1300, height =
# 1100, res = 300) svg(filename=here('produce_TxBSpoken_collocates.svg'),
# width=9, height=6, pointsize=11)
par(mar = rep(0, 4))
produce_freq_TxBspoken$colors <- colours[match(produce_freq_TxBspoken$USAS, levels(produce_freq_TxBspoken$USAS))]
wordcloud(produce_freq_TxBspoken$Collocate_lemma, produce_freq_TxBspoken$Frequency, 
    min.freq = 2, rot.per = 0, colors = produce_freq_TxBspoken$colors, ordered.colors = TRUE, 
    random.order = FALSE, scale = c(40 * max(table(produce_freq_BNC$Collocate_lemma))/anchor, 
        0.5))

Main collocates of MAKE in the produce/create sense (Ai) in Textbook Conversation

dev.off()

## null device 
##           1

MAKE in the sense (Ai) produce/create in Textbook Fiction

produce_sem <- read.csv(file = here("Produce_semantics.csv"), sep = "\t", stringsAsFactors = TRUE) # This data set was produced in the MAKE_produce_wordcloud_lollipop.R script

### Now onto a comparison of semantic categories between TxB narrative and Youth Fiction ###
### 

# Reorder data by size of difference in frequency # 
nar_sem = produce_sem %>% 
  rowwise() %>% mutate(dif = sqrt((Youth_FictionPer - TxBnarPer)^2)) %>% 
  arrange(desc(dif)) %>% 
  mutate(Semantics=factor(Semantics, Semantics))

nar_sem$Semantics <- factor(nar_sem$Semantics, levels = nar_sem$Semantics[order(nar_sem$dif)])

### Chi-squared tests ##

cols = c("Youth_Fiction", "TxBnar")
semantics_test = chisq.test(nar_sem[,cols])
# Comparing the two distributions #
print(semantics_test)

## 
##  Pearson's Chi-squared test
## 
## data:  nar_sem[, cols]
## X-squared = 22.119, df = 10, p-value = 0.01451

# X-squared = 22.119, df = 10, p-value = 0.01451

# Comparing proportions for each semantic category #
results = data.frame(Semantics=nar_sem$Semantics, p=1)
for(i in 1:nrow(results)){
  category = nar_sem$Semantics[i]
  row = nar_sem[nar_sem$Semantics==category,cols]
  other_rows = nar_sem[nar_sem$Semantics!=category,cols]
  collapsed = rbind(row, colSums(other_rows))
  category_test = fisher.test(collapsed)
  results$p[i] = category_test$p.value
  results$OddsRatio[i] = category_test$estimate
  results$conf.int.lower[i] = category_test$conf.int[1]
  results$conf.int.upper[i] = category_test$conf.int[2]
}

results$p = p.adjust(results$p, method='none')

## We can add the stars for low p-values using ifelse(). ##
#criterion = 0.05
#results$Star = ifelse(results$p < criterion, '**', '')
## But this package makes it much quicker! ##
results$Star <- stars.pval(results$p)
results = results %>%
  mutate(Star = replace(Star, Star==".", "˘")) # Change dot to a higher symbol such as º or ˘ 
# Check the outcome.
print(results)

##                         Semantics           p OddsRatio conf.int.lower
## 1               Buildings & House 0.003963951 5.7818229    1.531358331
## 2                            Food 0.066470807 0.4286075    0.160844525
## 3  Sports, Entertainment & Travel 0.100465585 0.4251720    0.127245834
## 4              World, Life & Body 0.245055350 1.8853900    0.680214692
## 5        General & abstract terms 0.166963624 0.4068191    0.089478772
## 6        Language & Communication 0.286462965 0.5142079    0.131945680
## 7          Clothing & Accessories 0.198985318 2.6442894    0.579540410
## 8                   Arts & Crafts 0.524096612 1.6639939    0.378966782
## 9           Materials & Equipment 0.673762817 1.2008138    0.471543246
## 10                          Drink 0.789240631 1.2584397    0.379082631
## 11          Industry & Technology 1.000000000 0.5370938    0.008975184
##    conf.int.upper Star
## 1       32.571583   **
## 2        1.061591    ˘
## 3        1.250977     
## 4        5.559175     
## 5        1.481647     
## 6        1.738404     
## 7       16.377216     
## 8        8.315215     
## 9        3.082114     
## 10       4.282852     
## 11      10.492001

## Lollipop chart ##
#svg(filename=here("MAKE_produce_Fiction.svg"), width=14, height=7, pointsize=12)

fig = ggplot(nar_sem) +
  geom_segment(aes(x=Semantics, xend=Semantics, y=Youth_FictionPer, yend=TxBnarPer), color="black") +
  geom_point(aes(x=Semantics, y=Youth_FictionPer), color=rgb(153,0,0, max = 255, alpha = 255), size=5) +
  geom_point(aes(x=Semantics, y=TxBnarPer), color=rgb(252, 141, 89, max = 255, alpha = 220), size=5) +
  coord_flip() +
  theme_minimal() +
  theme(
    panel.border = element_blank(),
    axis.text=element_text(size=16, colour = "black"),
    axis.title.x=element_text(size=16, face = "bold")
  ) +
  scale_x_discrete(limits=c(levels(nar_sem$Semantics),"")) +
  xlab("") +
  ylab("% of MAKE collocations in the (Ai) produce sense") + # Adds an empty factor to give space for legend as recommended in: https://stackoverflow.com/questions/16788180/r-ggplot-extend-the-range-of-a-category-x-axis
  theme(plot.margin=margin(10,10,10,10))

# +  labs(title = "Differences in semantic fields of collocates of MAKE in the produce sense (Ai).")

fig1 = fig + 
  geom_text(data=data.frame(), aes(x="Buildings & House", y= 5, label="Textbook Fiction"),color=rgb(252, 141, 89, max = 255, alpha = 255), hjust=1, size=6, fontface = "bold", nudge_x = 0.7) +
  geom_text(data=data.frame(), aes(x="Buildings & House", y= 14, label="Youth Fiction Sample"), color=rgb(153,0,0, max = 255, alpha = 255), hjust=0, size=6, fontface = "bold", nudge_x = 0.7)

## This is to help us place the stars in the middle of the bars ##
cols = c("Youth_FictionPer", "TxBnarPer")
results$y = rowMeans(nar_sem[,cols])

fig2 = fig1 +
  geom_text(aes(x=Semantics, y=y, label=Star), data=results, size=9)
print(fig2)

Differences in semantic fields of collocates of MAKE in the produce sense (Ai).

dev.off()

## null device 
##           1

#ggsave(here("produce_USAS_Fiction.png"), width = 24, height = 12, units = "cm", dpi = 300)

produce <- read.csv(file = here("Produce.csv"), sep = "\t", stringsAsFactors = TRUE)
fiction_produce <- produce %>% filter(Subcorpus=="Youth_Fiction" | Subcorpus =="TxBnar")
fiction_produce$Subcorpus <- droplevels(fiction_produce$Subcorpus)
table(fiction_produce$Collocate_lemma, fiction_produce$Subcorpus)

##                       
##                        TxBnar Youth_Fiction
##   [food]                    3             0
##   album                     0             0
##   alcohol                   1             0
##   anything                  0             0
##   army                      0             0
##   art                       0             0
##   ash                       0             0
##   bacon roll                0             0
##   bag                       0             0
##   bagel                     0             0
##   ball                      0             1
##   band                      0             0
##   banner                    0             0
##   basket                    0             1
##   beast                     0             1
##   belt                      0             0
##   bhaji                     0             0
##   bike                      0             0
##   biscuit                   0             0
##   blanket                   0             0
##   block                     0             0
##   blocks of ice             0             1
##   boat                      1             0
##   bobble                    0             0
##   body                      0             0
##   bonfire                   0             0
##   book                      0             1
##   bookcase                  0             0
##   booze                     0             0
##   box                       0             1
##   bracelet                  0             1
##   bread                     0             0
##   breadcrumb                0             0
##   breakfast                 2             1
##   brownies                  0             0
##   building                  0             1
##   bulgar                    0             0
##   burger                    0             0
##   burrow                    0             0
##   cake                      1             0
##   cake and biscuit          0             0
##   calendar                  0             0
##   canal                     0             0
##   canapé                    0             0
##   candle                    0             0
##   card                      0             0
##   carpet                    0             1
##   CD                        0             0
##   ceiling                   0             0
##   chair                     0             1
##   character                 0             0
##   cheese                    0             0
##   children                  0             0
##   chili                     1             0
##   chips                     0             0
##   chocolate                 0             0
##   chowder                   0             0
##   chutney                   0             0
##   city                      0             0
##   claw                      0             0
##   clay                      1             0
##   cliff                     0             0
##   clip                      0             1
##   clock                     0             0
##   clothes                   0             2
##   clothing                  0             0
##   cocoa                     1             0
##   coffee                    0             2
##   coin                      1             0
##   collage                   0             0
##   colour                    0             0
##   content                   0             0
##   copy                      1             0
##   costume                   0             0
##   course                    0             0
##   covering                  0             1
##   crack                     0             1
##   cross                     1             0
##   crumble                   0             0
##   cup                       0             1
##   cup of cocoa              0             1
##   cup of tea                1             0
##   curry                     1             0
##   cut                       1             0
##   cyberpet                  0             0
##   database                  0             0
##   decoration                0             0
##   den                       0             0
##   depression                1             0
##   dinner                    1             0
##   documentary               1             0
##   dog                       0             0
##   door                      0             1
##   drawing                   0             0
##   dress                     0             1
##   drink                     0             0
##   drive                     0             0
##   egg                       0             1
##   elderflower               0             0
##   engine                    0             0
##   Eton Mess                 0             0
##   everyone                  0             1
##   factory                   0             0
##   fairytale                 0             0
##   fake                      0             0
##   farm                      0             0
##   filling                   0             0
##   film                      7             0
##   fingerprint               0             0
##   fire                      2             0
##   folder                    0             0
##   food                      1             1
##   footprint                 0             0
##   foundation                1             0
##   fragrance                 0             1
##   frame                     0             0
##   fruit                     0             0
##   fudge                     0             0
##   fuel                      2             0
##   furniture                 1             0
##   future                    1             0
##   gadget                    0             0
##   game                      0             0
##   Gamemaker                 0             0
##   generator                 0             0
##   ginger honey lemon        0             0
##   gingerbread men           0             0
##   glass                     0             0
##   globe                     0             0
##   god                       0             1
##   gold                      0             1
##   gravy                     0             0
##   group                     1             0
##   gruel                     0             1
##   guacamole                 0             0
##   gum                       0             1
##   Guys                      0             0
##   gypsy tart                0             0
##   hair                      1             0
##   hallmark                  0             0
##   hat                       0             0
##   helmet                    0             1
##   history                   0             0
##   hole                      4             1
##   home                      1             0
##   hot chocolate             0             0
##   house                     0             0
##   hula hoop                 1             0
##   ice cream                 0             0
##   icing                     0             0
##   ID                        1             0
##   instrument                0             0
##   invention                 1             0
##   iPod                      0             0
##   island                    0             1
##   jacket                    0             0
##   jam                       0             0
##   joint                     0             0
##   jug of love               0             0
##   junk [food]               0             1
##   kimbap                    0             0
##   Kleenex                   0             0
##   lair                      0             1
##   lake                      1             0
##   lamp                      0             1
##   lasagne                   0             0
##   law                       0             0
##   leg                       0             1
##   Lego                      0             0
##   life                      0             1
##   light                     0             1
##   list                      1             1
##   lunch                     3             0
##   magazine                  0             1
##   mash                      0             0
##   mask                      1             0
##   matchwood                 0             1
##   matzah                    0             0
##   meal                      0             0
##   meat pattie               0             1
##   merchandise               0             0
##   mess                      0             0
##   metal                     1             0
##   milk                      0             0
##   milkshake                 0             1
##   mince meat                0             0
##   mince pie                 0             0
##   mixtape                   0             0
##   moon                      0             0
##   mousse                    0             0
##   movement                  0             1
##   movie                     0             0
##   muffin                    0             0
##   music                     2             0
##   mutant                    0             1
##   necklace                  1             0
##   nest                      0             0
##   object                    1             1
##   omelet                    0             1
##   opening                   1             0
##   pancake                   1             1
##   pancakes and cupcake      0             0
##   panorama                  0             0
##   parchment                 0             1
##   pasta                     0             0
##   pasty                     0             0
##   path                      0             2
##   pattern                   0             0
##   paving                    0             1
##   pen                       0             1
##   penny                     0             0
##   people                    0             1
##   phone                     0             0
##   picture                   0             0
##   pie                       0             0
##   pitta                     0             0
##   pizza                     0             0
##   plate                     0             0
##   play                      0             1
##   poetry                    0             0
##   poster                    4             0
##   pots of tea               0             1
##   present                   0             0
##   printer                   0             0
##   product                   0             0
##   programme                 0             0
##   project                   0             0
##   protein                   0             1
##   puppet                    0             0
##   rat                       1             0
##   record                    0             0
##   recording                 1             0
##   report                    2             0
##   rhythm                    0             0
##   ring                      0             2
##   road                      0             0
##   roof                      0             1
##   room                      0             3
##   rope                      1             0
##   rule                      3             1
##   salad                     0             0
##   sandwich                  3             1
##   sauce                     0             0
##   schedule                  0             0
##   scone                     0             0
##   seat                      0             0
##   settlement                0             1
##   shelter                   0             1
##   shit                      0             0
##   shoe                      2             0
##   shot                      0             0
##   sign                      1             0
##   snow                      0             1
##   snowflake                 0             1
##   snowman                   0             0
##   sock                      0             0
##   soil                      0             0
##   something                 1             0
##   song                      0             1
##   soup                      1             0
##   sour cream                0             0
##   spaghetti                 0             0
##   specification             0             0
##   spiral                    0             1
##   splot                     0             1
##   stain                     0             0
##   stall                     0             0
##   stamp                     0             0
##   stand                     0             1
##   steak and pancake         1             0
##   steeple                   0             1
##   stereo                    0             0
##   story                     1             0
##   supper                    0             0
##   T-shirt                   0             1
##   tagine                    0             0
##   tank                      0             0
##   tea                       4             1
##   teeth                     0             0
##   term                      0             1
##   thing                     0             1
##   thread                    0             0
##   throw                     1             0
##   tile                      0             0
##   tissue                    0             0
##   toad-in-the-hole          0             0
##   toast                     1             0
##   toastie                   0             0
##   tortilla                  0             0
##   track                     0             1
##   trap                      0             1
##   tree                      1             0
##   triangle                  0             0
##   trifle                    0             0
##   trough                    0             0
##   trousers                  0             0
##   TV series                 1             0
##   tyre                      0             0
##   uni                       0             0
##   video                     1             0
##   wall                      0             1
##   weapon                    0             1
##   web                       2             0
##   wedge                     0             0
##   well                      0             0
##   will                      0             1
##   window                    0             0
##   wine                      0             1
##   wonder                    0             0
##   wood                      1             0
##   word                      0             0
##   world                     1             1
##   wrap                      0             0
##   Yorkshire pudding         0             0

comp <- round(prop.table(table(fiction_produce$Collocate_lemma, fiction_produce$Subcorpus), 2), 4)*100
comp <- as.data.frame(unclass(comp))
comp$Collocate_lemma <- row.names(comp)
head(comp)

##          TxBnar Youth_Fiction Collocate_lemma
## [food]     3.23             0          [food]
## album      0.00             0           album
## alcohol    1.08             0         alcohol
## anything   0.00             0        anything
## army       0.00             0            army
## art        0.00             0             art

comp <- inner_join(comp, fiction_produce[,6:7], by = "Collocate_lemma")
comp <- distinct(comp)
head(comp)

##   TxBnar Youth_Fiction Collocate_lemma                  USAS
## 1   3.23          0.00          [food]                  Food
## 2   1.08          0.00         alcohol                 Drink
## 3   0.00          1.16            ball Materials & Equipment
## 4   0.00          1.16          basket         Arts & Crafts
## 5   0.00          1.16           beast         Arts & Crafts
## 6   0.00          1.16   blocks of ice Materials & Equipment

suffrager <- c(palettes_d$suffrager$CarolMan[2:4], palettes_d$suffrager$classic[1:2], palettes_d$suffrager$london, palettes_d$suffrager$oxon[c(1,5)])

levels(as.factor(comp$Youth_Fiction))

## [1] "0"    "1.16" "2.33" "3.49"

## With minimal threshold of just min 2 occurrences
comp %>% filter(TxBnar > 1.09 | Youth_Fiction > 1.17) %>% 
  ggplot(aes(TxBnar, Youth_Fiction, label = Collocate_lemma)) +
  # ggrepel geom, make arrows transparent, color by rank, size by n
  geom_text_repel(segment.alpha = 0, 
                  aes(colour=USAS)) +
  scale_colour_manual(values = suffrager,
        name = "Semantic field") +
  
  # set color gradient & customize legend
  geom_abline(color = "gray40", lty = 2) +
  
  # set word size range & turn off legend
  labs(y = "% of MAKE 'produce' collocations in Spoken BNC2014 sample", x = "% of MAKE 'produce' collocations in Textbook Conversation") +
  scale_y_log10(breaks=c(1,1.5,2,2.5,3)) +
  scale_x_log10(breaks=c(1,2,4,6,8)) +
  
  # minimal theme & customizations
  theme_bw() +
  theme(legend.position=c(0.86,0.35),
        #legend.justification = c("right","top"),
        panel.grid.major = element_line(colour = "whitesmoke"),
        panel.grid.minor=element_blank(),
        legend.background = element_rect(colour = 'darkgrey', fill = 'white', linetype='solid'))

#ggsave(here("Produce_MAKEs_Fiction_chatterplot.svg"), dpi = 300, width =22, height = 15, units = "cm")

### Now a wordcloud for each subcorpus ###

produceTxBnar <- produce[produce$Subcorpus == "TxBnar", ]
produceYF <- produce[produce$Subcorpus == "Youth_Fiction", ]

produce_freq_YF <- produceYF %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())

produce_freq_TxBnar <- produceTxBnar %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())

anchor = 10

# png(filename = here('produce_YF_categories.png'), width = 1300, height = 1100,
# res = 300) svg(filename=here('produce_YF_collocates.svg'), width=9, height=6,
# pointsize=11)
produce_freq_YF$colors <- colours[match(produce_freq_YF$USAS, levels(produce_freq_YF$USAS))]
wordcloud(produce_freq_YF$Collocate_lemma, produce_freq_YF$Frequency, min.freq = 2, 
    rot.per = 0, colors = produce_freq_YF$colors, ordered.colors = TRUE, random.order = FALSE, 
    scale = c(10 * max(table(produce_freq_YF$Collocate_lemma))/8, 0.5))

Main collocates of MAKE in the produce/create sense (Ai) in the Youth Fiction sample

dev.off()

## null device 
##           1

# png(filename = here('produce_TxBnar_categories.png'), width = 1300, height =
# 1100, res = 300) svg(filename=here('produce_TxBNar_collocates.svg'), width=9,
# height=6, pointsize=11)
produce_freq_TxBnar$colors <- colours[match(produce_freq_TxBnar$USAS, levels(produce_freq_TxBnar$USAS))]
wordcloud(produce_freq_TxBnar$Collocate_lemma, produce_freq_TxBnar$Frequency, min.freq = 2, 
    rot.per = 0, colors = produce_freq_TxBnar$colors, ordered.colors = TRUE, random.order = FALSE, 
    scale = c(20 * max(table(produce_freq_TxBnar$Collocate_lemma))/anchor, 0.5))

Main collocates of MAKE in the produce/create sense (Ai) in Textbook Fiction

dev.off()

## null device 
##           1

MAKE as a delexical verb

### Wordclouds for collocates ###

cols <- RColorBrewer::brewer.pal(n = 9, name = "OrRd")

delexical <- read.csv(here("Delexical_MAKE2.csv"), sep = ",", stringsAsFactors = TRUE)
str(delexical)

## 'data.frame':    499 obs. of  7 variables:
##  $ Register       : Factor w/ 4 levels "BNC_Spoken","narrative",..: 4 1 1 4 3 2 4 1 2 3 ...
##  $ Level          : Factor w/ 6 levels "","A","B","C",..: 1 1 1 1 1 5 1 1 2 1 ...
##  $ Series         : Factor w/ 8 levels "","Access","Achievers",..: 1 1 1 1 1 7 1 1 3 1 ...
##  $ Meaning        : Factor w/ 2 levels "","delexical": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Collocate      : Factor w/ 145 levels "a case","abortions",..: 1 2 3 4 5 6 6 7 7 7 ...
##  $ Collocate_lemma: Factor w/ 124 levels "abortion","accent",..: 25 1 2 3 4 5 5 6 6 6 ...
##  $ speech         : int  1 0 0 1 0 1 1 1 1 1 ...

table(delexical$Collocate_lemma, delexical$Register)

##                 
##                  BNC_Spoken narrative TxB_spoken Youth_Fiction
##   abortion                1         0          0             0
##   accent                  1         0          0             0
##   accusation              0         0          0             1
##   agreement               0         0          1             0
##   amend                   0         1          0             1
##   announcement            1         1          1             0
##   appeal                  0         0          0             1
##   appointment             1         0          4             0
##   argument                1         0          0             0
##   arrangement             2         1          0             1
##   arrest                  0         0          0             1
##   assumption              1         0          0             0
##   attempt                 0         0          2             0
##   available               0         0          1             0
##   aware                   1         0          0             0
##   balls up                1         0          0             0
##   bang                    1         0          0             0
##   bargain                 0         0          0             1
##   bed                     0         0          5             1
##   bet                     0         1          0             0
##   bid                     0         0          1             1
##   booking                 1         0          1             0
##   breakthrough            0         1          0             1
##   call                    0         1          7             4
##   case                    1         0          1             1
##   change                  2         2          8             0
##   cheque                  0         0          1             0
##   choice                  3         7          3             1
##   clear                   0         1          0             0
##   comment                 1         1          4             1
##   commitment              0         0          1             0
##   comparison              1         0          0             0
##   complaint               1         0          0             1
##   confession              0         0          0             1
##   connect                 1         0          0             0
##   contact                 3         2          1             0
##   contrast                0         1          0             0
##   contribution            0         0          2             1
##   conversation            1         2          0             1
##   correction              0         1          0             0
##   creation                1         0          0             0
##   curtsey                 0         0          0             1
##   cut                     0         0          1             0
##   dash                    0         1          0             0
##   deal                    1         0          1             1
##   decision               14         2          7             4
##   difference             17         5         14             6
##   discovery               0         1          0             3
##   distinction             0         0          1             0
##   economy                 1         0          0             0
##   effort                  8         3          5             2
##   entrance                0         0          0             1
##   exception               0         1          0             0
##   excuse                  0         0          1             2
##   feature                 1         0          0             0
##   fuss                    1         0          1             2
##   gain                    1         0          0             0
##   generalisation          1         0          0             0
##   gesture                 0         0          0             4
##   illusion                1         0          0             0
##   impression              1         0          2             1
##   improvement             0         0          1             0
##   issue                   1         0          0             0
##   joke                    1         0          2             1
##   journey                 1         1          2             1
##   killing                 1         0          0             0
##   leap                    1         1          0             0
##   list                    1         0          3             0
##   loss                    1         0          0             0
##   misery                  0         0          1             0
##   mistake                 8        13         16             4
##   move                    3         0          1             2
##   movement                1         0          0             1
##   noise                   8         8         13            12
##   note                    2         0          1             1
##   observation             0         0          1             0
##   offer                   1         1          0             0
##   pact                    0         2          0             0
##   pause                   0         0          1             0
##   payment                 2         0          1             0
##   peace                   0         0          0             1
##   phenomenon              1         0          0             0
##   plan                    1         1          4             1
##   point                   1         0          0             0
##   prediction              0         0          1             0
##   preparation             0         1          0             0
##   progress                0         2          5             4
##   promise                 0         0          4             1
##   proposal                0         0          1             0
##   protest                 0         0          1             0
##   racket                  0         0          0             2
##   ready                   0         0          0             1
##   request                 0         1          0             0
##   reservation             0         0          0             1
##   row                     1         0          0             0
##   sacrifice               1         0          0             0
##   save                    0         0          1             0
##   saving                  1         0          0             0
##   secret                  0         0          1             0
##   sense                  27         2          9             5
##   shamble                 0         0          0             1
##   sighting                0         0          3             0
##   sign                    0         0          0             1
##   signal                  1         0          0             0
##   silence                 0         0          0             1
##   small talk              0         0          0             1
##   sound                   1         6          8             6
##   speech                  0         2          1             1
##   start                   0         0          0             2
##   statement               0         1          0             0
##   stay                    1         0          0             0
##   stereotype              1         0          0             0
##   stop                    0         0          1             0
##   suggestion              0         1          2             0
##   sweep                   0         0          0             1
##   swipe                   0         0          0             1
##   tone                    0         0          1             0
##   transition              0         0          1             0
##   trip                    0         0          0             1
##   trouble                 0         1          1             2
##   use                     2         0          0             1
##   visit                   0         0          0             1
##   walk                    0         1          0             0
##   wish                    0         0          2             1

# https://stackoverflow.com/questions/37275220/wordclouds-with-absolute-word-sizes
# # Trick to make the wordclouds comparable with constant font size #

anchor <- 20  # Should be maximum frequency in any of the wordclouds to be compared

# Youth Fiction #
delexicalYF <- delexical[delexical$Register == "Youth_Fiction", ]

# png(filename = here('Delexical_YF_minFreq2.png'), width = 1200, height = 1000,
# res = 300) # Too low a quality? svg(filename=here('Delexical_YF_minFreq2.svg'),
# width=4, height=3, pointsize=11)
wordcloud::wordcloud(delexicalYF$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[9], 
    scale = c(8 * max(table(delexicalYF$Collocate_lemma))/anchor, 0.5))

dev.off()

## null device 
##           1

# Textbook narrative #
delexicalTxBnar <- delexical[delexical$Register == "narrative", ]
# png(filename = here('Delexical_TxBnar_minFreq2.png'), width = 1300, height =
# 1100, res = 300) svg(filename=here('Delexical_TxBnar_minFreq2.svg'), width=5,
# height=3, pointsize=11)
wordcloud(delexicalTxBnar$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[6], 
    scale = c(6 * max(table(delexicalTxBnar$Collocate_lemma))/anchor, 0.5))
dev.off()

## null device 
##           1

# BNC spoken #
delexicalBNC <- delexical[delexical$Register == "BNC_Spoken", ]
# png(filename = here('Delexical_BNCs_minFreq2.png'), width = 2000, height =
# 1800, res = 300) # Too low a quality?
# svg(filename=here('Delexical_BNCs_minFreq2.svg'), width=6, height=6,
# pointsize=11)
wordcloud(delexicalBNC$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[9], 
    scale = c(8 * max(table(delexicalBNC$Collocate_lemma))/anchor, 0.5))
dev.off()  # To save plot

## null device 
##           1

# Textbook spoken #
delexicalTxBspoken <- delexical[delexical$Register == "TxB_spoken", ]
# png(filename = here('Delexical_TxBspoken_minFreq2.png'), width = 2000, height =
# 1800, res = 300) svg(filename=here('Delexical_TxBspoken_minFreq2.svg'),
# width=5, height=10, pointsize=11)
wordcloud(delexicalTxBspoken$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[6], 
    scale = c(4 * max(table(delexicalTxBspoken$Collocate_lemma))/anchor, 0.5))
dev.off()  # To save plot

## null device 
##           1

##### For range of colours according to frequencies ##

wordcloud(delexicalTxBnar$Collocate_lemma, min.freq = 2, rot.per = 0, random.order = FALSE, 
    random.color = FALSE, colors = brewer.pal(4, "RdGy"))

##### https://stackoverflow.com/questions/18902485/colored-categories-in-r-wordclouds

colours <- c("red", "blue", "orange", "green")

delexical$color <- colours[match(delexical$Register, levels(delexical$Register))]

wordcloud(delexicalTxBnar$Collocate_lemma, min.freq = 2, rot.per = 0, random.order = FALSE, 
    random.color = FALSE, colors = as.character(delexical$color))

MAKE as a delexical verb in Textbook Conversation

#### Chatterplot: inspired by
#### https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
#### Used in MAKING tea and mistakes book chapter

delexical <- read.csv(here("Delexical_MAKE2.csv"), sep = ",", stringsAsFactors = TRUE)
spoken_delex <- delexical %>% filter(Register == "BNC_Spoken" | Register == "TxB_spoken")
spoken_delex$Register <- droplevels(spoken_delex$Register)
table(spoken_delex$Collocate_lemma, spoken_delex$Register)

##                 
##                  BNC_Spoken TxB_spoken
##   abortion                1          0
##   accent                  1          0
##   accusation              0          0
##   agreement               0          1
##   amend                   0          0
##   announcement            1          1
##   appeal                  0          0
##   appointment             1          4
##   argument                1          0
##   arrangement             2          0
##   arrest                  0          0
##   assumption              1          0
##   attempt                 0          2
##   available               0          1
##   aware                   1          0
##   balls up                1          0
##   bang                    1          0
##   bargain                 0          0
##   bed                     0          5
##   bet                     0          0
##   bid                     0          1
##   booking                 1          1
##   breakthrough            0          0
##   call                    0          7
##   case                    1          1
##   change                  2          8
##   cheque                  0          1
##   choice                  3          3
##   clear                   0          0
##   comment                 1          4
##   commitment              0          1
##   comparison              1          0
##   complaint               1          0
##   confession              0          0
##   connect                 1          0
##   contact                 3          1
##   contrast                0          0
##   contribution            0          2
##   conversation            1          0
##   correction              0          0
##   creation                1          0
##   curtsey                 0          0
##   cut                     0          1
##   dash                    0          0
##   deal                    1          1
##   decision               14          7
##   difference             17         14
##   discovery               0          0
##   distinction             0          1
##   economy                 1          0
##   effort                  8          5
##   entrance                0          0
##   exception               0          0
##   excuse                  0          1
##   feature                 1          0
##   fuss                    1          1
##   gain                    1          0
##   generalisation          1          0
##   gesture                 0          0
##   illusion                1          0
##   impression              1          2
##   improvement             0          1
##   issue                   1          0
##   joke                    1          2
##   journey                 1          2
##   killing                 1          0
##   leap                    1          0
##   list                    1          3
##   loss                    1          0
##   misery                  0          1
##   mistake                 8         16
##   move                    3          1
##   movement                1          0
##   noise                   8         13
##   note                    2          1
##   observation             0          1
##   offer                   1          0
##   pact                    0          0
##   pause                   0          1
##   payment                 2          1
##   peace                   0          0
##   phenomenon              1          0
##   plan                    1          4
##   point                   1          0
##   prediction              0          1
##   preparation             0          0
##   progress                0          5
##   promise                 0          4
##   proposal                0          1
##   protest                 0          1
##   racket                  0          0
##   ready                   0          0
##   request                 0          0
##   reservation             0          0
##   row                     1          0
##   sacrifice               1          0
##   save                    0          1
##   saving                  1          0
##   secret                  0          1
##   sense                  27          9
##   shamble                 0          0
##   sighting                0          3
##   sign                    0          0
##   signal                  1          0
##   silence                 0          0
##   small talk              0          0
##   sound                   1          8
##   speech                  0          1
##   start                   0          0
##   statement               0          0
##   stay                    1          0
##   stereotype              1          0
##   stop                    0          1
##   suggestion              0          2
##   sweep                   0          0
##   swipe                   0          0
##   tone                    0          1
##   transition              0          1
##   trip                    0          0
##   trouble                 0          1
##   use                     2          0
##   visit                   0          0
##   walk                    0          0
##   wish                    0          2

comp <- round(prop.table(table(spoken_delex$Collocate_lemma, spoken_delex$Register), 
    2), 4) * 100
comp <- as.data.frame(unclass(comp))
comp$Collocate <- row.names(comp)
head(comp)

##              BNC_Spoken TxB_spoken    Collocate
## abortion           0.69        0.0     abortion
## accent             0.69        0.0       accent
## accusation         0.00        0.0   accusation
## agreement          0.00        0.6    agreement
## amend              0.00        0.0        amend
## announcement       0.69        0.6 announcement

levels(as.factor(comp$TxB_spoken))

##  [1] "0"    "0.6"  "1.2"  "1.8"  "2.4"  "2.99" "4.19" "4.79" "5.39" "7.78"
## [11] "8.38" "9.58"

comp %>% filter(TxB_spoken > 0.7 | BNC_Spoken > 0.7) %>% ggplot(aes(TxB_spoken, BNC_Spoken, 
    label = Collocate)) + # ggrepel geom, make arrows transparent
geom_point(color = "darkred") + geom_text_repel(min.segment.length = 0.3, segment.alpha = 0.4, 
    force = 0.4) + geom_abline(color = "gray40", lty = 2) + labs(y = "% of delexical MAKEs in Spoken BNC2014 sample", 
    x = "% of delexical MAKEs in the TEC-Conv") + scale_y_log10(breaks = c(1, 2, 
    4, 6, 8, 10, 12, 14, 16, 18)) + scale_x_log10(breaks = c(1, 2, 4, 6, 8, 10, 12, 
    14, 16)) + # minimal theme & customizations
theme_bw() + theme(panel.grid.major = element_line(colour = "whitesmoke"), panel.grid.minor = element_blank())

# ggsave(here('Delex_MAKE_spoken_chatterplot.svg'), dpi = 300, width =17, height
# = 18, units = 'cm')

# Good dimensions to save as SVG = 727 x 587
dev.off()

## null device 
##           1

MAKE as a delexical verb in Textbook Fiction

#### Chatterplot: inspired by
#### https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
#### Used in MAKING tea and mistakes book chapter

delexical <- read.csv(here("Delexical_MAKE2.csv"), sep = ",", stringsAsFactors = TRUE)

fiction_delex <- delexical %>% filter(Register == "Youth_Fiction" | Register == "narrative")
fiction_delex$Register <- droplevels(fiction_delex$Register)
table(fiction_delex$Collocate_lemma, fiction_delex$Register)

##                 
##                  narrative Youth_Fiction
##   abortion               0             0
##   accent                 0             0
##   accusation             0             1
##   agreement              0             0
##   amend                  1             1
##   announcement           1             0
##   appeal                 0             1
##   appointment            0             0
##   argument               0             0
##   arrangement            1             1
##   arrest                 0             1
##   assumption             0             0
##   attempt                0             0
##   available              0             0
##   aware                  0             0
##   balls up               0             0
##   bang                   0             0
##   bargain                0             1
##   bed                    0             1
##   bet                    1             0
##   bid                    0             1
##   booking                0             0
##   breakthrough           1             1
##   call                   1             4
##   case                   0             1
##   change                 2             0
##   cheque                 0             0
##   choice                 7             1
##   clear                  1             0
##   comment                1             1
##   commitment             0             0
##   comparison             0             0
##   complaint              0             1
##   confession             0             1
##   connect                0             0
##   contact                2             0
##   contrast               1             0
##   contribution           0             1
##   conversation           2             1
##   correction             1             0
##   creation               0             0
##   curtsey                0             1
##   cut                    0             0
##   dash                   1             0
##   deal                   0             1
##   decision               2             4
##   difference             5             6
##   discovery              1             3
##   distinction            0             0
##   economy                0             0
##   effort                 3             2
##   entrance               0             1
##   exception              1             0
##   excuse                 0             2
##   feature                0             0
##   fuss                   0             2
##   gain                   0             0
##   generalisation         0             0
##   gesture                0             4
##   illusion               0             0
##   impression             0             1
##   improvement            0             0
##   issue                  0             0
##   joke                   0             1
##   journey                1             1
##   killing                0             0
##   leap                   1             0
##   list                   0             0
##   loss                   0             0
##   misery                 0             0
##   mistake               13             4
##   move                   0             2
##   movement               0             1
##   noise                  8            12
##   note                   0             1
##   observation            0             0
##   offer                  1             0
##   pact                   2             0
##   pause                  0             0
##   payment                0             0
##   peace                  0             1
##   phenomenon             0             0
##   plan                   1             1
##   point                  0             0
##   prediction             0             0
##   preparation            1             0
##   progress               2             4
##   promise                0             1
##   proposal               0             0
##   protest                0             0
##   racket                 0             2
##   ready                  0             1
##   request                1             0
##   reservation            0             1
##   row                    0             0
##   sacrifice              0             0
##   save                   0             0
##   saving                 0             0
##   secret                 0             0
##   sense                  2             5
##   shamble                0             1
##   sighting               0             0
##   sign                   0             1
##   signal                 0             0
##   silence                0             1
##   small talk             0             1
##   sound                  6             6
##   speech                 2             1
##   start                  0             2
##   statement              1             0
##   stay                   0             0
##   stereotype             0             0
##   stop                   0             0
##   suggestion             1             0
##   sweep                  0             1
##   swipe                  0             1
##   tone                   0             0
##   transition             0             0
##   trip                   0             1
##   trouble                1             2
##   use                    0             1
##   visit                  0             1
##   walk                   1             0
##   wish                   0             1

comp <- round(prop.table(table(fiction_delex$Collocate_lemma, fiction_delex$Register), 
    2), 4) * 100
comp <- as.data.frame(unclass(comp))
comp$Collocate_lemma <- row.names(comp)
head(comp)

##              narrative Youth_Fiction Collocate_lemma
## abortion          0.00          0.00        abortion
## accent            0.00          0.00          accent
## accusation        0.00          0.94      accusation
## agreement         0.00          0.00       agreement
## amend             1.23          0.94           amend
## announcement      1.23          0.00    announcement

levels(as.factor(comp$narrative))

## [1] "0"     "1.23"  "2.47"  "3.7"   "6.17"  "7.41"  "8.64"  "9.88"  "16.05"

levels(as.factor(comp$Youth_Fiction))

## [1] "0"     "0.94"  "1.89"  "2.83"  "3.77"  "4.72"  "5.66"  "11.32"

comp %>% filter(narrative > 2.48 | Youth_Fiction > 1.9) %>% ggplot(aes(narrative, 
    Youth_Fiction, label = Collocate_lemma)) + # ggrepel geom, make arrows transparent
geom_point(color = "darkred") + geom_text_repel(min.segment.length = 0.1, segment.alpha = 0.4, 
    forrce = 0.6) + geom_abline(color = "gray40", lty = 2) + labs(y = "% of delexical MAKEs in Youth Fiction sample", 
    x = "% of delexical MAKEs in Textbook Fiction") + scale_y_log10(breaks = c(1, 
    2, 4, 6, 8, 10, 12), limits = c(0.9, 12)) + scale_x_log10(breaks = c(1, 2, 4, 
    6, 8, 10, 12, 14, 16)) + # minimal theme & customizations
theme_bw() + theme(panel.grid.major = element_line(colour = "whitesmoke"), panel.grid.minor = element_blank())

# ggsave(here('delex_MAKE_Fiction_chatterplot.svg'), dpi = 300, width =15, height
# = 15, units = 'cm')

Delexical collocations of speech/communication

#### Percentage of collocates that are speech actions #####

total_delexical <- delexical %>% group_by(Register) %>% summarise(total = n())

speech <- delexical %>% group_by(Register) %>% filter(speech == 1) %>% summarise(speech = n())

speech2 <- left_join(speech, total_delexical, by = "Register") %>% mutate(speech_per = speech/total * 
    100) %>% mutate(not_speech = total - speech) %>% add_column(Variety = c("Reference", 
    "Textbook", "Textbook", "Reference")) %>% rename(Corpus = Register) %>% add_column(Register = c("Conversation", 
    "Fiction", "Conversation", "Fiction"))

speech2

## # A tibble: 4 × 7
##   Corpus        speech total speech_per not_speech Variety   Register    
##   <fct>          <int> <int>      <dbl>      <int> <chr>     <chr>       
## 1 BNC_Spoken        33   145       22.8        112 Reference Conversation
## 2 narrative         13    81       16.0         68 Textbook  Fiction     
## 3 TxB_spoken        24   167       14.4        143 Textbook  Conversation
## 4 Youth_Fiction     24   106       22.6         82 Reference Fiction

## Statistical testing ##

speech3 <- as.data.frame(speech2)
speech3

##          Corpus speech total speech_per not_speech   Variety     Register
## 1    BNC_Spoken     33   145   22.75862        112 Reference Conversation
## 2     narrative     13    81   16.04938         68  Textbook      Fiction
## 3    TxB_spoken     24   167   14.37126        143  Textbook Conversation
## 4 Youth_Fiction     24   106   22.64151         82 Reference      Fiction

speech_spoken <- speech3[c(1, 3), c(2, 5)]  # Spoken register
chisq.test(t(speech_spoken))  # p-value =  0.07751

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t(speech_spoken)
## X-squared = 3.1164, df = 1, p-value = 0.07751

vcd::assocstats(t(speech_spoken))  # Cramer's V = 0.108

##                     X^2 df P(> X^2)
## Likelihood Ratio 3.6524  1 0.055988
## Pearson          3.6566  1 0.055848
## 
## Phi-Coefficient   : 0.108 
## Contingency Coeff.: 0.108 
## Cramer's V        : 0.108

speech_narrative <- speech3[c(2, 4), c(2, 5)]  # Narrative register
chisq.test(t(speech_narrative))  # p-value = 0.3493

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t(speech_narrative)
## X-squared = 0.87612, df = 1, p-value = 0.3493

## Second attempt at better graph ##

cols <- RColorBrewer::brewer.pal(n = 7, name = "OrRd")

# svg(filename=here('delexical_speech.svg'), width=5, height=4)
ggplot(speech3, aes(Register, speech_per)) + geom_bar(aes(fill = Variety), stat = "identity", 
    position = "dodge", width = 0.5) + theme(legend.title = element_blank()) + ggsignif::geom_signif(y_position = 23.5, 
    xmin = c(0.8, 1.8), xmax = c(1.2, 2.2), annotation = c("ns", "ns"), tip_length = 0, 
    textsize = 3) + scale_fill_manual(name = "Variety", labels = c("ENL Reference", 
    "Textbook English"), values = cols[c(3, 7)]) + ylab("% of delexical MAKE") + 
    scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) + theme_bw()

dev.off()

## null device 
##           1

## Quantative analysis ##
speech <- delexical[delexical$speech == 1, ]
head(speech)  # Select only speech collocates

##        Register Level        Series   Meaning    Collocate Collocate_lemma
## 1 Youth_Fiction                                     a case            case
## 4 Youth_Fiction                                 accusation      accusation
## 6     narrative     D Join the Team delexical       amends           amend
## 7 Youth_Fiction                                     amends           amend
## 8    BNC_Spoken                               announcement    announcement
## 9     narrative     A     Achievers delexical announcement    announcement
##   speech
## 1      1
## 4      1
## 6      1
## 7      1
## 8      1
## 9      1

speech$Collocate_lemma <- factor(speech$Collocate_lemma)  # Drop unused levels
speech <- t(table(speech$Register, speech$Collocate_lemma))  # Compute raw frequency list
str(speech)

##  'table' int [1:31, 1:4] 0 0 1 0 1 2 1 1 0 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:31] "accusation" "amend" "announcement" "appeal" ...
##   ..$ : chr [1:4] "BNC_Spoken" "narrative" "TxB_spoken" "Youth_Fiction"

head(speech)

##               
##                BNC_Spoken narrative TxB_spoken Youth_Fiction
##   accusation            0         0          0             1
##   amend                 0         1          0             1
##   announcement          1         1          1             0
##   appeal                0         0          0             1
##   argument              1         0          0             0
##   arrangement           2         1          0             1

speech2 <- as.data.frame.matrix(speech)

speech2 %>% subset(narrative == 0 & TxB_spoken == 0) %>% select(c(BNC_Spoken, Youth_Fiction))

##            BNC_Spoken Youth_Fiction
## accusation          0             1
## appeal              0             1
## argument            1             0
## assumption          1             0
## aware               1             0
## comparison          1             0
## complaint           1             1
## confession          0             1
## point               1             0
## row                 1             0
## signal              1             0
## silence             0             1
## small talk          0             1
## speech              0             1

Phrasal verbs

PV <- read.csv(here("MAKE_PhrasalVerbs_Conc.csv"))

t(table(PV$Corpus, PV$PV))

##                       
##                        BNCSpoken TxB_narrative TxB_spoken Youth_Fiction
##   it up to sb.                 0             0          0             1
##   of (=capable)                0             0          1             2
##   of sth. (=interpret)         0             0          0             2
##   out (=discern)               1             4          0            10
##   out (=pet)                   1             0          0             0
##   out (=present as)            0             0          0             1
##   out to be (=pretend)         1             0          0             0
##   up (=compensate)             1             1          0             1
##   up (=comprise)               1             4          3             1
##   up (=face paint)             1             0          0             2
##   up (=invent)                17             4          2             6
##   up (=prepare)                1             0          1             2
##   up (=reconcile)              2             0          0             1
##   with (=make do with)         0             0          1             0

summary(as.factor(PV$Corpus))

##     BNCSpoken TxB_narrative    TxB_spoken Youth_Fiction 
##            26            13             8            29

table(PV$Corpus, PV$Level)

##                
##                     B  C  D  E
##   BNCSpoken     26  0  0  0  0
##   TxB_narrative  0  0  1  7  5
##   TxB_spoken     0  2  2  3  1
##   Youth_Fiction 29  0  0  0  0

Causative MAKE

TxBSpokenCaus <- read.csv(here("Textbook_MAKE_Causative_Constructions.csv"), stringsAsFactors = TRUE)
str(TxBSpokenCaus)

## 'data.frame':    172 obs. of  9 variables:
##  $ Register        : Factor w/ 1 level "spoken": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Level           : Factor w/ 5 levels "A","B","C","D",..: 5 5 3 5 5 4 3 5 3 4 ...
##  $ Series          : Factor w/ 9 levels "Access","Achievers",..: 4 6 4 6 5 9 1 5 9 9 ...
##  $ Conc1           : logi  NA NA NA NA NA NA ...
##  $ Conc2           : Factor w/ 5 levels "made","make",..: 2 1 2 2 1 1 2 4 4 1 ...
##  $ Conc3           : logi  NA NA NA NA NA NA ...
##  $ Cause_Cx        : Factor w/ 3 levels "adjectiveCx",..: 1 1 1 3 1 2 3 1 3 1 ...
##  $ Collocate       : Factor w/ 98 levels "ache","affordable",..: 2 77 81 13 54 33 16 36 52 30 ...
##  $ Semantic_prosody: logi  NA NA NA NA NA NA ...

TxBSpokenCaus$Register = "Textbook Conversation"

BNC2014Caus <- read.csv(here("SpokenBNC2014_MAKE_Causative_Constructions.csv"), stringsAsFactors = TRUE)
str(BNC2014Caus)

## 'data.frame':    208 obs. of  7 variables:
##  $ Filename        : Factor w/ 183 levels "filename#0","filename#1004",..: 8 114 142 142 96 173 103 85 36 53 ...
##  $ Conc1           : logi  NA NA NA NA NA NA ...
##  $ Conc2           : Factor w/ 4 levels "made","make",..: 2 2 2 2 1 1 2 4 1 2 ...
##  $ Conc3           : logi  NA NA NA NA NA NA ...
##  $ Cause_Cx        : Factor w/ 4 levels "adjectiveCx",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Collocate       : Factor w/ 122 levels "a priority","a real man",..: 25 100 64 64 79 97 28 28 35 61 ...
##  $ Semantic_prosody: Factor w/ 3 levels "","negative",..: 2 3 2 2 2 2 1 1 1 1 ...

BNC2014Caus <- BNC2014Caus %>% select(-c(Filename))
BNC2014Caus$Register = "Spoken BNC2014"
BNC2014Caus$Level = "Spoken BNC2014"
BNC2014Caus$Series = "Spoken BNC2014"

caus <- rbind(TxBSpokenCaus, BNC2014Caus) %>% select(-Semantic_prosody)
caus$Register <- as.factor(caus$Register)
str(caus)

## 'data.frame':    380 obs. of  8 variables:
##  $ Register : Factor w/ 2 levels "Spoken BNC2014",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Level    : Factor w/ 6 levels "A","B","C","D",..: 5 5 3 5 5 4 3 5 3 4 ...
##  $ Series   : Factor w/ 10 levels "Access","Achievers",..: 4 6 4 6 5 9 1 5 9 9 ...
##  $ Conc1    : logi  NA NA NA NA NA NA ...
##  $ Conc2    : Factor w/ 5 levels "made","make",..: 2 1 2 2 1 1 2 4 4 1 ...
##  $ Conc3    : logi  NA NA NA NA NA NA ...
##  $ Cause_Cx : Factor w/ 4 levels "adjectiveCx",..: 1 1 1 3 1 2 3 1 3 1 ...
##  $ Collocate: Factor w/ 184 levels "ache","affordable",..: 2 77 81 13 54 33 16 36 52 30 ...

table(caus$Cause_Cx, caus$Register)

##              
##               Spoken BNC2014 Textbook Conversation
##   adjectiveCx             75                    71
##   nounCx                  20                    14
##   verbalCxAA             112                    87
##   verbalCxAP               1                     0

round(prop.table(table(caus$Cause_Cx, caus$Register), 2), 4) * 100

##              
##               Spoken BNC2014 Textbook Conversation
##   adjectiveCx          36.06                 41.28
##   nounCx                9.62                  8.14
##   verbalCxAA           53.85                 50.58
##   verbalCxAP            0.48                  0.00

vcd::assoc(caus$Cause_Cx ~ caus$Register, shade = TRUE, varnames = FALSE)

# Chatterplot: inspired by
# https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098

comp <- round(prop.table(table(caus$Collocate, caus$Register), 2), 4) * 100
comp <- as.data.frame(unclass(comp))
comp$Collocate <- row.names(comp)
comp <- inner_join(comp, caus[, 7:8], by = "Collocate")
comp <- distinct(comp)
str(comp)

## 'data.frame':    184 obs. of  4 variables:
##  $ Spoken BNC2014       : num  0.48 0 0 0 0 0 0.48 0 0 1.44 ...
##  $ Textbook Conversation: num  0.58 0.58 0.58 1.16 0.58 1.16 0.58 0.58 0.58 2.33 ...
##  $ Collocate            : chr  "ache" "affordable" "anxious" "appear" ...
##  $ Cause_Cx             : Factor w/ 4 levels "adjectiveCx",..: 3 1 1 3 2 1 1 1 1 1 ...

comp %>% filter(`Textbook Conversation` > 0.95 | `Spoken BNC2014` > 0.95) %>% ggplot(aes(`Textbook Conversation`, 
    `Spoken BNC2014`, colour = Cause_Cx, label = Collocate)) + # ggrepel geom, make arrows transparent, color by rank, size by n
geom_point(aes(colour = Cause_Cx)) + geom_text_repel(min.segment.length = 0.5, segment.alpha = 0.4, 
    force = 1, max.overlaps = 20, aes(colour = Cause_Cx), show.legend = F) + scale_colour_manual(breaks = c("adjectiveCx", 
    "nounCx", "verbalCxAA"), values = palettes_d$suffrager$oxon[c(2, 1, 3)], labels = c("[X MAKE Y AdjP]", 
    "[X MAKE (Y) NP]", "[X MAKE Y Vinf]"), name = "Construction type") + 
# set color gradient & customize legend
geom_abline(color = "gray40", lty = 2) + 
# set word size range & turn off legend
labs(y = "% of causative MAKEs in the Spoken BNC2014 sample", x = "% of causative MAKEs in Textbook Conversation") + 
    scale_y_log10(breaks = c(1, 2, 4, 6, 8, 10)) + scale_x_log10(breaks = c(1, 2, 
    4, 6, 8, 10)) + 
# minimal theme & customizations
theme_bw() + theme(legend.position = c(0.97, 0.4), legend.justification = c("right", 
    "top"), panel.grid.major = element_line(colour = "whitesmoke"), panel.grid.minor = element_blank(), 
    legend.background = element_rect(colour = "darkgrey", fill = "white", linetype = "solid"))

# ggsave(here('Causative_Spoken_chatterplot.svg'), dpi=300, width=20, height=20,
# units='cm')

dev.off()

## null device 
##           1

Package used in this script

# packages.bib <- sapply(1:length(loadedNamespaces()), function(i)
# toBibtex(citation(loadedNamespaces()[i])))

knitr::write_bib(c(.packages(), "knitr"), "packages.bib")

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] wordcloud_2.6      forcats_0.5.1      stringr_1.4.0      dplyr_1.0.7       
##  [5] purrr_0.3.4        readr_2.0.2        tidyr_1.1.4        tibble_3.1.6      
##  [9] tidyverse_1.3.0    RColorBrewer_1.1-2 gtools_3.8.2       paletteer_1.3.0   
## [13] ggrepel_0.9.1      ggplot2_3.3.5      ggsignif_0.6.1     here_1.0.1        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2       sass_0.4.0       jsonlite_1.7.2   modelr_0.1.8    
##  [5] bslib_0.3.1      stargazer_5.2.2  assertthat_0.2.1 highr_0.9       
##  [9] cellranger_1.1.0 slam_0.1-48      yaml_2.2.1       pillar_1.6.4    
## [13] backports_1.4.1  lattice_0.20-41  glue_1.6.0       digest_0.6.29   
## [17] rvest_1.0.0      colorspace_2.0-2 htmltools_0.5.2  tm_0.7-8        
## [21] pkgconfig_2.0.3  broom_0.7.9      haven_2.3.1      scales_1.1.1    
## [25] tzdb_0.1.2       generics_0.1.1   farver_2.1.0     ellipsis_0.3.2  
## [29] withr_2.4.3      NLP_0.2-1        cli_3.1.0        magrittr_2.0.1  
## [33] crayon_1.4.2     readxl_1.3.1     evaluate_0.14    fs_1.5.2        
## [37] fansi_0.5.0      MASS_7.3-53.1    xml2_1.3.3       tools_4.0.3     
## [41] hms_1.0.0        formatR_1.8      lifecycle_1.0.1  munsell_0.5.0   
## [45] reprex_1.0.0     compiler_4.0.3   jquerylib_0.1.4  vcd_1.4-8       
## [49] rlang_0.4.12     grid_4.0.3       rstudioapi_0.13  labeling_0.4.2  
## [53] rmarkdown_2.11   gtable_0.3.0     DBI_1.1.1        rematch2_2.1.2  
## [57] R6_2.5.1         zoo_1.8-9        lubridate_1.7.10 knitr_1.37      
## [61] fastmap_1.1.0    utf8_1.2.2       rprojroot_2.0.2  stringi_1.7.6   
## [65] parallel_4.0.3   Rcpp_1.0.7       vctrs_0.3.8      dbplyr_2.1.0    
## [69] tidyselect_1.1.1 xfun_0.29        lmtest_0.9-38

Ahlmann-Eltze, Constantin, and Indrajeet Patil. 2021. Ggsignif: Significance Brackets for Ggplot2. https://CRAN.R-project.org/package=ggsignif.

Fellows, Ian. 2018. Wordcloud: Word Clouds. https://CRAN.R-project.org/package=wordcloud.

file., See AUTHORS. 2021. Paletteer: Comprehensive Collection of Color Palettes. https://github.com/EmilHvitfeldt/paletteer.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.

Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Müller, Kirill, and Hadley Wickham. 2021. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.

Neuwirth, Erich. 2014. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Slowikowski, Kamil. 2021. Ggrepel: Automatically Position Non-Overlapping Text Labels with Ggplot2. https://github.com/slowkow/ggrepel.

Warnes, Gregory R., Ben Bolker, and Thomas Lumley. 2020. Gtools: Various r Programming Tools. https://github.com/r-gregmisc/gtools.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2019a. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

———. 2019b. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

———. 2021a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.

———. 2021b. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Jim Hester. 2021. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Making sense of MAKE in Textbook English

Elen Le Foll

20/04/2019