This script is part of the Online Appendix to my PhD thesis.
Please cite as: Le Foll, Elen. 2022. Textbook English: A Corpus-Based Analysis of the Language of EFL textbooks used in Secondary Schools in France, Germany and Spain. PhD thesis. Osnabrück University.
For more information, see: https://elenlefoll.github.io/TextbookEnglish/
Please note that the plot dimensions in this notebook have been optimised for the print version of the thesis.
Built with R 4.0.3
knitr::opts_chunk$set(echo = TRUE, tidy = TRUE, message=FALSE, paged.print=TRUE, fig.width = 10, warning=FALSE)
library(here)
library(ggsignif)
library(ggrepel)
library(paletteer)
require(gtools)
library(RColorBrewer)
library(tidyverse)
library(wordcloud)
## Subcorpus raw_freq_MAKE wpm_freq_MAKE total_tokens
## 2 TxB_informative 713 2353.79 302916
## 3 TxB_instructional 2310 3901.63 592061
## 4 TxB_narrative 395 1555.80 253888
## 5 TxB_other 29 2014.73 14394
## 6 TxB_personal 115 1713.91 67098
## 7 TxB_poetry 82 3130.49 26194
## 8 TxB_spoken 699 1368.03 510954
## 9 TxB_words_phrases_sentences 1776 1942.93 914083
## raw_freq_not_MAKE raw_freq_verbs xlabels
## 2 302203 44160 Informative
## 3 589751 89342 Instructional
## 4 253493 42302 Narrative
## 5 14365 2068 Other
## 6 66983 11175 Personal
## 7 26112 4512 Poetry
## 8 510255 80106 Conversation
## 9 912307 138695 Exercises
Across the entire Textbook English Corpus (TEC), make is the 13th most frequent verb lemma after MAKE DO, HAVE, CAN GO, SAY, LOOK, USE, LOOK, THINK, WRITE and READ.
Moreover, it is the 47th most frequent lemma across all parts-of-speech. Its relative frequency of occurrence across all textbook registers is 2247.66 pmw. That said, its distribution across the different textbook registers is strikingly uneven (see Fig. @ref(fig:registers)).
Frequency of MAKE across all Textbook registers
## null device
## 1
The following sections will home in on the two textbook registers with the lowest relative frequencies of MAKE, namely the narrative texts and spoken language within the textbooks. The representations of MAKE in these two textbook registers are compared to the Youth Fiction Corpus and the Spoken BNC 2014 (LoveSpokenBNC20142017?) respectively.
Fig. @ref(fig:relfreqspokennar) reveals that the verb MAKE is more frequent fiction than in conversation. However, in both textbook registers, the relative frequencies of MAKE is lower than in the corresponding reference corpora. X2 tests (with Yates’ continuity correction) applied on the observed and expected raw frequencies show that the high-frequency verb is featured significantly less frequently than in the reference corpora (Conversation: X2 = 18.774, df = 1, p-value = <0.001; Fiction: X2 = 25.18, df = 1, p-value = <0.001).
Relative Frequencies of the verb lemma MAKE in Textbook and Reference subcorpora
## null device
## 1
## Register Data raw_Freq pmw_Freq total_tokens
## Ref_Youth_Fiction_sampled Fiction Reference 20657 2007.87 10288031
## TxB_narrative Fiction Textbook 395 1555.80 253888
## Ref_BNC_Spoken2014 Conversation Reference 18253 1617.24 11286509
## TxB_spoken Conversation Textbook 699 1368.03 510954
## raw_freq_not_MAKE total_verbs total_verbs_not_MAKE
## Ref_Youth_Fiction_sampled 10267374 1792111 1771454
## TxB_narrative 253493 42302 41907
## Ref_BNC_Spoken2014 11268256 2487515 2469262
## TxB_spoken 510255 80106 79407
## total_BE total_verbs_not_BE total_verbs_not_BE_MAKE
## Ref_Youth_Fiction_sampled 361207 1430904 1410247
## TxB_narrative 10278 32024 31629
## Ref_BNC_Spoken2014 748800 1738715 1720462
## TxB_spoken 25706 54400 53701
## MAKE_10000verbs MAKE_10000verbs_BE
## Ref_Youth_Fiction_sampled 115.26630 144.3633
## TxB_narrative 93.37620 123.3450
## Ref_BNC_Spoken2014 73.37845 104.9798
## TxB_spoken 87.25938 128.4926
## raw_Freq total_verbs_not_MAKE
## Ref_Youth_Fiction_sampled 20657 1771454
## TxB_narrative 395 41907
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: freq_nar
## X-squared = 17.263, df = 1, p-value = 3.254e-05
## X^2 df P(> X^2)
## Likelihood Ratio 18.612 1 1.6019e-05
## Pearson 17.456 1 2.9406e-05
##
## Phi-Coefficient : 0.003
## Contingency Coeff.: 0.003
## Cramer's V : 0.003
## raw_Freq raw_freq_not_MAKE
## Ref_BNC_Spoken2014 18253 11268256
## TxB_spoken 699 510255
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: freq_spoken
## X-squared = 18.774, df = 1, p-value = 1.472e-05
## X^2 df P(> X^2)
## Likelihood Ratio 19.894 1 8.1848e-06
## Pearson 18.929 1 1.3570e-05
##
## Phi-Coefficient : 0.001
## Contingency Coeff.: 0.001
## Cramer's V : 0.001
##
## Fisher's Exact Test for Count Data
##
## data: freq_spoken
## p-value = 9.4e-06
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.096321 1.277161
## sample estimates:
## odds ratio
## 1.182465
High-frequency verbs such as MAKE are not only highly polysemous, they also enter into many different lexico-grammatical constructions. This section explores the quantitative differences in the proportional use of these different meanings and constructions in two major Textbook English registers as compared to similar registers in naturally-occurring ENL speech and writing.
# Semantics of MAKE in Textbook English #
semantics <- read.csv(here("Make_semantics.csv"), sep = ",", header = TRUE, stringsAsFactors = TRUE)
semantics[9, 5] = 8 # Correction to dataset (one concordance line was annotated as PV but wasn't in fact MAKE as a noun and has been deleted from the PV dataset)
head(semantics)
## Category RefYouthFiction_raw TextbookNarrative_raw SpokenBNC2014_raw
## 1 Ai: produce 87 94 181
## 2 Aii: delexical 106 81 145
## 3 B: earn 5 14 24
## 4 C: achieve 13 26 20
## 5 D: be/become 4 2 6
## 6 E: cause 118 125 215
## TextbookConversation_raw
## 1 194
## 2 168
## 3 24
## 4 23
## 5 5
## 6 172
# Chi-squared tests #
# The chisq.test() function wants just a matrix of counts. So we need to select
# the columns that contain the relevant counts.
cols = c("SpokenBNC2014_raw", "TextbookConversation_raw")
semantics_test = chisq.test(semantics[, cols])
print(semantics_test)
##
## Pearson's Chi-squared test
##
## data: semantics[, cols]
## X-squared = 24.81, df = 8, p-value = 0.001674
# The test asks simply whether the distributions over categories are the same for
# the two corpora. And a low enough p-value is reason to doubt this hypothesis
# (loosely speaking).
# To test just one of the categories with the Chi-squared test we need to
# collapse the others. For example if we want to test the delexical category we
# need a category 'other' for all the rest. We can achieve this by binding the
# relevant row to the column sums of the other rows. Then we can test this 2x2
# matrix of counts. Since we have a 2x2 matrix we can use Fisher's exact test
# instead. We can do this for every category in a loop, and store the resulting
# p-values from the tests, along with a star symbol for the low p-values.
results = data.frame(Category = semantics$Category, p = 1)
for (i in 1:nrow(results)) {
category = semantics$Category[i]
row = semantics[semantics$Category == category, cols]
other_rows = semantics[semantics$Category != category, cols]
collapsed = rbind(row, colSums(other_rows))
category_test = fisher.test(collapsed)
results$p[i] = category_test$p.value
results$OddsRatio[i] = category_test$estimate
results$conf.int.lower[i] = category_test$conf.int[1]
results$conf.int.upper[i] = category_test$conf.int[2]
}
results
## Category p OddsRatio conf.int.lower conf.int.upper
## 1 Ai: produce 0.429831680 0.9065618 0.7089985 1.1588489
## 2 Aii: delexical 0.138265294 0.8240593 0.6340915 1.0700880
## 3 B: earn 1.000000000 0.9984627 0.5368422 1.8569966
## 4 C: achieve 0.646116946 0.8643411 0.4455196 1.6637706
## 5 D: be/become 1.000000000 1.1998363 0.3034438 4.9952869
## 6 E: cause 0.011400491 1.3640483 1.0688700 1.7424388
## 7 F: ensure 0.898208363 1.0661577 0.6246510 1.8237921
## 8 I: idiomatic 0.003604951 0.4810266 0.2781997 0.8120519
## 9 P: phrasal verbs 0.002600575 3.3327110 1.4518204 8.5812583
# The 'BH' or 'BY' methods are probably the most appropriate here if p-value
# correction is desired (controversial topic).
results$p = p.adjust(results$p, method = "none")
results$Star <- stars.pval(results$p)
# Check the outcome.
results
## Category p OddsRatio conf.int.lower conf.int.upper Star
## 1 Ai: produce 0.429831680 0.9065618 0.7089985 1.1588489
## 2 Aii: delexical 0.138265294 0.8240593 0.6340915 1.0700880
## 3 B: earn 1.000000000 0.9984627 0.5368422 1.8569966
## 4 C: achieve 0.646116946 0.8643411 0.4455196 1.6637706
## 5 D: be/become 1.000000000 1.1998363 0.3034438 4.9952869
## 6 E: cause 0.011400491 1.3640483 1.0688700 1.7424388 *
## 7 F: ensure 0.898208363 1.0661577 0.6246510 1.8237921
## 8 I: idiomatic 0.003604951 0.4810266 0.2781997 0.8120519 **
## 9 P: phrasal verbs 0.002600575 3.3327110 1.4518204 8.5812583 **
### Compute percentages for distributions of meanings ###
semantics = semantics %>% mutate(RefYouthFiction = (RefYouthFiction_raw/(sum(RefYouthFiction_raw))) *
100) %>% mutate(TextbookNarrative = (TextbookNarrative_raw/(sum(TextbookNarrative_raw))) *
100) %>% mutate(SpokenBNC2014 = (SpokenBNC2014_raw/(sum(SpokenBNC2014_raw))) *
100) %>% mutate(TextbookConversation = (TextbookConversation_raw/(sum(TextbookConversation_raw)) *
100))
Figure @ref(fig:semantics-spoken) visualises the different proportions of meanings of MAKE in Textbook Conversation and in a sample of the Spoken BNC 2014. Statistical differences between the proportions of each category was tested using Fisher’s Exact Test for Count Data the resulting p=values are plotted on Figures @ref(fig:semantics-spoken) and @ref(fig:semantics-nar) using the following convention: 0 - 0.001 = ***, 0.001 - 0.01 = **, 0.01 - 0.05 = *, 0.05 - 0.1 = ˘.
## Start with TxB spoken vs. BNCspoken ###
spoken_sem = semantics %>% select(Category, SpokenBNC2014, TextbookConversation)
# Lollipop graph:
# https://www.r-graph-gallery.com/303-lollipop-plot-with-2-values/
beige <- col2rgb("#FC8D59", alpha = TRUE)
bordeau <- col2rgb("#990000", alpha = TRUE)
# svg(filename=here('MAKE_semantics_Spoken.svg', width=12, height=6,
# pointsize=12)
fig = ggplot(spoken_sem) + geom_segment(aes(x = fct_rev(Category), xend = Category,
y = SpokenBNC2014, yend = TextbookConversation), color = "black") + geom_point(aes(x = Category,
y = SpokenBNC2014), color = rgb(252, 141, 89, max = 255, alpha = 220), size = 4) +
geom_point(aes(x = Category, y = TextbookConversation), color = rgb(153, 0, 0,
max = 255, alpha = 180), size = 4) + coord_flip() + theme_light() + theme(panel.border = element_blank(),
axis.text = element_text(size = 15, colour = "black"), axis.title.x = element_text(size = 15,
face = "bold")) + xlab("") + ylab("% of occurrences of MAKE") + theme(plot.margin = margin(20,
10, 10, 10))
# + labs(title = 'Distribution of MAKE meanings in Textbook Conversation and the
# Spoken BNC 2014')
### Adding legend directly onto plot ###
### https://stackoverflow.com/questions/40011005/adding-a-legend-to-a-dumbbell-chart-in-r
### #
fig1 = fig + geom_text(data = data.frame(), aes(x = "Ai: produce", y = 29, label = "Textbook Conversation"),
color = rgb(153, 0, 0, max = 255, alpha = 255), hjust = 1, size = 5, fontface = "bold",
nudge_y = 11, nudge_x = 0.2) + geom_text(data = data.frame(), aes(x = 9, y = 26,
label = "Spoken BNC 2014 sample"), color = rgb(252, 141, 89, max = 255, alpha = 255),
hjust = 0, size = 5, fontface = "bold", nudge_y = -11, nudge_x = 0.2)
# With the addition of an = assignment the plot above is now stored and can be
# added to. We have some relevant data in our 'results' dataframe. The Star
# column says what to display. And the Category column says for what category.
# But ggplot needs also to know how far along the y axis to put each star. For
# this, we need a y column. To place the stars half way along the segment, we
# need the mean of the two point values. We can get this from the plot data
# frame created above by calculating the row means for the two relevant columns.
# This is slightly complicated by the fact that the column names no longer have
# '_raw' appended.
columns = c("SpokenBNC2014", "TextbookConversation")
results$y = rowMeans(spoken_sem[, columns])
# Now we can add a text annotation layer to the plot. We map the Start column to
# the text itself, we map Category to the x axis as in the original plot, and we
# map the halfway points calculated above to the y axis. We specify that the
# data for this layer are coming from another data frame. And we adjust the size
# of the text.
fig2 = fig1 + geom_text(aes(x = Category, y = y, label = Star), data = results, size = 8)
print(fig2)
Distribution of MAKE meanings in Textbook Conversation and the Spoken BNC 2014
dev.off()
## null device
## 1
# ggsave(here('semantics_spoken.png'), width = 24, height = 12, units = 'cm', dpi
# = 300)
Causative uses of MAKE represent the most frequent semantic category in the BNC. The proportion of causative makes in Textbook Conversation, however, is significantly lower. The other two most frequent categories, the produce sense and delexical uses, are similarly distributed across the two conversation corpora. The significantly higher proportion of idioms featuring MAKE in Textbook Conversation is primarily due to the very high frequency of the idiom to make friends in Textbook Conversation (see Table @ref(tab:idioms_spoken)).
idioms_spoken <- read.csv(here("MAKE_idioms_freq_spoken.csv"), sep = ",", header = TRUE)
stargazer::stargazer(idioms_spoken, type = "html", title = "Idioms with MAKE in Textbook Conversation and a sample of the Spoken BNC 2014")
##
## <table style="text-align:center"><caption><strong>Idioms with MAKE in Textbook Conversation and a sample of the Spoken BNC 2014</strong></caption>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Raw.Freq..in.Textbook.Conversation</td><td>6</td><td>7.333</td><td>8.687</td><td>3</td><td>3.2</td><td>4.8</td><td>25</td></tr>
## <tr><td style="text-align:left">Raw.Freq..in.Spoken.BNC2014.Sample</td><td>6</td><td>2.500</td><td>1.871</td><td>0</td><td>1.2</td><td>3.8</td><td>5</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>
Idiomatic phrase | Raw Freq. in Textbook Conversation | Raw Freq. in Spoken BNC2014 Sample |
---|---|---|
MAKE friends | 25 | 5 |
MAKE the best/most of sth. | 4 | 2 |
MAKE faces | 4 | 4 |
MAKE fun of sth./sb. | 3 | 1 |
Practice makes perfect | 3 | 0 |
MAKE up .* mind | 5 | 3 |
Both Textbook Conversation and Textbook Fiction feature significantly fewer phrasal verbs with MAKE than their corresponding ENL reference corpora (cf. Fig. @ref(fig:semantics-spoken) and @ref(fig:semantics-nar)). This pedagogically relevant finding will be explored further in Section *. Furthermore, Textbook Fiction also features considerably fewer delexical uses of MAKE. This finding will be elaborated on in Section *.
#### Textbook Narrative vs. Youth Fiction ###
columns = c("RefYouthFiction_raw", "TextbookNarrative_raw")
semantics_nar_test = chisq.test(semantics[, columns])
print(semantics_nar_test)
##
## Pearson's Chi-squared test
##
## data: semantics[, columns]
## X-squared = 20.14, df = 8, p-value = 0.009818
results = data.frame(Category = semantics$Category, p = 1)
for (i in 1:nrow(results)) {
category = semantics$Category[i]
row = semantics[semantics$Category == category, columns]
other_rows = semantics[semantics$Category != category, columns]
collapsed = rbind(row, colSums(other_rows))
category_test = fisher.test(collapsed)
results$p[i] = category_test$p.value
results$OddsRatio[i] = category_test$estimate
results$conf.int.lower[i] = category_test$conf.int[1]
results$conf.int.upper[i] = category_test$conf.int[2]
}
results
## Category p OddsRatio conf.int.lower conf.int.upper
## 1 Ai: produce 0.61116111 0.9044069 0.63930842 1.2784425
## 2 Aii: delexical 0.04412899 1.4223754 1.00890050 2.0104157
## 3 B: earn 0.06040835 0.3492676 0.09748602 1.0389145
## 4 C: achieve 0.04742970 0.4832880 0.22436660 0.9926373
## 5 D: be/become 0.68629807 2.0085743 0.28598225 22.3251938
## 6 E: cause 0.64316182 0.9199817 0.67166232 1.2596311
## 7 F: ensure 0.44968236 0.6969169 0.29926374 1.5726868
## 8 I: idiomatic 0.86818253 0.8953140 0.43878014 1.8145862
## 9 P: phrasal verbs 0.01640735 2.3266949 1.15064154 4.9560309
results$p = p.adjust(results$p, method = "none")
# require(gtools)
results$Star <- stars.pval(results$p)
results = results %>% mutate(Star = replace(Star, Star == ".", "˘")) # Change dot to a higher symbol or: º˘
# Check the outcome.
results
## Category p OddsRatio conf.int.lower conf.int.upper Star
## 1 Ai: produce 0.61116111 0.9044069 0.63930842 1.2784425
## 2 Aii: delexical 0.04412899 1.4223754 1.00890050 2.0104157 *
## 3 B: earn 0.06040835 0.3492676 0.09748602 1.0389145 ˘
## 4 C: achieve 0.04742970 0.4832880 0.22436660 0.9926373 *
## 5 D: be/become 0.68629807 2.0085743 0.28598225 22.3251938
## 6 E: cause 0.64316182 0.9199817 0.67166232 1.2596311
## 7 F: ensure 0.44968236 0.6969169 0.29926374 1.5726868
## 8 I: idiomatic 0.86818253 0.8953140 0.43878014 1.8145862
## 9 P: phrasal verbs 0.01640735 2.3266949 1.15064154 4.9560309 *
## Textbook narrative vs. Youth Fiction ###
nar_sem = semantics %>% select(Category, RefYouthFiction, TextbookNarrative)
# svg(filename=here('MAKE_semantics_Fiction.svg'), width=12, height=6,
# pointsize=12)
fig = ggplot(nar_sem) + geom_segment(aes(x = fct_rev(Category), xend = Category,
y = RefYouthFiction, yend = TextbookNarrative), color = "darkgrey") + geom_point(aes(x = Category,
y = RefYouthFiction), color = rgb(252, 141, 89, max = 255, alpha = 220), size = 4) +
geom_point(aes(x = Category, y = TextbookNarrative), color = rgb(153, 0, 0, max = 255,
alpha = 180), size = 4) + coord_flip() + theme_light() + theme(panel.border = element_blank(),
axis.text = element_text(size = 14, colour = "black"), axis.title.x = element_text(size = 14,
face = "bold")) + xlab("") + ylab("% of occurrences of MAKE") + theme(plot.margin = margin(10,
10, 10, 10))
# + labs(title = 'Distribution of MAKE meanings in Textbook Fiction and Reference
# Youth Fiction')
fig1 = fig + geom_text(data = data.frame(), aes(x = "Ai: produce", y = 26, label = "Textbook Fiction"),
color = rgb(153, 0, 0, max = 255, alpha = 255), hjust = 1, size = 5, fontface = "bold",
nudge_y = 6, nudge_x = 0.2) + geom_text(data = data.frame(), aes(x = 9, y = 20,
label = "Youth Fiction Corpus sample"), color = rgb(252, 141, 89, max = 255,
alpha = 220), hjust = 0, size = 5, fontface = "bold", nudge_y = -10, nudge_x = 0.2)
cols = c("RefYouthFiction", "TextbookNarrative")
results$y = rowMeans(nar_sem[, cols])
fig2 = fig1 + geom_text(aes(x = Category, y = y, label = Star), data = results, size = 8)
print(fig2)
Distribution of MAKE meanings in Textbook Fiction and Reference Youth Fiction
dev.off()
## null device
## 1
# ggsave(here('semantics_nar.png', width = 24, height = 12, units = 'cm', dpi =
# 300)
Quantitatively, the semantic category “C: achieve” is proportionally more represented in Textbook Fiction than in the Youth Fiction Corpus sample. This is primarily due to the fact that this category is largely dominated by the phrase to make it which is featured 17 times in Textbook Fiction, as opposed to three times in the Youth Fiction sample. By contrast, a qualitative analysis of the corresponding concordance lines reveals that, in the Youth Fiction sample, the category includes a much broader range of phrases involving movement and/or a sense of achievement. These typical, yet syntactically relatively complex, phrases of narrative writing, usually rendered with one or more prepositions (-), are conspicuously absent from the Textbook Fiction.
(###) Then he made his way down the stairs and into the locker room. <Youth Fiction, file 183: Pratchett, Night Watch, 2002>
(###) […] Jack came out of the barn and made straight for me. <Youth Fiction, file 191: Delaney, The Spook’s Apprentice, 2004>.
(###) She and the bird started to make off towards my ship. <Youth Fiction, file 140: Adams, Mostly Harmless, 1992>
##
## Pearson's Chi-squared test
##
## data: spoken_sem[, cols]
## X-squared = 31.847, df = 10, p-value = 0.0004247
## Semantics p OddsRatio conf.int.lower
## 1 Sports, Entertainment & Travel 0.0001106101 0.2895076 0.13684543
## 2 Food 0.0392139156 1.5859533 1.01216142
## 3 Drink 0.0118899021 3.6564387 1.24501982
## 4 Clothing & Accessories 0.0151945924 0.2966682 0.08373583
## 5 Arts & Crafts 0.2216724760 1.7825954 0.73661879
## 6 World, Life & Body 0.3967783920 0.6736406 0.25017318
## 7 Industry & Technology 0.4591638554 1.5611497 0.52296327
## 8 General & abstract terms 0.6031684435 0.7056359 0.20223584
## 9 Materials & Equipment 0.7366388507 1.1443201 0.55721579
## 10 Language & Communication 1.0000000000 0.8907527 0.21104313
## 11 Buildings & House 1.0000000000 1.0749109 0.31465306
## conf.int.upper
## 1 0.5765733
## 2 2.4938716
## 3 13.0493546
## 4 0.8615393
## 5 4.5261472
## 6 1.7217635
## 7 4.9478222
## 8 2.2731796
## 9 2.3573155
## 10 3.5727125
## 11 3.6722592
## null device
## 1
#### Chatterplot: inspired by https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
produce <- read.csv(file = here("Produce.csv"), sep = "\t", stringsAsFactors = TRUE)
summary(produce$USAS)
## Arts & Crafts Buildings & House
## 36 31
## Clothing & Accessories Drink
## 32 36
## Food General & abstract terms
## 160 29
## Industry & Technology Language & Communication
## 20 26
## Materials & Equipment Sports, Entertainment & Travel
## 64 74
## World, Life & Body
## 44
str(produce)
## 'data.frame': 552 obs. of 10 variables:
## $ Subcorpus : Factor w/ 4 levels "BNCspoken","spoken",..: 1 4 1 1 1 1 1 1 1 2 ...
## $ File_level : Factor w/ 230 levels "A","B","C","D",..: 141 48 77 172 158 228 125 150 134 2 ...
## $ File_series : Factor w/ 10 levels "","Access","Achievers",..: 1 1 1 1 1 1 1 1 1 7 ...
## $ VDE : Factor w/ 1 level "produce": 1 1 1 1 1 1 1 1 1 1 ...
## $ Collocate : Factor w/ 341 levels "[food]","album",..: 6 15 25 48 49 50 50 73 74 91 ...
## $ Collocate_lemma: Factor w/ 314 levels "[food]","album",..: 6 14 24 45 46 46 46 70 71 87 ...
## $ USAS : Factor w/ 11 levels "Arts & Crafts",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Concordance1 : logi NA NA NA NA NA NA ...
## $ Concordance2 : Factor w/ 6 levels "made","Made",..: 6 3 1 1 1 6 1 3 6 3 ...
## $ Concordance3 : logi NA NA NA NA NA NA ...
spoken_produce <- produce %>% filter(Subcorpus=="BNCspoken" | Subcorpus =="spoken")
spoken_produce$Subcorpus <- droplevels(spoken_produce$Subcorpus)
table_spoken_produce <- table(spoken_produce$Collocate_lemma, spoken_produce$Subcorpus)
head(table_spoken_produce)
##
## BNCspoken spoken
## [food] 6 0
## album 1 0
## alcohol 0 0
## anything 1 1
## army 1 0
## art 1 0
comp <- round(prop.table(table(spoken_produce$Collocate_lemma, spoken_produce$Subcorpus), 2), 4)*100
comp <- as.data.frame(unclass(comp))
comp$Collocate_lemma <- row.names(comp)
head(comp)
## BNCspoken spoken Collocate_lemma
## [food] 3.33 0.00 [food]
## album 0.56 0.00 album
## alcohol 0.00 0.00 alcohol
## anything 0.56 0.52 anything
## army 0.56 0.00 army
## art 0.56 0.00 art
comp <- inner_join(comp, spoken_produce[,6:7], by = "Collocate_lemma")
comp <- distinct(comp)
head(comp)
## BNCspoken spoken Collocate_lemma USAS
## 1 3.33 0.00 [food] Food
## 2 0.56 0.00 album Sports, Entertainment & Travel
## 3 0.56 0.52 anything Materials & Equipment
## 4 0.56 0.00 army Industry & Technology
## 5 0.56 0.00 art Arts & Crafts
## 6 0.56 0.00 ash Materials & Equipment
suffrager <- c(palettes_d$suffrager$classic[1], palettes_d$suffrager$CarolMan[3:4], palettes_d$suffrager$london, palettes_d$suffrager$oxon[c(1,5)])
levels(as.factor(comp$spoken))
## [1] "0" "0.52" "1.04" "1.55" "2.07" "2.59" "4.15" "4.66" "5.18"
## [10] "10.36"
## With minimal threshold of just min 2 occurrences
comp %>% filter(spoken > 0.57 | BNCspoken > 0.57) %>%
ggplot(aes(spoken, BNCspoken, label = Collocate_lemma)) +
# ggrepel geom, make arrows transparent, color by rank, size by n
geom_point(aes(colour=USAS)) +
geom_text_repel(min.segment.length = 0.3, segment.alpha = 0.2, force = 0.4,
aes(colour=USAS), show_guide = F) +
scale_colour_manual(values = suffrager,
name = "Semantic field") +
# set color gradient & customize legend
geom_abline(color = "gray40", lty = 2) +
# set word size range & turn off legend
labs(y = "% of MAKE 'produce' collocations in Spoken BNC2014 sample", x = "% of 'produce' MAKEs in Textbook Conversation") +
scale_y_log10(breaks=c(1,2,3,4)) +
scale_x_log10(breaks=c(1,2,4,6,8,10,12)) +
# minimal theme & customizations
theme_bw() +
theme(legend.position=c(0.86,0.35),
#legend.justification = c("right","top"),
panel.grid.major = element_line(colour = "whitesmoke"),
panel.grid.minor=element_blank(),
legend.background = element_rect(colour = 'darkgrey', fill = 'white', linetype='solid'))
#ggsave(here("Produce_MAKEs_Spoken_chatterplot.svg"), dpi = 300, width =22, height = 15, units = "cm")
## With higher threshold of min. 3 occurrences
comp %>% filter(spoken > 1.04 | BNCspoken > 1.11) %>%
ggplot(aes(spoken, BNCspoken, label = Collocate_lemma)) +
# ggrepel geom, make arrows transparent, color by rank, size by n
geom_point(aes(colour=USAS)) +
geom_text_repel(min.segment.length = 0.3, segment.alpha = 0.2, force = 0.4,
aes(colour=USAS), show_guide = F) +
scale_colour_manual(values = suffrager,
name = "Semantic field") +
# set color gradient & customize legend
geom_abline(color = "gray40", lty = 2) +
# set word size range & turn off legend
labs(y = "% of MAKE 'produce' collocations in Spoken BNC2014 sample", x = "% of 'produce' MAKEs in Textbook Conversation") +
scale_y_log10(breaks=c(1,2,3,4)) +
scale_x_log10(breaks=c(1,2,4,6,8,10,12)) +
# minimal theme & customizations
theme_bw() +
theme(legend.position=c(0.86,0.30),
#legend.justification = c("right","top"),
panel.grid.major = element_line(colour = "whitesmoke"),
panel.grid.minor=element_blank(),
legend.background = element_rect(colour = 'darkgrey', fill = 'white', linetype='solid'))
#ggsave(here("Produce_MAKEs_Spoken_chatterplot_min3.svg"), dpi = 300, width =23, height = 15, units = "cm")
### Wordclouds for collocates ###
produce <- read.csv(file = here("Produce.csv"), sep = "\t", stringsAsFactors = TRUE)
summary(produce$USAS)
## Arts & Crafts Buildings & House
## 36 31
## Clothing & Accessories Drink
## 32 36
## Food General & abstract terms
## 160 29
## Industry & Technology Language & Communication
## 20 26
## Materials & Equipment Sports, Entertainment & Travel
## 64 74
## World, Life & Body
## 44
### Now I would like one colour for each USAS category in the df 'produce' ##
### Copying this
### https://stackoverflow.com/questions/18902485/colored-categories-in-r-wordclouds
# The problem is that the wordcloud() function first summarizes produce$colors,
# so that each word only appears once, and is matched to a frequency. The result
# is that the colors vector from the original data set is too long.
# One way to solve this is to handle summarising the data frame first With dplyr
# we can use group_by() to group the data frame by combinations of factor
# variables. Collocate_lemma and USAS are the relevant ones. We can then use
# summarize() to count the frequencies ourselves. The n() function counts
# frequencies (where n stands for 'number' of entries).
produce_freq <- produce %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())
# Now we can apply the colour scheme:
colours1 <- RColorBrewer::brewer.pal(n = 8, name = "Set1")
colours2 <- RColorBrewer::brewer.pal(n = 10, name = "RdGy")
colours <- c(colours1, colours2[c(1, 9, 10)]) # We need a total of 12 colours for 12 categories
### Now a wordcloud for each subcorpus ###
produceBNC <- produce[produce$Subcorpus == "BNCspoken", ]
produceTxBspoken <- produce[produce$Subcorpus == "spoken", ]
produce_freq_BNC <- produceBNC %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())
produce_freq_TxBspoken <- produceTxBspoken %>% group_by(Collocate_lemma, USAS) %>%
summarize(Frequency = n())
anchor = 10
# png(filename = here('produce_BNC_categories.png'), width = 1300, height = 1100,
# res = 300)
# svg(filename=here('produce_BNC_collocates.svg'), width=9, height=6,
# pointsize=11)
par(mar = rep(0, 4))
produce_freq_BNC$colors <- colours[match(produce_freq_BNC$USAS, levels(produce_freq_BNC$USAS))]
wordcloud(produce_freq_BNC$Collocate_lemma, produce_freq_BNC$Frequency, min.freq = 2,
rot.per = 0, colors = produce_freq_BNC$colors, ordered.colors = TRUE, random.order = FALSE,
scale = c(10 * max(table(produce_freq_BNC$Collocate_lemma))/anchor, 0.5))
Main collocates of MAKE in the produce/create sense (Ai) in the Spoken BNC2014 sample
dev.off()
## null device
## 1
# png(filename = here('produce_TxBspoken_categories.png'), width = 1300, height =
# 1100, res = 300) svg(filename=here('produce_TxBSpoken_collocates.svg'),
# width=9, height=6, pointsize=11)
par(mar = rep(0, 4))
produce_freq_TxBspoken$colors <- colours[match(produce_freq_TxBspoken$USAS, levels(produce_freq_TxBspoken$USAS))]
wordcloud(produce_freq_TxBspoken$Collocate_lemma, produce_freq_TxBspoken$Frequency,
min.freq = 2, rot.per = 0, colors = produce_freq_TxBspoken$colors, ordered.colors = TRUE,
random.order = FALSE, scale = c(40 * max(table(produce_freq_BNC$Collocate_lemma))/anchor,
0.5))
Main collocates of MAKE in the produce/create sense (Ai) in Textbook Conversation
dev.off()
## null device
## 1
produce_sem <- read.csv(file = here("Produce_semantics.csv"), sep = "\t", stringsAsFactors = TRUE) # This data set was produced in the MAKE_produce_wordcloud_lollipop.R script
### Now onto a comparison of semantic categories between TxB narrative and Youth Fiction ###
###
# Reorder data by size of difference in frequency #
nar_sem = produce_sem %>%
rowwise() %>% mutate(dif = sqrt((Youth_FictionPer - TxBnarPer)^2)) %>%
arrange(desc(dif)) %>%
mutate(Semantics=factor(Semantics, Semantics))
nar_sem$Semantics <- factor(nar_sem$Semantics, levels = nar_sem$Semantics[order(nar_sem$dif)])
### Chi-squared tests ##
cols = c("Youth_Fiction", "TxBnar")
semantics_test = chisq.test(nar_sem[,cols])
# Comparing the two distributions #
print(semantics_test)
##
## Pearson's Chi-squared test
##
## data: nar_sem[, cols]
## X-squared = 22.119, df = 10, p-value = 0.01451
# X-squared = 22.119, df = 10, p-value = 0.01451
# Comparing proportions for each semantic category #
results = data.frame(Semantics=nar_sem$Semantics, p=1)
for(i in 1:nrow(results)){
category = nar_sem$Semantics[i]
row = nar_sem[nar_sem$Semantics==category,cols]
other_rows = nar_sem[nar_sem$Semantics!=category,cols]
collapsed = rbind(row, colSums(other_rows))
category_test = fisher.test(collapsed)
results$p[i] = category_test$p.value
results$OddsRatio[i] = category_test$estimate
results$conf.int.lower[i] = category_test$conf.int[1]
results$conf.int.upper[i] = category_test$conf.int[2]
}
results$p = p.adjust(results$p, method='none')
## We can add the stars for low p-values using ifelse(). ##
#criterion = 0.05
#results$Star = ifelse(results$p < criterion, '**', '')
## But this package makes it much quicker! ##
results$Star <- stars.pval(results$p)
results = results %>%
mutate(Star = replace(Star, Star==".", "˘")) # Change dot to a higher symbol such as º or ˘
# Check the outcome.
print(results)
## Semantics p OddsRatio conf.int.lower
## 1 Buildings & House 0.003963951 5.7818229 1.531358331
## 2 Food 0.066470807 0.4286075 0.160844525
## 3 Sports, Entertainment & Travel 0.100465585 0.4251720 0.127245834
## 4 World, Life & Body 0.245055350 1.8853900 0.680214692
## 5 General & abstract terms 0.166963624 0.4068191 0.089478772
## 6 Language & Communication 0.286462965 0.5142079 0.131945680
## 7 Clothing & Accessories 0.198985318 2.6442894 0.579540410
## 8 Arts & Crafts 0.524096612 1.6639939 0.378966782
## 9 Materials & Equipment 0.673762817 1.2008138 0.471543246
## 10 Drink 0.789240631 1.2584397 0.379082631
## 11 Industry & Technology 1.000000000 0.5370938 0.008975184
## conf.int.upper Star
## 1 32.571583 **
## 2 1.061591 ˘
## 3 1.250977
## 4 5.559175
## 5 1.481647
## 6 1.738404
## 7 16.377216
## 8 8.315215
## 9 3.082114
## 10 4.282852
## 11 10.492001
## Lollipop chart ##
#svg(filename=here("MAKE_produce_Fiction.svg"), width=14, height=7, pointsize=12)
fig = ggplot(nar_sem) +
geom_segment(aes(x=Semantics, xend=Semantics, y=Youth_FictionPer, yend=TxBnarPer), color="black") +
geom_point(aes(x=Semantics, y=Youth_FictionPer), color=rgb(153,0,0, max = 255, alpha = 255), size=5) +
geom_point(aes(x=Semantics, y=TxBnarPer), color=rgb(252, 141, 89, max = 255, alpha = 220), size=5) +
coord_flip() +
theme_minimal() +
theme(
panel.border = element_blank(),
axis.text=element_text(size=16, colour = "black"),
axis.title.x=element_text(size=16, face = "bold")
) +
scale_x_discrete(limits=c(levels(nar_sem$Semantics),"")) +
xlab("") +
ylab("% of MAKE collocations in the (Ai) produce sense") + # Adds an empty factor to give space for legend as recommended in: https://stackoverflow.com/questions/16788180/r-ggplot-extend-the-range-of-a-category-x-axis
theme(plot.margin=margin(10,10,10,10))
# + labs(title = "Differences in semantic fields of collocates of MAKE in the produce sense (Ai).")
fig1 = fig +
geom_text(data=data.frame(), aes(x="Buildings & House", y= 5, label="Textbook Fiction"),color=rgb(252, 141, 89, max = 255, alpha = 255), hjust=1, size=6, fontface = "bold", nudge_x = 0.7) +
geom_text(data=data.frame(), aes(x="Buildings & House", y= 14, label="Youth Fiction Sample"), color=rgb(153,0,0, max = 255, alpha = 255), hjust=0, size=6, fontface = "bold", nudge_x = 0.7)
## This is to help us place the stars in the middle of the bars ##
cols = c("Youth_FictionPer", "TxBnarPer")
results$y = rowMeans(nar_sem[,cols])
fig2 = fig1 +
geom_text(aes(x=Semantics, y=y, label=Star), data=results, size=9)
print(fig2)
Differences in semantic fields of collocates of MAKE in the produce sense (Ai).
dev.off()
## null device
## 1
#ggsave(here("produce_USAS_Fiction.png"), width = 24, height = 12, units = "cm", dpi = 300)
produce <- read.csv(file = here("Produce.csv"), sep = "\t", stringsAsFactors = TRUE)
fiction_produce <- produce %>% filter(Subcorpus=="Youth_Fiction" | Subcorpus =="TxBnar")
fiction_produce$Subcorpus <- droplevels(fiction_produce$Subcorpus)
table(fiction_produce$Collocate_lemma, fiction_produce$Subcorpus)
##
## TxBnar Youth_Fiction
## [food] 3 0
## album 0 0
## alcohol 1 0
## anything 0 0
## army 0 0
## art 0 0
## ash 0 0
## bacon roll 0 0
## bag 0 0
## bagel 0 0
## ball 0 1
## band 0 0
## banner 0 0
## basket 0 1
## beast 0 1
## belt 0 0
## bhaji 0 0
## bike 0 0
## biscuit 0 0
## blanket 0 0
## block 0 0
## blocks of ice 0 1
## boat 1 0
## bobble 0 0
## body 0 0
## bonfire 0 0
## book 0 1
## bookcase 0 0
## booze 0 0
## box 0 1
## bracelet 0 1
## bread 0 0
## breadcrumb 0 0
## breakfast 2 1
## brownies 0 0
## building 0 1
## bulgar 0 0
## burger 0 0
## burrow 0 0
## cake 1 0
## cake and biscuit 0 0
## calendar 0 0
## canal 0 0
## canapé 0 0
## candle 0 0
## card 0 0
## carpet 0 1
## CD 0 0
## ceiling 0 0
## chair 0 1
## character 0 0
## cheese 0 0
## children 0 0
## chili 1 0
## chips 0 0
## chocolate 0 0
## chowder 0 0
## chutney 0 0
## city 0 0
## claw 0 0
## clay 1 0
## cliff 0 0
## clip 0 1
## clock 0 0
## clothes 0 2
## clothing 0 0
## cocoa 1 0
## coffee 0 2
## coin 1 0
## collage 0 0
## colour 0 0
## content 0 0
## copy 1 0
## costume 0 0
## course 0 0
## covering 0 1
## crack 0 1
## cross 1 0
## crumble 0 0
## cup 0 1
## cup of cocoa 0 1
## cup of tea 1 0
## curry 1 0
## cut 1 0
## cyberpet 0 0
## database 0 0
## decoration 0 0
## den 0 0
## depression 1 0
## dinner 1 0
## documentary 1 0
## dog 0 0
## door 0 1
## drawing 0 0
## dress 0 1
## drink 0 0
## drive 0 0
## egg 0 1
## elderflower 0 0
## engine 0 0
## Eton Mess 0 0
## everyone 0 1
## factory 0 0
## fairytale 0 0
## fake 0 0
## farm 0 0
## filling 0 0
## film 7 0
## fingerprint 0 0
## fire 2 0
## folder 0 0
## food 1 1
## footprint 0 0
## foundation 1 0
## fragrance 0 1
## frame 0 0
## fruit 0 0
## fudge 0 0
## fuel 2 0
## furniture 1 0
## future 1 0
## gadget 0 0
## game 0 0
## Gamemaker 0 0
## generator 0 0
## ginger honey lemon 0 0
## gingerbread men 0 0
## glass 0 0
## globe 0 0
## god 0 1
## gold 0 1
## gravy 0 0
## group 1 0
## gruel 0 1
## guacamole 0 0
## gum 0 1
## Guys 0 0
## gypsy tart 0 0
## hair 1 0
## hallmark 0 0
## hat 0 0
## helmet 0 1
## history 0 0
## hole 4 1
## home 1 0
## hot chocolate 0 0
## house 0 0
## hula hoop 1 0
## ice cream 0 0
## icing 0 0
## ID 1 0
## instrument 0 0
## invention 1 0
## iPod 0 0
## island 0 1
## jacket 0 0
## jam 0 0
## joint 0 0
## jug of love 0 0
## junk [food] 0 1
## kimbap 0 0
## Kleenex 0 0
## lair 0 1
## lake 1 0
## lamp 0 1
## lasagne 0 0
## law 0 0
## leg 0 1
## Lego 0 0
## life 0 1
## light 0 1
## list 1 1
## lunch 3 0
## magazine 0 1
## mash 0 0
## mask 1 0
## matchwood 0 1
## matzah 0 0
## meal 0 0
## meat pattie 0 1
## merchandise 0 0
## mess 0 0
## metal 1 0
## milk 0 0
## milkshake 0 1
## mince meat 0 0
## mince pie 0 0
## mixtape 0 0
## moon 0 0
## mousse 0 0
## movement 0 1
## movie 0 0
## muffin 0 0
## music 2 0
## mutant 0 1
## necklace 1 0
## nest 0 0
## object 1 1
## omelet 0 1
## opening 1 0
## pancake 1 1
## pancakes and cupcake 0 0
## panorama 0 0
## parchment 0 1
## pasta 0 0
## pasty 0 0
## path 0 2
## pattern 0 0
## paving 0 1
## pen 0 1
## penny 0 0
## people 0 1
## phone 0 0
## picture 0 0
## pie 0 0
## pitta 0 0
## pizza 0 0
## plate 0 0
## play 0 1
## poetry 0 0
## poster 4 0
## pots of tea 0 1
## present 0 0
## printer 0 0
## product 0 0
## programme 0 0
## project 0 0
## protein 0 1
## puppet 0 0
## rat 1 0
## record 0 0
## recording 1 0
## report 2 0
## rhythm 0 0
## ring 0 2
## road 0 0
## roof 0 1
## room 0 3
## rope 1 0
## rule 3 1
## salad 0 0
## sandwich 3 1
## sauce 0 0
## schedule 0 0
## scone 0 0
## seat 0 0
## settlement 0 1
## shelter 0 1
## shit 0 0
## shoe 2 0
## shot 0 0
## sign 1 0
## snow 0 1
## snowflake 0 1
## snowman 0 0
## sock 0 0
## soil 0 0
## something 1 0
## song 0 1
## soup 1 0
## sour cream 0 0
## spaghetti 0 0
## specification 0 0
## spiral 0 1
## splot 0 1
## stain 0 0
## stall 0 0
## stamp 0 0
## stand 0 1
## steak and pancake 1 0
## steeple 0 1
## stereo 0 0
## story 1 0
## supper 0 0
## T-shirt 0 1
## tagine 0 0
## tank 0 0
## tea 4 1
## teeth 0 0
## term 0 1
## thing 0 1
## thread 0 0
## throw 1 0
## tile 0 0
## tissue 0 0
## toad-in-the-hole 0 0
## toast 1 0
## toastie 0 0
## tortilla 0 0
## track 0 1
## trap 0 1
## tree 1 0
## triangle 0 0
## trifle 0 0
## trough 0 0
## trousers 0 0
## TV series 1 0
## tyre 0 0
## uni 0 0
## video 1 0
## wall 0 1
## weapon 0 1
## web 2 0
## wedge 0 0
## well 0 0
## will 0 1
## window 0 0
## wine 0 1
## wonder 0 0
## wood 1 0
## word 0 0
## world 1 1
## wrap 0 0
## Yorkshire pudding 0 0
comp <- round(prop.table(table(fiction_produce$Collocate_lemma, fiction_produce$Subcorpus), 2), 4)*100
comp <- as.data.frame(unclass(comp))
comp$Collocate_lemma <- row.names(comp)
head(comp)
## TxBnar Youth_Fiction Collocate_lemma
## [food] 3.23 0 [food]
## album 0.00 0 album
## alcohol 1.08 0 alcohol
## anything 0.00 0 anything
## army 0.00 0 army
## art 0.00 0 art
comp <- inner_join(comp, fiction_produce[,6:7], by = "Collocate_lemma")
comp <- distinct(comp)
head(comp)
## TxBnar Youth_Fiction Collocate_lemma USAS
## 1 3.23 0.00 [food] Food
## 2 1.08 0.00 alcohol Drink
## 3 0.00 1.16 ball Materials & Equipment
## 4 0.00 1.16 basket Arts & Crafts
## 5 0.00 1.16 beast Arts & Crafts
## 6 0.00 1.16 blocks of ice Materials & Equipment
suffrager <- c(palettes_d$suffrager$CarolMan[2:4], palettes_d$suffrager$classic[1:2], palettes_d$suffrager$london, palettes_d$suffrager$oxon[c(1,5)])
levels(as.factor(comp$Youth_Fiction))
## [1] "0" "1.16" "2.33" "3.49"
## With minimal threshold of just min 2 occurrences
comp %>% filter(TxBnar > 1.09 | Youth_Fiction > 1.17) %>%
ggplot(aes(TxBnar, Youth_Fiction, label = Collocate_lemma)) +
# ggrepel geom, make arrows transparent, color by rank, size by n
geom_text_repel(segment.alpha = 0,
aes(colour=USAS)) +
scale_colour_manual(values = suffrager,
name = "Semantic field") +
# set color gradient & customize legend
geom_abline(color = "gray40", lty = 2) +
# set word size range & turn off legend
labs(y = "% of MAKE 'produce' collocations in Spoken BNC2014 sample", x = "% of MAKE 'produce' collocations in Textbook Conversation") +
scale_y_log10(breaks=c(1,1.5,2,2.5,3)) +
scale_x_log10(breaks=c(1,2,4,6,8)) +
# minimal theme & customizations
theme_bw() +
theme(legend.position=c(0.86,0.35),
#legend.justification = c("right","top"),
panel.grid.major = element_line(colour = "whitesmoke"),
panel.grid.minor=element_blank(),
legend.background = element_rect(colour = 'darkgrey', fill = 'white', linetype='solid'))
#ggsave(here("Produce_MAKEs_Fiction_chatterplot.svg"), dpi = 300, width =22, height = 15, units = "cm")
### Now a wordcloud for each subcorpus ###
produceTxBnar <- produce[produce$Subcorpus == "TxBnar", ]
produceYF <- produce[produce$Subcorpus == "Youth_Fiction", ]
produce_freq_YF <- produceYF %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())
produce_freq_TxBnar <- produceTxBnar %>% group_by(Collocate_lemma, USAS) %>% summarize(Frequency = n())
anchor = 10
# png(filename = here('produce_YF_categories.png'), width = 1300, height = 1100,
# res = 300) svg(filename=here('produce_YF_collocates.svg'), width=9, height=6,
# pointsize=11)
produce_freq_YF$colors <- colours[match(produce_freq_YF$USAS, levels(produce_freq_YF$USAS))]
wordcloud(produce_freq_YF$Collocate_lemma, produce_freq_YF$Frequency, min.freq = 2,
rot.per = 0, colors = produce_freq_YF$colors, ordered.colors = TRUE, random.order = FALSE,
scale = c(10 * max(table(produce_freq_YF$Collocate_lemma))/8, 0.5))
Main collocates of MAKE in the produce/create sense (Ai) in the Youth Fiction sample
dev.off()
## null device
## 1
# png(filename = here('produce_TxBnar_categories.png'), width = 1300, height =
# 1100, res = 300) svg(filename=here('produce_TxBNar_collocates.svg'), width=9,
# height=6, pointsize=11)
produce_freq_TxBnar$colors <- colours[match(produce_freq_TxBnar$USAS, levels(produce_freq_TxBnar$USAS))]
wordcloud(produce_freq_TxBnar$Collocate_lemma, produce_freq_TxBnar$Frequency, min.freq = 2,
rot.per = 0, colors = produce_freq_TxBnar$colors, ordered.colors = TRUE, random.order = FALSE,
scale = c(20 * max(table(produce_freq_TxBnar$Collocate_lemma))/anchor, 0.5))
Main collocates of MAKE in the produce/create sense (Ai) in Textbook Fiction
dev.off()
## null device
## 1
### Wordclouds for collocates ###
cols <- RColorBrewer::brewer.pal(n = 9, name = "OrRd")
delexical <- read.csv(here("Delexical_MAKE2.csv"), sep = ",", stringsAsFactors = TRUE)
str(delexical)
## 'data.frame': 499 obs. of 7 variables:
## $ Register : Factor w/ 4 levels "BNC_Spoken","narrative",..: 4 1 1 4 3 2 4 1 2 3 ...
## $ Level : Factor w/ 6 levels "","A","B","C",..: 1 1 1 1 1 5 1 1 2 1 ...
## $ Series : Factor w/ 8 levels "","Access","Achievers",..: 1 1 1 1 1 7 1 1 3 1 ...
## $ Meaning : Factor w/ 2 levels "","delexical": 1 1 1 1 1 2 1 1 2 1 ...
## $ Collocate : Factor w/ 145 levels "a case","abortions",..: 1 2 3 4 5 6 6 7 7 7 ...
## $ Collocate_lemma: Factor w/ 124 levels "abortion","accent",..: 25 1 2 3 4 5 5 6 6 6 ...
## $ speech : int 1 0 0 1 0 1 1 1 1 1 ...
table(delexical$Collocate_lemma, delexical$Register)
##
## BNC_Spoken narrative TxB_spoken Youth_Fiction
## abortion 1 0 0 0
## accent 1 0 0 0
## accusation 0 0 0 1
## agreement 0 0 1 0
## amend 0 1 0 1
## announcement 1 1 1 0
## appeal 0 0 0 1
## appointment 1 0 4 0
## argument 1 0 0 0
## arrangement 2 1 0 1
## arrest 0 0 0 1
## assumption 1 0 0 0
## attempt 0 0 2 0
## available 0 0 1 0
## aware 1 0 0 0
## balls up 1 0 0 0
## bang 1 0 0 0
## bargain 0 0 0 1
## bed 0 0 5 1
## bet 0 1 0 0
## bid 0 0 1 1
## booking 1 0 1 0
## breakthrough 0 1 0 1
## call 0 1 7 4
## case 1 0 1 1
## change 2 2 8 0
## cheque 0 0 1 0
## choice 3 7 3 1
## clear 0 1 0 0
## comment 1 1 4 1
## commitment 0 0 1 0
## comparison 1 0 0 0
## complaint 1 0 0 1
## confession 0 0 0 1
## connect 1 0 0 0
## contact 3 2 1 0
## contrast 0 1 0 0
## contribution 0 0 2 1
## conversation 1 2 0 1
## correction 0 1 0 0
## creation 1 0 0 0
## curtsey 0 0 0 1
## cut 0 0 1 0
## dash 0 1 0 0
## deal 1 0 1 1
## decision 14 2 7 4
## difference 17 5 14 6
## discovery 0 1 0 3
## distinction 0 0 1 0
## economy 1 0 0 0
## effort 8 3 5 2
## entrance 0 0 0 1
## exception 0 1 0 0
## excuse 0 0 1 2
## feature 1 0 0 0
## fuss 1 0 1 2
## gain 1 0 0 0
## generalisation 1 0 0 0
## gesture 0 0 0 4
## illusion 1 0 0 0
## impression 1 0 2 1
## improvement 0 0 1 0
## issue 1 0 0 0
## joke 1 0 2 1
## journey 1 1 2 1
## killing 1 0 0 0
## leap 1 1 0 0
## list 1 0 3 0
## loss 1 0 0 0
## misery 0 0 1 0
## mistake 8 13 16 4
## move 3 0 1 2
## movement 1 0 0 1
## noise 8 8 13 12
## note 2 0 1 1
## observation 0 0 1 0
## offer 1 1 0 0
## pact 0 2 0 0
## pause 0 0 1 0
## payment 2 0 1 0
## peace 0 0 0 1
## phenomenon 1 0 0 0
## plan 1 1 4 1
## point 1 0 0 0
## prediction 0 0 1 0
## preparation 0 1 0 0
## progress 0 2 5 4
## promise 0 0 4 1
## proposal 0 0 1 0
## protest 0 0 1 0
## racket 0 0 0 2
## ready 0 0 0 1
## request 0 1 0 0
## reservation 0 0 0 1
## row 1 0 0 0
## sacrifice 1 0 0 0
## save 0 0 1 0
## saving 1 0 0 0
## secret 0 0 1 0
## sense 27 2 9 5
## shamble 0 0 0 1
## sighting 0 0 3 0
## sign 0 0 0 1
## signal 1 0 0 0
## silence 0 0 0 1
## small talk 0 0 0 1
## sound 1 6 8 6
## speech 0 2 1 1
## start 0 0 0 2
## statement 0 1 0 0
## stay 1 0 0 0
## stereotype 1 0 0 0
## stop 0 0 1 0
## suggestion 0 1 2 0
## sweep 0 0 0 1
## swipe 0 0 0 1
## tone 0 0 1 0
## transition 0 0 1 0
## trip 0 0 0 1
## trouble 0 1 1 2
## use 2 0 0 1
## visit 0 0 0 1
## walk 0 1 0 0
## wish 0 0 2 1
# https://stackoverflow.com/questions/37275220/wordclouds-with-absolute-word-sizes
# # Trick to make the wordclouds comparable with constant font size #
anchor <- 20 # Should be maximum frequency in any of the wordclouds to be compared
# Youth Fiction #
delexicalYF <- delexical[delexical$Register == "Youth_Fiction", ]
# png(filename = here('Delexical_YF_minFreq2.png'), width = 1200, height = 1000,
# res = 300) # Too low a quality? svg(filename=here('Delexical_YF_minFreq2.svg'),
# width=4, height=3, pointsize=11)
wordcloud::wordcloud(delexicalYF$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[9],
scale = c(8 * max(table(delexicalYF$Collocate_lemma))/anchor, 0.5))
dev.off()
## null device
## 1
# Textbook narrative #
delexicalTxBnar <- delexical[delexical$Register == "narrative", ]
# png(filename = here('Delexical_TxBnar_minFreq2.png'), width = 1300, height =
# 1100, res = 300) svg(filename=here('Delexical_TxBnar_minFreq2.svg'), width=5,
# height=3, pointsize=11)
wordcloud(delexicalTxBnar$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[6],
scale = c(6 * max(table(delexicalTxBnar$Collocate_lemma))/anchor, 0.5))
dev.off()
## null device
## 1
# BNC spoken #
delexicalBNC <- delexical[delexical$Register == "BNC_Spoken", ]
# png(filename = here('Delexical_BNCs_minFreq2.png'), width = 2000, height =
# 1800, res = 300) # Too low a quality?
# svg(filename=here('Delexical_BNCs_minFreq2.svg'), width=6, height=6,
# pointsize=11)
wordcloud(delexicalBNC$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[9],
scale = c(8 * max(table(delexicalBNC$Collocate_lemma))/anchor, 0.5))
dev.off() # To save plot
## null device
## 1
# Textbook spoken #
delexicalTxBspoken <- delexical[delexical$Register == "TxB_spoken", ]
# png(filename = here('Delexical_TxBspoken_minFreq2.png'), width = 2000, height =
# 1800, res = 300) svg(filename=here('Delexical_TxBspoken_minFreq2.svg'),
# width=5, height=10, pointsize=11)
wordcloud(delexicalTxBspoken$Collocate_lemma, min.freq = 2, rot.per = 0, colors = cols[6],
scale = c(4 * max(table(delexicalTxBspoken$Collocate_lemma))/anchor, 0.5))
dev.off() # To save plot
## null device
## 1
##### For range of colours according to frequencies ##
wordcloud(delexicalTxBnar$Collocate_lemma, min.freq = 2, rot.per = 0, random.order = FALSE,
random.color = FALSE, colors = brewer.pal(4, "RdGy"))
##### https://stackoverflow.com/questions/18902485/colored-categories-in-r-wordclouds
colours <- c("red", "blue", "orange", "green")
delexical$color <- colours[match(delexical$Register, levels(delexical$Register))]
wordcloud(delexicalTxBnar$Collocate_lemma, min.freq = 2, rot.per = 0, random.order = FALSE,
random.color = FALSE, colors = as.character(delexical$color))
#### Chatterplot: inspired by
#### https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
#### Used in MAKING tea and mistakes book chapter
delexical <- read.csv(here("Delexical_MAKE2.csv"), sep = ",", stringsAsFactors = TRUE)
spoken_delex <- delexical %>% filter(Register == "BNC_Spoken" | Register == "TxB_spoken")
spoken_delex$Register <- droplevels(spoken_delex$Register)
table(spoken_delex$Collocate_lemma, spoken_delex$Register)
##
## BNC_Spoken TxB_spoken
## abortion 1 0
## accent 1 0
## accusation 0 0
## agreement 0 1
## amend 0 0
## announcement 1 1
## appeal 0 0
## appointment 1 4
## argument 1 0
## arrangement 2 0
## arrest 0 0
## assumption 1 0
## attempt 0 2
## available 0 1
## aware 1 0
## balls up 1 0
## bang 1 0
## bargain 0 0
## bed 0 5
## bet 0 0
## bid 0 1
## booking 1 1
## breakthrough 0 0
## call 0 7
## case 1 1
## change 2 8
## cheque 0 1
## choice 3 3
## clear 0 0
## comment 1 4
## commitment 0 1
## comparison 1 0
## complaint 1 0
## confession 0 0
## connect 1 0
## contact 3 1
## contrast 0 0
## contribution 0 2
## conversation 1 0
## correction 0 0
## creation 1 0
## curtsey 0 0
## cut 0 1
## dash 0 0
## deal 1 1
## decision 14 7
## difference 17 14
## discovery 0 0
## distinction 0 1
## economy 1 0
## effort 8 5
## entrance 0 0
## exception 0 0
## excuse 0 1
## feature 1 0
## fuss 1 1
## gain 1 0
## generalisation 1 0
## gesture 0 0
## illusion 1 0
## impression 1 2
## improvement 0 1
## issue 1 0
## joke 1 2
## journey 1 2
## killing 1 0
## leap 1 0
## list 1 3
## loss 1 0
## misery 0 1
## mistake 8 16
## move 3 1
## movement 1 0
## noise 8 13
## note 2 1
## observation 0 1
## offer 1 0
## pact 0 0
## pause 0 1
## payment 2 1
## peace 0 0
## phenomenon 1 0
## plan 1 4
## point 1 0
## prediction 0 1
## preparation 0 0
## progress 0 5
## promise 0 4
## proposal 0 1
## protest 0 1
## racket 0 0
## ready 0 0
## request 0 0
## reservation 0 0
## row 1 0
## sacrifice 1 0
## save 0 1
## saving 1 0
## secret 0 1
## sense 27 9
## shamble 0 0
## sighting 0 3
## sign 0 0
## signal 1 0
## silence 0 0
## small talk 0 0
## sound 1 8
## speech 0 1
## start 0 0
## statement 0 0
## stay 1 0
## stereotype 1 0
## stop 0 1
## suggestion 0 2
## sweep 0 0
## swipe 0 0
## tone 0 1
## transition 0 1
## trip 0 0
## trouble 0 1
## use 2 0
## visit 0 0
## walk 0 0
## wish 0 2
comp <- round(prop.table(table(spoken_delex$Collocate_lemma, spoken_delex$Register),
2), 4) * 100
comp <- as.data.frame(unclass(comp))
comp$Collocate <- row.names(comp)
head(comp)
## BNC_Spoken TxB_spoken Collocate
## abortion 0.69 0.0 abortion
## accent 0.69 0.0 accent
## accusation 0.00 0.0 accusation
## agreement 0.00 0.6 agreement
## amend 0.00 0.0 amend
## announcement 0.69 0.6 announcement
levels(as.factor(comp$TxB_spoken))
## [1] "0" "0.6" "1.2" "1.8" "2.4" "2.99" "4.19" "4.79" "5.39" "7.78"
## [11] "8.38" "9.58"
comp %>% filter(TxB_spoken > 0.7 | BNC_Spoken > 0.7) %>% ggplot(aes(TxB_spoken, BNC_Spoken,
label = Collocate)) + # ggrepel geom, make arrows transparent
geom_point(color = "darkred") + geom_text_repel(min.segment.length = 0.3, segment.alpha = 0.4,
force = 0.4) + geom_abline(color = "gray40", lty = 2) + labs(y = "% of delexical MAKEs in Spoken BNC2014 sample",
x = "% of delexical MAKEs in the TEC-Conv") + scale_y_log10(breaks = c(1, 2,
4, 6, 8, 10, 12, 14, 16, 18)) + scale_x_log10(breaks = c(1, 2, 4, 6, 8, 10, 12,
14, 16)) + # minimal theme & customizations
theme_bw() + theme(panel.grid.major = element_line(colour = "whitesmoke"), panel.grid.minor = element_blank())
# ggsave(here('Delex_MAKE_spoken_chatterplot.svg'), dpi = 300, width =17, height
# = 18, units = 'cm')
# Good dimensions to save as SVG = 727 x 587
dev.off()
## null device
## 1
#### Chatterplot: inspired by
#### https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
#### Used in MAKING tea and mistakes book chapter
delexical <- read.csv(here("Delexical_MAKE2.csv"), sep = ",", stringsAsFactors = TRUE)
fiction_delex <- delexical %>% filter(Register == "Youth_Fiction" | Register == "narrative")
fiction_delex$Register <- droplevels(fiction_delex$Register)
table(fiction_delex$Collocate_lemma, fiction_delex$Register)
##
## narrative Youth_Fiction
## abortion 0 0
## accent 0 0
## accusation 0 1
## agreement 0 0
## amend 1 1
## announcement 1 0
## appeal 0 1
## appointment 0 0
## argument 0 0
## arrangement 1 1
## arrest 0 1
## assumption 0 0
## attempt 0 0
## available 0 0
## aware 0 0
## balls up 0 0
## bang 0 0
## bargain 0 1
## bed 0 1
## bet 1 0
## bid 0 1
## booking 0 0
## breakthrough 1 1
## call 1 4
## case 0 1
## change 2 0
## cheque 0 0
## choice 7 1
## clear 1 0
## comment 1 1
## commitment 0 0
## comparison 0 0
## complaint 0 1
## confession 0 1
## connect 0 0
## contact 2 0
## contrast 1 0
## contribution 0 1
## conversation 2 1
## correction 1 0
## creation 0 0
## curtsey 0 1
## cut 0 0
## dash 1 0
## deal 0 1
## decision 2 4
## difference 5 6
## discovery 1 3
## distinction 0 0
## economy 0 0
## effort 3 2
## entrance 0 1
## exception 1 0
## excuse 0 2
## feature 0 0
## fuss 0 2
## gain 0 0
## generalisation 0 0
## gesture 0 4
## illusion 0 0
## impression 0 1
## improvement 0 0
## issue 0 0
## joke 0 1
## journey 1 1
## killing 0 0
## leap 1 0
## list 0 0
## loss 0 0
## misery 0 0
## mistake 13 4
## move 0 2
## movement 0 1
## noise 8 12
## note 0 1
## observation 0 0
## offer 1 0
## pact 2 0
## pause 0 0
## payment 0 0
## peace 0 1
## phenomenon 0 0
## plan 1 1
## point 0 0
## prediction 0 0
## preparation 1 0
## progress 2 4
## promise 0 1
## proposal 0 0
## protest 0 0
## racket 0 2
## ready 0 1
## request 1 0
## reservation 0 1
## row 0 0
## sacrifice 0 0
## save 0 0
## saving 0 0
## secret 0 0
## sense 2 5
## shamble 0 1
## sighting 0 0
## sign 0 1
## signal 0 0
## silence 0 1
## small talk 0 1
## sound 6 6
## speech 2 1
## start 0 2
## statement 1 0
## stay 0 0
## stereotype 0 0
## stop 0 0
## suggestion 1 0
## sweep 0 1
## swipe 0 1
## tone 0 0
## transition 0 0
## trip 0 1
## trouble 1 2
## use 0 1
## visit 0 1
## walk 1 0
## wish 0 1
comp <- round(prop.table(table(fiction_delex$Collocate_lemma, fiction_delex$Register),
2), 4) * 100
comp <- as.data.frame(unclass(comp))
comp$Collocate_lemma <- row.names(comp)
head(comp)
## narrative Youth_Fiction Collocate_lemma
## abortion 0.00 0.00 abortion
## accent 0.00 0.00 accent
## accusation 0.00 0.94 accusation
## agreement 0.00 0.00 agreement
## amend 1.23 0.94 amend
## announcement 1.23 0.00 announcement
levels(as.factor(comp$narrative))
## [1] "0" "1.23" "2.47" "3.7" "6.17" "7.41" "8.64" "9.88" "16.05"
levels(as.factor(comp$Youth_Fiction))
## [1] "0" "0.94" "1.89" "2.83" "3.77" "4.72" "5.66" "11.32"
comp %>% filter(narrative > 2.48 | Youth_Fiction > 1.9) %>% ggplot(aes(narrative,
Youth_Fiction, label = Collocate_lemma)) + # ggrepel geom, make arrows transparent
geom_point(color = "darkred") + geom_text_repel(min.segment.length = 0.1, segment.alpha = 0.4,
forrce = 0.6) + geom_abline(color = "gray40", lty = 2) + labs(y = "% of delexical MAKEs in Youth Fiction sample",
x = "% of delexical MAKEs in Textbook Fiction") + scale_y_log10(breaks = c(1,
2, 4, 6, 8, 10, 12), limits = c(0.9, 12)) + scale_x_log10(breaks = c(1, 2, 4,
6, 8, 10, 12, 14, 16)) + # minimal theme & customizations
theme_bw() + theme(panel.grid.major = element_line(colour = "whitesmoke"), panel.grid.minor = element_blank())
# ggsave(here('delex_MAKE_Fiction_chatterplot.svg'), dpi = 300, width =15, height
# = 15, units = 'cm')
#### Percentage of collocates that are speech actions #####
total_delexical <- delexical %>% group_by(Register) %>% summarise(total = n())
speech <- delexical %>% group_by(Register) %>% filter(speech == 1) %>% summarise(speech = n())
speech2 <- left_join(speech, total_delexical, by = "Register") %>% mutate(speech_per = speech/total *
100) %>% mutate(not_speech = total - speech) %>% add_column(Variety = c("Reference",
"Textbook", "Textbook", "Reference")) %>% rename(Corpus = Register) %>% add_column(Register = c("Conversation",
"Fiction", "Conversation", "Fiction"))
speech2
## # A tibble: 4 × 7
## Corpus speech total speech_per not_speech Variety Register
## <fct> <int> <int> <dbl> <int> <chr> <chr>
## 1 BNC_Spoken 33 145 22.8 112 Reference Conversation
## 2 narrative 13 81 16.0 68 Textbook Fiction
## 3 TxB_spoken 24 167 14.4 143 Textbook Conversation
## 4 Youth_Fiction 24 106 22.6 82 Reference Fiction
## Statistical testing ##
speech3 <- as.data.frame(speech2)
speech3
## Corpus speech total speech_per not_speech Variety Register
## 1 BNC_Spoken 33 145 22.75862 112 Reference Conversation
## 2 narrative 13 81 16.04938 68 Textbook Fiction
## 3 TxB_spoken 24 167 14.37126 143 Textbook Conversation
## 4 Youth_Fiction 24 106 22.64151 82 Reference Fiction
speech_spoken <- speech3[c(1, 3), c(2, 5)] # Spoken register
chisq.test(t(speech_spoken)) # p-value = 0.07751
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t(speech_spoken)
## X-squared = 3.1164, df = 1, p-value = 0.07751
vcd::assocstats(t(speech_spoken)) # Cramer's V = 0.108
## X^2 df P(> X^2)
## Likelihood Ratio 3.6524 1 0.055988
## Pearson 3.6566 1 0.055848
##
## Phi-Coefficient : 0.108
## Contingency Coeff.: 0.108
## Cramer's V : 0.108
speech_narrative <- speech3[c(2, 4), c(2, 5)] # Narrative register
chisq.test(t(speech_narrative)) # p-value = 0.3493
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t(speech_narrative)
## X-squared = 0.87612, df = 1, p-value = 0.3493
## Second attempt at better graph ##
cols <- RColorBrewer::brewer.pal(n = 7, name = "OrRd")
# svg(filename=here('delexical_speech.svg'), width=5, height=4)
ggplot(speech3, aes(Register, speech_per)) + geom_bar(aes(fill = Variety), stat = "identity",
position = "dodge", width = 0.5) + theme(legend.title = element_blank()) + ggsignif::geom_signif(y_position = 23.5,
xmin = c(0.8, 1.8), xmax = c(1.2, 2.2), annotation = c("ns", "ns"), tip_length = 0,
textsize = 3) + scale_fill_manual(name = "Variety", labels = c("ENL Reference",
"Textbook English"), values = cols[c(3, 7)]) + ylab("% of delexical MAKE") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) + theme_bw()
dev.off()
## null device
## 1
## Quantative analysis ##
speech <- delexical[delexical$speech == 1, ]
head(speech) # Select only speech collocates
## Register Level Series Meaning Collocate Collocate_lemma
## 1 Youth_Fiction a case case
## 4 Youth_Fiction accusation accusation
## 6 narrative D Join the Team delexical amends amend
## 7 Youth_Fiction amends amend
## 8 BNC_Spoken announcement announcement
## 9 narrative A Achievers delexical announcement announcement
## speech
## 1 1
## 4 1
## 6 1
## 7 1
## 8 1
## 9 1
speech$Collocate_lemma <- factor(speech$Collocate_lemma) # Drop unused levels
speech <- t(table(speech$Register, speech$Collocate_lemma)) # Compute raw frequency list
str(speech)
## 'table' int [1:31, 1:4] 0 0 1 0 1 2 1 1 0 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:31] "accusation" "amend" "announcement" "appeal" ...
## ..$ : chr [1:4] "BNC_Spoken" "narrative" "TxB_spoken" "Youth_Fiction"
head(speech)
##
## BNC_Spoken narrative TxB_spoken Youth_Fiction
## accusation 0 0 0 1
## amend 0 1 0 1
## announcement 1 1 1 0
## appeal 0 0 0 1
## argument 1 0 0 0
## arrangement 2 1 0 1
speech2 <- as.data.frame.matrix(speech)
speech2 %>% subset(narrative == 0 & TxB_spoken == 0) %>% select(c(BNC_Spoken, Youth_Fiction))
## BNC_Spoken Youth_Fiction
## accusation 0 1
## appeal 0 1
## argument 1 0
## assumption 1 0
## aware 1 0
## comparison 1 0
## complaint 1 1
## confession 0 1
## point 1 0
## row 1 0
## signal 1 0
## silence 0 1
## small talk 0 1
## speech 0 1
PV <- read.csv(here("MAKE_PhrasalVerbs_Conc.csv"))
t(table(PV$Corpus, PV$PV))
##
## BNCSpoken TxB_narrative TxB_spoken Youth_Fiction
## it up to sb. 0 0 0 1
## of (=capable) 0 0 1 2
## of sth. (=interpret) 0 0 0 2
## out (=discern) 1 4 0 10
## out (=pet) 1 0 0 0
## out (=present as) 0 0 0 1
## out to be (=pretend) 1 0 0 0
## up (=compensate) 1 1 0 1
## up (=comprise) 1 4 3 1
## up (=face paint) 1 0 0 2
## up (=invent) 17 4 2 6
## up (=prepare) 1 0 1 2
## up (=reconcile) 2 0 0 1
## with (=make do with) 0 0 1 0
summary(as.factor(PV$Corpus))
## BNCSpoken TxB_narrative TxB_spoken Youth_Fiction
## 26 13 8 29
table(PV$Corpus, PV$Level)
##
## B C D E
## BNCSpoken 26 0 0 0 0
## TxB_narrative 0 0 1 7 5
## TxB_spoken 0 2 2 3 1
## Youth_Fiction 29 0 0 0 0
TxBSpokenCaus <- read.csv(here("Textbook_MAKE_Causative_Constructions.csv"), stringsAsFactors = TRUE)
str(TxBSpokenCaus)
## 'data.frame': 172 obs. of 9 variables:
## $ Register : Factor w/ 1 level "spoken": 1 1 1 1 1 1 1 1 1 1 ...
## $ Level : Factor w/ 5 levels "A","B","C","D",..: 5 5 3 5 5 4 3 5 3 4 ...
## $ Series : Factor w/ 9 levels "Access","Achievers",..: 4 6 4 6 5 9 1 5 9 9 ...
## $ Conc1 : logi NA NA NA NA NA NA ...
## $ Conc2 : Factor w/ 5 levels "made","make",..: 2 1 2 2 1 1 2 4 4 1 ...
## $ Conc3 : logi NA NA NA NA NA NA ...
## $ Cause_Cx : Factor w/ 3 levels "adjectiveCx",..: 1 1 1 3 1 2 3 1 3 1 ...
## $ Collocate : Factor w/ 98 levels "ache","affordable",..: 2 77 81 13 54 33 16 36 52 30 ...
## $ Semantic_prosody: logi NA NA NA NA NA NA ...
TxBSpokenCaus$Register = "Textbook Conversation"
BNC2014Caus <- read.csv(here("SpokenBNC2014_MAKE_Causative_Constructions.csv"), stringsAsFactors = TRUE)
str(BNC2014Caus)
## 'data.frame': 208 obs. of 7 variables:
## $ Filename : Factor w/ 183 levels "filename#0","filename#1004",..: 8 114 142 142 96 173 103 85 36 53 ...
## $ Conc1 : logi NA NA NA NA NA NA ...
## $ Conc2 : Factor w/ 4 levels "made","make",..: 2 2 2 2 1 1 2 4 1 2 ...
## $ Conc3 : logi NA NA NA NA NA NA ...
## $ Cause_Cx : Factor w/ 4 levels "adjectiveCx",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Collocate : Factor w/ 122 levels "a priority","a real man",..: 25 100 64 64 79 97 28 28 35 61 ...
## $ Semantic_prosody: Factor w/ 3 levels "","negative",..: 2 3 2 2 2 2 1 1 1 1 ...
BNC2014Caus <- BNC2014Caus %>% select(-c(Filename))
BNC2014Caus$Register = "Spoken BNC2014"
BNC2014Caus$Level = "Spoken BNC2014"
BNC2014Caus$Series = "Spoken BNC2014"
caus <- rbind(TxBSpokenCaus, BNC2014Caus) %>% select(-Semantic_prosody)
caus$Register <- as.factor(caus$Register)
str(caus)
## 'data.frame': 380 obs. of 8 variables:
## $ Register : Factor w/ 2 levels "Spoken BNC2014",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Level : Factor w/ 6 levels "A","B","C","D",..: 5 5 3 5 5 4 3 5 3 4 ...
## $ Series : Factor w/ 10 levels "Access","Achievers",..: 4 6 4 6 5 9 1 5 9 9 ...
## $ Conc1 : logi NA NA NA NA NA NA ...
## $ Conc2 : Factor w/ 5 levels "made","make",..: 2 1 2 2 1 1 2 4 4 1 ...
## $ Conc3 : logi NA NA NA NA NA NA ...
## $ Cause_Cx : Factor w/ 4 levels "adjectiveCx",..: 1 1 1 3 1 2 3 1 3 1 ...
## $ Collocate: Factor w/ 184 levels "ache","affordable",..: 2 77 81 13 54 33 16 36 52 30 ...
table(caus$Cause_Cx, caus$Register)
##
## Spoken BNC2014 Textbook Conversation
## adjectiveCx 75 71
## nounCx 20 14
## verbalCxAA 112 87
## verbalCxAP 1 0
round(prop.table(table(caus$Cause_Cx, caus$Register), 2), 4) * 100
##
## Spoken BNC2014 Textbook Conversation
## adjectiveCx 36.06 41.28
## nounCx 9.62 8.14
## verbalCxAA 53.85 50.58
## verbalCxAP 0.48 0.00
vcd::assoc(caus$Cause_Cx ~ caus$Register, shade = TRUE, varnames = FALSE)
# Chatterplot: inspired by
# https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
comp <- round(prop.table(table(caus$Collocate, caus$Register), 2), 4) * 100
comp <- as.data.frame(unclass(comp))
comp$Collocate <- row.names(comp)
comp <- inner_join(comp, caus[, 7:8], by = "Collocate")
comp <- distinct(comp)
str(comp)
## 'data.frame': 184 obs. of 4 variables:
## $ Spoken BNC2014 : num 0.48 0 0 0 0 0 0.48 0 0 1.44 ...
## $ Textbook Conversation: num 0.58 0.58 0.58 1.16 0.58 1.16 0.58 0.58 0.58 2.33 ...
## $ Collocate : chr "ache" "affordable" "anxious" "appear" ...
## $ Cause_Cx : Factor w/ 4 levels "adjectiveCx",..: 3 1 1 3 2 1 1 1 1 1 ...
comp %>% filter(`Textbook Conversation` > 0.95 | `Spoken BNC2014` > 0.95) %>% ggplot(aes(`Textbook Conversation`,
`Spoken BNC2014`, colour = Cause_Cx, label = Collocate)) + # ggrepel geom, make arrows transparent, color by rank, size by n
geom_point(aes(colour = Cause_Cx)) + geom_text_repel(min.segment.length = 0.5, segment.alpha = 0.4,
force = 1, max.overlaps = 20, aes(colour = Cause_Cx), show.legend = F) + scale_colour_manual(breaks = c("adjectiveCx",
"nounCx", "verbalCxAA"), values = palettes_d$suffrager$oxon[c(2, 1, 3)], labels = c("[X MAKE Y AdjP]",
"[X MAKE (Y) NP]", "[X MAKE Y Vinf]"), name = "Construction type") +
# set color gradient & customize legend
geom_abline(color = "gray40", lty = 2) +
# set word size range & turn off legend
labs(y = "% of causative MAKEs in the Spoken BNC2014 sample", x = "% of causative MAKEs in Textbook Conversation") +
scale_y_log10(breaks = c(1, 2, 4, 6, 8, 10)) + scale_x_log10(breaks = c(1, 2,
4, 6, 8, 10)) +
# minimal theme & customizations
theme_bw() + theme(legend.position = c(0.97, 0.4), legend.justification = c("right",
"top"), panel.grid.major = element_line(colour = "whitesmoke"), panel.grid.minor = element_blank(),
legend.background = element_rect(colour = "darkgrey", fill = "white", linetype = "solid"))
# ggsave(here('Causative_Spoken_chatterplot.svg'), dpi=300, width=20, height=20,
# units='cm')
dev.off()
## null device
## 1
# packages.bib <- sapply(1:length(loadedNamespaces()), function(i)
# toBibtex(citation(loadedNamespaces()[i])))
knitr::write_bib(c(.packages(), "knitr"), "packages.bib")
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] wordcloud_2.6 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
## [5] purrr_0.3.4 readr_2.0.2 tidyr_1.1.4 tibble_3.1.6
## [9] tidyverse_1.3.0 RColorBrewer_1.1-2 gtools_3.8.2 paletteer_1.3.0
## [13] ggrepel_0.9.1 ggplot2_3.3.5 ggsignif_0.6.1 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 sass_0.4.0 jsonlite_1.7.2 modelr_0.1.8
## [5] bslib_0.3.1 stargazer_5.2.2 assertthat_0.2.1 highr_0.9
## [9] cellranger_1.1.0 slam_0.1-48 yaml_2.2.1 pillar_1.6.4
## [13] backports_1.4.1 lattice_0.20-41 glue_1.6.0 digest_0.6.29
## [17] rvest_1.0.0 colorspace_2.0-2 htmltools_0.5.2 tm_0.7-8
## [21] pkgconfig_2.0.3 broom_0.7.9 haven_2.3.1 scales_1.1.1
## [25] tzdb_0.1.2 generics_0.1.1 farver_2.1.0 ellipsis_0.3.2
## [29] withr_2.4.3 NLP_0.2-1 cli_3.1.0 magrittr_2.0.1
## [33] crayon_1.4.2 readxl_1.3.1 evaluate_0.14 fs_1.5.2
## [37] fansi_0.5.0 MASS_7.3-53.1 xml2_1.3.3 tools_4.0.3
## [41] hms_1.0.0 formatR_1.8 lifecycle_1.0.1 munsell_0.5.0
## [45] reprex_1.0.0 compiler_4.0.3 jquerylib_0.1.4 vcd_1.4-8
## [49] rlang_0.4.12 grid_4.0.3 rstudioapi_0.13 labeling_0.4.2
## [53] rmarkdown_2.11 gtable_0.3.0 DBI_1.1.1 rematch2_2.1.2
## [57] R6_2.5.1 zoo_1.8-9 lubridate_1.7.10 knitr_1.37
## [61] fastmap_1.1.0 utf8_1.2.2 rprojroot_2.0.2 stringi_1.7.6
## [65] parallel_4.0.3 Rcpp_1.0.7 vctrs_0.3.8 dbplyr_2.1.0
## [69] tidyselect_1.1.1 xfun_0.29 lmtest_0.9-38