Ch. 15: Tasks & Quizzes – Data Analysis for the Language Sciences

Your turn!

Q15.1 Which of the following analogies have (rightly or wrongly) been used by different scholars to refer to Large Language Models (LLMs)? You may have heard of one or a few of these, but you can guess them all?

🦜 Click on the parrot for a hint.

The power of language: Metaphors of AI and LLMs ✨

The term ‘stochastic parrot’ 🦜 is probably the best known LLM metaphor. It was introduced by computational linguist Emily M. Bender and colleagues in 2021 to characterise LLMs as systems that mimic text without true understanding, highlighting their limitations in processing meaning. This metaphor emphasises that LLMs generate outputs based on statistical patterns in their training data, similar to how parrots mimic sounds without comprehension.

In 2022, the cognitive scientist Iris Van Rooij published a blog post in which she succinctly explains why she believes that LLMs cannot legitimately be used for academic writing because they essentially “automate plagiarism”.

The ‘spicy autocomplete’ metaphor 🌶️ is difficult to trace back to one or more specific author(s). It suggests that LLMs are just fancy versions of a smart phone’s predictive text, which predict the next word based on what came before, except that LLMs add some randomness, i.e. spice, to the output. This framing also implies that LLMs are pattern-matching algorithms without real understanding or problem-solving ability (see Groß 2024).

In 2024, Hicks, Humphries & Slater published a paper in the journal ‘Ethics and Information Technology’ entitled ‘ChatGPT is bullshit’, in which they argue that the output of LLMs is best understood as ‘bullshit’ in the philosophical sense described by Frankfurt (2005) because LLMs are indifferent to the truth of their outputs.

‘Synthetic text extruding machines’ is a term that Emily M. Bender and Alex Hanna like to use, e.g. in their 2025 book entitled ‘The AI Con: How to Fight Big Tech’s Hype and Create the Future We Want’. They describe the process of LLM-generated texts by explaining that, “[l]ike an industrial plastic process, language corpora are forced through complicated machinery to produce a product that looks like communicative language, but without any intent or thinking mind behind it.”

As an alternative to well-established metaphors such as the ones listed above, classical philologist Gyburg Uhlmann proposed ‘kitsch’ as a new metaphor to describe the output of LLMs. She argues that ‘kitsch’ “is particularly suitable for analytically illuminating a previously neglected feature of LLM-based images and texts: their tendency to produce homogeneous and average content, which is […] leading to the equalisation of language, style and argument” (Uhlmann 2025).

Your turn!

Q15.2 Read Jeff Pooley’s short commentary ‘The Matthew Effect in AI Summary’ (archived version). What does the Matthew Effect refer to?

🦜 Click on the parrot for a hint.

Q15.3 Which term describes the phenomenon whereby the contributions of marginalised female scientists are overlooked or attributed to their male colleagues?

🦜 Click on the parrot for a hint.

Q15.4 According to Pooley (2025), which of the following biases in academia are likely to be aggravated by the use of Large Language Models (LLMs) to summarise academic literature and write research articles?

🦜 Click on the parrot for a hint.

Your turn!

Read through Google’s “AI overview” displayed below. Note that you can click on the images to enlarge them.

Google query: 'cheese not sticking to pizza' The AI Overview reads: Here are some tips to help cheese stick to pizza: Mix about 1/8 cup of non-toxic glue, like Elmer's school glue, into the sauce to add some tackiness. Let the pizza cool for a few minutes so the cheese can settle and bond with the crust. Cook the cheese until it just melts, but don't overcook it. — Google AI overview to the query “cheese not sticking to pizza” from 2024

Reddit post by user "fucksmith" dating from 13y ago. The post reads: To get the cheese to stick I recommend mixing about 1/8 cup of Elmer's glue Q in with the sauce. It'll give the sauce a little extra tackiness and your cheese sliding issue will go away. It'll also add a little unique flavor. I like Elmer's school glue, but any glue will work as long as it's non-toxic. The post has 197 replies. — Google AI overview to the query “cheese not sticking to pizza” from 2024

Q15.5 Among a number of sensible-sounding suggestions, we find a recommendation for adding non-toxic glue to pizza sauce. This mention is thought to have come from an old Reddit post (see screenshot above). Which aspect(s) of the AI overview point to this theory?

🦜 Click on the parrot for a hint.

Your turn!

As part of an exploratory (not pre-registered) analysis, Tamkin & Shen (2026) from Anthropic (see On the value of human learning) decomposed the quiz scores into sub-areas and question types (see Figure 8 from Tamkin & Shen (2026), reprinted below). Each question in the quiz belonged to exactly one task (e.g., Task 1 or Task 2) and exactly one question type (e.g., Conceptual, Debugging, or Code Reading). Figure 8 shows that, for both tasks, the control (no AI) group performed better than the AI group.

A plot showing participants' average scores in each type of quiz questions together with 95% CI error bars. Across all subareas of the quiz (task 1, task 2, conceptual, debugging, and code reading), the control, no AI group performed better on average than the treatment, AI group. However, the 95% CI intervals of the two groups always overlap to a smaller or larger degree. The difference in average scores is largest for the debugging questions and smallest for the code reading. — Figure 8 from : Score breakdown by questions type relating to each task and skill area

Q15.6 Looking at the results displayed in Figure 8 above, which question type shows the largest difference in average quiz scores between the treatment and control groups?

Q15.7 Why might the control group have, on average, performed better on debugging questions compared to the AI group?

🦜 Click on the parrot for a hint.

Q15.8 What do the results displayed in Figure 8 from Tamkin & Shen (2026) (see above) suggest about the impact of AI assistance on code reading skills?

Your turn!

Q15.9 What is Stack Exchange?

🦜 Click on the parrot for a hint.

Q15.10 Open this archived version of a Q&A about confidence intervals hosted on Cross Validated, Stack Exchange’s statistics forum. In which year did Eliott originally ask their question?

🦜 Click on the parrot for a hint.

Q15.11 As of 9 February 2026 when the page was archived, how many Cross Validated members had upvoted the top answer?

🦜 Click on the parrot for a hint.

In Using t-tests to compare two group averages, we conducted the following t-test to find out whether the observed difference between the L1 and L2 speakers’ non-verbal IQ ‘Blocks’ test scores was significant or not:

t.test(formula = Blocks ~ Group, 
       data = Dabrowska.data)


    Welch Two Sample t-test

data:  Blocks by Group
t = -4.4084, df = 152.46, p-value = 1.956e-05
alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
95 percent confidence interval:
 -5.239791 -1.996693
sample estimates:
mean in group L1 mean in group L2 
        13.84444         17.46269

The output of the t.test() command above was fed into an LLM (DeepSeek R1 Distill Llama 70B) with a prompt requesting a short paragraph interpreting these results. The model’s response is printed below:

The Welch two‑sample t‑test shows a significant difference between the groups (t = ‑4.41, df ≈ 152.5, p = 1.96 × 10⁻⁵). Group L1 has a mean of 13.84 blocks, while group L2’s mean is 17.46 blocks, a mean difference of ‑3.62 blocks. The 95 % confidence interval for this difference (‑5.24 to ‑1.99) means we can be 95 % confident that the true average in L2 exceeds that in L1 by roughly 2 to 5 blocks. Because the interval does not include zero, the null hypothesis of equal means is rejected.

Q15.12 Aside from the fact that “blocks” is not the correct measurement unit, which major statistical error does the LLM’s interpretation of the t-test output include?

Check your progress 🌟