Ch. 15: Tasks & Quizzes
Q15.1 Which of the following analogies have (rightly or wrongly) been used by different scholars to refer to Large Language Models (LLMs)? You may have heard of one or a few of these, but you can guess them all?
The term ‘stochastic parrot’ 🦜 is probably the best known LLM metaphor. It was introduced by computational linguist Emily M. Bender and colleagues in 2021 to characterise LLMs as systems that mimic text without true understanding, highlighting their limitations in processing meaning. This metaphor emphasises that LLMs generate outputs based on statistical patterns in their training data, similar to how parrots mimic sounds without comprehension.
In 2022, the cognitive scientist Iris Van Rooij published a blog post in which she succinctly explains why she believes that LLMs cannot legitimately be used for academic writing because they essentially “automate plagiarism”.
The ‘spicy autocomplete’ metaphor 🌶️ is difficult to trace back to one or more specific author(s). It suggests that LLMs are just fancy versions of a smart phone’s predictive text, which predict the next word based on what came before, except that LLMs add some randomness, i.e. spice, to the output. This framing also implies that LLMs are pattern-matching algorithms without real understanding or problem-solving ability (see Groß 2024).
In 2024, Hicks, Humphries & Slater published a paper in the journal ‘Ethics and Information Technology’ entitled ‘ChatGPT is bullshit’, in which they argue that the output of LLMs is best understood as ‘bullshit’ in the philosophical sense described by Frankfurt (2005) because LLMs are indifferent to the truth of their outputs.
‘Synthetic text extruding machines’ is a term that Emily M. Bender and Alex Hanna like to use, e.g. in their 2025 book entitled ‘The AI Con: How to Fight Big Tech’s Hype and Create the Future We Want’. They describe the process of LLM-generated texts by explaining that, “[l]ike an industrial plastic process, language corpora are forced through complicated machinery to produce a product that looks like communicative language, but without any intent or thinking mind behind it.”
As an alternative to well-established metaphors such as the ones listed above, classical philologist Gyburg Uhlmann proposed ‘kitsch’ as a new metaphor to describe the output of LLMs. She argues that ‘kitsch’ “is particularly suitable for analytically illuminating a previously neglected feature of LLM-based images and texts: their tendency to produce homogeneous and average content, which is […] leading to the equalisation of language, style and argument” (Uhlmann 2025).
Q15.2 Read Jeff Pooley’s short commentary ‘The Matthew Effect in AI Summary’ (archived version). What does the Matthew Effect refer to?
Q15.3 Which term describes the phenomenon whereby the contributions of marginalised female scientists are overlooked or attributed to their male colleagues?
Q15.4 According to Pooley (2025), which of the following biases in academia are likely to be aggravated by the use of Large Language Models (LLMs) to summarise academic literature and write research articles?
Read through Google’s “AI overview” displayed below. Note that you can click on the images to enlarge them.
Q15.5 Among a number of sensible-sounding suggestions, we find a recommendation for adding non-toxic glue to pizza sauce. This mention is thought to have come from an old Reddit post (see screenshot above). Which aspect(s) of the AI overview point to this theory?
As part of an exploratory (not pre-registered) analysis, Tamkin & Shen (2026) from Anthropic (see On the value of human learning) decomposed the quiz scores into sub-areas and question types (see Figure 8 from Tamkin & Shen (2026), reprinted below). Each question in the quiz belonged to exactly one task (e.g., Task 1 or Task 2) and exactly one question type (e.g., Conceptual, Debugging, or Code Reading). Figure 8 shows that, for both tasks, the control (no AI) group performed better than the AI group.

Q15.6 Looking at the results displayed in Figure 8 above, which question type shows the largest difference in average quiz scores between the treatment and control groups?
Q15.7 Why might the control group have, on average, performed better on debugging questions compared to the AI group?
Q15.8 What do the results displayed in Figure 8 from Tamkin & Shen (2026) (see above) suggest about the impact of AI assistance on code reading skills?
Q15.9 What is Stack Exchange?
Q15.10 Open this archived version of a Q&A about confidence intervals hosted on Cross Validated, Stack Exchange’s statistics forum. In which year did Eliott originally ask their question?
Q15.11 As of 9 February 2026 when the page was archived, how many Cross Validated members had upvoted the top answer?
In Using t-tests to compare two groups, we conducted the following t-test to find out whether the observed difference between the L1 and L2 speakers’ non-verbal IQ ‘Blocks’ test scores was significant or not:
t.test(formula = Blocks ~ Group,
data = Dabrowska.data)
Welch Two Sample t-test
data: Blocks by Group
t = -4.4084, df = 152.46, p-value = 1.956e-05
alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
95 percent confidence interval:
-5.239791 -1.996693
sample estimates:
mean in group L1 mean in group L2
13.84444 17.46269
The output of the t.test() command above was fed into an LLM (DeepSeek R1 Distill Llama 70B) with a prompt requesting a short paragraph interpreting these results. The model’s response is printed below:
The Welch two‑sample t‑test shows a significant difference between the groups (t = ‑4.41, df ≈ 152.5, p = 1.96 × 10⁻⁵). Group L1 has a mean of 13.84 blocks, while group L2’s mean is 17.46 blocks, a mean difference of ‑3.62 blocks. The 95 % confidence interval for this difference (‑5.24 to ‑1.99) means we can be 95 % confident that the true average in L2 exceeds that in L1 by roughly 2 to 5 blocks. Because the interval does not include zero, the null hypothesis of equal means is rejected.
Q15.12 Aside from the fact that “blocks” is not the correct measurement unit, which major statistical error does the LLM’s interpretation of the t-test output include?
Check your progress 🌟
Congratulations, you have reached the end of this textbook! In this concluding chapter, you have answered 0 out of 12 questions correctly. Check out the Next-step resources to continue your learning journey!
Rojects!️ Artwork by Allison Horst (CC BY 4.0)


