Ch. 2: Tasks & Quizzes – Data Analysis for the Language Sciences

Your turn!

For these quiz questions and many future tasks, we will make use of the IRIS database.

Connect to the IRIS website and navigate to its Search and Download page.
Scroll down to the filter option ‘Data Type’.
Click on ‘Data Type’ and browse through the different data types that are most commonly used in language-related research.

Data Type filter dropdown menu with the following visible categories: Oral production (168), Closed response format (157), Open response (105), Judgements (97), Written production (76), Reaction times (62), Qualitative (46) — Screenshot from the IRIS database search page (accessed on 17 April 2024)

Q2.1 For which kinds of studies could these different types of data have been collected? Think about both experimental and observational studies.

Q2.2 Which of these data types is most likely to be measured in milliseconds (ms)?

Your turn!

Q2.3 Which other reasons could potentially jeopardise the comparison of test results data from two different groups of pupils?

Your turn!

Q2.4 In which format are Microsoft Word files typically saved?

Q2.5 Which of these files are audio files?

Your turn!

Imagine that you want to run an experiment similar to the one carried out in Schimke et al. (2018) (see Figure 2.2 (a)). You can reuse the Playmobil image files created by the researchers as they helpfully uploaded them to the IRIS database.

In which file format do you think the images are archived? To find out, click here to go directly to the list of data and materials associated with the study. There are four entries in the IRIS database that are associated with this study. Select the “Pictorial” entry which contains the images. It allows you to download a ZIP file called Images_online.zip. ZIP is an archive file format that can contain one or more compressed files. Download this ZIP file and decompress (‘unzip’) it. You should find that it contains a folder entitled ‘Images’, which contains 58 pictures of different combinations of Playmobil figures that correspond to the experiment’s stimulus sentences.

Q2.6 In which file format are these Playmobil image files?

Image files typically contain metadata that is embedded in the image files themselves. This metadata may include the dimensions of the image and its colour profile. To view this metadata, right-click on one of the image files that you have extracted from the ZIP file and select the option to get more information about the file, e.g. “Get Info” or “Properties”.

Q2.7 How wide are these Playmobil images in pixel?

Your turn!

In this task, we will practice opening a DSV file in LibreOffice Calc. Our example file is a real dataset from Schimke et al. (2018). We will begin by downloading it from the public repository IRIS.

In addition to the eye-tracking experiments, Schimke et al. (2018) conducted two further experiments in which participants completed a gap-filling task via an online survey platform. In the first of these experiments, the participants were native (L1) speakers of French, German, and Spanish. In the second, they were French- and Spanish-speaking learners (L2) of German.

In both experiments, the L1 and L2 participants were shown ambiguous sentences similar to the ones used in the eye-tracking experiment with the Playmobil images (see Note 2.2). After having read each stimulus, the participants were asked to complete a gap-fill task according to their understanding of the preceding ambiguous sentence. Participants were told “that there were no incorrect responses and that they should answer spontaneously” (Schimke et al. 2018: 755). Below is an example questionnaire item in the three languages examined:

1. Der Briefträger ist dem Straßenfeger begegnet, bevor er schnell ein Sandwich geholt hat. ___________________ hat ein Sandwich geholt.

2. Le facteur a rencontré le balayeur avant qu’il prenne rapidement un sandwich. ___________________ a pris un sandwich.

3a. El cartero se reunió con el barrendero antes de que él recogiera velozmente un emparedado. ___________________ recogió un emparedado.

3b. El cartero se reunió con el barrendero antes de que recogiera velozmente un emparedado. ___________________ recogió un emparedado.

Note that, for Spanish, there were two types of stimuli: one with an overt pronoun (as in 3a. with él) and one without (as in 3b. with a null pronoun), as both variants are possible in Spanish. All three examples translate as:

The postman encountered the street sweeper before he quickly fetched a sandwich. ___________________ fetched a sandwich.

To complete the gap, participants could either select ‘The postman’ or ‘The street sweeper’.

Go back to the study’s page on IRIS and select the second entry entitled ‘Other questionnaire’ which, among other things, contains ‘Written production data’.

Note that this database entry includes both research data and research materials: the file sentences_offline_task.xlsx contains the full list of questionnaire items, including both experimental and filler items, with which we could reconstruct the experiment to replicate it with a new set of participants. For now, however, we are not interested in obtaining materials to replicate the study, but rather in examining the study’s original data.

This IRIS entry also contains three data files. The last file (logoddslearnersfinal.txt) is the DSV file that was used to create Table 2.2 above.

In this task, we are going to look at the questionnaire data corresponding to the gap-filling task experiment conducted with German L2 learners, which is contained in the data file offlinedataLearners.txt:

Download the offlinedataLearners.txt file (which is the second listed) and save it on your computer (see Folders and paths).
Launch LibreOffice (see Open Source if you have not yet installed LibreOffice) and, from the list of options under ‘Create’, click on ‘Calc Spreadsheet’ to open up a blank spreadsheet.
From the ‘File’ drop-down menu, select ‘Open…’ or use the keyboard shortcut . Find offlinedataLearners.txt in the folder where you saved it and click on ‘Open’.
A ‘Text Import’ dialogue box will pop up. This a DSV file, not a fixed-width file, so ensure that the option ‘Separated by’ is selected. If not already set by default, it is also a good idea to select ‘Unicode (UTF-8)’ for the ‘Character set’.
Experiment with the different ‘Separator Options’ until the preview at the bottom of the dialogue box looks like a table.
Ensure that, apart from the ‘Separator Options’, all other options in the dialogue box are unselected and then click on ‘OK’.

Q2.8 What is the separator character in the file offlinedataLearners.txt?

Q2.9 What is the delimiter character in the file offlinedataLearners.txt?

Q2.10 How many observations does the file offlinedataLearners.txt contain?

Q2.11 In this table, what does each observation correspond to?

Your turn!

In this task, you will find out how genetics researchers who use spreadsheets for their analyses regularly have their data so badly damaged that it affects the results of their publication. Though we have no statistics on how spreadsheet errors affect the work of linguists, it is (unfortunately) very likely to be just as bad as in genetics.

Ziemann, Eren & El-Osta (2016) reported that a fifth of genetics publications with supplementary .xls or .xlsx files with gene lists contained errors caused by Excel’s auto-formatting behaviour. The results of this study shocked the research community and a report about it went viral. Click on the link below to read the open-access article “Gene name errors: Lessons not learnt” by Abeysooriya et al. (2021) to find out whether the situation has improved since 2016 and answer the questions below.

Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu & Mark Ziemann. 2021. Gene name errors: Lessons not learned. PLOS Computational Biology. Public Library of Science 17(7). e1008984. https://doi.org/10.1371/journal.pcbi.1008984.

Q2.12 Has the proportion of genetics publications with Excel gene lists affected by auto-formatting errors decreased since 2016?

Q2.13 Does using LibreOffice Calc (see Open Source) also cause these same issues?

Q2.14 Did highly reputable journals publish fewer articles with erroneous Excel gene lists?

Check your progress 🌟