2 Data files and formats

2.1 Data in the language sciences

In this book, we are concerned with empirical research in the language sciences, in other words, with research that is based on the analysis of data. But what are data exactly? Data can be collected via surveys, measurements, or observations. To begin with, however, these collected datasets are “raw”. Data only becomes information once we have analysed and interpreted the data in a meaningful way. Hence, just like uncooked pasta does not make a flavourful meal, we must learn to “cook” the raw data to obtain meaningful information.

What kind of data are analysed in the language sciences? To get a rough idea of the range of data types analysed in the language sciences, let us take a look at the IRIS database.

IRIS is a collection of instruments, materials, stimuli, data, and data coding and analysis tools used for research into languages, including first, second, and beyond, and signed language learning, multilingualism, language education, language use, and language processing. Materials are freely accessible and searchable, easy to upload (for contributions) and download (for use). (2011)

As such, IRIS supports Open Science and Open Scholarship (see Chapter 1).

Your turn!

In this task and many future tasks, we will make use of the IRIS database.

Connect to the IRIS website and navigate to its Search and Download page.
Scroll down to the filter option ‘Data Type’.
Click on ‘Data Type’ and browse through the different data types that are most commonly used in language-related research.

Data Type filter dropdown menu with the following visible categories: Oral production (168), Closed response format (157), Open response (105), Judgements (97), Written production (76), Reaction times (62), Qualitative (46) — Figure 2.1: Screenshot from the IRIS database search page (accessed on 17 April 2024)

Q2.1 For which kinds of studies could these different types of data have been collected? Think about both experimental and observational studies.

Q2.2 Which of these data types is most likely to be measured in milliseconds (ms)?

2.2 Types of research data

Given the wide range of methods used in language research, it is no surprise that they are so many different types of research data. Although the data types listed on the IRIS search page (see Figure 2.1 for extract) are very broad and the categories not clearly defined, the list illustrates the breadth of research data types typically analysed in language studies.

The first data type category, “Oral production”, for instance, can equally refer to text transcriptions of language users’ oral production, audio, or video files. It can also refer to either raw data or to (more or less) processed data. For example, a transcript of a conversation could have been automatically annotated for part-of-speech, meaning that every word would be marked for their word class (e.g., This_DT is_VBZ not_RB raw_JJ text_NN data_NN ._PUNC), or it could have been manually anonymised by adding placeholders (e.g., Is <NAME> going out with <NAME>?) indicating that certain words have been retracted for data protection reasons.

The second most frequent data type category, “Closed response format”, includes different kinds of questionnaires and tests. Questionnaires may ask study participants to disclose personal information relevant to the research questions using single or multiple-choice questions, such as what language(s) they use at home, how long they have studied a language for, or how old they are. Tests may be designed to assess participants’ language competences (e.g., in the form of a vocabulary or grammar test), as well as other aspects relevant to the research questions being investigated (e.g., short-term memory or baseline reaction times).

In this book we will focus on the research processes that take place after the data have been collected. However, it is vital that we are aware of the conditions and context in which the data we are analysing were collected and pre-processed. It is no exaggeration to say that these steps in the research process can entirely change the results of the data analysis. Suppose we decide to compare the abilities of two groups of French L2 learners. To do this, we administered a language production test to two whole classes of secondary school students learning French as a second language using two different teaching methods. If one group had 15 minutes to complete the test and the other had up to 60 minutes, the results would not be comparable.

Your turn!

Q2.3 Which other reasons could potentially jeopardise the comparison of test results data from two different groups of pupils?

Whilst there are many ways to ensure that as many factors as possible are controlled for, not all can be controlled for. What is crucial is that all aspects of the data collection process are well documented so that all factors, whether controlled or not, can be taken into account when analysing the data.

In research, we usually distinguish between primary data, which are the data that you collected yourself, and secondary data, which are data that were collected by others. Hence if you were to carry out a new study based on data that you found on IRIS, you would be conducting a secondary data analysis. Especially when conducting secondary data analyses, it is crucial that we have enough information about the data itself, i.e. metadata. Metadata is crucial for finding, sharing, evaluating, and reusing datasets. Metadata can be generated automatically and stored within the data file. For example, unless this metadata was explicitly deleted or amended, Microsoft Word files typically contain metadata describing who created the file, when it was first created, and when the file was last modified. For some data and projects, it also makes sense to create separate metadata files that contain additional or more detailed information about the collected data.

2.3 Data formats and file extensions

Different data types come in different data formats. For audio files, you may be familiar with the MP3 format, but this is by no means the only format in which audio files can be saved. Many other audio file formats exist, such as Waveform Audio File Format (WAVE) and Free Lossless Audio Codec (FLAC).

We can usually tell in what format a file is in by looking at its file extension. The file extension is the suffix of the file name. It comes at the end of the file name and is preceded by a dot. The file extension of a WAVE file is .wav, whereas that of an MP3 file is .mp3; hence the file recording.wav is a WAVE file, whereas recording.mp3 is an MP3 file.

Your turn!

Q2.4 In which format are Microsoft Word files typically saved?

Q2.5 Which of these files are audio files?

Unfortunately, many modern operating systems have a tendency to hide file extensions by default. This results in the files recording.wav and recording.mp3 both being displayed as recording in File Finder/Explorer windows (compare Figure 2.2 (a) and Figure 2.2 (b)). This is misleading and can lead to all kinds of problems.

File Finder window showing 7 files with file names that do not include a file extension. — (a) Displaying file names without file extensions

File Finder window showing 7 files with file extensions such as .pdf, .html and .txt. — (a) Displaying file names without file extensions

To ensure that you can always see the extensions of the files on your computer in the File Explorer (on Windows) or the File Finder (on macOS), follow these instructions:

On Windows: https://www.howtogeek.com/205086/beginner-how-to-make-windows-show-file-extensions/.
On macOS: https://support.apple.com/en-gb/guide/mac-help/mchlp2304/mac (select the version of your operating system at the top of the page).

2.4 Sharing research data and materials

In line with the principles of Open Science (see Chapter 1), it is important to ensure that both the materials that were used to collect research data (e.g., questionnaire items, audio, image or video stimuli, language aptitude tests, etc.) and the data themselves are made openly available to the research community, whenever legally possible and ethically responsible. Sharing materials ensures that studies can be replicated, for example with new participants or in a different language. Sharing research data also allows independent researchers to reproduce the results of studies, allowing them to verify the reported results and to conduct additional analyses that may confirm, contradict, or extend the conclusions of the original studies.

You may be wondering how linguists and language education researchers can make their research data and materials publicly available. Table 2.1 provides a non-exhaustive list of public repositories where researchers can upload research data and materials (with figures collected in early June 2024¹). Some are specific to the language sciences, while others cater to all research disciplines. If you completed the Task 2.1 in Section 2.1, you should already be familiar with at least one of these! 😉 All of the examples, tasks, and exercises in this book are based on research data and materials that researchers have made available in open access on one or more of these repositories.

Table 2.1: Non-exhaustive list of public repositories of research data and materials.

Repository	Discipline	Nb. of entries	Provides DOI	Online since
Dryad	All	60000	Yes	2008
Figshare	All	8000000	Yes	2012
HAL	All	5000000	No	2001
Harvard Dataverse	All	160000	Yes	2006
IRIS	Linguistics	3500	No	2011
Open Science Repository, OSF	All	153663	Yes	2012
Tromsø Repository of Language and Linguistics, TROLLing	Linguistics	4500	Yes	2014
Vivil	Clinical research	7000	Yes	2013
Zenodo	All	3750000	Yes	2013

In the following tasks, we will look at a study by Schimke et al. (2018) (see Figure 2.3 (a)), which is an example of a publication which was awarded the Open Data and the Open Materials badges (see Figure 2.3 (b)). This means that the research materials and data associated with this study can be found in an open, online repository:

“This article has been awarded Open Materials and Open Data badges. All materials and data are publicly accessible via the IRIS Repository at https://www.iris-database.org/iris/app/home/detail?id=york:934337. Learn more about the Open Practices badges from the Center for Open Science: https://osf.io/tvyxz/wiki” Schimke et al. (2018).

The authors could have chosen to upload their materials and data to any of the online repositories listed in Table 2.1 but, in this case, they chose IRIS.

Heading of PDF version of the article from the journal Language Learning: A Journal of Research in Language Studies. EMPIRICAL STUDY First Language Influence on SecondLanguage Offline and Online AmbiguousPronoun Resolution Sarah Schimke, Israel de la Fuente, Barbara Hemforth, and Saveria Colonnada from University Münster, University of Lille/CNRS, CNRS/University of Paris Diderot, and University of Paris 8/CNRS — (a) Title page of the Schimke et al. (2018)

The Open Data badge is blue and shows a simple barplot and the Open Materials is yellow and shows an open cardbox. — (a) Title page of the Schimke et al. (2018)

Among other results, Schimke et al. (2018) report on two eye-tracking experiments. One of these experiments involved Spanish-speaking participants listening to ambiguous sentences in Spanish whilst looking at images of Playmobil figures (see Figure 2.4 for an example).

Photo of two Playmobil figures. On the left, a street sweeper with a broom and, on the right, a postman with a trolley. — Figure 2.4: Image from Experiment 1 in Schimke et al. (2018)

Note 2.1: How did the experiment work?

In this eye-tracking experiment, participants were instructed to decide whether the sentences they heard matched the Playmobil images or not. Consider the following two sentences from the experiment:

El barrendero se encontró con el cartero antes de que recogiera las cartas.
[The street sweeper met the postman before he fetched the letters.]

El barrendero se encontró con el cartero antes de que recogiera la escoba.
[The street sweeper met the postman before he fetched the broom.]

Up until the point at which either las cartas [the letters] or la escoba [the broom] are heard, it is unclear who is doing the fetching. From a grammatical point of view, it could be either the street sweeper or the postman.

Participants were presented with Figure 2.4 as they were listening to either Sentence 1 or Sentence 2. At the same time, the researchers measured how long it took for the participants to look at the subject governing the verb recogiera. In other words, for Sentence 1, they were interested in how long it took participants to focus on the postman Playmobil figure and, in Sentence 2, on the street sweeper. Such fine measurements are made in milliseconds, i.e. in thousandths of seconds, using a special eye-tracking device.

Your turn!

Imagine that you want to run an experiment similar to the one carried out in Schimke et al. (2018). You can reuse the Playmobil image files created by the researchers as they helpfully uploaded them to the IRIS database.

In which file format do you think the images are archived? To find out, click here to go directly to the list of data and materials associated with the study. There are four entries in the IRIS database that are associated with this study. Select the “Pictorial” entry which contains the images. It allows you to download a ZIP file called Images_online.zip. ZIP is an archive file format that can contain one or more compressed files. Download this ZIP file.

Once the download was successful, navigate to the folder where the file was saved on your computer and unzip the file, i.e., decompress it and extract its contents:

To unzip on Windows, double-click the .zip file
- select ‘Extract All’,
- select a folder,
- and then click ‘Extract’.
On a Mac, simply double-click the .zip file to unzip it.
If you are using the Linux command line, use the command unzip followed by the name of the file to unzip it.

You should find that the ZIP file contains a folder entitled ‘Images’, which contains 58 pictures of different combinations of Playmobil figures that correspond to the experiment’s stimulus sentences.

Q2.6 In which file format are these Playmobil image files?

Image files typically contain metadata that is embedded in the image files themselves. This metadata may include the dimensions of the image and its colour profile. To view this metadata, right-click on one of the image files that you have extracted from the ZIP file and select the option to get more information about the file, e.g., “Get Info” or “Properties”.

Q2.7 How wide are these Playmobil images in pixel?

2.5 Working with tabular data

The measurements made by the eye-tracking device in Schimke et al. (2018)’s eye-tracking experiments were stored in the form of tables. Table 2.2 is an extract of a table that contains processed eye-tracking data from Schimke et al. (2018). It forms part of the study’s supplementary materials and can also be downloaded from the IRIS database.

In this table, each row corresponds to the data associated with one participant’s eye movements while listening to a single stimulus sentence and looking at the corresponding Playmobil image (e.g., Figure 2.4). The extract displayed as Table 2.2 only shows the data associated with the first six stimulus sentences (items) that participant “s1”, a Spanish L2 learner, listened to. The columns crit1, crit2 and crit3 contain values derived from the measurements made using the eye-tracking device.² From Table 2.2, we can also see that participant “s1” was 19 years old when they started formally learning Spanish (AoO stands for “age of onset of formal instruction”) and that they were 20 when the experiment was conducted.

Table 2.2: Extract of table containing eye-tracking data from Schimke et al. (2018)'s appendix

language	subject	disambiguation	item	crit1	crit2	crit3	AoO	age
S	s1	1	1	0.3451355	-0.5618789	0.7036070	19	20
S	s1	2	2	-0.2679332	-1.5849625	0.1852149	19	20
S	s1	1	3	-1.1563420	0.9898042	-1.5849625	19	20
S	s1	2	4	-1.5849625	-0.0874628	-1.5849625	19	20
S	s1	1	5	1.5849625	0.1831223	1.5849625	19	20
S	s1	2	6	-0.7824086	-0.8548021	-1.1758498	19	20

When working with data, tables are ubiquitous. Data stored in tables are called tabular data. Hence, learning to work with tabular data is a crucial data literacy skill.

In the language sciences, the results of most studies (whether experimental or corpus studies) are stored in tables. For example, when researchers conduct an online survey, the data collected by the online survey platform (e.g., Qualtrics, SoSci, SurveyMonkey) are automatically stored in the form of one or more table(s). These can then be exported from the survey platform in various tabular file formats (e.g., .csv, .json, .xlsx).

In some cases, data may be collected by analogue means, e.g., by getting participants to answer a paper questionnaire or collecting school children’s work on paper. However, for quantitative analysis, analogue research data are first digitalised. Then, the data are typically stored as text files in file formats such as .txt or .csv.

2.5.1 Delimiter-separated values (DSV) files

Tables can be stored in many data formats but the simplest and most widely used in linguistic research are text files with delimiter-separated values (DSV). For sharing and archiving research data, DSV files are favoured over formats specific to propriety software such as .xslx (Microsoft Excel files) or .numbers (Apple Numbers files). This is because DSV files can be “understood” by many different programs and on all operating systems. The fact that they are simple text files means that we will also be able to reliably read them in the future, even if programs such as Excel or Numbers have evolved or have been discontinued. Reliability and compatibility are fundamental to maintaining the integrity of research data and ensuring that data can be reused, even in the distant future.

In DSV files, each value (e.g., measurement or response) is separated by a specific separator character. In principle, any character can be used to separate values, but the most common separators are the comma (,), tab (\t), colon (:), and semicolon (;). Below is the .csv file corresponding to Table 2.1.

Repository,Discipline,Nb. of entries,Provides DOI,Online since
Dryad,All,60000,Yes,2008
Figshare,All,8000000,Yes,2012
HAL,All,5000000,No,2001
Harvard Dataverse,All,160000,Yes,2006
IRIS,Linguistics,3500,No,2011
"Open Science Repository, OSF",All,153663,Yes,2012
"Tromsø Repository of Language and Linguistics,TROLLing",Linguistics,4500,Yes,2014
Vivil,Clinical research,7000,Yes,2013
Zenodo,All,3750000,Yes,2013

As you can see, the values are separated by commas.³ Additionally, some of the values are enclosed in, or delimited by, double quotation marks ("). This prevents any commas that may occur within an actual field value, e.g., the comma in the field Open Science Repository, OSF, from being interpreted as a separator character.

Given that DSV files are text files, it is possible to open them in a free plain-text editor (e.g., Notepad++ or BBEdit) or a text-processing program (e.g., Microsoft Word or LibreOffice Writer). However, these programmes will typically display DSV files as in Figure 2.5.

A screenshot showing a CSV file opened in Microsoft Word. It looks like a long list of words separated by commas. — Figure 2.5: The `.csv` file corresponding to Table 2.1 opened in Microsoft Word

We can probably agree that what we are seeing in Figure 2.5 is not a very reader-friendly way to display tabular data! This is why DSV files are more often opened in spreadsheet programs (e.g., LibreOffice Calc, Google Sheets, Microsoft Excel, Numbers) than in text-editing programs. Let’s find out how in the next section.

2.6 A word of warning about spreadsheet programs ⚠️

You should be aware that opening DSV files in spreadsheet programs can corrupt the files! Once a file is corrupted, it is often not possible to retrieve the original data so this is very bad news, indeed. Such problems are particularly frequent when opening DSV files with Microsoft Excel and Google Sheets. This is because the default settings in these programs surreptitiously modify files upon opening.

These ‘auto-format’ modifications include replacing certain values by dates (e.g., changing 3-4 to March, 4th) or numbers (e.g., changing 1.23E5 to 123000)⁴, removing leading zeros (e.g., changing 001 to 1), or misinterpreting certain characters (e.g., the value -ism will generate an error because the hyphen is interpreted as minus sign).

Not only can these auto-format modifications lead to inaccurate data analysis but, in the worst of cases, they can even cause data loss. The crux of the problem is that often users do not realise what the program has done in the background. How bad can this be? Find out by completing the task below.

Your turn!

In this task, you will find out how genetics researchers who use spreadsheets for their analyses regularly have their data so badly damaged that it affects the results of their publication. Though we have no statistics on how spreadsheet errors affect the work of linguists, it is (unfortunately) very likely to be just as bad as in genetics.

Ziemann, Eren & El-Osta (2016) reported that a fifth of genetics publications with supplementary .xls or .xlsx files with gene lists contained errors caused by Excel’s auto-formatting behaviour. The results of this study shocked the research community and a report about it went viral. Click on the link below to read the open-access article “Gene name errors: Lessons not learnt” by Abeysooriya et al. (2021) to find out whether the situation has improved since 2016 and answer the questions below.

Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu & Mark Ziemann. 2021. Gene name errors: Lessons not learned. PLOS Computational Biology. Public Library of Science 17(7). e1008984. https://doi.org/10.1371/journal.pcbi.1008984.

Q2.12 Has the proportion of genetics publications with Excel gene lists affected by auto-formatting errors decreased since 2016?

Q2.13 Does using LibreOffice Calc (see Section 1.2) also cause these same issues?

Q2.14 Did highly reputable journals publish fewer articles with erroneous Excel gene lists?

It is worth noting that, for some Windows users, these auto-formatting issues can corrupt files that they have never actively opened in Excel! 🤯 This happens when Windows applies Excel’s default settings to all CSV files, regardless of what program they are actually opened with. To ensure that this does not happen to you, check that Excel is definitely not your default app to open .csv and .tsv files (see below for instructions).

Opening a .csv or .tsv file in LibreOffice from a File Finder/Explorer window

Remember that to open a .csv or .tsv file on your computer, should never ever double-click on it and let the default program open it! As we saw in Section 2.6, this can break or ‘corrupt’ the file. To avoid accidentally double-clicking on a .csv or .tsv file and having the file corrupted, I recommend making either LibreOffice or a plain-text editor (e.g., Notepad++ or BBEdit) your default application to open up such files.

On MacOS, you can change the default application used to open files of any file extensions by right-clicking a file name with this particular extension and than selecting ‘Get Info’ (Figure 2.9 (a)). In the example below, Numbers is the default application for all .csv files (see Figure 2.9 (b)). In the dropdown menu ‘Open with:’, you can then select LibreOffice (provided you have installed it beforehand!) and finally click on ‘Change All…’ (Figure 2.9 (c)). You will be asked to confirm your choice.

A screenshot showing the context menu of a macOS file that can be accessed with a right click. The drop-down option 'Get Info' is highlighted. — (a)

A screenshot showing the info dialog of a macOS File Finder window with the 'Open with' section highlighted. It shows that the Numbers app is the default to open such files on this particular computer. — (a)

If your operating system is Windows, you should look in your Windows’ settings for the option ‘Default Apps’ (see Figure 2.10).

Screenshot of the menu showing the location of the default apps setting in the Windows settings. It can be found under the section 'Apps'. — Figure 2.10: Default apps in Windows settings

In the next step, select ‘Choose default apps by file type’. Here, you can search for .csv as a file type, and choose which program you want to set as the default program for opening .csv files. If Excel is currently your default (as in Figure 2.11 (a)), you can click on Excel and choose a different program. LibreOffice is a sensible, open-source alternative (see Figure 2.11 (b)). A plain-text editor such as Notepad would also be fine (also listed on Figure 2.11 (b)).

Screenshot of the default setting for the file type '.csv' — (a) Excel as the default programme for `.csv` files

Screenshot of the selection dialogue for default apps. — (a) Excel as the default programme for `.csv` files

If it is not possible to adjust the default app settings, either due to insufficient permissions or because you only have temporary access to this PC, do not to open .csv or .tsv files with the default program. Instead, right-click on the file name and, using the ‘Open with’ option, select the option to open the file with LibreOffice, if available, or else with a plain-text editor.

Check your progress 🌟

You have successfully completed 0 out of 14 questions in this chapter.

Are you confident that you can…?

Distinguish different types of research data (Section 2.2)
Find and download openly available research data and materials (Section 2.4)
Distinguish different data formats using the file extensions (Section 2.3)
Open delimiter-separated values (DSV) files in LibreOffice Calc (Section 2.5.1) - (Section 2.5.2)
Explain the risks of opening DSV files in Microsoft Excel and Google sheets (Section 2.6)

In the following chapter, you will learn how to name, save, and back-up research data files so as to facilitate sound data analysis.

Chapter overview

2.1 Data in the language sciences

2.2 Types of research data

2.3 Data formats and file extensions

2.5 Working with tabular data

2.5.1 Delimiter-separated values (DSV) files

2.5.2 Opening DSV files in LibreOffice Calc

2.6 A word of warning about spreadsheet programs ⚠️

Check your progress 🌟