2 Open Science statement
This page displays Section 4.4.2 from the Author Accepted Manuscript (AAA) version of the book: https://osf.io/jpxae/. Please cite the Version of Record: https://benjamins.com/catalog/scl.116.
Among the wealth of Textbook English publications summarised in Chapter 3 (see also Appendix A), very few have included the data and, where relevant, the code necessary to reproduce or replicate the findings that they report (thereby reflecting current sharing practices in linguistics more broadly, see Bochynska et al. 2023).1
Although the terms are sometimes used interchangeably (see Parsons et al. 2022 for a comprehensive glossary of Open Science terminology), ‘reproducibility’ is used here to refer to the ability to obtain the same results using the researchers’ original data and code, whilst ‘replicability’2 entails repeating a study and obtaining compatible results with different data analysed with either the same or different methods (Berez-Kroeker et al. 2018: 4; Porte & McManus 2018: 6–7). Not only does not sharing data and materials mean that published results are not reproducible, hereby making it difficult to assess their reliability, it also makes it very difficult to attempt to replicate the results to gain insights into the extent to which they are generalisable, e.g., across a different set of EFL textbooks used in a different educational context (see also Le Foll 2024a; McManus).
A major barrier to the reproducibility of (corpus) linguistic research is that it is often not possible for copyright or, when participants are involved, data protection reasons to make linguistic data available to the wider public. However, both research practice and the impact of our research can already be greatly improved if we publish our code or, when using GUI software, methods sections detailed enough for an independent researcher to be able to perfectly repeat the full procedure. If this is done, it is possible to conduct detailed reviews of our methodologies and replicate the effects reported in published literature using different data.
Aside from data protection and copyright restrictions, there are, of course, many more reasons why researchers may be reluctant to share their data and code (see, e.g., Al-Hoorie & Marsden; Gomes et al. 2022). It is not within the scope of this monograph to discuss these; however, it is important to acknowledge that, in many ways, such transparency makes us vulnerable. At the end of the day: to err is human. Yet, the risks involved in committing to Open Science practices are particularly tangible for researchers working on individual projects, like me, who have had no formal training in project management, programming, or versioning, and have therefore had to learn “on the job”. Nonetheless, I am convinced that the advantages outweigh the risks. Striving for transparency helps both the researchers themselves and others reviewing the work to spot and address problems. As a result, the research community can build on both the mishaps and successes of previous research, thus improving the efficiency of research processes and ultimately contributing to advancing scientific progress.
It is with this in mind that I have decided, whenever possible, to publish the data and code necessary to reproduce the results reported in the present monograph following the FAIR principles (i.e., ensuring that research materials are Findable, Accessible, Interoperable and Reusable, see Wilkinson et al. 2016). For copyright reasons, the corpora themselves cannot be made available. However, the full, unedited tabular outputs of the tool used for automatic corpus annotation (the MFTE Perl; see 5.3.2 and Appendix C) are published in the Online Supplements. Together with the commented data analysis scripts also published in the Online Supplements, as well as in the associated Open Science Framework (OSF) repository, these tables allow for the computational reproduction of all of the results and plots discussed in the following chapters.
In describing the study’s methodology, maximum transparency is strived for by reporting on how each sample size was determined and on which grounds variables and data points were excluded, manipulated and/or transformed. Most of these operations were conducted in the open-source programming language and environment R (R Core Team 2022). The annotated data processing and analysis scripts have been rendered to HTML pages (viewable in the Online Supplements) thus allowing researchers to review the procedures followed without necessarily installing all the required packages and running the code themselves. Furthermore, these scripts include additional analyses, tables, and plots that were made as part of this study but which, for reasons of space, were not reported on in detail here. Whenever data, packages or other open-source scripts from other researchers were used, links to these are also provided in the Online Supplements (in addition to the corresponding references in the bibliography). To reproduce the R analyses, use the renv::restore()
command to ensure that you are using the correct package versions (Ushey & Wickham 2023). For full reproducibility it may be necessary to use rig to run the code in R v. 4.3.1.
This is also true of my own earlier work on the language of EFL textbooks (Le Foll 2021; 2022a; 2022b). More recent work conducted as part of this project , however, was published alongside with the data and code (Le Foll 2022c; 2023; 2024b).↩︎
Confusingly, other terms are also frequently used to refer to the same or related concepts, e.g., repeatability, robustness and generalisability (see, e.g., Belz et al. 2021: 2–3; Parsons et al. 2022).↩︎