2 Open Science statement

Last modifications on this page

June 22, 2024

Note

This page displays Section 4.4.2 from the Author Accepted Manuscript (AAA) version of the book: https://osf.io/jpxae/. Please cite the Version of Record: https://benjamins.com/catalog/scl.116.

Among the wealth of Textbook English publications summarised in Chapter 3 (see also Appendix A), very few have included the data and, where relevant, the code necessary to reproduce or replicate the findings that they report (thereby reflecting current sharing practices in linguistics more broadly, see Bochynska et al. 2023).¹

Although the terms are sometimes used interchangeably (see Parsons et al. 2022 for a comprehensive glossary of Open Science terminology), ‘reproducibility’ is used here to refer to the ability to obtain the same results using the researchers’ original data and code, whilst ‘replicability’² entails repeating a study and obtaining compatible results with different data analysed with either the same or different methods (Berez-Kroeker et al. 2018: 4; Porte & McManus 2018: 6–7). Not only does not sharing data and materials mean that published results are not reproducible, hereby making it difficult to assess their reliability, it also makes it very difficult to attempt to replicate the results to gain insights into the extent to which they are generalisable, e.g., across a different set of EFL textbooks used in a different educational context (see also Le Foll 2024a; McManus).

A major barrier to the reproducibility of (corpus) linguistic research is that it is often not possible for copyright or, when participants are involved, data protection reasons to make linguistic data available to the wider public. However, both research practice and the impact of our research can already be greatly improved if we publish our code or, when using GUI software, methods sections detailed enough for an independent researcher to be able to perfectly repeat the full procedure. If this is done, it is possible to conduct detailed reviews of our methodologies and replicate the effects reported in published literature using different data.

Aside from data protection and copyright restrictions, there are, of course, many more reasons why researchers may be reluctant to share their data and code (see, e.g., Al-Hoorie & Marsden; Gomes et al. 2022). It is not within the scope of this monograph to discuss these; however, it is important to acknowledge that, in many ways, such transparency makes us vulnerable. At the end of the day: to err is human. Yet, the risks involved in committing to Open Science practices are particularly tangible for researchers working on individual projects, like me, who have had no formal training in project management, programming, or versioning, and have therefore had to learn “on the job”. Nonetheless, I am convinced that the advantages outweigh the risks. Striving for transparency helps both the researchers themselves and others reviewing the work to spot and address problems. As a result, the research community can build on both the mishaps and successes of previous research, thus improving the efficiency of research processes and ultimately contributing to advancing scientific progress.

It is with this in mind that I have decided, whenever possible, to publish the data and code necessary to reproduce the results reported in the present monograph following the FAIR principles (i.e., ensuring that research materials are Findable, Accessible, Interoperable and Reusable, see Wilkinson et al. 2016). For copyright reasons, the corpora themselves cannot be made available. However, the full, unedited tabular outputs of the tool used for automatic corpus annotation (the MFTE Perl; see 5.3.2 and Appendix C) are published in the Online Supplements. Together with the commented data analysis scripts also published in the Online Supplements, as well as in the associated Open Science Framework (OSF) repository, these tables allow for the computational reproduction of all of the results and plots discussed in the following chapters.

In describing the study’s methodology, maximum transparency is strived for by reporting on how each sample size was determined and on which grounds variables and data points were excluded, manipulated and/or transformed. Most of these operations were conducted in the open-source programming language and environment R (R Core Team 2022). The annotated data processing and analysis scripts have been rendered to HTML pages (viewable in the Online Supplements) thus allowing researchers to review the procedures followed without necessarily installing all the required packages and running the code themselves. Furthermore, these scripts include additional analyses, tables, and plots that were made as part of this study but which, for reasons of space, were not reported on in detail here. Whenever data, packages or other open-source scripts from other researchers were used, links to these are also provided in the Online Supplements (in addition to the corresponding references in the bibliography). To reproduce the R analyses, use the renv::restore() command to ensure that you are using the correct package versions (Ushey & Wickham 2023). For full reproducibility it may be necessary to use rig to run the code in R v. 4.3.1.

Al-Hoorie, Ali H. & Emma Marsden. Open scholarship and transparency in applied linguistics research. https://doi.org/10.31219/osf.io/7ntq2.

Belz, Anya, Shubham Agarwal, Anastasia Shimorina & Ehud Reiter. 2021. A systematic review of reproducibility research in Natural Language Processing. arXiv:2103.07929 [cs]. http://arxiv.org/abs/2103.07929.

Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, et al. 2018. Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics 56(1). 1–18. https://doi.org/10.1515/ling-2017-0032.

Bochynska, Agata, Liam Keeble, Caitlin Halfacre, Joseph V. Casillas, Irys-Amélie Champagne, Kaidi Chen, Melanie Röthlisberger, Erin M. Buchanan & Timo B. Roettger. 2023. Reproducible research practices and transparency across linguistics. Glossa Psycholinguistics 2(1). https://doi.org/10.5070/G6011239.

Gomes, Dylan G. E., Patrice Pottier, Robert Crystal-Ornelas, Emma J. Hudgins, Vivienne Foroughirad, Luna L. Sánchez-Reyes, Rachel Turba, et al. 2022. Why don’t we share data and code? Perceived barriers and benefits to public archiving practices. Proceedings of the Royal Society B: Biological Sciences. Royal 289(1987). 20221113. https://doi.org/10.1098/rspb.2022.1113.

Le Foll, Elen. 2021. Register Variation in School EFL Textbooks. Register Studies 3(2). 207–246. https://doi.org/10.1075/rs.20009.lef.

Le Foll, Elen. 2022a. Making tea and mistakes: The functions of make in spoken English and textbook dialogues. In Zihan Yin & Elaine Vine (eds.), Multifunctionality in english: Corpora, language and academic literacy pedagogy (Routledge Advances in Corpus Linguistics), 157–178. Routledge.

Le Foll, Elen. 2022b. “I’m putting some salt in my sandwich.” The use of the progressive in EFL textbook conversation. In Susanne Flach & Martin Hilpert (eds.), Broadening the spectrum of corpus linguistics: New approaches to variability and change (Studies in Corpus Linguistics), 93–132. John Benjamins. https://doi.org/10.1075/scl.105.04lef.

Le Foll, Elen. 2022c. Textbook english: A corpus-based analysis of the language of EFL textbooks used in secondary schools in France, Germany and Spain. Osnabrück University PhD thesis. https://doi.org/10.48693/278.

Le Foll, Elen. 2023. A conceptual replication of the multi-dimensional model of General Spoken and Written English (Biber 1988): Challenges, limitations and potential solutions. https://osf.io/f5496/.

Le Foll, Elen. 2024b. Schulenglisch: A multi-dimensional model of the variety of English taught in German secondary schools. AAA: Arbeiten aus Anglistik und Amerikanistik / Agenda: Advancing Anglophone Studies 49(1). 15–50. https://doi.org/10.24053/AAA-2024-0019.

Le Foll, Elen. 2024a. Why we need open science and open education to bridge the corpus research–practice gap. In Peter Crosthwaite (ed.), Corpora for language learning: Bridging the research-practice divide, 142–156. Routledge.

McManus, Kevin. Replication and Open Science in Applied Linguistics Research. In Luke Plonsky (ed.), Open science in applied linguistics. John Benjamins. Preprint: https://osf.io/bqr9w/.

Parsons, Sam, Flávio Azevedo, Mahmoud M. Elsherif, Samuel Guay, Owen N. Shahim, Gisela H. Govaart, Emma Norris, et al. 2022. A community-sourced glossary of open scholarship terms. Nature Human Behaviour. Nature 6(3). 312–318. https://doi.org/10.1038/s41562-021-01269-4.

Porte, Graeme & Kevin McManus. 2018. Doing replication research in applied linguistics. 1st edn. Routledge. https://doi.org/10.4324/9781315621395.

R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria. https://www.R-project.org/.

Ushey, Kevin & Hadley Wickham. 2023. Renv: Project environments. https://CRAN.R-project.org/package=renv.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(1). 160018. https://doi.org/10.1038/sdata.2016.18.

This is also true of my own earlier work on the language of EFL textbooks (Le Foll 2021; 2022a; 2022b). More recent work conducted as part of this project , however, was published alongside with the data and code (Le Foll 2022c; 2023; 2024b).↩︎
Confusingly, other terms are also frequently used to refer to the same or related concepts, e.g., repeatability, robustness and generalisability (see, e.g., Belz et al. 2021: 2–3; Parsons et al. 2022).↩︎