Data Analysis for the Language Sciences

A very gentle introduction to statistics and data visualisation in R

Author

Elen Le Foll

Published

12 January 2026

Last updated on
17 June 2026

Preface

Note

This open textbook is currently under review. The main part consists of 15 content chapters, each featuring hands-on tasks and interactive quizzes to immediately apply the content of the chapters to real linguistic datasets and research questions.

The second part of the textbook (online only) is a growing collection of case-study chapters written by students of my data analysis seminars at the University of Cologne. The case studies consist of computational reproductions of published linguistics studies based on the authors’ original data, as well as some additional statistical analyses and data visualisations by the students. These chapters are valuable learning resources as the student authors document all their steps in the form of reproducible Quarto documents and explain how they proceeded in great detail.

Both student and colleague feedback on this draft is most welcome! ✉️️

What is this book about?

This textbook is intended as a hands-on introduction to data management, statistics, and data visualisation for students and researchers in the language sciences. It relies exclusively on freely accessible, open-source tools, focusing primarily on the programming language and environment R.

It is often claimed that learning R is “not for everyone”, or that it has “a steep learning curve”. This textbook aims to prove that the opposite is true. There are many reasons why it is worth investing the time and effort to learn how to do research in R, and it is no more difficult than learning any other new skill. In fact, the results of a recent study suggests that language aptitude is a much stronger predictor of programming aptitude than numeracy (i.e., “being good at numbers”) (Prat et al. 2020). So if you have successfully learnt a foreign language in the past, there is no reason why you shouldn’t succeed in learning a programming language!

Learning R is like learning a foreign language. If you enjoy learning languages, then ‘R’ is just another one. […] You have to learn vocabulary, grammar and syntax. Similar to learning a new language, programming languages also have steep learning curves and require quite some commitment. (Dauber 2024)

The rationale for this textbook is based on my personal observations, in both teaching and consulting, that many ‘introductory’ textbooks to statistics and/or R are not suitable for many humanities students and researchers, who typically have little to no prior programming experience and for whom the word “statistics” often evokes little more than unpleasant memories of school mathematics. It is worth stressing that is not a matter of generation (I have observed this phenomenon across all age groups), intelligence (I have taught people far more intelligent than me), or an innate inability to deal with numbers and/or computers (although these are beliefs that, sadly, some have deeply internalised). Instead, I am convinced that, for many people, it is simply a matter of finding a sturdy, first stepping stone and gathering up the courage to step on it to begin this learning journey.

The aim of this textbook is by no means to replace any of the brilliant, existing textbooks aimed at imparting statistical literacy for linguistics research, but rather to provide a stepping stone to be able to access these wonderful resources.¹

Who is this book for?

The target audience for this book are students and researchers in the language sciences, including (applied) linguistics, (first and second) language teaching, and language education research. All examples are taken from these research areas. Ultimately, however, this textbook may be of use to anyone who feels they could benefit from a maximally accessible stepping stone, whichever discipline they come from.

How to use this book

The online version of this textbook features a dark mode and a reader mode. Both can be accessed via buttons located in the top left corner of the webpage, just below the short title. This textbook is intended to be read linearly, chapter by chapter. Apart from the first introductory chapters, all other chapters require several hours of commitment. From Chapter 6 onwards, all examples come from the same dataset and build on each other. The chapters are interspersed with interactive quiz questions. These will often require you to complete short practical tasks. Completing these tasks is essential to fully assimilate the textbook’s contents. You may attempt and repeat them as often as you like. I recommend that you complete these “Your turn!” sections as you go along but, if you find that they are disturbing your reading flow at any stage, you can click on the green “Your turn!” header to temporarily hide them. The online version of the textbook also includes an appendix that repeats all of the quiz questions for those of you who prefer to complete the “Your turn!” tasks at the end of each chapter. Whichever order you opt for, I highly recommend that you take the time to complete these tasks. That’s because the best way to learn a new set of skills is to try things out: so, with this in mind, let’s get cracking!

About the author

I started learning about statistics and R in 2017 when I realised that it would be important for me to conduct the kind of quantitative analyses that I wanted to do as part of my PhD in applied linguistics/English language teaching (Le Foll 2022). I had no previous experience in either and there were no such courses at my university. Even though I mostly learnt by myself, it would be incorrect to say that I am self-taught: I learnt from some of the resources listed in appendix of next-step resources, attended bootcamps and summer schools, read countless posts on StackOverflow and various blogs, and exchanged with like-minded people on social media (#Rstats, #dataviz, #TidyTuesday). This is why it is probably fairer to say that I am community-taught.

I now like to describe myself as an “advanced beginner” in R and statistics. I am not a programmer, nor a statistician, but rather an applied linguist and committed educator. I enjoy teaching data literacy, statistics, and data visualisation to current and future generations of linguists, language education scholars, and teachers. I teach regular methods courses at the University of Cologne that are attended not just by M.A. and M.Ed. students, but also by some doctoral and post-doctoral researcher colleagues. In addition, I teach workshops for both doctoral and post-doctoral researchers at other institutions on a freelance basis.

This textbook was partly designed on the basis of materials that I have developed for these courses and workshops. Publishing these materials is my way to contribute to the wonderful community of people who have helped me on my leaRning journey. 🤗️

Photo of a white woman with glasses in her 30s smiling, wearing a floral top and a conference badge. She is standing in front of an academic poster entitled: Textbook English: A Corpus-Based Analysis of Language Use in EFL Textbooks. — Me back in 2017, proudly presenting at my first international conference.²

Acknowledgements 💜

This textbook has benefited greatly from the generous, critical feedback I have received from both novice and expert users of R throughout this project. Many thanks to my colleagues Nick Bearman, Marie Flesch, Ben Golub, Fritjof Lammers, Akira Murakami, and Sonja Eisenbeiß for their friendly critical peer review and to my (former) students at the University of Cologne, Jan Hollmann, Rose Hörsting, Tiziana Ilie, Vishar Kavehamoli, Marie Klünter, Vijaya Lakshmi, Fiona Maier, Jasmin Meinert, Paula Raabe, Gina Reinhard, Matteo Schmelzer, Poppy Siahaan, Veronika Strobl, Clara Stumm, Katja Wiesner, Ali Yıldız, and Isabel Zimmer, for their highly valuable learner feedback on earlier drafts of various chapters of this textbook.

Special thanks also go out to the researchers whose works are used as case studies in this textbook, in particular Sarah Schimke and Ewa Dąbrowska, and to Allison Horst whose beautiful and witty artworks illustrate many of the chapters of this textbook (e.g. Figure 1).

In addition, I would like to thank everyone who has contributed to my own data analysis learning journey. At the risk of forgetting someone, I would like to extend special thanks to Vaclav Brezina, Guillaume Desagulier, Stephanie Evert, Stefan Gries, Daniël Lakens, Natalia Levshina, Luke Tudge, Bodo Winter, the RLadies community, the R package developers and maintainers of all the packages that I use, as well as the many generous contributors to online forums such as Stack Overflow and to the #Rstats community on social media.

Get in touch! 📩️

If (parts of) this textbook helped you on your leaRning journey or for your teaching, do drop me a line to let me know!

If you’ve spotted an error or if you have any other suggestion to improve this resource, I would love to hear from you, too. ✉️️

A word about the license 🔓️

The online version of this textbook is published under a CC BY-NC-SA license, which means that these materials can be shared and adapted for free as long as the original source is cited, the use is non-commerical, and any adaptations are shared under the same or a compatible license.

I have chosen this license because I explicitly object to tech companies scraping my work and reselling (mashed-up versions of) it as “AI”. If you work for a for-profit educational institution or are a freelance trainer who would like to use (parts of) this textbook, please send me a brief e-mail to explain your context of (re-)use; I will most likely be very happy to grant you permission to do so.

How to cite this textbook

Please cite the current version of the web version of the textbook as:

Le Foll, Elen. 2026. Data Analysis for the Language Sciences: A very gentle introduction to statistics and data visualisation in R. Open Educational Resource. https://elenlefoll.github.io/RstatsTextbook/ (accessed DATE).

To cite a specific passage, please quote the corresponding chapter or section number(s). Note that the case-study chapters include their how “how to cite” text boxes.

Header text saying 'R learners' above five friendly-looking monsters holding up signs that together read “we believe in you.” — Figure 1: Artwork encouraging beginner `R` learners by @allison_horst CC-BY 4.0.

A list of next-step resources can be found in the Appendix.↩︎
I chose this picture because I vividly remember two professors pointing out that I had written “p = 0.00” on my poster (which I had copied-and-pasted from the output of the statistics software that I had used) and laughing among themselves —but well within earshot— at how stupid that was. Learning quantitative data analysis skills certainly requires a lot of effort on the part of the learner, but it also requires an academic culture that strives to include rather than exclude. This textbook explicitly aims for an inclusive approach to teaching the basics of data analysis in R and I have included this photo as a reminder to always persevere, whether in the face of seemingly insurmountable error message or snarky remarks! For those of you who are curious, a p-value can never equal exactly zero. But p-values can be extremely close to zero so that the value may be rounded off to 0.00. In this case, however, it is standard practice to report p < 0.001.↩︎