Data Analysis for the Language Sciences

A very gentle introduction to statistics and data visualisation in R

Author

Elen Le Foll

Published

January 15, 2025

Preface

Warning

This open textbook is very much work in progress. The plan is to have ca. 13 content chapters in total, plus a number of case-study chapters (co-)written by students.

This draft is primarily intended as preparation/revision materials for the linguistics methods classes that I regularly teach at the University of Cologne, as well as R and statistics workshops for pre- and post-doctoral researchers that I occasionally give at other higher education institutions.

Student and colleague feedback on this draft is very welcome! ✉️

What is this book about?

This textbook is intended as a hands-on introduction to data management, statistics, and data visualisation for students and researchers in the language sciences. It relies exclusively on freely accessible, open-source tools, focusing primarily on the programming language and environment R.

It is often claimed that learning R is “not for everyone”, or that it has “a steep learning curve”. This textbook aims to prove that the opposite is true. There are many reasons why it is worth investing the time and effort to learn how to do research in R, and it is no more difficult than learning any other new skill. In fact, the results of a recent study suggests that language aptitude is a much stronger predictor of programming aptitude than numeracy (i.e., “being good at numbers”) (Prat et al. 2020). So if you have successfully learnt a foreign language in the past, there is no reason why you shouldn’t succeed in learning a programming language!

Learning R is like learning a foreign language. If you enjoy learning languages, then ‘R’ is just another one. […] You have to learn vocabulary, grammar and syntax. Similar to learning a new language, programming languages also have steep learning curves and require quite some commitment. (Dauber 2024)

The rationale for this textbook is based on my personal observations, in both teaching and consulting, that many ‘introductory’ textbooks to statistics and/or R are not suitable for many humanities scholars, who typically have little to no prior programming experience and for whom the word “statistics” often evokes little more than unpleasant memories of school mathematics. It is worth stressing that is not a matter of generation (I have observed this phenomenon across all age groups), intelligence (I have taught people far more intelligent than me), or an innate inability to deal with numbers and/or computers (although these are beliefs that, sadly, some have deeply internalised). Instead, I am convinced that, for many people, it is simply a matter of finding a sturdy, first stepping stone and gathering up the courage to step on it to begin this learning journey.

The aim of this textbook is by no means to replace any of the brilliant, existing textbooks aimed at imparting statistical literacy for linguistics research, but rather to provide a stepping stone to be able to access these wonderful resources.1

Who is this book for?

The target audience for this book are students and researchers in the language sciences, including (applied) linguistics, (first and second) language teaching, and language education research. All examples are taken from these research areas. Ultimately, however, this textbook may be of use to anyone who feels they could benefit from a maximally accessible stepping stone, whichever discipline they are coming from.

This textbook is intended to be read linearly, chapter by chapter. Apart from the first introductory chapter, all other chapters will require several hours of commitment. They include quiz questions and short practical tasks. Completing these tasks is essential to genuinely assimilate the textbook’s contents. That’s because the best way to learn new skills is to try things out. So, with this in mind, let’s get cracking!

Header text saying 'R learners' above five friendly-looking monsters holding up signs that together read “we believe in you.”
Figure 1: Artwork encouraging beginner R learners by @allison_horst.

About the author

I started learning about statistics and R in 2017 when I realised that it would be important for me to conduct the kind of quantitative analyses that I wanted to do as part of my PhD in applied linguistics/English language teaching (Le Foll 2022). I had no previous experience in either and there were no such courses at my university. Even though I mostly learnt by myself, it would be incorrect to say that I am self-taught: I learnt from some of the resources listed in Appendix A, attended bootcamps and summer schools, read countless posts on StackOverflow and various blogs, and exchanged with like-minded people on social media (#Rstats, #dataviz, #TidyTuesday). This why it is fairer to say that I am community-taught.

I now like to describe myself as an “advanced beginner” in R and statistics. I am not a programmer, nor a statistician, but rather an applied linguist and committed educator. I enjoy teaching data literacy, statistics, and data visualisation to current and future generations of linguists, language education scholars, and teachers. I teach regular methods courses at the University of Cologne that are attended not just by M.A. and M.Ed. students, but also by some doctoral and post-doctoral researcher colleagues. In addition, I teach workshops for both doctoral and post-doctoral researchers at other institutions on a freelance basis.

This textbook was partly designed on the basis of materials that I have developed for these courses and workshops. Publishing these materials is my way to contribute to the wonderful community of people who have helped me on my leaRning journey. 🤗

Photo of a white woman with glasses in her 30s smiling, wearing a floral top and a conference badge. She is standing in front of an academic poster entitled: Textbook English: A Corpus-Based Analysis of Language Use in EFL Textbooks.

Me back in 2017, proudly presenting at my first international conference.2

Acknowledgements

This textbook has benefited greatly from the generous, critical feedback I have received from both novice and expert users of R throughout this project. Many thanks to my colleagues from the Digital Research Academy, Nick Bearman, Ben Golub, and Fritjof Lammers for their critical peer review and to my (former) students at the University of Cologne, Jan Hollmann, Rose Hörsting, Vijaya Lakshmi, Paula Raabe, Poppy Siahaan, Veronika Strobl, Clara Stumm, Katja Wiesner, and Isabel Zimmer, for their critical learner feedback.

Special thanks also go out to the researchers whose works are used as case studies in this textbook, Sarah Schimke and Ewa Dąbrowska, and to Allison Horst whose beautiful and witty artworks illustrate many of the chapters of this textbook (e.g., Figure 1).

In addition, I would like to thank everyone who has contributed and continues to contribute to my own data analysis learning journey. At the risk of forgetting someone, I would like to extend special thanks to Vaclav Brezina, Guillaume Desagulier, Stephanie Evert, Stefan Gries, Daniël Lakens, Natalia Levshina, Luke Tudge, the RLadies Stack group, the R package developers and maintainers of all the packages I use, as well as the many generous contributors to Stack Overflow and to #Rstats on social media.

Get in touch! 📩

If (parts of) this textbook helped you on your leaRning journey or for your teaching, do drop me a line to let me know!

If you have any suggestions for improvements, I would also love to hear from you. ✉️

Please cite the current version of the web version of the textbook as:

Le Foll, Elen. 2024. Data Analysis for the Language Sciences: A very gentle introduction to statistics and data visualisation in R. Open Educational Resource. https://elenlefoll.github.io/RstatsTextbook/ (accessed DATE).

To cite a specific passage, please quote the corresponding chapter or section number(s), as the web version of the textbook does not include page numbers.


  1. A (work-in-progress) list of next-step resources can be found in Appendix A.↩︎

  2. I chose this picture because I vividly remember two professors pointing out that I had written “p = 0.00” on my poster (which I had copied-and-pasted from the output of the statistics tool that I had used) and laughing among themselves (but well within earshot) at how stupid that was. Learning these skills certainly requires a lot of effort on the part of the learner, but it also requires an academic culture that strives to include rather than exclude. This textbook explicitly aims for an inclusive approach to teaching the basics of data literacy and I have included this photo as a reminder to always persevere, whether in the face of seemingly insurmountable error message or snarky remarks!↩︎