Data Analysis for the Language Sciences
A very gentle introduction to statistics and data visualisation in R
Preface
This open textbook is very much work in progress. The plan is to have ca. 13 content chapters in total, plus a number of case-study chapters (co-)written by students.
This draft is primarily intended as preparation/revision materials for the linguistics methods classes that I regularly teach at the University of Cologne, as well as R
and statistics workshops for pre- and post-doctoral researchers that I occasionally give at other higher education institutions.
Student and colleague feedback on this draft is very welcome! ✉️
What is this book about?
This textbook is intended as a hands-on introduction to data management, statistics, and data visualisation for students and researchers in the language sciences. It relies exclusively on freely accessible, open-source tools, focusing primarily on the programming language and environment R
.
It is often claimed that learning R
is “not for everyone”, or that it has “a steep learning curve”. This textbook aims to prove that the opposite is true. There are many reasons why it is worth investing the time and effort to learn how to do research in R
, and it is no more difficult than learning any other new skill. In fact, the results of a recent study suggests that language aptitude is a much stronger predictor of programming aptitude than numeracy (i.e., “being good at numbers”) (Prat et al. 2020). So if you have successfully learnt a foreign language in the past, there is no reason why you shouldn’t succeed in learning a programming language!
Learning R is like learning a foreign language. If you enjoy learning languages, then ‘R’ is just another one. […] You have to learn vocabulary, grammar and syntax. Similar to learning a new language, programming languages also have steep learning curves and require quite some commitment. (Dauber 2024)
The rationale for this textbook is based on my personal observations, in both teaching and consulting, that many ‘introductory’ textbooks to statistics and/or R
are not suitable for many humanities scholars, who typically have little to no prior programming experience and for whom the word “statistics” often evokes little more than unpleasant memories of school mathematics. It is worth stressing that is not a matter of generation (I have observed this phenomenon across all age groups), intelligence (I have taught people far more intelligent than me), or an innate inability to deal with numbers and/or computers (although these are beliefs that, sadly, some have deeply internalised). Instead, I am convinced that, for many people, it is simply a matter of finding a sturdy, first stepping stone and gathering up the courage to step on it to begin this learning journey.
The aim of this textbook is by no means to replace any of the brilliant, existing textbooks aimed at imparting statistical literacy for linguistics research, but rather to provide a stepping stone to be able to access these wonderful resources.1
Who is this book for?
The target audience for this book are students and researchers in the language sciences, including (applied) linguistics, (first and second) language teaching, and language education research. All examples are taken from these research areas. Ultimately, however, this textbook may be of use to anyone who feels they could benefit from a maximally accessible stepping stone, whichever discipline they are coming from.
This textbook is intended to be read linearly, chapter by chapter. Apart from the first introductory chapter, all other chapters will require several hours of commitment. They include quiz questions and short practical tasks. Completing these tasks is essential to genuinely assimilate the textbook’s contents. That’s because the best way to learn new skills is to try things out. So, with this in mind, let’s get cracking!
Acknowledgements
This textbook has benefited greatly from the generous, critical feedback I have received from both novice and expert users of R
throughout this project. Many thanks to my colleagues from the Digital Research Academy, Nick Bearman, Ben Golub, and Fritjof Lammers for their critical peer review and to my (former) students at the University of Cologne, Jan Hollmann, Rose Hörsting, Vijaya Lakshmi, Paula Raabe, Poppy Siahaan, Veronika Strobl, Clara Stumm, Katja Wiesner, and Isabel Zimmer, for their critical learner feedback.
Special thanks also go out to the researchers whose works are used as case studies in this textbook, Sarah Schimke and Ewa Dąbrowska, and to Allison Horst whose beautiful and witty artworks illustrate many of the chapters of this textbook (e.g., Figure 1).
In addition, I would like to thank everyone who has contributed and continues to contribute to my own data analysis learning journey. At the risk of forgetting someone, I would like to extend special thanks to Vaclav Brezina, Guillaume Desagulier, Stephanie Evert, Stefan Gries, Daniël Lakens, Natalia Levshina, Luke Tudge, the RLadies Stack group, the R
package developers and maintainers of all the packages I use, as well as the many generous contributors to Stack Overflow and to #Rstats on social media.
Get in touch! 📩
If (parts of) this textbook helped you on your leaR
ning journey or for your teaching, do drop me a line to let me know!
If you have any suggestions for improvements, I would also love to hear from you. ✉️
Please cite the current version of the web version of the textbook as:
Le Foll, Elen. 2024. Data Analysis for the Language Sciences: A very gentle introduction to statistics and data visualisation in R. Open Educational Resource. https://elenlefoll.github.io/RstatsTextbook/ (accessed DATE).
To cite a specific passage, please quote the corresponding chapter or section number(s), as the web version of the textbook does not include page numbers.
A (work-in-progress) list of next-step resources can be found in Appendix A.↩︎
I chose this picture because I vividly remember two professors pointing out that I had written “p = 0.00” on my poster (which I had copied-and-pasted from the output of the statistics tool that I had used) and laughing among themselves (but well within earshot) at how stupid that was. Learning these skills certainly requires a lot of effort on the part of the learner, but it also requires an academic culture that strives to include rather than exclude. This textbook explicitly aims for an inclusive approach to teaching the basics of data literacy and I have included this photo as a reminder to always persevere, whether in the face of seemingly insurmountable error message or snarky remarks!↩︎