10 The Gramma`R` of Graphics

Warning

As with the rest of this textbook (see Preface), this chapter is very much work in progress. All feedback is very welcome.

One of the advantages of working in R is that it allows us to create highly customised graphs for effective data visualisation. The Grammar of Graphics (Wilkinson 2005) is a theoretical framework that defines a structured approach to building and understanding statistical graphs. We will focus on using the tidyverse package {ggplot2} (Wickham 2016) to create effective data visualisations. The {ggplot2} is an implementation of the Grammar of Graphics (GG) syntax in R.

Figure 10.1: Hex sticker of the {ggplot2} package

This chapter is divided into two parts: the first explains the syntax of the Grammar of Graphics (Wilkinson 2005) and how the {ggplot2} package works, while the second part focuses on the semantics of statistical graphics and provides an introduction to the many different types of data visualisations that can be created using the package.

Chapter overview

In this chapter, you will learn how to:

Create and interpret bar plots to visualise categorical variables.
Create and interpret histograms, density plots, and violin plots to visualise continuous numeric variables.
Create and interpret boxplots to visualise the distribution of continuous numeric variables across different subsets of the data.
Create and interpret scatter plots to visualise correlations between pairs of numeric variables.
Create and interpret facetted plots to explore the relationship between three or more variables at once.
Create interactive plots for data exploration.

A fuzzy monster in a beret and scarf, critiquing their own column graph on a canvas in front of them while other assistant monsters (also in berets) carry over boxes full of elements that can be used to customize a graph (like themes and geometric shapes). In the background is a wall with framed data visualizations. Stylized text reads: ggplot2: build a data masterpiece. — Figure 10.2: Building a data masterpiece with {ggplot2} (artwork by @allison_horst)

Set-up and data import

Prerequisites

This chapter assumes that you are familiar with the concepts of descriptive statistics explained in Chapter 8, and the use of the data wrangling functions introduced in Chapter 9.

All examples, tasks, and quiz questions are based on data from:

Dąbrowska, Ewa. 2019. Experience, Aptitude, and Individual Differences in Linguistic Attainment: A Comparison of Native and Nonnative Speakers. Language Learning 69(S1). 72-100. https://doi.org/10.1111/lang.12323.

Our starting point for this chapter is the wrangled combined dataset that we created and saved in Chapter 9. Follow the instructions in this chapter to create this R object or download combined_L1_L2_data.rds from the textbook’s GitHub repository.

Before we begin, we must load the combined_L1_L2_data.rds file that we created and saved in Chapter 9. This file contains the data of all the L1 and L2 participants of Dąbrowska (2019). The categorical variables are stored as factors and obvious data entry inconsistencies and typos have been corrected (see Chapter 9).

library(here)
library(tidyverse)

Dabrowska.data <- readRDS(file = here("data", "processed", "combined_L1_L2_data.rds"))

Check that your data is correctly imported by examining the output of View(Dabrowska.data) and str(Dabrowska.data). Once you are satisfied that that’s the case, we ready to get creative! 🎨

10.1 The syntax of graphics

The syntax of the Grammar of Graphics (Wilkinson 2005) is made up of layers (Figure 10.3), which allow us to create highly effective and efficient data visualisations, while giving us lots of flexibility and control.

A layered diagram illustrating the key elements of data visualization: Data, Aesthetics, Geometries, Facets, Statistics, Coordinates, and Theme, each shown as a layer with a distinct colour. — Figure 10.3: The syntax of the Grammar of Graphics as visualised in the QCBS R Workshop Series (CC-BY-NC-SA).

The data layer and the aesthetics layer are compulsory as you cannot build a graph that does not map some data onto some visual aspect (= aesthetic) of a graph. The remaining layers are optional, but some are very important. In the following, we will explain how the geometries, facet, scales, coordinates, and theme layers are used to build and customise graphs using the {ggplot2} library in R.

10.1.1 Aesthetics

As explained in the documentation, the ggplot() function¹ has two compulsory arguments. First, we must select the data that we want to visualise. Second, we must specify which variable(s) from the data should be mapped onto which visual property or aesthetics (short: aes) of the plot.

ggplot {ggplot2} R Documentation

10.1.2 Description

ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

10.1.3 Usage
ggplot(data = NULL, mapping = aes(), ...)

For example, to create a bar plot visualising the distribution of participants’ occupational groups in the combined dataset from Dąbrowska (2019) (Dabrowska.data), we need to map the OccupGroup variable from Dabrowska.data onto our plot’s x-axis (x).

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup))

An empty plot with the x-axis labeled OccupGroup and four group labels: C, I, M, and PS. No data is shown. — Figure 10.4: Attempt to plot the distribution of participants’ occupational groups in Dąbrowska (2019)

As you can see from Figure 10.4, however, running this code returns an empty plot: All we get is a grid background and a nicely labelled x-axis, but no data… Why might that be? 🤔

10.1.4 Geometries

The reason we are not seeing any data is that we have not yet specified with which kind of geometry (short: geom) we would like to plot the data. The {ggplot2} library features more than 30 different geom functions! They all begin with the prefix geom_. To create a bar plot showing participants’ occupational groups, we need to add a geom_bar() layer to our empty ggplot object (see Figure 10.5).

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar()

A bar plot showing the number of various occupational groups. The x-axis is labeled as OccupGroup and represents different occupation categories: C, I, M, and PS, with PS having the highest count of them all. The y-axis represents the number of individuals in each group and is labeled as count. — Figure 10.5: Distribution of participants’ occupational groups (C = clerical position, I = inactive (i.e., unemployed, retired, or homemakers), M = manual jobs, PS = professional-level job or studying for a degree)

Note that we use the + operator to add layers to ggplot objects. If we try to use the pipe operator (|>) within the ggplot() function, we will get an error message.

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) |> 
  geom_bar()

Error in `geom_bar()`: 
! `mapping` must be created by `aes()`. 
ℹ Did you use `%>%` or `|>` instead of `+`? 
Run `rlang::last_trace()` to see where the error occurred.

10.1.5 Statistics and labels

We now have a simple bar plot that represents the distribution of participants’ occupational groups in Dąbrowska (2019). By default, the axis labels are simply the names of the variables that are mapped onto the plot’s aesthetics. That’s why, in Figure 10.5, our x-axis is labelled “OccupGroup”.

What about the y-axis? We did not specify a y-aesthetic within the mapping argument of our ggplot() object, yet the y-axis is labelled “count”. This is because geom_bar() automatically computes a “count” statistic that gets mapped to the y-aesthetic.

If we want to change these axis labels, we can do so by adding a labs() layer to our plot (see Figure 10.6).

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants")

This is the same bar plot as above. The only difference is that the x-axis is labeled as Occupational group and the y-axis is labeled as Number of participants. — Figure 10.6: Distribution of participants’ occupational groups (C = clerical position, I = inactive (i.e., unemployed, retired, or homemakers), M = manual jobs, PS = professional-level job or studying for a degree)

Quiz time!

Q10.1 Which of the following labels can be added or modified using the labs() function?

What is alt-text and why is it important?

Alt-text, or alternative text, is a concise description of an image used to make its informational content accessible to people with visual impairments. Using a screenreader programme, blind and visually impaired individuals can have the alt-text associated with an image read out to them.

A good alt-text aims to convey the main message and insights of the graph, allowing someone who cannot see it to understand the information being presented. In the context of online publications, alt-text is also useful in regions with low bandwidth as images may take a very long time to load. By including alt-text, we can therefore make our work more accessible and inclusive, enabling more people to engage with and understand our data and analyses.

10.1.6 Data

Instead of using the data argument of the ggplot() function as we did above, we can pipe the data into the function’s first argument (see Section 7.5.2). Compare these two methods and their outputs.

Using the data argument of ggplot()

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup))      +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants")

Piping the data into ggplot()

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants")

The outputs are exactly the same! Piping the dataset into the ggplot() function, however, allows us to easily wrangle the data that we want to visualise ‘on the fly’, without transforming the data object itself. For example, we can use the tidyverse filter() function (see Section 9.8) to examine the distribution of occupational groups among L2 participants only (see Figure 10.7).

Dabrowska.data |> 
  filter(Group == "L2") |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants",
       title = "Occupational groups of L2 participants")

Bar plot with the title of Occupational groups of L2 participants showing the distribution of L2 participants' occupational group. The x-axis is labeled as Occupational group with PS having the highest count, and the y-axis is labeled as Number of participants. — Figure 10.7

We can also combine several filter() conditions using the & (AND) and | (OR) operators. For example, we may want to visualise the distribution of the occupational groups of participants who are L2 speakers of English and whose first language is Polish (see Figure 10.8).

Dabrowska.data |> 
  filter(Group == "L2" & NativeLg == "Polish") |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants",
       title = "Occupational groups of Polish L2 participants")

Bar plot showing the number of Polish L2 participants across occupational groups C, I, M, and PS, with M having the highest count. The x-axis and the y-axis are labeled as the plot above. The title of the plot is Occupational groups of Polish L2 participants. — Figure 10.8

10.1.7 Facets

If we want to compare two subsets of the data, we can add a facet layer to subdivide the plot into several plots each representing a subset of the data. In the following, we use the facet_wrap() function to subdivide our bar plot by the Group variable (~ Group). This allows us to easily compare the distribution of occupations across L1 and the L2 participants (see Figure 10.9).

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  facet_wrap(~ Group) +
  labs(x = "Occupational group",
       y = "Number of participants",
       title = "Occupational groups of participants")

Bar plot titled Occupational groups of participants showing the number of participants in four occupational groups for two categories, L1 and L2. The plot is divided into two panels: in the L1 group, the number of participants is roughly equal across all occupations, ranging from about 20 to 25. In the L2 group, PS has the highest number of participants at over 30, followed by M and C, while I has the fewest participants. — Figure 10.9

To compare the distributions of occupations of the male and female L2 participants, we can combine a filter() operation to select only the L2 participants, with a facet_wrap() layer (see Figure 10.10).

Dabrowska.data |> 
  filter(Group == "L2") |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  facet_wrap(~ Gender) +
  labs(x = "Occupational group",
       y = "Participants",
       title = "Occupational groups of L2 participants")

Bar plot titled Occupational groups of L2 participants showing the number of female and male participants across four occupational groups. The plot is divided into two panels: F and M. In both the highest number belongs to the group PS. — Figure 10.10

To explore potential gender differences in occupational groups across both L1 and L2 groups, we can combine the two variables within the facet_wrap() function (see Figure 10.11).

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  facet_wrap(~ Group + Gender) +
  labs(x = "Occupational group",
       y = "Participants")

Bar plot showing the number of participants across four occupational groups, split by language group (L1 and L2) and gender (F and M). The plot has four panels. L1 Female participant counts are relatively even across groups, with the highest in PS. In L2 Female panel a large number of participants are in the PS group, in L1 Male panel all occupational groups have similar participant counts, and in L2 Male panel PS and M have the highest counts. — Figure 10.11

10.1.8 Scales

Scale layers allow us to map data values to the visual values of an aesthetic. For example, to make our facetted plot in Figure 10.11 easier to read, we could add some colour using a fill aesthetic to fill each bar with a colour that corresponds to the participants’ gender. To do so, we map each unique value of the variable Gender (“F” and “M”) onto a colour that is then used to fill the corresponding bars of our bar plot. Adding the fill aesthetic automatically generates a legend (see Figure 10.12).

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, 
                       fill = Gender)) +
  geom_bar() +
  facet_wrap(~ Group + Gender) +
  labs(x = "Occupational group",
       y = "Participants")

This is the same plot as above. The only difference is that the bars in the two F panels are in the colour light red, and the bars in the two M panels are in the colour blue. There is a legend titled Gender on the right side of the plot that shows which colour shows which category. — Figure 10.12

As we did not specify any fill colours for Figure 10.12, {ggplot2} used default colours taken from the {scales} package of the tidyverse environment. This is because, in the Grammar of Graphics, colour palettes are governed by scales. To specify a different set of colours, we therefore need to specify a scale layer.

One way to do this is to use scale_fill_manual() to manually pick our own colours, either using R colour codes (such as purple) or hexadecimal colour codes (such as #34027d). Note that both types of colour codes must be enclosed in quotation marks.

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, 
                       fill = Gender)) +
  geom_bar() +
  facet_wrap(~ Group + Gender) +
  labs(x = "Occupational group",
       y = "Participants") +
  scale_fill_manual(values = c("purple", "#34027d"))

This is the same plot as above. The only difference is that the bars in the two F panels are in the colour light purple, and the bars in the two M panels are in the dark purple. — Figure 10.13: A facetted bar plot with hand-picked colours

Although it makes the plot easier to interpret, the colour aesthetic (here fill) is not strictly necessary to understand the data represented in Figure 10.13. After all, the two gender subgroups are already distinguished by the facet_wrap() layer. That’s not necessarily a bad thing, but you must consider whether such redundant elements facilitate the interpretation of the data visualised or not.

In some cases, colour is used as the only way of identifying subgroups in the data, for example in a stacked bar plot (see Figure 10.14). In such cases, it is important to consider how the plot will be perceived by different people (see note on colour blindness below).

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, 
                       fill = Gender)) +
  geom_bar() +
  facet_wrap(~ Group) +
  labs(x = "Occupational group",
       y = "Participants") +
  scale_fill_manual(values = c("purple", "#34027d"))

Bar plot showing the number of participants across four occupational groups, split by two language groups (L1 and L2). Each bar in this plot contains two colours: light purple representing female participants and dark purple representing male ones. There is a legend titled Gender on the right side of the plot. — Figure 10.14: A stacked bar plot with hand-picked colours

A note on colours and colour blindness 🌈

Colour blindness is a condition that results in a decreased ability to see colours and perceive differences in colour. There are different types of colour blindness but, in general, it is best to avoid red-green contrasts. To ensure that your data visualisations are accessible to as many people as possible, you may want to use the {colorBlindness} package (Ou 2021) to simulate the appearance of a set of colours for people with different forms of colour blindness.

#install.packages("colorBlindness")
library(colorBlindness)

colorBlindness::displayAllColors(scales::hue_pal()(6))

Using the {colorBlindness} package, we can immediately see that the default {scales} discrete palette that {ggplot2} used in Figure 10.12 is not accessible to colour blind people (deuteranope and protanope), nor is it distinguishable when printed in grey-scale (desaturate). In contrast, our hand-picked colours from Figure 10.13 fare much better.

colorBlindness::displayAllColors(c("pink", "#34027d"))

But you need not manually pick colours, as many people have developed and shared R packages that feature attractive, ready-to-use colour-blind friendly palettes. The {viridis} package (Garnier et al. 2023), for example, includes eight such palettes (“magma”, “inferno”, “plasma”, “cividis”, “rocket”, “turbo”) that also reproduce well in grey-scale. And, as it is included in the {ggplot2} installation, you don’t even need to install the {viridis} package separately!

colorBlindness::displayAllColors(viridis::viridis(6))

Choosing an appropriate palette is not the only way to make your visualisations accessible to colour-blind readers. Another way is to provide redundant mappings to other aesthetics such size, line type, shape, or pattern (see Figure 10.15).

Show {ggplot2} code to generate patterned barplot.

#install.packages("ggpattern")
library(ggpattern)

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, 
                       fill = Gender,
                       pattern = Gender)) +
  geom_bar_pattern(pattern_density = 0.01,  # adjust for better visibility
                   pattern_fill = "transparent",  # keeps pattern visible
                   pattern_colour = "white") +  # pattern lines color
  facet_wrap(~ Group) +
  labs(x = "Occupational group",
       y = "Participants") +
  scale_fill_manual(values = c("purple", "#34027d")) +
  scale_pattern_manual(values = c("stripe", "crosshatch")) # specify types of patterns

This is the same plot as above. The only difference is that the two different shades of purple representing female and male participants are marked with a distinct pattern as well. — Figure 10.15

“What to do if you get the error message: seq.default(from, to, by) : invalid”

If you get the error message “Error in seq.default(from, to, by) : invalid ‘(to - from)/by’” when trying to run this chunk of code, this is because your Plots pane in RStudio is not large enough to accommodate the plot. Increase its size and it should work (see also https://stackoverflow.com/questions/73960726/error-in-seq-defaultfrom-to-by-invalid-to-from-by-ggpattern).

Finally, it is important to remember that colour blindness is by no means the only type of visual impairment you should consider when creating visualisations. Wordlwide, far more people are affected by blindness and low vision. Chapter 13 explains how to add alternative texts (alt-text) to plots and images. Many people with visual impairments rely on screen readers that use these alternative texts to provide audio descriptions of images and plots. These alternative texts can also improve the user experience when there are internet connection issues and images do not load properly or quickly enough.

Some academic publishers still require grey-scale plots, in which case you will want to use the scale layer scale_fill_grey(). Alternatively, the colour palettes of the {viridis} package (see information box on colour blindness) render well in grey, too. The {viridis} function for a discrete colour scale (as needed for a categorical variable such as Gender) can be called up using the scale_fill_viridis_d() function. With the “option” argument, you can switch between eight different viridis palettes (“magma”, “inferno”, “plasma”, “cividis”, “rocket”, “turbo”).

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = Gender)) +
  geom_bar() +
  facet_wrap(~ Group) +
  labs(x = "Occupational group",
       y = "Participants") +
  scale_fill_grey()

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = Gender)) +
  geom_bar() +
  facet_wrap(~ Group) +
  labs(x = "Occupational group",
       y = "Participants") +
  scale_fill_viridis_d(option = "viridis")

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = Gender)) +
  geom_bar() +
  facet_wrap(~ Group) +
  labs(x = "Occupational group",
       y = "Participants") +
  scale_fill_viridis_d(option = "turbo")

If you like colours, check out the {paletteer} package (Hvitfeldt 2021), which provides a neat interface to access a very large collection of R colour packages, some of which are very fun! The advantage is that you only need to install one package (install.packages("paletteer")) to have a huge range of palettes at your disposal. Below is a small selection of some personal favourites.

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Participants") +
  paletteer::scale_fill_paletteer_d("beyonce::X11")

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Participants") +
  paletteer::scale_fill_paletteer_d("lisa::BridgetRiley", direction = -1)

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Participants") +
  paletteer::scale_fill_paletteer_d("rockthemes::janelle")

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Participants") +
  paletteer::scale_fill_paletteer_d("lisa::FridaKahlo", direction = -1)

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Participants") +
    paletteer::scale_fill_paletteer_d("ltc::kiss")

Dabrowska.data |> 
  ggplot(mapping = aes(x = OccupGroup, fill = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Participants") +
  paletteer::scale_fill_paletteer_d("tayloRswift::speakNow")

10.1.9 Themes

The {ggplot2} framework also allows for the addition of an optional theme() layer to further customise the look of plots. The default {ggplot2} theme is theme_grey(). Here are some of the pre-built themes that come with the {ggplot2} library for you to compare. As with colour palettes, you can install additional libraries that will give you access to literally hundreds of ready-made themes for you to explore.

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants") +
  theme_bw()

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants") +
  theme_dark()

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants") +
  theme_light()

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants") +
  theme_minimal()

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants") +
  theme_void()

Pretty much all aspects of plot themes can be customised. To demonstrate this, Figure 10.25 displays a bar plot with some highly customised aesthetics. I will let you judge how meaningful these custom choices are and whether they genuinely help the reader to interpret the data… 🤨

See {ggplot2} code used to generate this plot.

ggplot(data = Dabrowska.data,
       mapping = aes(x = OccupGroup, fill = Gender)) +
  geom_bar() +
  labs(x = "Occupational group",
       y = "Number of participants",
        title = "An example of an extravagantly customised ggplot...") +
    theme(
      panel.background = element_rect(fill = "#FFC080", color = NA),
      panel.grid.major = element_line(color = "gold", linewidth = 1.5),
      panel.grid.minor = element_line(color = "grey20", linewidth = 0.5),
      axis.title.x = element_text(face = "bold", size = 12, color = "brown", angle = 10),
      axis.title.y = element_text(size = 25, color = "green", family = "Courier New"),
      axis.text.x = element_text(face = "italic", size = 12, color = "cyan"),
      axis.text.y = element_text(size = 14, color = "grey"),
      plot.title = element_text(face = "bold", size = 10, color = "purple", family = "Comic Sans MS")
      )

Figure 10.25: Illustration of some of the customisation options of `ggplot` objects

10.1.10 Coordinates

By default, the coordinate system that is used in ggplot objects is the Cartesian coordinate system, which has a horizontal axis (x) and a vertical axis (y) that are perpendicular to each other. To change this default Cartesian coordinate system, we need to add a coordinate layer.

For example, if we want to display the full names of the four occupational groups used in Dąbrowska (2019), we can change the labels of the categories using mutate() and fct_recode() before pipping the data into ggplot() (see Section 10.1.6) and then flip the x and y axes using the coordinate layer coord_flip(). As shown in Figure 10.26, this makes long labels much easier to read.

Dabrowska.data |> 
  mutate(OccupGroup = fct_recode(OccupGroup,
                                 `Professionally inactive` = "I",
                                 `Clerical profession` = "C",
                                 `Manual profession` = "M",
                                 `Professional-level job/\nstudent` = "PS")) |> 
  mutate(Gender = fct_rev(Gender)) |> 
  ggplot(mapping = aes(x = OccupGroup, fill = Gender)) +
  geom_bar() +
  labs(x = NULL,
       y = "Participants") +
  scale_fill_viridis_d() +
  coord_flip() +
  theme_minimal() +
  theme(axis.text = element_text(size = 14))

The vast majority of statistical graphs use the Cartesian coordinate system. Pie charts and other circular plots, however, use the polar coordinate system (coord_polar), whereby quantities are mapped onto angles rather distances. In general, humans are much better at judging lengths than angles or areas (Cleveland & McGill 1987), which is why circular graphs such as pie charts are typically not recommended forms of good data visualisations (see, e.g., Few). That said, they can be produced using the {ggplot2} library by adding the coordinate layer coord_polar("y") and modifying a few parameters.

Dabrowska.data |> 
  mutate(OccupGroup = fct_recode(OccupGroup,
                                 `Professionally inactive` = "I",
                                 `Clerical profession` = "C",
                                 `Manual profession` = "M",
                                 `Professional-level job/\nstudent` = "PS")) |> 
  ggplot(mapping = aes(x = "", fill = OccupGroup)) +
  geom_bar(width = 1) +
  labs(fill = "Occupational group") +
  scale_fill_viridis_d(direction = -1) +
  coord_polar("y") +
  theme_void()

10.2 The semantics of graphics

So far, we have seen how the syntax of the Grammar of Graphics can be used to build statistical graphs layer by layer. We now turn to the semantics of graphics. As linguists are well placed to know, semantics is the study of meaning. In the Grammar of Graphics, the semantics of graphics is defined as “the meanings of the representative symbols and arrangements we use to display information” (Wilkinson 2005: 20). In what follows, we will see how thinking about the semantics of graphics can help us to think about how the different components of a graph interact to convey insightful visual information from raw data. This will help us to make informed choices when choosing the geometries, scales, facets, and themes of our data visualisations.

But, first, let’s think about why we visualise data. Data visualisation is about more than just communicating the results of our analyses to others at the publication stage. In fact, good data visualisation can help us make informed decisions throughout the research process from the data wrangling stage to the evaluation of complex statistical models. Here are some reasons for visualising data. Can you think of others? 🤔

For yourself

To explore your data
To detect data processing errors and outliers
To check assumptions of statistical tests or models (see Chapter 11 and Chapter 12)
To examine variation across different subsets of the data
To better interpret the results of statistical analyses (see Chapter 12)

For others

To communicate the results of your analyses more effectively
To communicate about your data (in more detail)
To communicate complex information more efficiently
To attract the reader’s attention
To allow the reader to reach their own conclusions

A group of fuzzy round monsters with binoculars, backpacks and guide books looking up a graphs flying around with wings (like birders, but with exploratory data visualizations). Stylized text reads “ggplot2: visual data exploration.” — Figure 10.28: Using the {ggplot2} package for data exploration (artwork by @allison_horst)

Depending on the type of data that we want to visualise and why, we can choose different types of plots. A great resource to choose a graphic that is suitable for your data is the R Graph Gallery.

In the following, we will first look at how we can plot categorical variables and discrete numeric variables, before we move on to visualising continuous numeric variables and combinations of different types of variables (see Section 7.2).

10.2.1 Bar plots

As we saw in Section 10.1, bar plots (also called bar charts) are a great way to visualise categorical variables. We also saw that, when using horizontal writing systems, it is often easier to interpret a bar plot if its coordinates are flipped so that longer labels can be read more readily.

Task 10.1

Study Figure 10.29 and think about which {ggplot2} functions were used to generate it. 🤔

Then, click on the “Show R code” button to compare your intuitions with the actual code. Note that there may well be more than one solution: So do try out your version and see if spot any differences!

Show R code.

Dabrowska.data |> 
  filter(Group == "L2") |> 
  mutate(NativeLg = fct_rev(fct_infreq(NativeLg))) |> 
  ggplot(aes(x = NativeLg, 
           fill = NativeLgFamily)) +
  geom_bar() +
  coord_flip() +
  scale_fill_viridis_d(option = "F") +
  scale_y_continuous(limits = c(0, 40)) +
  theme_minimal() +
  labs(x = NULL, 
       y = NULL, 
       fill = "Language family",
       title = "Native languages of L2 participants") +
  theme_minimal(base_size = 16)

To create Figure 10.29, we first reordered the factor levels of the NativeLg variable using two functions from the {forcats} package (see Section 9.3.2): fct_infreq() is first used to order the factors according to their frequency (by default, they are sorted alphabetically), and then fct_rev() is used to reverse that order. The latter step is needed because the coord_flip() functions reverses everything. You can check the order of a factor’s level using the function levels(). Note that, if two levels have the same number of occurrences, they are ordered alphabetically (as seen in Figure 10.29).

levels(Dabrowska.data$NativeLg)

 [1] "Cantonese"  "Chinese"    "French"     "German"     "Greek"     
 [6] "Italian"    "Lithuanian" "Mandarin"   "Polish"     "Russian"   
[11] "Spanish"

levels(fct_infreq(Dabrowska.data$NativeLg))

 [1] "Polish"     "Mandarin"   "Lithuanian" "Cantonese"  "Chinese"   
 [6] "Spanish"    "French"     "Russian"    "German"     "Greek"     
[11] "Italian"

levels(fct_rev(fct_infreq(Dabrowska.data$NativeLg)))

 [1] "Italian"    "Greek"      "German"     "Russian"    "French"    
 [6] "Spanish"    "Chinese"    "Cantonese"  "Lithuanian" "Mandarin"  
[11] "Polish"

Quiz time!

Q10.2 Compare the two plots below. Which one makes it easier to see which occupational group has the fewest participants and why?

Bar plot

Show R code for the bar plot

Dabrowska.data |> 
  mutate(OccupGroup = fct_recode(OccupGroup,
                                 `Professionally inactive` = "I",
                                 `Clerical profession` = "C",
                                 `Manual profession` = "M",
                                 `Professional-level job/\nstudent` = "PS")) |> 
  filter(Gender == "M") |> 
  ggplot(mapping = aes(x = OccupGroup, 
                       fill = OccupGroup)) +
  geom_bar() +
  labs(x = NULL,
       y = "Participants") +
  scale_fill_viridis_d() +
  coord_flip() +
  theme_minimal() +
  theme(axis.text = element_text(size = 15),
        legend.position = "none")

Pie chart

Show R code for the pie chart

Dabrowska.data |> 
  mutate(OccupGroup = fct_recode(OccupGroup,
                                 `Professionally inactive` = "I",
                                 `Clerical profession` = "C",
                                 `Manual profession` = "M",
                                 `Professional-level job/\nstudent` = "PS")) |> 
  filter(Gender == "M") |> 
  ggplot(mapping = aes(x = "", fill = OccupGroup)) +
    geom_bar(width = 1) +
    scale_fill_viridis_d(direction = -1) +
    coord_polar("y") +
    theme_void(base_size = 20) # This increases the font size.

🐭 Click on the mouse for a second hint.

Task 10.2

Using the {ggplot2} library, create a bar plot that shows the distribution of occupational groups (OccupGroup) among male L1 and L2 participants in Dąbrowska (2019)’s study.

a. Drawing on the information provided by your bar plot, how many male participants reported having manual jobs?

Show sample code to answer Q10.3.

Dabrowska.data |> 
  filter(Gender == "M") |> 
  ggplot(mapping = aes(x = OccupGroup)) +
  geom_bar() +
  labs(x = "Occupational group", 
       y = "Male participants") +
  theme_minimal()

b. Is the legend in the bar plot that you have created necessary?

Show code to answer Q10.4.

Dabrowska.data |> 
  filter(Gender == "M") |> 
  ggplot(mapping = aes(x = OccupGroup, 
                       fill = OccupGroup)) +
  geom_bar() +
  theme_minimal() +
  theme(legend.position = "none")

c. Create a pie chart that shows the distribution of occupational groups among male participants (as shown in Figure 10.27). Which line of code is essential to create a pie chart using {ggplot2}?

Show code to create pie chart below.

Dabrowska.data |> 
  filter(Gender == "M") |> 
  ggplot(mapping = aes(x = "", 
                       fill = OccupGroup)) +
    geom_bar(width = 1) +
    coord_polar("y") +
    theme_void()

d. Transform the pie chart that you just created in c. to make it look like Figure 10.32. To achieve this, wrangle the data before piping it into the data argument of the ggplot() function. Which {tidyverse} function can you use to rename the labels?

Show sample code to answer Q10.6.

Dabrowska.data |> 
  mutate(`OccupGroup` = fct_recode(OccupGroup,
                                 `Professionally inactive` = "I",
                                 `Clerical profession` = "C",
                                 `Manual profession` = "M",
                                 `Professional-level job` = "PS")) |> 
  filter(Gender == "M") |> 
  ggplot(mapping = aes(x = "", 
                       fill = OccupGroup)) +
  geom_bar(width = 1) +
  coord_polar("y") +
  theme_void() +
  labs(fill = "Occupational group") # This last line of code changes the title of the legend, which is the label for the variable associated with the `fill` aestetics.

10.2.2 Histograms

In Section 8.2.2, we visually examined the distribution of participants’ ages in a bar plot. This was possible because the age variable in Dąbrowska (2019) was recorded as a discrete numeric variable (i.e. either as 18 or 19, but not 18.4 years of age).

ggplot(data = Dabrowska.data,
       mapping = aes(Age)) +
  geom_bar() +
  scale_x_continuous() +
  theme_minimal()

Bar charts are best suited for categorical data and should only be used to visualise discrete numeric variables that have a fairly limited number of possible values. As we can see from the output of the unique() function, this is not the case for the Age variable in Dabrowska.data, as it includes 40 different age values, ranging from 17 to 65.

unique(Dabrowska.data$Age) |> 
  sort()

 [1] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 40 41 42
[26] 44 46 47 48 51 52 53 55 57 58 59 60 61 62 65

The distribution of participants’ ages is therefore better visualised as a histogram or density plot. To visualise participants’ age as histogram rather than as a bar plot, all we need to do is change the plot geometry (see Section 10.1.4) from geom_bar() to geom_histogram().

ggplot(data = Dabrowska.data,
       mapping = aes(x = Age)) +
  geom_histogram() +
  labs(x = "Age (in years)",
       y = "Number of participants") +
  theme_minimal()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When generating this histogram, a message appears in the R console that informs us that, by default, the geom_histogram() function subdivided the Age values into 30 bins. This means that the age range from 17 to 65 has been subdivided into 30 groups of equal size. Given that there is a range of 48 in the Age values in this dataset, this is not a great way to subdivide the values of this variable.

As indicated in the message, to change this behaviour, we can adjust the value of the “binwidth” argument. This argument determines how many years go in each bin. So if we choose to have two years in each subdivision of the Age variable, we will end up with 24 bins. Logically, if we decide to group four years in each subdivision of the Age variable, we will end up with just 12 bins.

Compare the three histograms below. In your opinion, which binwidth provides the most effective way to visualise the distribution of participants’ ages? 🤔

ggplot(data = Dabrowska.data,
       mapping = aes(x = Age)) +
  geom_histogram() +
  labs(x = "Age (in years)",
       y = "Number of participants") +
  theme_minimal()

ggplot(data = Dabrowska.data,
       mapping = aes(x = Age)) +
  geom_histogram(binwidth = 2) +
  labs(x = "Age (in years)",
       y = "Number of participants") +
  theme_minimal()

ggplot(data = Dabrowska.data,
       mapping = aes(x = Age)) +
  geom_histogram(binwidth = 4) +
  labs(x = "Age (in years)",
       y = "Number of participants") +
  theme_minimal()

Task 10.3

a. How many different scores did participants obtain on the English grammar test (as stored in the Grammar variable).

Show sample code to help you answer 10.7.

unique(Dabrowska.data$Grammar) |> 
  length()

b. What are the lowest Grammar scores among L1 and L2 participants?

Show sample code to help you answer Q10.8.

Dabrowska.data |> 
  group_by(Group) |> 
  summarise(lowest = min(Grammar))

c. Create a histogram of participants’ Grammar scores. Which geometrical parameters need to be used to obtain exactly the same histogram as below?

Show answer to Q10.9

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar)) +
  geom_histogram(binwidth = 6) +
  labs(x = "Scores on English grammar test",
       y = "Number of participants") +
  theme_bw()

d. Without trying out the code for yourself, which script was used to generate Plot 1 below?

Plot 1

Script A

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar,
                     fill = Group)) +
  geom_histogram(binwidth = 6,
                 alpha = 0.6) +
  labs(x = "Scores on English grammar test",
       y = "Number of participants") +
  theme_minimal() +
  theme(legend.position = "none")

Script B

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar,
                     fill = Group)) +
  geom_bar(alpha = 0.6) +
  labs(x = "Scores on English grammar test",
       y = "Number of participants") +
  theme_minimal() +
  theme(legend.position = "none")

Script C

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar,
                     colour = Group)) +
  geom_histogram(binwidth = 6) +
  labs(x = "Scores on English grammar test",
       y = "Number of participants") +
  theme_minimal() +
  theme(legend.position = "none")

Script D

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar,
                     fill = Group)) +
  geom_histogram(binwidth = 6,
                 alpha = 0.6) +
  facet_wrap(~ Group) +
  labs(x = "Scores on English grammar test",
       y = "Number of participants") +
  theme_bw() +
  theme(legend.position = "none")

e. Without trying out the code for yourself, which script was used to generate Plot 2 below?

Plot 2

10.2.3 Density plots

An alternative to displaying the data in discrete bins is to apply a density function to smooth over the bins of the histogram. This is what we call a density plot. Figure 10.33 is a density plot of participants’ grammar test scores.

Create density plots in R using {ggplot2} is very simple. Because, yes, you’ve guessed it: there’s a geom_ function for density plots and it’s called… geom_density()! 😃

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar)) +
  geom_density(fill = "purple") +
  labs(x = "Scores on English grammar test") +
  theme_minimal()

Figure 10.33: Distribution of English vocabulary test results

Density plots are particularly useful to examine distribution shapes visually. Looking at Figure 10.33, we can immediately see that the values of the Grammar variable are not normally distributed (see Section 8.2.2).

Task 10.4

a. Which line of code needs to be added to the code used to generate Figure 10.33 to produce a two-panel density plot like Figure 10.34?

Show sample code to help you answer Q10.12.

ggplot(data = Dabrowska.data,
       mapping = aes(x = Grammar)) +
  geom_density(fill = "purple") +
  facet_wrap(~ Group) +
  labs(x = "Scores on English grammar test") +
  theme_bw()

b. Create four density plots to visualise the distribution of participants’ ART (author recognisition test), Blocks (non-verbal IQ test), Colloc (English collocation test), and Vocab (English vocabularly test) scores.

Which variable’s distribution is closest to a normal distribution?

Show sample code to help you answer Q10.13.

Dabrowska.data |> 
  select(Vocab, Colloc, ART, Blocks) |>  
  tidyr::gather() |>  # This function from tidyr converts a selection of variables into two variables: a key and a value. The key contains the names of the original variable and the value the data. This means we can then use the facet_wrap function from ggplot2
  ggplot(aes(value)) +
    theme_bw() +
    facet_wrap(~ key, scales = "free", ncol = 2) +
    scale_x_continuous(expand=c(0,0)) +
    geom_density(fill = "purple")

Going further: Setting properties within geoms

You can change attributes of a layer by specifying them as arguments within a specific geom_ function.

The help file of each geom_ function provides a list of the aesthetics arguments that each function has (see below). Aesthetics that must be provided for a function to work are marked in bold, whilst the others are optional. Here, the required arguments are taken directly from the aes values specified for the entire plot and therefore need not be specified again within the geom_density() function. If we do not specify any of the optional aesthetics of the geom_ functions, sensible default values will be used. For instance, the line colour of density plots will be black, unless otherwise specified with the argument “colour”.

?geom_density

[…]

Aesthetics

geom_density() understands the following aesthetics (required aesthetics are in bold):

x

y

alpha

colour

fill

group

linetype

linewidth

weight

Learn more about setting these aesthetics in vignette("ggplot2-specs"). […]

Figure 10.35 is an example of a density plot with some highly customised aesthetics. It goes without saying that, just because you can customise many aesthetic aspects of a geom_ layer, it doesn’t necessarily mean that it’s a good idea to do so! 🙃

Dabrowska.data |> 
  ggplot(mapping = aes(x = Blocks)) +
    geom_density(colour = "purple",
                linewidth = 1.5,
                linetype = "dotdash",
                fill = "pink",
                alpha = 0.8,
                ) +
    labs(x = "Non-verbal IQ (Blocks) test results")

It can, however, be very useful to help identify different elements within a complex plot as in Figure 10.36.

mean.blocks <- Dabrowska.data |>
  group_by(Group) |>
  summarise(mean = mean(Blocks))

Dabrowska.data |>
  ggplot(mapping = aes(x = Blocks, 
                       fill = Group,
                       colour = Group)) + 
  geom_density(alpha = 0.6, 
               position = "identity") +
  geom_vline(data = mean.blocks, 
             aes(xintercept = mean, 
                 colour = Group), 
             linetype = "dashed",
             linewidth = 0.8) +
  scale_colour_viridis_d(option = "turbo") +
  scale_fill_viridis_d(option = "turbo") +
  theme_minimal() +
  labs(x = "Non-verbal IQ test (Blocks) test results")

Note that, to create Figure 10.36, we first calculated the mean Blocks scores for both L1 and L2 participants and stored these values in a new object called mean.blocks. These values are then called within the geom_vline() function, which draws vertical lines. In other words, in addition to setting aes() mappings within the ggplot() function, we can also add additional data mappings within a specific geom_ function. It’s no exaggeration to say that, with {ggplot2}, pretty much anything is possible!

10.2.4 Boxplots

In Section 8.3.2, we saw that boxplots are a great way to visualise both the central tendency (median) of a numeric variable and the spread around this central tendency (IQR). There is an in-built function to create boxplots in {ggplot2}. No prizes will be awarded for guessing that the necessary geom_ function is called… geom_density()! 😆

Whilst it’s possible to plot just a single boxplot, that rarely makes sense. In fact, the x-axis in Figure 10.37 is entirely nonsensical! The distribution of Grammar scores across the entire dataset is much better visualised as a histogram or density plot (see Section 10.2.3) than as a single boxplot.

Dabrowska.data |>
  ggplot(mapping = aes(y = Grammar)) +
  geom_boxplot() +
  theme_minimal() + 
  labs(y = "Grammar scores")

If, however, we want to compare the Grammar scores of two or more different groups of participants, a boxplot makes a lot more sense (see Figure 10.38). To achieve this, we add a second argument within the aes() function, which maps the values of the Group variable (which are either “L1” or “L2”) to the plot’s x-axis.

Dabrowska.data |>
  ggplot(mapping = aes(y = Grammar, 
                       x = Group)) +
  geom_boxplot() +
  theme_minimal() +
  labs(y = "Grammar scores")

The meaning conveyed by Figure 10.38 is clear: there is hardly any difference between the average (median) grammar comprehension test scores of L1 and L2 participants in Dąbrowska (2019)’s dataset. Indeed, we can see that the thicker, middle lines within each boxplot are almost at the same level. However, the two boxplots have very different shapes and overall lengths: the scores of the 50% of L2 participants who scored below the median are much more spread out than those of the L1 participants who obtained below-average scores. This makes intuitive sense: native English speakers living in the UK who volunteer for such a study are likely to all have a fairly high to very high understanding of English grammar. By contrast, the L2 speakers are much more varied: some are highly proficient in English, while others are not. This range of proficiency could due to all sorts of reasons.

What are some of the possible reasons that you can think of? 🤔 Make a note of them as we will explore these hypotheses further in Section 10.2.5.

Quiz time!

Q10.3 Create a boxplot to compare how participants in different occupational groups (OccupGroup) performed on the English grammar test. Which part of the code used to produce Figure 10.38 do you need to modify to achieve this?

Show sample code to answer Q10.14.

Dabrowska.data |>
  ggplot(mapping = aes(y = Grammar, 
                       x = OccupGroup)) +
  geom_boxplot() +
  theme_minimal() +
  labs(y = "Grammar scores")

Q10.4 The code below was used to create ?fig-OccupBoxPlot, except that the arguments of the aes() function have been deleted. Which data mappings were specified inside the aes() function to produce ?fig-OccupBoxPlot?

Dabrowska.data |>
  mutate(Group = fct_recode(Group,
                            `L1 participants` = "L1",
                            `L2 participants` = "L2")) |> 
  ggplot(mapping = aes(█ █ █ █ █ █ █ █ █ █ █ █)) +
    geom_boxplot(alpha = 0.8) +
    scale_fill_viridis_d(option = "viridis") +
    scale_y_continuous(breaks = seq(0, 100, 10)) +
    labs(y = "Grammar scores", 
         x = NULL,
         fill = "Occupational group",
         title = "English grammar comprehension test results",
         subtitle = "among L1 and L2 participants with different occupations") +
    theme_bw() +
    theme(element_text(size = 12),
          legend.position = "bottom", # We move the legend to the bottom of the plot. 
          legend.box.background = element_rect()) # We add a frame around the legend.

Going further: Dot plots and violin plots

The {ggplot2} library offers many more geom_ functions for you to explore. Here are two more types of graphs that are currently only rarely used in the language sciences, but which can be very effective ways to visualise the distribution of a numeric variable across different levels of a categorical variable.

10.2.4.0.1 Dot plots

In a dot plot, each data point (corresponding, here, to a single participant) is represented by a single dot. The size of each dot corresponds to the chosen bin width. This makes dot plots a combination of a boxplot (see Section 8.3.2) and a histogram (see Section 10.2.2).

ggplot(data = Dabrowska.data,
    mapping = aes(y = Grammar, 
                     x = Group)) +
    geom_dotplot(binaxis = "y", 
              stackdir = "center",
              binwidth = 3) +
    labs(y = "Grammar comprehension scores",
         x = NULL) +
    theme_bw()

10.2.4.0.2 Violin plots

The help file of the geom_violin() function describes violin plots as follows:

A violin plot is a compact display of a continuous distribution. It is a blend of geom_boxplot() and geom_density(): a violin plot is a mirrored density plot displayed in the same way as a boxplot.

ggplot(data = Dabrowska.data,
       mapping = aes(y = Grammar, 
                       x = Group)) +
    geom_violin() +
    labs(y = "Grammar comprehension scores",
         x = NULL) +
    theme_bw()

By themselves, violin plots are rather abstract representations of variable distributions. However, in combination with boxplots, they can be an effective way to visualise and compare data distributions. Note that, here, the order of the layers is important because, if we first draw the boxplots and then the violin plots, the violin plots will mask the boxplots completely.

Dabrowska.data |> 
  ggplot(mapping = aes(y = Grammar, 
                       x = Group)) +
    geom_violin(width = 1, 
                colour = "grey", 
                fill = "grey") +
    geom_boxplot(width = 0.08, 
                 alpha = 0.2, 
                 outliers = FALSE) +
    labs(y = "Grammar comprehension scores",
         x = NULL) +
    theme_bw()

10.2.5 Scatter plots

Scatter plots are ideal to examine the relationship between two numeric variables. They are best suited to continuous numeric variables.

In the following, we will build a scatter plot to explore the following hypothesis:

In the data from Dąbrowska (2019), English grammar comprehension scores are more strongly associated with the level of formal education among L2 speakers than among L1 speakers.

To explore this hypothesis, we map the total number of years that participants spent in formal education (EduTotal) onto the x-axis and their Grammar scores onto the y-axis. In addition, we use the facet_wrap() function to split the data into two panels: one for the L1 participants and the other for the L2 group.

Dabrowska.data |> 
  ggplot(mapping = aes(x = EduTotal, 
                     y = Grammar)) +
  facet_wrap(~ Group) +
  geom_point() +
  labs(x = "Number of years in formal education",
       y = "Grammar comprehension test scores",
       title = "Exploring hypothesis 1") +
  theme_bw()

At first glance, it would seem that our data does not support our initial hypothesis: among the L2 participants, there is no obvious trend suggesting that those who scored lowest on the grammar test were the ones who spent fewer years in formal education.-

In Figure 10.39, we add a regression line (in blue) per panel to our facetted scatter plot using the geom_smooth(method = "lm") function. This allows us to visualise the correlation between participants’ grammar scores and the number of years they spent in formal education. Regression lines in scatter plots are interpreted as follows:

If the regression line goes up, there is a positive correlation between the two numeric variables. ↗️
If the line goes down, there is a negative correlation. ↘️
The steeper the line, the stronger the correlation. 💪
If the line is flat (or nearly flat), there is no (linear) correlation between the two variables. ➡️
Be aware that even very strong correlations do not necessarily imply (direct) causation. ❌

Dabrowska.data |> 
  ggplot(mapping = aes(x = EduTotal, 
                       y = Grammar)) +
  facet_wrap(~ Group) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Number of years in formal education",
       y = "Grammar comprehension test scores",
       title = "Exploring hypothesis 1") +
  theme_bw()

The regression lines added by the geom_smooth(method = "lm") function are lines of best fit: the better the fit, the closer the points are to the line. If few points are on or close to the line, it means that the regression line is not a good approximation of the relationship between the two variables. This is clearly the case in Figure 10.39 - especially in the L2 panel (more on this in ?sec-Correlations). Our data visualisation therefore do not support our hypothesis that, in the Dąbrowska (2019) data, grammar scores are more strongly associated with the level of formal education among the L2 speakers than among the L1 speakers. If anything, our data shows the opposite pattern! Our line of fit is both closer to the data points and steeper in the L1 panel than in the L2 panel.

So far, all of our data visualisations have only displayed the characteristics of the collected data. In other words, they display descriptive statistics (see Chapter 8) that do not allow us to make inferences about other participants who were not tested as part of Dąbrowska (2019)’s study. Tests of statistical significance, including of correlations, are introduced in the following chapter.

Quiz time!

In this quiz, you will explore another hypothesis:

L2 speakers’ grammar comprehension scores (Grammar) are positively correlated with their length of residence in the UK (LoR): L2 speakers who have lived in the UK for longer have a better understanding than those who arrived more recently.

Using {ggplot2}, create a scatter plot that allows you to explore this hypothesis.

Q10.5 What do you need to do before piping the data into the ggplot() function?