Ch. 9: Tasks & Quizzes – Data Analysis for the Language Sciences

Your turn!

The tidyverse package {forcats} (Wickham 2025) has a lot of very useful functions to manipulate factors. They all start with fct_.

Q9.1 Type ?fct_ in an R script or directly in the Console and then press the tab key ( on your keyboard). A list of all loaded functions that start with fct_ should pop up. Which of these is not listed?

Q9.2 In the factor object L1.Gender.fct (which we created above), the first level is “F” because it comes first in the alphabet. Which of these commands will make “M” the first level instead? Check out the help files of the following {forcats} functions to understand what they do and try them out.

🐭 Click on the mouse for a hint

Your turn!

In this task, you will do some data wrangling on the L2 dataset from Dąbrowska (2019).

Q9.3 Which of these columns from L2.data represent categorical variables and therefore ought to be converted to factors?

Q9.4 Convert all character vectors of L2.data to factors and save the new table as L2.data.fct. Use the str() function to check that your conversion has worked as planned. How many different factor levels are there in the categorical variable Occupation?

Click here to view R code to help you answer c.

L2.data.fct <- L2.data |> 
  mutate(across(where(is.character), factor))

str(L2.data.fct)

Q9.5 Use the summary() and str() functions to inspect the sanity of L2 dataset now that you have converted all the character vectors to factors. Have you noticed that there are three factor levels in the Gender variable of the L2 dataset whereas there are only two in the L1 dataset? What is the most likely reason for this?

🐭 Click on the mouse for a hint.

Your turn!

This task focuses on the OccupGroup variable, which is found in both the L1 and L2 datasets.

OccupGroup is a categorical variable that groups participants’ professional occupations into different categories. In the L2 dataset, there are four occupational categories.

L2.data.fct |> 
  count(OccupGroup)

  OccupGroup  n
1          C 10
2          I  3
3          M 21
4         PS 33

Dąbrowska (2019: 6) explains that these abbreviations correspond to:

C: Clerical positions
I: Occupationally inactive (i.e. unemployed, retired, or homemakers)
M: Manual jobs
PS: Professional-level jobs or studying for a degree

Q9.6 Examine the OccupGroup variable in the L1 dataset (L1.data). What do you notice? Why are L1 participants grouped into five rather than four occupational categories?

Click here for R code to help you Q9.6.

summary(L1.data.fct$OccupGroup)
##   C   I   M  PS PS  
##  22  23  20  24   1

L1.data.fct |> 
  count(OccupGroup)
##   OccupGroup  n
## 1          C 22
## 2          I 23
## 3          M 20
## 4         PS 24
## 5        PS   1

Q9.7 Which {stringr} function removes trailing spaces from character strings? Find the appropriate function on the {stringr} cheatsheet.

Show R code to use the function and check that it worked as expected.

L1.data.cleaned <- L1.data.fct |> 
  mutate(OccupGroup = str_trim(OccupGroup))

L1.data.cleaned |> 
  count(OccupGroup)

Q9.8 Following the removal of trailing whitespaces, what percentage of L1 participants have a professional-level jobs/are studying for a degree?

Show R code to help you answer Q9.8.

L1.data.cleaned |> 
  count(OccupGroup) |> 
  mutate(percent = n / sum(n),
         percent = percent*100, 
         percent = round(percent, digits = 2)
         )

Your turn!

For some analyses, it may be useful to group together participants whose native languages come from the same family of languages. For example, French, Spanish and Italian L1 speakers, may be considered as a one group of participants whose native language is a Romance language.

Use mutate() and case_when() to add a new variable to L2.data that corresponds to the L2 participant’s native language family. Call this new variable NativeLgFamily. Use the following language family categories:

Baltic
Chinese
Germanic
Hellenic
Romance
Slavic

If you’re not sure which language family a language belongs to, look it up on Wikipedia (e.g. the Wikipedia page on the German language informs us in a text box at the top of the article that German is a Germanic language).

Q9.9 Which language family is the second most represented among L2 participants’ native languages in Dąbrowska (2019)?

Q9.10 How many L2 participants are native speakers of a language that belongs to the family of Romance languages?

🐭 Click on the mouse for a hint.

Q9.11 What percentage of L2 participants have a Slavic native language? Round your answer to the nearest percent.

🐭 Click on the mouse for a hint.

Q9.12 If you check the output of colnames(L2.data) or View(L2.data), you will see that the new variable that you created is now the last column in the table. Consult the help file of the {dplyr} function relocate() to work out how to place this column immediately after NativeLg.

Click here for solutions to Q9.9—Q9.12.

As is often the case, there are several ways to solve these Your turn! tasks. Here is one solution based on what we have covered so far in this chapter.

Q9.9 Note that the following code will only work if you followed the instructions in the section above to create the NativeLg.cleaned variable as it relies on this variable to create the new NativeLgFamily variable.

L2.data <- L2.data |> 
  mutate(NativeLgFamily = case_when(
    NativeLg.cleaned == "Lithuanian" ~ "Baltic",
    NativeLg.cleaned %in% c("Cantonese", "Mandarin", "Chinese") ~ "Chinese",
    NativeLg.cleaned == "German" ~ "Germanic",
    NativeLg.cleaned == "Greek" ~ "Hellenic",
    NativeLg.cleaned %in% c("French", "Italian", "Spanish") ~ "Romance",
    NativeLg.cleaned %in% c("Polish", "Russian") ~ "Slavic"))

As always, it is important to check that things have gone to plan.

L2.data |> 
  select(NativeLg.cleaned, NativeLgFamily)

  NativeLg.cleaned NativeLgFamily
1       Lithuanian         Baltic
2           Polish         Slavic
3           Polish         Slavic
4          Italian        Romance
5       Lithuanian         Baltic
6           Polish         Slavic

Q9.10 We can display the distribution of language families using either the base R table() function or the {tidyverse} count() function.

table(L2.data$NativeLgFamily)


  Baltic  Chinese Germanic Hellenic  Romance   Slavic 
       5       15        1        1        6       39

L2.data |> 
  count(NativeLgFamily)

  NativeLgFamily  n
1         Baltic  5
2        Chinese 15
3       Germanic  1
4       Hellenic  1
5        Romance  6
6         Slavic 39

Q9.11 We can add a column to show the distribution in percentages by adding a new “percent” column to the count() table using mutate():

L2.data |>
  count(NativeLgFamily) |>
  mutate(percent = n / sum(n),
         percent = percent*100,
         percent = round(percent, digits = 0)
         ) |> 
  arrange(desc(n))

1: We start with the dataset that contains the new NativeLgFamily variable.
2: We pipe it into the count() function. As shown above, this function produces a frequency table with counts stored in the variable n.
3: We divide the number of participant with each native language (n) by the total number of participants (sum(n)). We obtain proportions ranging from 0 to 1.
4: We multiply these by 100 to get percentages.
5: We round the percentages to two decimal places.
6: We reorder the table so that the most represented group is at the top. To do so, we pipe our table into the dplyr::arrange(). By default, arrange() orders values in ascending order (from smallest to largest); hence, we add the desc() function to sort the table in descending order of frequency.

  NativeLgFamily  n percent
1         Slavic 39      58
2        Chinese 15      22
3        Romance  6       9
4         Baltic  5       7
5       Germanic  1       1
6       Hellenic  1       1

🪐 Note that this a {tidyverse} approach to working out percentages, see Mode for a base R approach.

Q9.12 At the time of writing, the help file of the relocate() function still featured examples using the {magrittr} pipe (%>%) rather than the native R pipe (|>) (see Piped Functions), but the syntax remains the same. The first argument is the data which we are piping into the function, the second argument is the column that we want to move. Then, we need to specify where to with either the .after or the .before argument.

L2.data <- L2.data |> 
  relocate(NativeLgFamily, .after = NativeLg)

Cartoon of fuzzy monsters moving columns around in fork lifts, while one supervises. Stylized text reads “dplyr::relocate() - move columns around! Default: move to FRONT , or move to .before or .after a specified column.” — Figure 1: Artwork explaining the `dplyr::relocate()` function CC BY 4.0 @allison_horst.

Note that the help file specifies that both “.after” and “.before” begin with a dot. If you leave the dot out, the function will not work as expected! Can you spot what has happened here?

L2.data |> 
  relocate(NativeLgFamily, after = NativeLg) |> 
  str()

'data.frame':   67 obs. of  6 variables:
 $ Participant   : int  220 244 46 221 222 230 247 237 243 213 ...
 $ Gender        : chr  "F" "f" "F" "F" ...
 $ Occupation    : chr  "Student" "student" "Cleaner" "Student" ...
 $ OccupGroup    : chr  "PS" "PS" "M" "PS" ...
 $ NativeLg      : chr  "Lithuanian" "polish" "Polish" "Italian" ...
 $ NativeLgFamily: chr  "Baltic" "Slavic" "Slavic" "Romance" ...

The relocate() function has moved NativeLgFamily to the first column (the function’s default position) and has also moved NativeLg to the second position, but it has renamed the column after.

This is a reminder to always check whether your data wrangling operations have gone as planned. Just because you didn’t get an error message doesn’t mean that your code did what you wanted! ⚠️

Your turn!

Q9.13 The following operations describe the steps performed by the data wrangling code chunk above. In which order are the operations performed?

🐭 Click on the mouse for a hint.

Q9.14 In the combined dataset, how many participants have a clerical occupation?

🐭 Click on the mouse for a hint.

Q9.15 Of the participants who have a clerical occupation, how many were over 50 years old at the time of the data collection?

Click here to see the R code to answer Q9.15

There are various ways to find the answer to Q5. Sticking to a function that we have looked at so far, you could cross-tabulate Age and OccupGroup using the count() function.

combined.data |> 
  count(OccupGroup, Age)

   OccupGroup Age n
1           C  20 1
2           C  25 6
3           C  27 2
4           C  28 3
5           C  29 4
6           C  30 3
7           C  32 3
8           C  37 1
9           C  38 1
10          C  39 1
11          C  41 1
12          C  51 2
13          C  52 1
14          C  53 1
15          C  57 1
16          C  60 1

And then add up the frequencies listed in the rows that correspond to participants with clerical jobs who are 50.

2 + 1 + 1 + 1 +1

[1] 6

But, of course, this is method is rather error-prone! Instead, we can use dplyr::filter() (see below) to filter the combined dataset according to our two criteria of interest and then count the number of rows (i.e. participants) remaining in the dataset once the filter has been applied.

combined.data |>
  filter(OccupGroup == "C" & Age > 50) |> 
  nrow()

[1] 6

Check your progress 🌟