6 ImpoRting data – Data Analysis for the Language Sciences

6.1 Accessing data from a published study

As we saw in Section 1.1, it is good practice to share both the data and materials associated with research studies so that others can reproduce and replicate the research.

In the following chapters, we will focus on data associated with the following study:

Dąbrowska, Ewa. 2019. Experience, Aptitude, and Individual Differences in Linguistic Attainment: A Comparison of Native and Nonnative Speakers. Language Learning 69(S1). 72–100. https://doi.org/10.1111/lang.12323.

Front page of an EMPIRICAL STUDY journal article published at Language Learning: Experience, Aptitude, and Individual Differences in Linguistic Attainment: A Comparison of Native and Nonnative Speakers by Ewa Dabrowska (University of Birmingham — Figure 6.1: Title page from the journal Language Learning

Follow the DOI¹ link above and read the abstract to find out what the study was about. You do not need to have institutional or paid access to the full paper to read the abstract.

The author, Ewa Dąbrowska, has made the data used in this study available on an open repository (see Section 2.4). To find out on which repository, go back to the study’s DOI link and click on the drop-down menu “Supporting Information”. It links to a PDF file. Click on the link and scroll to the last page which contains the following information about the data associated with this study:

Appendix S4: Datasets

Dąbrowska, E. (2018). L1 data [Data set]. Retrieved from https://www.iris-database.org/iris/app/home/detail?id=york:935513

Dąbrowska, E. (2018). L2 data [Data set]. Retrieved from https://www.iris-database.org/iris/app/home/detail?id=york:935514

Quiz time!

Q6.4 On which repository/repositories can the data be found?

You may have noticed that the datasets were published in 2018, whereas the article (Dąbrowska 2019) was published in the following year. This is very common in academic publications as it can take many months or even years for an article or book to be published, by which time the author(s) may have already made the data available on a repository. This particular article was actually first published on the journal’s website on 22 October 2018 as an “advanced online publication”, but was not officially published until March 2019 as part of Volume 69, Issue S1 of the journal (see https://doi.org/10.1111/lang.12323). This explains the discrepancy between the publication date of the datasets and the publication date of the article recorded in the bibliographic reference.

6.2 Saving and examining the data

Click on the two links listed in Appendix S4 and download the two datasets. Note that the URL may take a few seconds to redirect and load. Save the two datasets in an appropriate place on your computer (see Section 3.3), as we will continue to work with these two files in the following chapters.

What’s a good place to save these files? 🤔

If you haven’t already done so, I suggest that you create a folder in which you save everything that you create whilst learning from this textbook. This folder could be called something along the lines of DataLiteracyTextbook, 2024_data_literacy, or LeFoll_2024_DataLiteracy (see Section 3.2). Then, within this folder, I recommend that you create another folder called Dabrowska2019 (note how I have not included the “ą” character in the folder name as this could cause problems), and within this folder, create another folder called data. This is the folder in which you can save these two files.

Quiz time!

Q6.5 In which data format are these two files saved?

The file L1_data.csv contains data about the study’s L1 participants. It is a delimiter-separated values (DSV) file (see Section 2.5.1). The first five lines of the file are printed below. Note that this is a very wide table as it contains many columns (you can scroll to the right to view all the columns).

Participant,Age,Gender,Occupation,OccupGroup,OtherLgs,Education,EduYrs,ReadEng1,ReadEng2,ReadEng3,ReadEng,Active,ObjCl,ObjRel,Passive,Postmod,Q.has,Q.is,Locative,SubCl,SubRel,GrammarR,Grammar,VocabR,Vocab,CollocR,Colloc,Blocks,ART,LgAnalysis
1,21,M,Student,PS,None,3rd year of BA,17,1,2,2,5,8,8,8,8,8,8,6,8,8,8,78,95,48,73.33333333,30,68.75,16,17,15
2,38,M,Student/Support Worker,PS,None,NVQ IV Music Performance,13,1,2,3,6,8,8,8,8,8,8,7,8,8,8,79,97.5,58,95.55555556,35,84.375,11,31,13
3,55,M,Retired,I,None,No formal (City and Guilds),11,3,3,4,10,8,8,8,8,8,7,8,8,8,8,79,97.5,58,95.55555556,31,71.875,5,38,5
4,26,F,Web designer,PS,None,BA Fine Art,17,3,3,3,9,8,8,8,8,8,8,8,8,8,8,80,100,53,84.44444444,37,90.625,20,26,15

6.3 Using Projects in RStudio

One of the advantages of working with RStudio is that it allows us to harness the potential of RStudio Projects. Projects help us to keep our digital kitchen nice and tidy. In RStudio, each project has its own directory, environment, and history which means that we can work on multiple projects at the same time and RStudio will keep them completely separate. This means that we can easily switch between cooking different dishes, say a gluten-free egg curry and vegan pancakes, without fear of accidentally setting the wrong temperature on the cooker or contaminating either dish.

Regardless of whether or not you’re a keen multi-tasker, RStudio Projects are a great way to help you keep together all the data, scripts, and outputs associated with a single project in an organised manner. In the long run, this will make your life much, much easier. It will also be an absolute lifesaver as soon as you need to share your work with others (e.g., your supervisor, colleagues, reviewers, etc.).

To create a new Project, you have two options. In RStudio, you can select ‘File’, then ‘New Project…’ (see “a” on Figure 6.2). Alternatively, you can click on the Project button in the top-right corner of your RStudio window and then select ‘New Project…’ (see “b” on Figure 6.2).

Screenshot of the *RStudio* interface with two highlighted sections for creating a new project. The first highlight (a) is on the 'File' menu in the top left corner, where the option 'New Project...' is circled. The second highlight (b) is in the top right corner, where the option 'New Project...' in a dropdown menu is circled. Both options provide a way to start a new project in RStudio. — Figure 6.2: Create a new project in RStudio

Both options will open up a window with three options for creating a new project:

New Directory (which allows you to create an entirely new project for which you do not yet have a folder on your computer)
Existing Directory (which allows you to create a project in an existing folder associated with your project)
Version Control (see Bryan 2018).

In Section 6.2, you should have already saved the data that we want to import in a dedicated folder on your computer. Here, a folder is the same as a directory. Hence, you can select the second option: ‘Existing Directory’.

Clicking on this option will open up a new window (Figure 6.2). Click on ‘Browse…’ to navigate to the folder where you intend to save all your work related to Dąbrowska (2019). If you followed my suggestions earlier on, this would be a folder called something along the lines of Dabrowska2019. Once you have selected the correct folder, select the option ‘Open in a new session’ and then click on ‘Create Project’.

New Project Wizard for Create Project from Existing Directory R Project. It shows the working directory: .../Documents/DataLiteracyTextbook/Dabrowska2019. There are three buttons: Browse..., Create Project and Cancel — New project window

Creating an RStudio project generates a new file in your project folder called Dabrowska2019.Rproj. You can see it in the Files pane of RStudio. Note that the extension of this newly created file is .Rproj. Such .Rproj files store information about your project options, which you will not need to edit. More usefully, .Rproj files can be used as shortcuts for opening your projects. To see how this works, shut down RStudio. Then, in your computer file system (e.g., using a File Explorer window on Windows and a Finder window on macOS), navigate to your project folder to locate your .Rproj file (see Figure 6.3 (a)). Double-click on the file. This will automatically launch RStudio with all the correct settings for this particular project. Alternatively, you can use the Project button in the top-right corner of your RStudio window to open up a project from RStudio itself (see Figure 6.3 (b)).

Screenshot of the project folder in the MacOS Finder with the .Rproj-file highlighted. To launch the project in a new session, the corresponding .Rproj-file can be double-clicked. — (a) Launching a Project from the File Finder

Screenshot of the dropdown menu under the 'Project: (None)' in RStudio, displaying options such as 'New Project...', 'Open Project...', 'Open Project in New Session...', and 'Close Project.' The 'Open Project in New Session...' option is circled in red. — (a) Launching a Project from the File Finder

6.4 Working directories

The folder in which the .Rproj file was created corresponds to your project’s working directory. Once you have opened a Project, you can see the path to your project’s working directory at the top of the Console pane in RStudio. The Files pane should also show the content of this directory.

Screenshot of the Files pane of *RStudio* showing the contens of a folder entitled 'Dabrowska2019'. It contains an RProject file called 'Dabrowska2019.Rproj', a folder called 'analysis' and another folder called 'data'. — Contents of the project folder as displayed by RStudio

Click on the “New Folder” icon in your Files pane to create a new subfolder called analysis. Your folder Dabrowska2019 should now contain an .RProj file and two subfolders called analysis and data.

6.5 Importing data from a `.csv` file

We will begin by creating a new R script in which we will write the code necessary to import the data from Dąbrowska (2019)‘s study in R. To do so, from the Files pane in RStudio, click on the analysis folder to open it and then click on the ’New Blank File’ icon in the menu bar of the Files pane and select ‘R Script’. This will open a new, empty R script in your Source pane. It is best to always begin by saving a newly created file. Save this empty script with a computer- and human-friendly file name such as 1_DataImport.R (Section 3.2). It should now appear in your analysis folder in the Files pane.

Given that we want to import two .csv files, we are going to use the function read.csv(). You can find out what this function does by running the command ?read.csv or help(read.csv) in the Console to open up the documentation. This help file contains information about several base R functions used to import data. Scroll down to find the information about the read.csv() function. It reads:

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

This line from the documentation informs us that this function’s first argument is the path to the file from which we want to import the data. It also informs us that file is the only argument that does not have a default value (as it is not followed by an equal sign and a value). In this function, file is therefore the only argument that is compulsory. Hence, in theory, all we need to write to import the data is:

L1.data <- read.csv(file = "data/L1_data.csv")

In fact, we could shorten things even further as, unless otherwise specified, R will assume that the first value listed after a function corresponds to the function’s first argument which, here, is file. In other words, this command and the one above are equivalent:

L1.data <- read.csv("data/L1_data.csv")

The file path "data/L1_data.csv" informs R that the data is located in a subfolder of the project’s working directory called data and that, within this data subfolder, the file that we want to import is called L1_data.csv. Note that the file extension must be specified (see Section 2.3). Note, also, the file path is separated with a single forward slash /. In R, this should work regardless of the operating system that you are using and, in order to be able to easily share your scripts with others, it is recommended that you use forward slashes even if you are running Windows (Section 3.3).

Although the command above did the job, in practice, it is often safer to spell things out further to remind ourselves of some of the default settings of the function that we are using in case they need to be checked or changed at a later stage. In this example, we will therefore import the data with the following command:

L1.data <- read.csv(file = "data/L1_data.csv",
                    header = TRUE,
                    sep = ",",
                    quote = "\"",
                    dec = ".")

In the command above, header = TRUE, explicitly tells R to import the first row of the .csv table as column headers rather than values. This is not strictly necessary because, as we saw from the function’s help file, TRUE is already set as the default value for this argument, but it is good to remind ourselves of how this particular dataset is organised.

The arguments sep and quote specify the characters that, in this .csv file are used to separate the values on the one hand, and delineate them, on the other (see Section 2.5.1). As we saw above, Dąbrowska (2019)’s .csv files use the comma as the separator and the double quotation mark as the quoting character. Note that the " character needs to be preceded by a backslash (\) (we say it needs to be “escaped”) because otherwise R will interpret it as part of the command syntax, which would lead to an error. Finally, the argument dec = "." explicitly tells R that this .csv file uses the dot as the decimal point. In some countries, e.g., Germany and France, the comma is used to represent decimal places so, if you obtain data from a German or French colleague, this setting may need to be changed to dec = "," for the data to be imported correctly.

“Hell is empty, and all the devils are {here}.” 😈

This section title borrows a quote from The Tempest by William Shakespeare to reflect the fact that file paths are perhaps the most frequent source of frustration among (beginner) coders. Section 6.6 explains how to deal with the most frequent error messages. Ultimately, however, these errors are typically due to poor data management (see Chapter 3). That’s because the devil’s in the detail (remember: no spaces, special characters, differences between different operating systems, etc.). As a result, even advanced users of R and other programming languages frequently find that file path issues continue to plague their work, if they fail to take file management seriously.

To make your projects more robust to such issues, I strongly recommend working with the {here} package in addition to RProjects. You will first need to install the package:

install.packages("here")

When you load the package, it automatically runs the here() function with no argument, which returns the path to your project directory, as determined by the location of the .RProj file associated with your project.

library(here)

You can now use the here() function to build paths relative to this directory with the following syntax:

here("data", "L1_data.csv")

[1] "/Users/lefoll/Documents/UzK/RstatsTextbook/data/L1_data.csv"

And you can embed this path in your import command like this:

library(here)

L1.data <- read.csv(file = here("data", "L1_data.csv"))

Much like wearing a helmet for extra safety (Figure 6.4), {here} makes the paths that you include in your code far more robust. In other words, they are far less likely to fail and break your code when you share your scripts with your colleagues, or run them yourself from different directories or operating systems. For more reasons to use {here}, check out the blog post “Why should I use the here package when I’m already using projects?” by Barrett (2018).

A cartoon of a cracked glass cube looking frustrated with casts on its arm and leg, with bandaids on it, containing “setwd”, looks on at a metal riveted cube labeled “R Proj” holding a skateboard looking sympathetic, and a smaller cube with a helmet on labeled “here” doing a trick on a skateboard. — Figure 6.4: Although a fairly common way of working with data in `R`, using `setwd()` (see Section 6.12.1) is dangerous and will, sooner or later, cause you and/or your colleagues some nasty accidents. A combination of using RProj and {here} as described above is much safer!

6.6 Import errors and issues 🥺

It is crucial that you check whether your data has genuinely been correctly imported. Here’s a list of things to check (in that order):

Were you able to run the import command without producing any errors? If you are getting an error, remember that this is most likely due to a typo (see Section 5.6)!
- If part of the error message reads “No such file or directory”, this means that either the file path or the file name is incorrect. Carefully check the path that you specified in your import command (if you’re struggling to find the correct path, you may want to try out the import method explained in Section 6.12.2). To ensure that you are not misspelling the name of the file, you can press the tab key on your keyboard to get RStudio to auto-complete the file name for you.
- If the error message includes the statement “could not find function”, this means that you have either misspelled the name of the function or this is not a base R function and you have forgotten to load the library to which this function belongs (see Section 6.8).
- As usual whenever you get an error message, also check that you have included all of the necessary brackets and quotation marks (see Section 5.6).
Has the R data object appeared in your Environment pane? Does it have the expected number of rows (observations) and columns (variables)? L1.data contains 90 observations and 31 variables. If you are getting different numbers, this might be because you previously opened the .csv file with Excel or that your computer converted it to Excel format automatically. To remedy this, ensure that you have followed all the steps described in Section 2.5.2.
To view the entire table, use the function View() with the name of your data object as the first and only argument, e.g., View(L1.data)². This will open up a new tab in your Source pane that displays the full table, much like in a spreadsheet programme. You can search and filter the table in this tab, but you cannot edit it in any way (and that’s a good thing because, if we want to edit things, we want to ensure that we keep track of our changes in a script!). Browse through the table and check that everything “looks healthy”. This is much like visually inspecting and smelling ingredients before using them in a recipe. It’s not perfect but if something is really off, you should notice it. Check that each cell appears to have one and only one value.
Finally, use the str() function to view the structure of your data object in a more compact way. Using the command str(L1.data) will display a summary of the data.frame in the Console. The summary begins by informing us that this data object is a data.frame, that contains 90 observations and 31 variables. Then, it lists all of the variables, followed by the type of values stored in this variable (e.g., character strings or integers) and then the first few values for each variable. Especially with very wide tables that contain a lot of variables, it is often easier to check the summary of the imported data with str() than with View(), though I would always recommend taking a few seconds to do both. This is time well spent!

6.7 Importing tabular data in other formats

We have seen how to load data from a .csv file into R by creating an R data frame object that contains the data extracted from a .csv file. But, as we saw in Chapter 2, not all datasets are stored as .csv files. Fear not: there are many import functions in R, with which you can import pretty much all kinds of data formats! This section introduces a few of the most useful ones for research in the language sciences.

We begin with the highly versatile function read.table(). The read.csv() is actually a variant of read.table(). You recall that when we called up the help file for the former using ?read.csv(), we obtained a combined help file for several functions, the first of which was read.table(). By specifying the following arguments as we did earlier, we can actually use the read.table() function to import our .csv file with exactly the same results:

L1.data <- read.table(file = "data/L1_data.csv",
                    header = TRUE,
                    sep = ",",
                    quote = "\"",
                    dec = ".")

6.7.1 Tab-separated file

In Task 2.3 in Section 2.5.2, you downloaded and examined a DSV file with a .txt extension that was separated by tabs: offlinedataLearners.txt from Schimke et al. (2018).

If we change the separator character argument to \t for tab, we can also import this dataset in R using the read.table() function:

OfflineLearnerData <- read.table(file = "data/offlinedataLearners.txt",
                                        header = TRUE,
                                        sep = "\t",
                                        dec = ".")

For the command above to work, you will first need to save the file offlinedataLearners.txt to the folder specified in the path. Otherwise, you will get an error message informing you that there is “No such file or directory” (see Section 6.6).

6.7.2 Semi-colon-separated file

Figure 6.5 displays an extract of the dataset AJT_raw_scores_L2.csv from an experimental study by Busterud et al. (2023). Although this DSV file has a .csv extension, it is actually separated by semicolons. As you can see in Figure 6.5, in the file AJT_raw_scores_L2.csv, the comma is used to show the decimal place.

The image shows a table from the data file AJT_raw_scores_L2.csv by Busterud et al. It includes rows numbered 261 to 274 with columns for ID, L3, Years of L3, Gender, L3 selfasses, L3 grade, L2 selfasses, and L2 grade. All entries in L3, Years of L3, and Gender are 2, 4, and 4, respectively. Values for L3 selfasses range from 1 to 2,5, L3 grade from 2 to 5, L2 selfasses from 2 to 6, and L2 grade from 3 to 6. — Figure 6.5: Extract of data file AJT_raw_scores_L2.csv from

If you look carefully, you will also see that this dataset has some empty cells. This data can be downloaded from https://doi.org/10.18710/JBMAPT. It is delivered with a README text file. It is good practice to include a README file when publishing datasets or code and, as the name suggests, it is always a good idea to actually read README files! 🙃 Among other things, this particular README explains that, in this dataset: “Missing data are represented by empty cells.”

If you call up the help file for the read.table() function again, you will see that there is an argument called na.strings. The default value is NA. When we import this dataset AJT_raw_scores_L2.csv from Busterud et al. (2023), we will therefore need to change this argument to ensure that empty cells are recognised as missing values.

In addition to the file path, the command to import this dataset specifies the separator character as the semicolon (sep = ";"), the character used to represent decimals (dec = ","), and empty cells as missing values (na.strings = ""):

AJT.raw.scores.L2 <- read.table(file = "data/AJT_raw_scores_L2.csv",
                          header = TRUE,
                          sep = ";",
                          dec = ",",
                          na.strings = "")

Once we have run this command, we should check that the data have been correctly imported, for example by using the View() function:

View(AJT.raw.scores.L2)

           ID L3 Years.of.L3 Gender L3.selfasses L3.grade L2.selfasses L2.grade
1     BKRE452  2           4      2          2.5        3            4        3
2     SHEL876  2           4      2          3.0        5            6        5
3     SVIØ510  2           4      1          4.0        4            6        5
4     EHEA194  2           4      1          2.0        2            6        4
5     ERAO442  2           4      2          3.0        3            5        4
6     SEIO103  2           4      1          2.0        3            4        3
7     NMOI241  2           4      1          3.0        4            4        4
8  BBIE911/77  2           4      1           NA        3           NA        4
9     UUNO561  2           4      1          2.0        3            3        4
10    SMAO470  2           4     NA          3.0     <NA>            6       NA
11    SSID616  2           3      1          2.0        2            6        4
12    SHRI714  2           1      2          4.0        3            3        3
13    HALI620  2           1      2          5.0        6            4        4

Here, we can see that the data have been correctly imported as a table. The commas have been correctly converted to decimal points and the empty cells are now labelled NA.

6.8 Using {readr} from the {tidyverse} to import tabular files

The {tidyverse} is a family of packages that we will use a lot in future chapters. This family of package includes the {readr} package which features some very useful functions to import data into R. You can install and load the {readr} package either individually or as part of the {tidyverse} bundle:

# Install the package individually:
install.packages("readr")

# Or install the full tidyverse (this will take a little longer):
install.packages("tidyverse")

# Load the library:
library(readr)

Delimiter-separated values (DSV) files

The {readr} package includes functions to import DSV files that are similar, but not identical to the base R functions explained above. The main difference is that the {readr} functions load data into an R object of type “tibble” rather than “data frame”. In practice, this will not make a difference for our work in future chapters. Hence, the following two commands can equally be used to import L1_data.csv:

# Import .csv file using the base R function read.csv():
L1.data <- read.csv(file = "data/L1_data.csv", 
                    header = TRUE, 
                    quote = "\"")

class(L1.data)

[1] "data.frame"

# Import .csv file using the {readr} function read_csv():
L1.data <- read_csv(file = "data/L1_data.csv", 
                    col_names = TRUE, 
                    quote = "\"")

class(L1.data)

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Note that instead of the argument header = TRUE, the {readr} function read_csv() takes the argument col_names = TRUE, which has the same effect.

There are a few more differences between the two functions that are worth noting:

If the column headers in your original data file contain spaces, these will be automatically replaced by dots (.) when you import data using base R functions. By contrast, with the {readr} functions, the spaces will, by default, be retained. As we will see later, this behaviour can represent both an advantage and disadvantage, depending on what you want to do.
The {readr} functions are quicker and are therefore recommended if you are importing large datasets.
In general, the behaviour of {readr} functions is more consistent across different operating systems and locale settings (e.g., the language in which your operating system is set).

Note that, just like read.csv() was a special case of read.table, the {readr} function read_csv() is a special variant of the more general function read_delim() that can be used to import data from all kinds of DSV files. Check the help file to find out all the options using ?read_delim.

The help file informs us that the package includes a function specifically designed to import semi-colon separated file with the the comma as the decimal point: read_csv2(). It further states that “[t]his format is common in some European countries.” If you scroll down the help page, you will see that its usage is summarised in the following way:

read_csv2(
  file,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  id = NULL,
  locale = default_locale(),
  na = c("", "NA"),
  quoted_na = TRUE,
  quote = "\"",
  comment = "",
  trim_ws = TRUE,
  skip = 0,
  n_max = Inf,
  guess_max = min(1000, n_max),
  progress = show_progress(),
  name_repair = "unique",
  num_threads = readr_threads(),
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE,
  lazy = should_read_lazy()
)

This overview of the read_csv2 function shows all of the arguments of the function and their default values. For instance, with na = c("", "NA"), it tells us that, by default, both empty cells and cells with the value NA will be interpreted by the function as NA values.

Having checked the default values for all of the arguments of the read_csv2 function, we may conclude that we can safely use this {readr} function to import the file AJT_raw_scores_L2.csv from Busterud et al. (2023) without changing any of these default values. Hence, all we need is:

AJT.raw.scores.L2 <- read_csv2(file = "data/AJT_raw_scores_L2.csv")

Note that, whereas when we used the base R function read.table() the header for the third variable in the file was imported as Years.of.L3, using the {readr} function, the variable is entitled Years of L3.

Fixed width files

Fixed width files (with file extensions such as .gz, .bz2 or .xz) are a less common type of data source in the language sciences. In these text-based files, the values are separated not by a specific character such as the comma or the tab, but by a set amount of white/empty space other than a tab. Fixed width files can be loaded using the read_fwf() function from {readr}. Fields can be specified by their widths with fwf_widths() or by their positions with fwf_positions().

6.9 Importing files from spreadsheet software

If your data are currently stored in a spreadsheet software (e.g., LibreOffice Calc, Google Sheets, or Microsoft Excel), you can export them to .csv or .tsv. However, if you do not wish to do this (e.g., because your colleague wishes to maintain the spreadsheet format that includes formatting elements such as bold or coloured cells), there are functions to import these file formats directly into R.

6.9.1 LibreOffice Calc

For LibreOffice Calc (which you should have installed in Section 1.2), you can install the {readODS} package and use its read_ods() function to import .ods files. Details about all the options can be found here https://www.rdocumentation.org/packages/readODS/versions/2.3.0.

# Install from CRAN (safest option):
install.packages("readODS")

# Or install the development version from Github:
remotes::install_github("ropensci/readODS")

# Load the library:
library(readODS)

# Import your .ods data:
MyLibreOfficeData <- read_ods("data/MyLibreOfficeTable.ods",
                            sheet = 1,
                            col_names = TRUE,
                            na = NA)

6.9.2 Microsoft Excel

Various packages can be used to import Microsoft Excel file formats, but the simplest is {readxl}, which is part of the {tidyverse}. It allows users to import data in both .xlsx and the older .xls format. You can find out more about its various options here: https://readxl.tidyverse.org/.

# Install from CRAN (safest option):
install.packages("readxl")

# Or install the development version from Github:
remotes::install_github("tidyverse/readxl")

# Load the library:
library(readxl)

# Import your .ods data:
MyExcelData <- read_excel("data/MyExcelSpreadsheet.xlsx",
                            sheet = 1,
                            col_names = TRUE,
                            na = NA)

6.9.3 Google Sheets

There are also several ways to import data from Google Sheets. The simplest is to export your tabular data as a .csv, .tsv, .xslx, or .ods file by selecting Google Sheet’s menu option ‘File’ > ‘Download’. Then, you can simply import this downloaded file in R using the corresponding function as described above.

However, if you want to directly import your data from Google Sheets and be able to dynamically update the analyses that you conduct in R even as the input data is amended on Google Sheets, you can use the {googlesheets4} package (which is part of the {tidyverse}):

# Install from CRAN (safest option):
install.packages("googlesheets4")

# Or install the development version from Github:
remotes::install_github("tidyverse/googlesheets4")

# Load the library:
library(googlesheets4)

# Import your Google Sheets data using your (spread)sheet's URL:
MySheetsData <- read_sheet("https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077",
                            sheet = 1,
                            col_names = TRUE,
                            na = "NA")

# Or import your Google Sheets data using just the sheet's ID:
MySheetsData <- read_sheet("1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077",
                            sheet = 1,
                            col_names = TRUE,
                            na = "NA")

Sign in with Google. Tidyverse API Packages wants access to your Google Account. Select what Tidyverse API Packages can access • See, edit, create, and delete all your Google Sheets spreadsheets. Because you're using Sign in with Google, Tidyverse API Packages will be able to Associate you with your personal info on Google, See your primary Google Account email address. Make sure you trust Tidyverse API Packages. You may be sharing sensitive info with this site or app. Learn about how Tidyverse API Packages will handle your data by reviewing its privacy policies. You can always see or remove access in your Google Account. Learn about the risks. The two buttons at the bottom read: Cancel and Continue — Figure 6.6: Dialogue box to consent to the {tidyverse} API Packages having access to your Google Drive to import directly from a Google Sheet. Read this carefully before clicking on ‘Continue’.

6.9.4 Importing spreadsheet files with multiple sheets/tabs

Note that, as spreadsheet software typically allow users to have several “sheets” or “tabs” within a file that each contains separate tables, the functions read_excel(), read_ods(), and read_sheet include an argument called sheet which allows you to specify which sheet should be imported. The default value is 1, which simply means that the first one is imported. If your sheets have names, you can also use its name as the argument value, e.g.:

MyExcelData <- read_excel("data/MyExcelSpreadsheet.xlsx",
                            sheet = "raw data",
                            col_names = TRUE,
                            na = NA)

6.10 Importing data files from SPSS, SAS and Stata

If you’ve recently switched from working in SPSS, SAS, or Stata (or are collaborating with someone who uses these programmes), it might be useful to know that you can also import the data files created by programmes directly into R using the {haven} package. Details of all the options can be found here: https://haven.tidyverse.org/.

# Install from CRAN (safest option):
install.packages("haven")

# Or install the development version from Github:
remotes::install_github("tidyverse/haven")

# Load the library:
library(haven)

# Import an SAS file
MySASData <- read_sas("MySASDataFile.sas7bdat")

# Import an SPSS file
MySPSSDataFile <- read_sav("MySPSSDataFile.sav")

# Import a Stata file
MyStataData <- read_dta("MyStataDataFile.dta")

6.11 Importing other file formats

In this textbook, we will only deal with DSV files (Section 2.5.1) but, as you can imagine, there are many more R packages and functions that allow you to import all kinds of other file formats. These include .xml, .json and .html files, various database formats, and other files with complex structures (see, e.g., https://rc2e.com/inputandoutput).

In addition, fellow linguists are constantly developing new packages to work with file formats that are specific to our discipline. In the spirit of Open Science (see Chapter 1), many are making these packages available to the wider research community by releasing them under open licenses. For example, Linguistics M.A. students Katja Wiesner and Nicolas Werner wrote an R package to facilitate the import of .eaf files generated by the annotation software ELAN (Lausberg & Sloetjes 2009) into R as part of a seminar project supervised by Dr. Fahime (Fafa) Same at the University of Cologne (https://github.com/relan-package/rELAN/?tab=readme-ov-file).

6.12 Quick-and-dirty (aka bad!) ways to import data in `R`

Feel free to skip this section if you got on just fine with the importing method introduced above as the following two methods are problematic for a number of reasons. However, they may come in useful in special cases, which is why both are briefly explained below.

6.12.1 Hardcoding file paths in `R` scripts 🥴

Whilst it is certainly not recommended (see, e.g., Bryan 2017), it is nonetheless worth understanding this method of working with file paths in R as you may well come across it in other people’s code.

Instead of creating an RProject to determine a project’s working directory (as we did in Section 6.3), it is possible to begin a script with a line of code that sets the working directory for the script using the function setwd(), e.g.:

setwd("/Users/lefoll/Documents/UzK/RstatsTextbook/Dabrowska2019")

Afterwards, data files can be imported using a relative path from the working directory just like we did earlier.

L1.data <- read.csv(file = "data/L1_data.csv")

If you have to work with an R script that uses this method, you will need to amend the path designated as the working directory by setwd() to the corresponding path on your own computer. This might not sound like much of an issue, but as data scientist, R expert, and statistics professor Jenny Bryan (2017) explains:

The chance of the setwd() command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable.

In the language sciences, not everyone is aware of the severity of these issues. Hence, it is not uncommon for researchers to make their scripts even less reproducible by not setting a working directory at all and, instead, relying exclusively on absolute paths (Section 3.3). Hence, every time they want to import data (and, as we will see later on, export objects from R, too), they write out the full file path in the command like this:

L1.data <- read.csv(file = "/Users/lefoll/Documents/UzK/RstatsTextbook/Dabrowska2019/data/L1_data.csv")

Having to work with such a script is particularly laborious because it means that, if you inherit such a script from a colleague, you will have to manually change every single file path in the script to the corresponding file paths on your own computer. And, as Bryan (2017) points out, this will also apply if you change anything in your own computer directory structure! I hope I’ve made clear that the potential for making errors in the process is far too important to even consider going down that route.

However, should you have to use this method at some point for whatever reason, you can make use of Section 3.3 which explained how to copy full file paths from a file Explorer or Finder window. Note that if there are spaces or other special characters other than _ or - anywhere in your file path, your import command will fail (see Section 3.2 on naming conventions for folders and files). The following command, for instance, will fail and return an error (see Figure 6.7) because the folder “Uni Work” contains a space.

L1.data <- read.csv(file = "/Users/lefoll/Documents/Uni Work/RstatsTextbook/Dabrowska2019/data/L1_data.csv")

Screenshot of the Console pane of *RStudio* showing the above command having been run and having returned an error message: Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file '/Users/lefoll/Documents/Uni Work/RstatsTextbook/Dabrowska2019/data/L1_data.csv': No such file or directory — Figure 6.7: Error message due to an error in the file path

The only way to fix this issue is to remove the space in the name of the folder (in your File Finder or Navigator window) and then amend the file path in your R script accordingly.

6.12.2 Importing data using RStudio’s GUI 🫤

You may have noticed that, if you click on a data file from the Files pane in RStudio (Figure 6.8), RStudio will offer to import the dataset for you. This looks like (and genuinely is) a very convenient way to import data in an R session using RStudio’s GUI (graphical user interface).

Screenshot of the Files pane in *RStudio* displaying the options after clicking on a CSV file. The option 'Import Dataset...' is highlighted. Clicking on this will open the 'Import Text Data' dialog for importing this file. — Figure 6.8: Importing a file from RStudio’s File pane

Clicking on ‘Import Dataset’ opens up RStudio’s ‘Import Text Data’ dialogue box, which is similar to the one that we saw in LibreOffice Calc (Section 2.5.2). It allows you to select the relevant options to correctly import the file and displays a preview to check that the options that you have selected are correct. You can also specify the name of the R object to which you want to assign the imported data. By default, the name of the data file (minus the file extension and any special characters) is suggested.

Screenshot of the 'Import Text Data'-dialouge showing the import settings of a the file 'L1_data.csv'. It shows the 'File URL' of the selected file to be imported, a 'Data Preview', the 'Import Options' and the 'Code Preview'. — Figure 6.9: RStudio’s ‘Import Text Data’ dialogue

As soon as you click on the ‘Import’ button, the data is imported and opened using the View() function for you to check the sanity of the data.

This importing method works a treat, so what’s not to like? Well, the first problem is that you are not in full control. You cannot select which import function is used; RStudio decides for you. You may have noticed that it chooses to use the {readr} import functions, rather than the base R ones. There are lots of good reasons to use the {readr} functions (see Section 6.8), but it may not be what you wanted to do. When we do research, it is important for us to be in control of every step of the analysis process.

Second, your data import settings are not saved in an .R script as the commands were only sent to the Console: they are part of your script. This means that if you import your data in this way, do some analyses, and then close RStudio, you will have no way of knowing with which settings you imported the data to obtain the results of your analysis! This can have serious consequences for the reproducibility of your work.

Whilst there is no way of remedying the first issue, the second can easily be fixed. After you have successfully imported your data from RStudio’s Files pane, you can (and should!) immediately copy the import commands from the Console into your .R script. In this way, the next time you want to re-run your analysis, you can begin by running these import commands directly from your .R script rather than by via RStudio’s Files pane.

If you are running into errors due to incorrect file paths, it can be useful to try to import your data using RStudio’s GUI to see where you are going wrong by comparing your own attempts with the import commands that RStudio generated.

Check your progress 🌟

You have successfully completed 0 out of 7 questions in this chapter.

Are you confident that you can…?

Access, save, and examine data from published studies (Section 6.1) - (Section 6.2)
Set up and use projects in RStudio (Section 6.3)
Import data from .csv files into R (Section 6.5)
Check whether your data is correctly imported (Section 6.6)
Import tabular data in other formats (Section 6.7)
Use the {readr} package to import tabular files (Section 6.8)
Import files with single and multiple tabs from spreadsheet software (Section 6.9) - (Section 6.9.4)
Find out how to import data in other file formats (Section 6.10 - Section 6.11)

Now it’s time to start exploring research data in R! In the next chapter, you will learn how to work with in-built R functions to find out more about the DSV data from Dąbrowska (2019) that you imported in this chapter.

6 Impo`R`ting data

Chapter overview

6.1 Accessing data from a published study

6.2 Saving and examining the data

6.3 Using Projects in RStudio

6.4 Working directories

6.5 Importing data from a `.csv` file

6.6 Import errors and issues 🥺

6.7 Importing tabular data in other formats

6.7.1 Tab-separated file

6.7.2 Semi-colon-separated file

6.8 Using {readr} from the {tidyverse} to import tabular files

Delimiter-separated values (DSV) files

Fixed width files

6.9 Importing files from spreadsheet software

6.9.1 LibreOffice Calc

6.9.2 Microsoft Excel

6.9.3 Google Sheets

6.9.4 Importing spreadsheet files with multiple sheets/tabs

6.10 Importing data files from SPSS, SAS and Stata

6.11 Importing other file formats

6.12 Quick-and-dirty (aka bad!) ways to import data in `R`

6.12.1 Hardcoding file paths in `R` scripts 🥴

6.12.2 Importing data using RStudio’s GUI 🫤

Check your progress 🌟

Chapter overview

6.1 Accessing data from a published study

6.2 Saving and examining the data

6.3 Using Projects in RStudio

6.4 Working directories

6.5 Importing data from a .csv file

6.6 Import errors and issues 🥺

6.7 Importing tabular data in other formats

6.7.1 Tab-separated file

6.7.2 Semi-colon-separated file

6.8 Using {readr} from the {tidyverse} to import tabular files

Delimiter-separated values (DSV) files

Fixed width files

6.9 Importing files from spreadsheet software

6.9.1 LibreOffice Calc

6.9.2 Microsoft Excel

6.9.3 Google Sheets

6.9.4 Importing spreadsheet files with multiple sheets/tabs

6.10 Importing data files from SPSS, SAS and Stata

6.11 Importing other file formats

6.12 Quick-and-dirty (aka bad!) ways to import data in R

6.12.1 Hardcoding file paths in R scripts 🥴

6.12.2 Importing data using RStudio’s GUI 🫤

Check your progress 🌟

6.5 Importing data from a `.csv` file

6.12 Quick-and-dirty (aka bad!) ways to import data in `R`

6.12.1 Hardcoding file paths in `R` scripts 🥴