3 Data management
Chapter overview
Even if you are confident that you have no trouble managing your computer files, it is still worth taking a few minutes to read up on the basics of data management. This is especially true if you consider yourself a “digital native” as modern operating systems have made the way that computers deal with files very opaque.
In this chapter, you will learn about:
- File and folder naming conventions
- Absolute and relative computer paths
- Solutions for backing up your files
3.1 Recipes for successful data management: MasterChef meets Master’s Thesis
Data management is hardly a “hot” topic that people like to dwell on. That’s a shame because good file management is absolutely central to be able to conduct research and poor file management has the potential to seriously ‘spice things up’… but not in a good way! 🌶️ Whether you are working on a short course assignment, your Master’s or PhD thesis, or as part of a large research project team: research-related files must be named appropriately and safely stored in meaningful places.
Imagine trying to make a curry in an utterly disorganised kitchen that contains dozens of different spices, scattered across different cabinets and drawers, with vague or misleading labels. For example, you might have three jars labelled “Chilli” and no way of knowing which is mild “Kashmiri Chilli” as opposed to the extra hot “Thai Bird’s Eye Chilli”. The third might not be chilli at all, but actually a jar of paprika that has been entirely mislabelled. Some of these spices have been gathering dust for decades but the labels have no best-before dates so there is no way of knowing which are still fragrant. Cooking in such a kitchen would turn even the simplest cooking task into a tedious, time-consuming, and error-prone chore: If you’re not extremely careful, you could easily end up serving something that is bland or, in the worst of cases, entirely inedible! Similarly, in research, if your files are poorly named or stored haphazardly, it will make your work far less efficient, considerably more error-prone, and ultimately utterly frustrating.
But the good news is: just as a tidy, well-organized kitchen can greatly enhance your cooking experience, good file management can streamline your research process, help you avoid making mistakes, and reduce stress. In the following sections, we will cook up some good practices for file naming, data management, and project organisation. We will start with basic recipes for naming and managing your files. See the ‘Going further’ boxes for tips on learning the ‘gourmet skills’ needed to handle more complex projects. So, let’s don the chef’s hat and learn how to create a user-friendly computer workspace. And remember, as with cooking, practice makes perfect! 🧑🏽🍳
3.2 Naming conventions
File names are labels. They tell us what is inside a file and helps us identify the correct file quickly and reliably. If you had to run to the printing shop to get your thesis printed in time for a tight deadline, which of these sets of files would you rather have to choose from? Which is more likely to lead you to getting the wrong version printed?
Like the labels on your neatly organised spice jars, file and folder names should be clear, concise, and easily readable. Good file and folder names should be both human-friendly and computer-friendly.
By human-friendly we mean that you and any other human being should easily be able to understand what a folder or file contains. Just like you wouldn’t want a label on a spice jar to be a random string of numbers (e.g., 0171
) or only include the best-before date but nothing else (e.g., 31 Jan 2028
), you also wouldn’t want to guess what a file contains based on an ambiguous or unclear name like Chili
. Labels should be informative but succinct (e.g., Thai Bird's Eye Chilli 31 Jan 2028
not Thai Bird's Eye Chilli bought on December 19, 2023 whilst Christmas shopping with mum, note that the best before date is 31 January 2028
)! Unless you and all your colleagues read Thai, do not be tempted to write the part of the file name in Thai as this could also lead to misunderstandings.
Another reason for not including Thai characters in your file name is that it would not be computer-friendly. In general, computers are not good at dealing with names that contain anything else but Latin alphanumeric characters, e.g., the letters A
to Z
and a
to z
with no accents and the numbers 0
to 9
. Hyphens (-
) and underscores (_
) can also be used, but not spaces. The dot (.
) is reserved for the file extension and should ideally not be used elsewhere in the file name.
Hence, whilst Thai Bird's Eye Chilli 31 Jan 2028
is human-friendly, it is not computer-friendly. To make it a computer-friendly label, we need to remove the apostrophe. Whilst spaces are not strictly forbidden, they can cause all kinds of issues and are therefore also best avoided. Space characters can be replaced by hyphens (-
) and underscores (_
) and the two can be combined in a meaningful way. For example, in the label Thai-Birds-Eye-Chilli_31-Jan-2028
, the _
distinguishes between two different pieces of information, whilst the -
helps humans to parse individual words within a piece of information. Using such patterns consistently not only helps humans to read file names efficiently, it also means that computers can easily ‘parse’, i.e., break down such names into meaningful items. This can be very useful to search for files or automatically extract metadata from file names.
It is fine to use both lower-case and upper-case letters in file and folder names. However, some operating systems will treat upper-case and lower-case letters as the same, whilst others will not. This means that you should avoid having file names that are only distinguishable by case.
Finally, it is worth noting that file names cannot be infinitely long! The maximum length of a file name depends on the operating system and the application that you use1 but, as a rule of thumb, if you can display the entire file name in a reasonably sized Finder window (on macOS) or File Explorer window (on Windows), its length is unquestionably both human- and computer-friendly.
Q3.1 In which case is this file name? my_first_file_name.R
Q3.2 Why is this file name problematic? MyDocument final.1a.docx
Q3.3 Which of these file names are both human-friendly and computer-friendly?
It is also important to ensure that file names are easily sortable. If you have a series of files that document a process, consider beginning each file name with a number that correspond to the order of the process, e.g., 01_DataPreparation.R
, 02_DataAnnotation.R
, 03_AnnotationEvaluation.R
. Left-padding the numbers with one or more 0
will mean that the files are sorted numerically, even when files are listed alphabetically (see Figure 3.3 (b)).
It is often a good idea to include the date in file names. However, many date formats are not easily sortable (see Figure 3.4 (a)). Formatting dates using the ‘YYYY-MM-DD’ format as in Figure 3.4 (b) will allow you to easily sort your files in chronological order.
Even though computers have gotten much better at dealing with folder and file names containing spaces and special characters, using anything other than basic Latin alphanumeric characters, -
and _
in file and folder names will - sooner or later - cause you or your colleagues some serious issues. This is especially true when you start coding. Do not delay getting used to using systematic, human- and computer-friendly folder and file names! In the long run, these simple guidelines will make your digital life much smoother and save you much time and unnecessary stress.
3.3 Folders and paths
Now that you know how to name your files and folders sensibly, we can turn to best practices for organising these files and folders. Returning to our kitchen analogy, imagine that, over many years, you collected hundreds of recipes from friends and family. These recipes are jotted down on individual sheets of paper, all of which have been thoughtlessly tossed into a large kitchen drawer called ‘Documents’, which also happens to contain receipts for kitchen appliances still under warranty, takeaway brochures, and various other bits of paper. In such a kitchen, finding Aunt Sophie’s famous caramelised apple cake could take a while! If, however, you had a dedicated kitchen drawer for recipes which contained neatly labelled folders different types of dishes, you would know to look for this cake recipe in the Desserts folder. Within the Desserts folder, you could have sub-folders for different types of desserts (e.g., cakes, ice creams, trifles). This would make finding Aunt Sophie’s recipe an absolute piece of cake!
Thinking about how to structure folders and sub-folders for your projects is about creating a kind of road map that should be readily interpretable by both humans and computers. This is where the concept of ‘paths’ arises. Paths, in simple terms, describe the location of a file or a folder in a computer’s filesystem. There are different types of paths. An absolute path provides a complete path from the computer’s “root folder”. If our house were our root folder, the absolute path to Aunt Sophie’s recipe would be "/Kitchen/Recipes/Desserts/Cakes/Apple-Cake_Aunt-Sophie"
. Hence, just like your home’s postal address, which ideally specifies your home’s absolute location worldwide, an absolute path provides a complete path from a computer’s root folder to the file or folder in question.
By contrast, a relative path represents the location of a file or folder relative to another folder. Hence, if we already have the Dessert folder open in front of us, the relative path to the apple cake recipe would simply be "Cakes/Apple-Cake_Aunt-Sophie"
. However, if we wanted to access a recipe in the Starters folder from the Cakes folder, we would first have to go “back up the path” from the Cakes folder to the Recipes drawer. This is achieved by adding ../
to the front of the relative path, e.g., "../Starters/Soups/Pea-Mint-Soup_Barbara"
.
For example, "/Users/lefoll/Documents/Teaching/RstatsTextbook/ToDo.txt"
is the absolute path from my computer’s root folder to the file containing my to-do list in relation to this textbook project. By contrast, a “relative path” represents the location of a file or folder relative to another folder. Hence, if I am already in the directory "/Users/lefoll/Documents/Teaching/RstatsTextbook/"
, the relative path to my to-do list is only "ToDo.txt"
.
Note that, here, we use the term “folder” as a metaphor for a computer file directory. Most modern operating systems use folder icons that look like the kind of paper file folders that office workers use to have piled up on their desks as a means of visually representing directories in computer file systems.
To complicate things a little, the way file paths are written varies depending on the computer’s operating system. In Unix-based systems like Linux and macOS, paths are written using forward slashes (e.g., "/Users/elen/Documents/Teaching/RstatsTextbook/ToDo.txt"
), whereas on Windows, paths are written using backslashes (e.g., "C:\Users\elen\Documents\Teaching\RstatsTextbook\ToDo.txt"
).
There are many ways to find out where your files are stored on your computer. Let us begin by opening a Finder window (on macOS) or a File Explorer window (on Windows). Navigate to the folder which contains the file for which you want to find the absolute path. Alternatively you could use your computer’s search function to search for the file. Once you have found it:
on Windows: Right-click on the file (in some older Windows versions, you may also need to press the “shift” key). Among the options presented to you, click on the one to copy the file path (e.g., “Copy as path” or similar in the language of your operating system).
on macOS: Right-click on the file and then press the Option/⌥ key on your keyboard. Pressing down this key will change the options you are given after having right-clicked. One of these options should now be “Copy … as Pathname” (or something equivalent in the language of your operating system). Click on this option.
Then, open any text-editing programme (e.g., LibreOffice Writer, Microsoft Word, TextEdit, or NotePad++) and use the shortcut Ctrl/Cmd + V
to paste your file’s path in the empty document. If you are on Windows, your path should have backslashes, whereas if you are on Linux or macOS, your path should have forward slashes.
Q3.4 What is the absolute path to the highlighted file in Figure 3.5?
Q3.5 From the “UzK” folder, what is the relative path to the highlighted file in Figure 3.5?
Q3.6 From the “Rscripts” folder, what is the relative path to the folder “2023_SoSe_CADS” (see Figure 3.5)?
Hint: From the Rscript
folder, you will need to go “back up the path” twice: once to get to the course folder 2024_SoSe_Stats
and a second time to get to the UzK
folder, before you can move to the 2023_SoSe_CADS
folder. Going back up the path is achieved with ../
.
Read the abstract of the following academic article. What was this experimental study about?
Terai, Masato, Junko Yamashita & Kelly E. Pasich. 2021. Effects of Learning Direction in Retrieval Practice on EFL Vocabulary Learning. Studies in Second Language Acquisition 43(5). 1116–1137. https://doi.org/10.1017/S0272263121000346.
a) According to the study, which is the most effective way of learning vocabulary in a foreign language?
The authors of this article have published the data and materials associated with this study on IRIS. You can find them here: https://iris-database.org/search/?s_publicationAPAInlineReference=Terai%20et%20al.%20(2021)
b) In which format are the video files associated with this publication?
c) In which format is the analysis code which they shared on IRIS?
d) The associated materials also include a section entitled “Scores on measures / tests”. Download the file dataset1_ssla_20210313.csv
from this section. Which character is used as the separator in this delimiter-separated values (DSV) file?
3.4 Backing up data: ‘Fire safety’ measures in the digital kitchen 🧯
A basic principle of sound data management consists in keeping a copy of all your files in more than one place. This ensures that, should something go awry, your research is not lost forever but instead can be recovered and restored promptly. There are many ways things could go wrong: laptops can get stolen or permanently damaged (laptops are not terribly keen on hot chocolate as it turns out… 🙈), computer files can be corrupted and become unusable, you or someone else may accidentally delete files, your computer can become infested with a nasty virus, etc.
An effective way to protect your projects is to abide by the 3-2-1 rule (Schweinberger 2022). It’s simple:
- Ensure that you have at least three copies of your data (e.g., one that you work with on your personal computer and two back-up copies).
- Split the backup copies between two different storage media (e.g., a hard-drive stored in your office and online in a secure cloud service).
- Store one of these copies in a secure place off-site (i.e., not where your computer usually is).
One solution is to store your three copies on:
- your personal laptop or computer,
- a backup hard drive stored in a secure location, and
- a secure online repository such as the data management system provided by your institution, e.g., Sciebo, ownCloud, or GitLab.
Choosing an online repository will protect your data if your computer malfunctions or is damaged or stolen, but remember that it can also potentially make your data accessible to others. This is particularly true of commercial back-up solutions such as Microsoft’s OneDrive, Google’s Drive, Apple’s iCloud, or Dropbox, which although convenient and very user-friendly, should not be used to store sensitive data (e.g., data that may be used to identify individuals, contain financial information, health records, location data, or proprietary research data). Always check if your institution has its own, secure cloud option. If not, keeping a second hard-drive copy in a separate, secure location is likely the safest solution.
Whilst the 3-2-1 rule stipulates that you should keep at least three copies of each file, in an optimal scenario, each file should exist only once at each location (e.g., on your laptop, a separate hard-drive, and the server of an online repository). It is quite easy to (often unknowingly) end up with several duplicates of the same file on any one machine but this can cause issues if, for example, you end up updating the wrong version of the file. Avoiding and eliminating file duplicates is therefore an important step towards proficient data management.
3.5 Conclusion
Sound data management - comprising of both good folder and file naming practices and the smart organisation of these folders and files - is the foundation for efficient research workflow. Understanding and applying these basic principles of file management will ensure that everything in your digital ‘kitchen’ has its place, is well labelled, and easy to find. By ensuring that we keep our kitchens clean, tidy, and safe, we can whip out some truly delicious dishes!
Whilst the above caption is true, if it helps, you might want to imagine that someone very judgemental could actually look in your Documents folder at any given time!
This short online module is ideal to learn more about smarter ways to work with files and data:
The University of Queensland Library. 2023. Work with Data and Files. The University of Queensland. https://uq.pressbooks.pub/digital-essentials-data-and-files/. (14 May, 2024).
To go further, here are some great in-depth resources to learn more about data management in linguistics and education research specifically:
Berez-Kroeker, Andrea L., Bradley McDonnell, Eve Koller & Lauren B. Collister. 2022. The Open Handbook of Linguistic Data Management. MIT Press. https://doi.org/10.7551/mitpress/12200.001.0001.
Lewis, Crystal. Data Management in Large-Scale Education Research. https://datamgmtinedresearch.com/. (14 May, 2024).
Both of these are available as Open Educational Resources.
Check your progress 🌟
You have successfully completed 0 out of 6 questions in this chapter.
Are you confident that you can…?
If that’s the case, you are now all set to install R
and RStudio in Chapter 4 and learn how to get started in R
in Chapter 5!
For example, many Windows applications have a maximum file path length of 260 characters (alvinashcraft et al. 2022).↩︎