Appendix D — Evaluation of the Multi-Feature Tagger of English (MFTE)

Last modifications on this page

July 21, 2024

For more information on the tagger itself, as well as the evaluation data and methods, see Le Foll (2021) and https://github.com/elenlefoll/MultiFeatureTaggerEnglish.

Using the MFTE

The Multi-Feature Tagger of English (MFTE) Perl is free to use and was released under an Open Source licence. If you are interested in using the MFTE for your own project, I recommend using the latest version of the MFTE Python, which is much easier to use, can tag many more features, and also underwent a thorough evaluation. Note also that all future developments of the tool will be made on the MFTE Python. To find out more, see Le Foll & Shakir (2023) and https://github.com/mshakirDr/MFTE.

D.1 Set-up

The following packages must be installed and loaded to process the evaluation data.

Built with R 4.4.1

#renv::restore() # Restore the project's dependencies from the lockfile to ensure that same package versions are used as in the original thesis.

library(caret) # For computing confusion matrices
library(harrypotter) # Only for colour scheme
library(here) # For path management
library(knitr) # Loaded to display the tables using the kable() function
library(paletteer) # For nice colours
library(readxl) # For the direct import of Excel files
library(tidyverse) # For everything else!

D.2 Data import from evaluation files

The data is imported directly from the Excel files in which the manual tag check and corrections was performed. A number of data wrangling steps need to be made for the data to be converted to a tidy format.

Code
# Function to import and wrangle the evaluation data from the Excel files in which the manual evaluation was conducted
importEval3 <- function(file, fileID, register, corpus) {
  Tag1 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag1, Tag1Gold) |> 
  rename(Tag = Tag1, TagGold = Tag1Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)
  
  Tag2 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag2, Tag2Gold) |> 
  rename(Tag = Tag2, TagGold = Tag2Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)

Tag3 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag3, Tag3Gold) |> 
  rename(Tag = Tag3, TagGold = Tag3Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)

output <- rbind(Tag1, Tag2, Tag3) |> 
  mutate(across(where(is.factor), str_remove_all, pattern = fixed(" "))) |> # Removes all white spaces which are found in the excel files
  filter(!is.na(Output)) |> 
  mutate_if(is.character, as.factor)
}

# Second function to import and wrangle the evaluation data for Excel files with four tag columns as opposed to three
importEval4 <- function(file, fileID, register, corpus) {
  Tag1 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag1, Tag1Gold) |> 
  rename(Tag = Tag1, TagGold = Tag1Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)
  
  Tag2 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag2, Tag2Gold) |> 
  rename(Tag = Tag2, TagGold = Tag2Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)

Tag3 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag3, Tag3Gold) |> 
  rename(Tag = Tag3, TagGold = Tag3Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)

Tag4 <- file |> 
  add_column(FileID = fileID, Register = register, Corpus = corpus) |>
  select(FileID, Corpus, Register, Output, Tokens, Tag4, Tag4Gold) |> 
  rename(Tag = Tag4, TagGold = Tag4Gold, Token = Tokens) |> 
  mutate(Evaluation = ifelse(is.na(TagGold), TRUE, FALSE)) |> 
  mutate(TagGold = ifelse(is.na(TagGold), as.character(Tag), as.character(TagGold))) |>
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)

output <- rbind(Tag1, Tag2, Tag3, Tag4) |> 
  mutate(across(where(is.factor), str_remove_all, pattern = fixed(" "))) |> # Removes all white spaces which are found in the excel files
  filter(!is.na(Tag)) |> 
  mutate_if(is.character, as.factor)

}

# Function to decide which of the two above functions should be used
importEval <- function(file, fileID, register, corpus) { 
  if(sum(!is.na(file$Tag4)) > 0) {
    output = importEval4(file = file, fileID = fileID, register = register, corpus = corpus)
  }
  else{
    output = importEval3(file = file, fileID = fileID, register = register, corpus = corpus)
  }
}

Solutions_Intermediate_Spoken_0032 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "Solutions_Intermediate_Spoken_0032_Evaluation.xlsx")), fileID = "Solutions_Intermediate_Spoken_0032", register = "Conversation", corpus = "TEC-Sp")

HT_5_Poetry_0001 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "HT_5_Poetry_0001_Evaluation.xlsx")), fileID = "HT_5_Poetry_0001", register = "Poetry", corpus = "TEC-Fr")

Achievers_A1_Informative_0006 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "Achievers_A1_Informative_0006_Evaluation.xlsx")), fileID = "Achievers_A1_Informative_0006", register = "Informative", corpus = "TEC-Sp")

New_GreenLine_5_Personal_0003 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "New_GreenLine_5_Personal_0003_Evaluation.xlsx")), fileID = "New_GreenLine_5_Personal_0003", register = "Personal communication", corpus = "TEC-Ger")

Piece_of_cake_3e_Instructional_0006 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "Piece_of_cake_3e_Instructional_0006_Evaluation.xlsx")), fileID = "Piece_of_cake_3e_Instructional_0006", register = "Instructional", corpus = "TEC-Fr")

Access_4_Narrative_0006 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "Access_4_Narrative_0006_Evaluation.xlsx")), fileID = "Access_4_Narrative_0006", register = "Fiction", corpus = "TEC-Ger")

BNCBFict_b2 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBFict_b2.xlsx")), fileID = "BNCBFict_b2", register = "fiction", corpus = "BNC2014")

BNCBFict_m54 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBFict_m54.xlsx")), fileID = "BNCBFict_m54", register = "fiction", corpus = "BNC2014")

BNCBFict_e27 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBFict_e27.xlsx")), fileID = "BNCBFict_e27", register = "fiction", corpus = "BNC2014")

BNCBMass16 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBMass16.xlsx")), fileID = "BNCBMass16", register = "news", corpus = "BNC2014")

BNCBMass23 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBMass23.xlsx")), fileID = "BNCBMass23", register = "news", corpus = "BNC2014")

BNCBReg111 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBReg111.xlsx")), fileID = "BNCBReg111", register = "news", corpus = "BNC2014")

BNCBReg750 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBReg750.xlsx")), fileID = "BNCBReg750", register = "news", corpus = "BNC2014")

BNCBSer486 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBSer486.xlsx")), fileID = "BNCBSer486", register = "news", corpus = "BNC2014")

BNCBSer562 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBSer562.xlsx")), fileID = "BNCBSer562", register = "news", corpus = "BNC2014")

BNCBEBl8 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBEBl8.xlsx")), fileID = "BNCBEBl8", register = "internet", corpus = "BNC2014")

BNCBEFor32 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "BNCBEFor32.xlsx")), fileID = "BNCBEFor32", register = "internet", corpus = "BNC2014")

S2DD <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "S2DD.xlsx")), fileID = "S2DD", register = "spoken", corpus = "BNC2014")

S3AV <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "S3AV.xlsx")), fileID = "S3AV", register = "spoken", corpus = "BNC2014")

SEL5 <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "SEL5.xlsx")), fileID = "SEL5", register = "spoken", corpus = "BNC2014")

SVLK <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "SVLK.xlsx")), fileID = "SVLK", register = "spoken", corpus = "BNC2014")

SZXQ <- importEval(file = read_excel(here("data", "MFTE", "evaluation", "SZXQ.xlsx")), fileID = "SZXQ", register = "spoken", corpus = "BNC2014")

TaggerEval <- rbind(Solutions_Intermediate_Spoken_0032, HT_5_Poetry_0001, Achievers_A1_Informative_0006, New_GreenLine_5_Personal_0003, Piece_of_cake_3e_Instructional_0006, Access_4_Narrative_0006, BNCBEBl8, BNCBFict_b2, BNCBFict_m54, BNCBFict_e27, BNCBEFor32, BNCBMass16, BNCBMass23, BNCBReg111, BNCBReg750, BNCBSer486, BNCBSer562, S2DD, S3AV, SEL5, SVLK, SZXQ)

Some tags had to be merged to account for changes made to the MFTE between the evaluation and the tagging of the corpora included in the present study.

Code
TaggerEval <- TaggerEval |> 
  mutate(Tag = ifelse(Tag == "PHC", "CC", as.character(Tag))) |> 
  mutate(TagGold = ifelse(TagGold == "PHC", "CC", as.character(TagGold))) |> 
  mutate(Tag = ifelse(Tag == "QLIKE", "LIKE", as.character(Tag))) |> 
  mutate(TagGold = ifelse(TagGold == "QLIKE", "LIKE", as.character(TagGold))) |> 
  mutate(Tag = ifelse(Tag == "TO", "IN", as.character(Tag))) |> 
  mutate(TagGold = ifelse(TagGold == "TO", "IN", as.character(TagGold))) |> 
  mutate_if(is.character, as.factor) |> 
  mutate(Evaluation = ifelse(as.character(Tag) == as.character(TagGold), TRUE, FALSE))

# head(TaggerEval) # Check sanity of data
# summary(TaggerEval) # Check sanity of data

# saveRDS(TaggerEval, here("data", "processed", "MFTE_Evaluation_Results.rds"))

# write.csv(TaggerEval, here("data", "processed", "MFTE_Evaluation_Results.csv"))

This table provides a summary of the complete evaluation dataset. It comprises 25,233 tags that were checked (and, if needs be, corrected) by at least one human annotator. This number includes tags for punctuation marks, which make up a considerable proportion of the tags.

          FileID          Corpus               Register        Output     
 BNCBFict_b2 : 2621   TEC-Sp : 1042   fiction      :6500   ._.    : 1156  
 BNCBFict_e27: 2104   TEC-Fr : 2058   news         :6312   the_DT :  820  
 BNCBFict_m54: 1775   TEC-Ger: 1415   spoken       :6047   ,_,    :  720  
 BNCBMass16  : 1619   BNC2014:20718   internet     :1859   a_DT   :  466  
 SEL5        : 1463                   Instructional:1048   of_IN  :  328  
 BNCBEFor32  : 1305                   Poetry       :1010   (Other):21742  
 (Other)     :14346                   (Other)      :2457   NA's   :    1  
     Token            Tag           TagGold      Evaluation     
 .      : 1156   NN     : 4415   NN     : 4328   Mode :logical  
 the    :  820   IN     : 2145   IN     : 2113   FALSE:832      
 ,      :  720   DT     : 1454   DT     : 1457   TRUE :24401    
 to     :  495   .      : 1367   .      : 1367                  
 's     :  493   VPRT   : 1044   VPRT   : 1054                  
 (Other):21547   VBD    :  899   VBD    :  895                  
 NA's   :    2   (Other):13909   (Other):14019                  

D.3 Estimating MFTE accuracy for Textbook English

In total, 4,515 tags from the TEC were manually checked. This chunk calculates the recall and precision rates of each feature, ignoring all punctuation and symbols.

Code
data <- TaggerEval |> 
  filter(Corpus %in% c("TEC-Fr", "TEC-Ger", "TEC-Sp")) |> 
  filter(TagGold != "UNCLEAR") |> 
  filter(Tag %in% c(str_extract(Tag, "[A-Z0-9]+"))) |> # Remove punctuation tags which are uninteresting here.
  filter(Tag != "SYM" & Tag != "``") |> 
  droplevels() |> 
  mutate(Tag = factor(Tag, levels = union(levels(Tag), levels(TagGold)))) |> # Ensure that the factor levels are the same for the next caret operation
  mutate(TagGold = factor(TagGold, levels = union(levels(Tag), levels(TagGold))))

# Spot gold tag corrections that are not actually errors (should return zero rows if all is well)
# data[data$Tag==data$TagGold & data$Evaluation == FALSE,] |> as.data.frame()

The breakdown of inaccurate vs. accurate tags in this TEC evaluation sample is:

   Mode   FALSE    TRUE 
logical     114    3831 

Note that the following accuracy metrics calculated using the caret::confusionMatrix are not very representative because they include tags, which were not entered in the study, e.g., LS and FW.

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
          0.97           0.97           0.97           0.98           0.20 
AccuracyPValue  McnemarPValue 
          0.00            NaN 

Accuracy metrics per feature are more interesting and relevant.

Precision Recall F1
Class: ABLE 1.00 1.00 1.00
Class: ACT 0.97 0.98 0.98
Class: AMP 1.00 1.00 1.00
Class: ASPECT 1.00 1.00 1.00
Class: BEMA 1.00 1.00 1.00
Class: CAUSE 1.00 1.00 1.00
Class: CC 1.00 0.99 1.00
Class: CD 0.95 0.95 0.95
Class: COMM 1.00 0.98 0.99
Class: COND 1.00 1.00 1.00
Class: CONT 1.00 1.00 1.00
Class: CUZ 1.00 1.00 1.00
Class: DEMO 0.97 0.97 0.97
Class: DMA 1.00 1.00 1.00
Class: DOAUX 0.86 1.00 0.92
Class: DT 1.00 1.00 1.00
Class: DWNT 0.67 1.00 0.80
Class: ELAB 1.00 1.00 1.00
Class: EMPH 0.83 1.00 0.91
Class: EX 1.00 1.00 1.00
Class: EXIST 1.00 1.00 1.00
Class: FPP1P 1.00 1.00 1.00
Class: FPP1S 1.00 1.00 1.00
Class: FPUH 1.00 1.00 1.00
Class: FREQ 1.00 1.00 1.00
Class: FW 0.10 1.00 0.18
Class: GTO 1.00 1.00 1.00
Class: HDG 1.00 1.00 1.00
Class: HGOT 1.00 1.00 1.00
Class: IN 1.00 1.00 1.00
Class: JJ 0.96 0.98 0.97
Class: JPRED 0.97 0.90 0.94
Class: LIKE 0.83 1.00 0.91
Class: MDCA 1.00 1.00 1.00
Class: MDCO 1.00 1.00 1.00
Class: MDMM 1.00 0.67 0.80
Class: MDNE 1.00 0.80 0.89
Class: MDWO 1.00 1.00 1.00
Class: MDWS 1.00 1.00 1.00
Class: MENTAL 0.99 0.99 0.99
Class: NCOMP 0.88 1.00 0.94
Class: NN 0.95 0.99 0.97
Class: NULL 1.00 0.08 0.14
Class: OCCUR 0.94 1.00 0.97
Class: PASS 0.89 0.89 0.89
Class: PEAS 1.00 0.87 0.93
Class: PGET 1.00 1.00 1.00
Class: PIT 1.00 1.00 1.00
Class: PLACE 1.00 0.83 0.91
Class: POLITE 1.00 1.00 1.00
Class: POS 1.00 1.00 1.00
Class: PROG 1.00 0.89 0.94
Class: QUAN 0.96 0.98 0.97
Class: QUPR 1.00 1.00 1.00
Class: RB 1.00 0.99 0.99
Class: RP 1.00 1.00 1.00
Class: SO 1.00 0.64 0.78
Class: SPLIT 1.00 1.00 1.00
Class: SPP2 1.00 1.00 1.00
Class: STPR 0.60 1.00 0.75
Class: THATD 0.86 1.00 0.92
Class: THRC 1.00 0.71 0.83
Class: THSC 0.69 1.00 0.82
Class: TIME 1.00 0.97 0.98
Class: TPP3P 1.00 1.00 1.00
Class: TPP3S 1.00 1.00 1.00
Class: VB 0.94 0.94 0.94
Class: VBD 0.97 0.99 0.98
Class: VBG 0.96 1.00 0.98
Class: VBN 0.85 0.92 0.88
Class: VIMP 0.99 0.88 0.93
Class: VPRT 0.98 0.98 0.98
Class: WHQU 0.97 1.00 0.98
Class: WHSC 1.00 0.97 0.99
Class: XX0 1.00 1.00 1.00
Class: YNQU 1.00 1.00 1.00
Class: OCR NA 0.00 NA

D.4 MFTE accuracy for reference corpora (or comparable corpora)

D.4.1 Conversation

These are extracts from the Spoken BNC2014 (as entered in the study). The evaluation data for this sample excludes 7 tokens deemed unclear by at least one human annotator.

Code
data <- TaggerEval |> 
  filter(Register == "spoken") |> 
  filter(TagGold != "UNCLEAR") |> 
  filter(Tag %in% c(str_extract(Tag, "[A-Z0-9]+"))) |> # Remove all punctuation tags which are uninteresting here.
  droplevels() |> 
  mutate(Tag = factor(Tag, levels = union(levels(Tag), levels(TagGold)))) |> # Ensure that the factor levels are the same for the next caret operation
  mutate(TagGold = factor(TagGold, levels = union(levels(Tag), levels(TagGold))))

# Spot gold tag corrections that are not actually errors (should return zero rows if all is well)
# data[data$Tag==data$TagGold & data$Evaluation == FALSE,] |> as.data.frame()

The breakdown of inaccurate vs. accurate tags in this evaluation sample is:

   Mode   FALSE    TRUE 
logical     224    5388 

Note that the following accuracy metrics calculated using the caret::confusionMatrix are not very representative because they include tags, which were not entered in the study, e.g., LS and FW.

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
          0.96           0.96           0.95           0.97           0.12 
AccuracyPValue  McnemarPValue 
          0.00            NaN 

D.4.2 Fiction

The evaluation data for this sample excludes 0 tokens deemed unclear by at least one human annotator.

data <- TaggerEval |> 
  filter(Register == "fiction") |> 
  filter(TagGold != "UNCLEAR") |> 
  filter(Tag %in% c(str_extract(Tag, "[A-Z0-9]+"))) |> # Remove all punctuation tags which are uninteresting here.
  filter(Tag != "SYM" & Tag != "``") |> 
  droplevels() |> 
  mutate(Tag = factor(Tag, levels = union(levels(Tag), levels(TagGold)))) |> # Ensure that the factor levels are the same for the next caret operation
  mutate(TagGold = factor(TagGold, levels = union(levels(Tag), levels(TagGold))))

# Spot gold tag corrections that are not actually errors (should return zero rows if all is well)
# data[data$Tag==data$TagGold & data$Evaluation == FALSE,] |> as.data.frame()

The breakdown of inaccurate vs. accurate tags in this evaluation sample is:

   Mode   FALSE    TRUE 
logical     168    5346 

Note that the following accuracy metrics calculated using the caret::confusionMatrix are not very representative because they include tags, which were not entered in the study, e.g., LS and FW.

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
          0.97           0.97           0.96           0.97           0.19 
AccuracyPValue  McnemarPValue 
          0.00            NaN 

D.4.3 Informative

The evaluation data for this sample excludes 8 tokens deemed unclear by at least one human annotator.

data <- TaggerEval |> 
  filter(Register == "news" | FileID %in% c("BNCBEFor32", "BNCBEBl8")) |> 
  filter(TagGold != "UNCLEAR") |> 
  filter(Tag %in% c(str_extract(Tag, "[A-Z0-9]+"))) |> # Remove all punctuation tags which are uninteresting here.
  filter(Tag != "SYM" & Tag != "``") |> 
  droplevels() |> 
  mutate(Tag = factor(Tag, levels = union(levels(Tag), levels(TagGold)))) |> # Ensure that the factor levels are the same for the next caret operation
  mutate(TagGold = factor(TagGold, levels = union(levels(Tag), levels(TagGold))))

# Spot gold tag corrections that are not actually errors (should return zero rows if all is well)
# data[data$Tag==data$TagGold & data$Evaluation == FALSE,] |> as.data.frame()

The breakdown of inaccurate vs. accurate tags in this evaluation sample is:

   Mode   FALSE    TRUE 
logical     309    7113 

Note that the following accuracy metrics calculated using the caret::confusionMatrix are not very representative because they include tags, which were not entered in the study, e.g., LS and FW.

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
          0.96           0.95           0.95           0.96           0.24 
AccuracyPValue  McnemarPValue 
          0.00            NaN 

D.5 Estimating the overall MFTE accuracy for corpora used in the study

Code
data <- TaggerEval |> 
  filter(TagGold != "UNCLEAR") |> 
  filter(Tag %in% c(str_extract(Tag, "[A-Z0-9]+"))) |> # Remove all punctuation tags which are uninteresting here.
  filter(Tag != "SYM" & Tag != "``") |> 
  filter(TagGold != "SYM" & TagGold != "``") |> 
  droplevels() |> 
  mutate(Tag = factor(Tag, levels = union(levels(Tag), levels(TagGold)))) |> # Ensure that the factor levels are the same for the next caret operation
  mutate(TagGold = factor(TagGold, levels = union(levels(Tag), levels(TagGold))))

# Generate a better formatted results table for export: recall, precision and f1
confusion_matrix <- cm$table
total <- sum(confusion_matrix)
number_of_classes <- nrow(confusion_matrix)
correct <- diag(confusion_matrix)
# sum all columns
total_actual_class <- apply(confusion_matrix, 2, sum)
# sum all rows
total_pred_class <- apply(confusion_matrix, 1, sum)
# Precision = TP / all that were predicted as positive
precision <- correct / total_pred_class
# Recall = TP / all that were actually positive
recall <- correct / total_actual_class
# F1
f1 <- (2 * precision * recall) / (precision + recall)
# create data frame to output results
results <- data.frame(precision, recall, f1, total_actual_class)

results |> 
  kable(digits = 2)
precision recall f1 total_actual_class
ACT 0.92 0.99 0.95 177
AMP 1.00 0.94 0.97 16
ASPECT 1.00 1.00 1.00 23
BEMA 0.99 0.99 0.99 111
CAUSE 1.00 1.00 1.00 18
CC 1.00 0.99 0.99 254
CD 0.99 0.98 0.98 134
COMM 1.00 1.00 1.00 88
CONC 0.90 0.82 0.86 11
COND 1.00 1.00 1.00 17
CONT 0.96 1.00 0.98 54
CUZ 1.00 0.90 0.95 10
DEMO 1.00 0.96 0.98 51
DMA 0.50 0.40 0.44 5
DOAUX 0.92 0.92 0.92 25
DT 1.00 1.00 1.00 490
DWNT 1.00 1.00 1.00 5
ELAB 1.00 1.00 1.00 3
EMPH 0.98 0.95 0.96 43
EX 1.00 1.00 1.00 15
EXIST 0.96 1.00 0.98 27
FPP1P 1.00 1.00 1.00 49
FPP1S 1.00 1.00 1.00 59
FPUH 1.00 0.67 0.80 3
FREQ 1.00 1.00 1.00 15
FW 0.29 0.40 0.33 5
GTO 1.00 1.00 1.00 4
HDG 1.00 1.00 1.00 5
IN 0.99 1.00 0.99 836
JJAT 0.94 0.87 0.90 360
JJPR 0.92 0.74 0.82 108
LIKE 1.00 1.00 1.00 9
MDCA 1.00 1.00 1.00 12
MDCO 1.00 1.00 1.00 12
MDMM 1.00 1.00 1.00 1
MDNE 1.00 0.95 0.98 22
MDWO 1.00 1.00 1.00 20
MDWS 1.00 1.00 1.00 31
MENTAL 0.98 1.00 0.99 106
NCOMP 0.92 0.99 0.96 171
NN 0.96 0.98 0.97 1805
OCCUR 1.00 1.00 1.00 11
PASS 0.92 0.92 0.92 79
PEAS 1.00 0.91 0.96 70
PGET 1.00 0.67 0.80 6
PIT 1.00 0.96 0.98 78
PLACE 0.86 1.00 0.93 19
POLITE 1.00 1.00 1.00 7
POS 0.98 0.96 0.97 46
PROG 0.92 0.88 0.90 40
PRP 0.00 0.00 NaN 1
QUAN 0.96 1.00 0.98 80
QUPR 1.00 1.00 1.00 21
RB 0.96 0.95 0.96 137
RP 1.00 0.82 0.90 44
SO 1.00 0.89 0.94 9
SPLIT 1.00 1.00 1.00 40
SPP2 1.00 1.00 1.00 53
STPR 0.50 1.00 0.67 2
THATD 0.85 1.00 0.92 11
THRC 1.00 0.50 0.67 8
THSC 0.85 1.00 0.92 34
TIME 0.95 0.98 0.96 40
TPP3P 1.00 1.00 1.00 61
TPP3S 1.00 1.00 1.00 108
URL 1.00 1.00 1.00 1
USEDTO 0.00 NaN NaN 0
VB 0.90 0.93 0.91 258
VBD 0.96 0.97 0.97 215
VBG 0.91 0.91 0.91 111
VBN 0.42 1.00 0.59 22
VIMP 0.71 0.34 0.47 29
VPRT 0.95 0.95 0.95 351
WHQU 1.00 0.44 0.62 9
WHSC 0.95 1.00 0.97 95
XX0 1.00 0.97 0.99 76
YNQU 0.00 NaN NaN 0
`` NaN 0.00 NaN 1
NULL NaN 0.00 NaN 38
SYM NaN 0.00 NaN 1
Code
resultslong <- results |> 
  drop_na() %>%
  mutate(tag = row.names(.)) |> 
  filter(tag != "NULL" & tag != "SYM" & tag != "OCR" & tag != "FW" & tag != "USEDTO") |> 
  rename(n = total_actual_class) |> 
  pivot_longer(cols = c("precision", "recall", "f1"), names_to = "metric", values_to = "value") |> 
  mutate(metric = factor(metric, levels = c("precision", "recall", "f1")))

# summary(resultslong$n)

ggplot(resultslong, aes(y = reorder(tag, desc(tag)), x = value, group = metric, colour = n)) +
  geom_point(size = 2) +
  ylab("") +
  xlab("") +
  facet_wrap(~ metric) +
  scale_color_paletteer_c("harrypotter::harrypotter", trans = "log", breaks = c(1,10, 100, 1000), labels = c(1,10, 100, 1000), name = "# tokens \nmanually\nevaluated") +
  theme_bw() +
  theme(panel.grid.major.y = element_line(colour = "darkgrey")) +
  theme(legend.position = "right")

Code
#ggsave(here("plots", "TaggerAccuracyPlot.svg"), width = 7, height = 12)

D.6 Exploring tagger errors

To inspect regular/systematic tagger errors, we add an error tag with the incorrectly assigned tag and underscore and then the correct “gold” label.

Code
errors <- TaggerEval |> 
  filter(Evaluation=="FALSE") |> 
  filter(TagGold != "UNCLEAR") |> 
  mutate(Error = paste(Tag, TagGold, sep = " -> "))

FreqErrors <- errors |> 
  #filter(Corpus %in% c("TEC-Fr", "TEC-Ger", "TEC-Sp")) |> 
  count(Error) |> 
  arrange(desc(n))

# Number of error types that only occur once
once <- FreqErrors |> 
  filter(n == 1) |> 
  nrow()

The total number of errors is 817. Of those, 94 occur just once. In total, there are 198 different types of errors. The most frequent 10 are:

Code
FreqErrors |> 
  filter(n > 10) |> 
  kable(digits = 2)
Error n
NCOMP -> NULL 37
NN -> JJAT 35
JJAT -> NN 27
NN -> VB 27
IN -> RP 25
NN -> VPRT 24
VB -> NN 22
THSC -> DEMO 19
VB -> VIMP 19
NN -> OCR 16
VBN -> JJAT 16
ACT -> NULL 15
THATD -> NULL 15
CD -> NN 12
MENTAL -> NULL 12
NN -> VBG 11
NN -> VIMP 11
THSC -> THRC 11
VBG -> PROG 11
VBN -> JJPR 11

The code in the following chunk can be used to take a closer look at specific types of frequent errors.

errors |> 
  filter(Error == "NN -> JJAT") |> 
  select(-Output, -Corpus, -Tag, -TagGold) |> 
  filter(grepl(x = Token, pattern = "[A-Z]+.")) |> 
  kable(digits = 2)
FileID Register Token Evaluation Error
BNCBEFor32 internet Intermediate FALSE NN -> JJAT
BNCBMass16 news FINAL FALSE NN -> JJAT
BNCBMass16 news Big FALSE NN -> JJAT
BNCBReg111 news Scottish FALSE NN -> JJAT
BNCBReg111 news Scottish FALSE NN -> JJAT
BNCBReg111 news Mental FALSE NN -> JJAT
BNCBReg111 news Scottish FALSE NN -> JJAT
BNCBReg111 news Central FALSE NN -> JJAT
BNCBReg750 news English FALSE NN -> JJAT
BNCBReg750 news Natural FALSE NN -> JJAT
BNCBReg750 news European FALSE NN -> JJAT
BNCBReg750 news Christian FALSE NN -> JJAT
BNCBReg750 news Social FALSE NN -> JJAT
BNCBReg750 news Common FALSE NN -> JJAT
BNCBSer486 news Northern FALSE NN -> JJAT
BNCBSer486 news Northern FALSE NN -> JJAT
BNCBSer486 news Northern FALSE NN -> JJAT
BNCBSer562 news United FALSE NN -> JJAT
BNCBSer562 news White FALSE NN -> JJAT
BNCBSer562 news Untold FALSE NN -> JJAT
BNCBSer562 news New FALSE NN -> JJAT
SEL5 spoken Black FALSE NN -> JJAT
errors |> 
  filter(Error %in% c("NN -> VB", "VB -> NN", "NN -> VPRT", "VPRT -> NN")) |> 
  count(Token) |> 
  arrange(desc(n)) |> 
  filter(n > 1) |> 
  kable(digits = 2) 
Token n
mince 5
build 4
win 4
hunt 3
wags 3
throw 2
look 2
swamp 2
stop 2
defeats 2
errors |> 
  filter(Error == "ACT -> NULL") |> 
  count(Token) |> 
  arrange(desc(n)) |> 
  kable(digits = 2) 
Token n
win 3
throw 2
lost 2
left 1
waiting 1
working 1
running 1
done 1
fixed 1
Play 1
reached 1

For more information on the MFTE evaluation, see (Le Foll 2021) and https://github.com/elenlefoll/MultiFeatureTaggerEnglish.

D.7 Packages used in this script

D.7.1 Package names and versions

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] knitcitations_1.0.12 lubridate_1.9.3      forcats_1.0.0       
 [4] stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2         
 [7] readr_2.1.5          tidyr_1.3.1          tibble_3.2.1        
[10] tidyverse_2.0.0      readxl_1.4.3         paletteer_1.6.0     
[13] knitr_1.48           here_1.0.1           harrypotter_2.1.1   
[16] caret_6.0-94         lattice_0.22-6       ggplot2_3.5.1       

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     timeDate_4032.109    fastmap_1.2.0       
 [4] pROC_1.18.5          digest_0.6.36        rpart_4.1.23        
 [7] timechange_0.3.0     lifecycle_1.0.4      survival_3.6-4      
[10] magrittr_2.0.3       compiler_4.4.1       rlang_1.1.4         
[13] tools_4.4.1          utf8_1.2.4           yaml_2.3.9          
[16] data.table_1.15.4    xml2_1.3.6           plyr_1.8.9          
[19] withr_3.0.0          nnet_7.3-19          grid_4.4.1          
[22] stats4_4.4.1         fansi_1.0.6          colorspace_2.1-0    
[25] future_1.33.2        globals_0.16.3       scales_1.3.0        
[28] iterators_1.0.14     MASS_7.3-60.2        cli_3.6.3           
[31] rmarkdown_2.27       generics_0.1.3       rstudioapi_0.16.0   
[34] future.apply_1.11.2  httr_1.4.7           tzdb_0.4.0          
[37] reshape2_1.4.4       splines_4.4.1        parallel_4.4.1      
[40] BiocManager_1.30.23  cellranger_1.1.0     vctrs_0.6.5         
[43] hardhat_1.4.0        Matrix_1.7-0         jsonlite_1.8.8      
[46] hms_1.1.3            listenv_0.9.1        foreach_1.5.2       
[49] gower_1.0.1          recipes_1.1.0        bibtex_0.5.1        
[52] glue_1.7.0           parallelly_1.37.1    RefManageR_1.4.0    
[55] rematch2_2.1.2       codetools_0.2-20     stringi_1.8.4       
[58] gtable_0.3.5         munsell_0.5.1        pillar_1.9.0        
[61] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.0          
[64] R6_2.5.1             rprojroot_2.0.4      evaluate_0.24.0     
[67] backports_1.5.0      renv_1.0.3           class_7.3-22        
[70] Rcpp_1.0.13          gridExtra_2.3        nlme_3.1-164        
[73] prodlim_2024.06.25   xfun_0.46            ModelMetrics_1.2.2.2
[76] pkgconfig_2.0.3     

D.7.2 Package references

[1] S. A. file. paletteer: Comprehensive Collection of Color Palettes. R package version 1.6.0. 2024. https://github.com/EmilHvitfeldt/paletteer.

[2] G. Grolemund and H. Wickham. “Dates and Times Made Easy with lubridate”. In: Journal of Statistical Software 40.3 (2011), pp. 1-25. https://www.jstatsoft.org/v40/i03/.

[3] A. Jimenez Rico. harrypotter: Palettes Generated from All “Harry Potter” Movies. R package version 2.1.1. 2020. https://github.com/aljrico/harrypotter.

[4] M. Kuhn. caret: Classification and Regression Training. R package version 6.0-94. 2023. https://github.com/topepo/caret/.

[5] Kuhn and Max. “Building Predictive Models in R Using the caret Package”. In: Journal of Statistical Software 28.5 (2008), p. 1–26. DOI: 10.18637/jss.v028.i05. https://www.jstatsoft.org/index.php/jss/article/view/v028i05.

[6] K. Müller. here: A Simpler Way to Find Your Files. R package version 1.0.1. 2020. https://here.r-lib.org/.

[7] K. Müller and H. Wickham. tibble: Simple Data Frames. R package version 3.2.1. 2023. https://tibble.tidyverse.org/.

[8] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. https://www.R-project.org/.

[9] D. Sarkar. Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. ISBN: 978-0-387-75968-5. http://lmdvr.r-forge.r-project.org.

[10] D. Sarkar. lattice: Trellis Graphics for R. R package version 0.22-6. 2024. https://lattice.r-forge.r-project.org/.

[11] V. Spinu, G. Grolemund, and H. Wickham. lubridate: Make Dealing with Dates a Little Easier. R package version 1.9.3. 2023. https://lubridate.tidyverse.org.

[12] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0. 2023. https://forcats.tidyverse.org/.

[13] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. https://ggplot2.tidyverse.org.

[14] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1. 2023. https://stringr.tidyverse.org.

[15] H. Wickham. tidyverse: Easily Install and Load the Tidyverse. R package version 2.0.0. 2023. https://tidyverse.tidyverse.org.

[16] H. Wickham, M. Averick, J. Bryan, et al. “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43 (2019), p. 1686. DOI: 10.21105/joss.01686.

[17] H. Wickham and J. Bryan. readxl: Read Excel Files. R package version 1.4.3. 2023. https://readxl.tidyverse.org.

[18] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.5.1. 2024. https://ggplot2.tidyverse.org.

[19] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2023. https://dplyr.tidyverse.org.

[20] H. Wickham and L. Henry. purrr: Functional Programming Tools. R package version 1.0.2. 2023. https://purrr.tidyverse.org/.

[21] H. Wickham, J. Hester, and J. Bryan. readr: Read Rectangular Text Data. R package version 2.1.5. 2024. https://readr.tidyverse.org.

[22] H. Wickham, D. Vaughan, and M. Girlich. tidyr: Tidy Messy Data. R package version 1.3.1. 2024. https://tidyr.tidyverse.org.

[23] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[24] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[25] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48. 2024. https://yihui.org/knitr/.