pdfnamep1
pdfname
This lab journal shows the extraction and cleaning of first and last names of doctoral recipients on a mock data set.
Starting with an empty environment.
rm(list = ls())
fpackage.check
: Check if packages are installed (and
install if not) in R (source).fpackage.check <- function(packages) {
lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
})
}
tidyverse
: general package for data
manipulation
stringr
: to execute various string
manipulations
stringi
: to remove diacritics
kableExtra
: to make tables in Rmarkdown
packages = c("tidyverse", "stringr", "stringi", "kableExtra")
fpackage.check(packages)
We use one mock dataset:
phd_metadf
load(file = "./data/phd_metadf.rda")
This lab journal is used to demonstrate how we extract first and last names from PhDs.
For this, we use an example dataframe containing 8 observations (4 ethnic majority names, 2 ethnic minority names, 2 other names - all split equally between male- and female-typed first names). This mock dataframe mirrors the set-up of our web-scraped data, which consists of name information which are web scraped from different sources.
Below you can see the example data,
auteur | narcisname | pdfname | pdfnamep1 | diss_birthplace | uni | phd_year | id |
---|---|---|---|---|---|---|---|
Verschuuren, J-W.M. | J.W.M. Verschuuren | jan-willem marinus verschuuren | verschuuren, jan-willem marinus | geboren te ’s-gravenhage | UvA | 2013 | a |
Janssen-de Jong, Corine | Prof. Dr. C.W.P. Janssen-de Jong | de jong, c. w. p. | RUG | 1998 | b | ||
Vries, J. de; Lammers, H.G.W. | de Vries, J. | jan de vries | geboren te oldenzaal in | EUR | 2004 | c | |
van Vliet, M.A. | Dr. M.A. (Monique) van Vliet | vliet, monique van | TUE | 2006 | d | ||
Aydin, S. | Aydin, S | selim aydin | aydin, selim | geboren te giresun, turkije | RU | 2017 | e |
Karimi, Nahid Fouad | N.F. Karimi (MSc) | LU | 2012 | f | |||
Bernard, J.O | Bernard, Jacques Olivier | jacques bernard | bernard, jacques | quimper, frankrijk | UU | 2019 | g |
L.C. Schneider (Lena) | Schneider, LC | schneider, lena carolina | WUR | 2000 | h |
First, we extract last names in the format “lastname comma firstname”
# remove everything after a comma
phd_metadf$lastname <- gsub(",.*$", "", phd_metadf$auteur)
# remove diacritics
phd_metadf$lastname <- stri_trans_general(phd_metadf$lastname, id = "latin-ascii")
# we assume that V/D, v/d and v.d. indicate the popular Dutch nobiliary particle "van de"
phd_metadf$lastname <- str_replace(phd_metadf$lastname, "(v\\.d\\.)|((V|v)/(D|d))", "van de")
# remove initials
phd_metadf$lastname <- str_remove_all(phd_metadf$lastname, initialpattern) # object initialpattern defined in section 1
phd_metadf$lastname <- trimws(phd_metadf$lastname, which = "both") # removing whitespace
# remove titles
phd_metadf$lastname <- trimws(phd_metadf$lastname, which = "both")
phd_metadf$lastname <- str_remove_all(phd_metadf$lastname, "(^(M|m)r\\.?\\s)|(^(D|d)rs?\\.?\\s)|(^St\\.\\s)|(^(M|M)r?s\\.?\\s)|(^ext\\.\\s)|(^(M|m)d\\.?\\s)")
# remove first names between brackets from the last name
phd_metadf$lastname <- str_remove_all(phd_metadf$lastname, "\\s\\(.*\\)$")
# Next, we extract the nobiliary particles and save these in a separate object. These can be used when matching names to other databases later.
phd_metadf$lastname <- trimws(phd_metadf$lastname, which = "both") # first, we trim whitespaces
# extract nobiliary particles at the beginning of the last name string (vande3)
phd_metadf$np <- str_extract(phd_metadf$lastname, paste(vande3, collapse = "|")) # extract into new variable np
phd_metadf$lastname <- str_remove(phd_metadf$lastname, paste(vande3, collapse = "|")) # remove from lastname variable
# extract nobiliary particles at the end of the auteur string (vande4)
# first detect, or else it overwrites the np with NA if not applicable
phd_metadf$end <- as.numeric(str_detect(phd_metadf$auteur, paste(vande4, collapse = "|")))
phd_metadf$np <- ifelse(phd_metadf$end==1, str_extract(phd_metadf$auteur, paste(vande4, collapse = "|")), phd_metadf$np) #only extract if np is detected at the end
phd_metadf <- subset(phd_metadf, select = -end)
# also extract nobiliary particles at the end of the lastname string (vande4)
phd_metadf$end <- as.numeric(str_detect(phd_metadf$lastname, paste(vande4, collapse = "|")))
phd_metadf$np <- ifelse(phd_metadf$end==1, str_extract(phd_metadf$lastname, paste(vande4, collapse = "|")), phd_metadf$np)
phd_metadf$lastname <- str_remove(phd_metadf$lastname, paste(vande4, collapse = "|"))
phd_metadf <- subset(phd_metadf, select = -end) #remove intermediate object
# trim white spaces in the noviliary particle
phd_metadf$np <- trimws(phd_metadf$np, which = "both")
# removing multiple spaces
phd_metadf$lastname <- sub("\\s{2,}", "\\s", phd_metadf$lastname)
phd_metadf$lastname <- trimws(phd_metadf$lastname, which = "both")
# removing "-" from nobiliary particle El- or Al- and "'" for 't
phd_metadf$np <- str_remove(phd_metadf$np, "-")
phd_metadf$np <- str_remove(phd_metadf$np, "\'")
# remove Jr.
phd_metadf$lastname <- str_remove(phd_metadf$lastname, "\\sJr\\.$")
phd_metadf$lastname <- str_remove(phd_metadf$lastname, "\\sJunior$")
# remove interpunction
phd_metadf$lastname <- str_remove_all(phd_metadf$lastname, "(\\?)|(\\`)|(\\^)|(^\\')|(\\´)|(\\·)")
# names and nobiliary particles to lowercase
phd_metadf$lastname <- tolower(phd_metadf$lastname)
phd_metadf$np <- tolower(phd_metadf$np)
# we only want the first last name if there are multiple
phd_metadf$lastname <- str_extract(phd_metadf$lastname, "[:lower:]+")
# combining nobiliary particle and last name into a separate objecr
phd_metadf %>%
mutate(lastname_full = ifelse(!is.na(np), paste(np, lastname), lastname)) -> phd_metadf
In the following part of the script, first names from personal NARCIS profiles will be cleaned. The method here is roughly the same as the method used to clean the first names from the ‘Author’ section of dissertation NARCIS pages.
The cleaned first names are stored in an intermediate object, “int”, to ensure that the original scraped name is preserved in “narcisname”.
E.g. “A.B.C. (Andrea) Smith”
phd_metadf$int <- str_extract(phd_metadf$narcisname, "(?<=\\().*(?=\\))")
phd_metadf$int <- ifelse(phd_metadf$int == "", NA, phd_metadf$int)
E.g. “Smith, Andrea”
phd_metadf$comma <- str_detect(phd_metadf$narcisname, ",") # only when narcisname contains a comma
phd_metadf <- phd_metadf %>%
mutate(int = ifelse((is.na(int) & comma==TRUE), str_remove(narcisname, ".*,"), int))
phd_metadf$int <- trimws(phd_metadf$int, which = c("both"))
phd_metadf <- subset(phd_metadf, select = -comma)
Similar approach to extraction of first names from ‘Auteur’, meaning we remove non-name strings from the name object (nobiliary particles, initials, titles, punctuation)
# nobiliary particles at different places in the string
phd_metadf$int <- str_remove(phd_metadf$int, paste0(vande0, collapse="|"))
phd_metadf$int <- str_remove(phd_metadf$int, paste(vande1, collapse = "|"))
phd_metadf$int <- str_remove(phd_metadf$int, paste(vande2, collapse = "|"))
phd_metadf$int <- str_remove(phd_metadf$int, paste(vande3, collapse = "|"))
phd_metadf$int <- str_remove(phd_metadf$int, paste(vande4, collapse = "|"))
# removing initials
phd_metadf$int <- str_remove_all(phd_metadf$int, initialpattern) # object initialpattern defined in section 1
# next, we remove a number of different titles that can be found in the name object
phd_metadf$int <- str_remove_all(phd_metadf$int, "(MSc)|(MA)|(MPhil)|(PhD)|(Mr\\.)|(^Mr)|((D|d)rs?\\.)|(Md\\.)|(Mr?s\\.)")
phd_metadf$int <- gsub("/.*", "", phd_metadf$int) # some names in the form "Floor/Floris". Remove everything after "/"
# finally, we do some general cleaning of the names
phd_metadf$int <- trimws(phd_metadf$int, which = "both")
phd_metadf$int <- tolower(phd_metadf$int) # set names to lowercase
# remove diacritics from the name
phd_metadf$int <- stri_trans_general(phd_metadf$int, id = "latin-ascii")
# keep only the first-mentioned name
phd_metadf$int <- str_extract(phd_metadf$int, "[a-z]+\\-*[a-z]*")
phd_metadf <- phd_metadf %>%
mutate(int = ifelse((int == "^//s$" | int == "") ,NA, int)) # empty to NA
phd_metadf$int <- trimws(phd_metadf$int, which = "both")
phd_metadf$int <- ifelse(nchar(phd_metadf$int)==1, NA, phd_metadf$int) # remove erroneous one-character names
We give priority to names from the dissertation pages, because these appear to be the formal first names, while personal NARCIS page names correspond to the given first name. Given that we match individuals based on the name which is formally registered, it seems appropriate to take the ‘auteur’ name as the basis and replace with ‘narcisname’ when the name based on ‘auteur’ is unknown.
phd_metadf <- phd_metadf %>%
mutate(firstname = ifelse(is.na(firstname), int, firstname))
# set empty to NA
phd_metadf <- phd_metadf %>%
mutate(firstname = ifelse(firstname=="", NA, firstname))
phd_metadf$narcis_fn <- phd_metadf$int # longer but more informative name
phd_metadf <- subset(phd_metadf, select = -int)
There are two name objects gathered from the pdfs: pdfname (consisting of names that were extracted from running text in the dissertation, often from dissertation front pages) and pdfnamep1 (names on standardized dissertation frontpages that are provided by a subset of universities, which contain a specific header for the PhD name)
In comparison to the names from the 2 NARCIS page sources, the name object from the dissertation PDF can sometimes be a bit messy. Especially the ‘pdfname’ object, due to the fact that these names are extracted from natural text, meaning sometimes non-name information enters the name object. Furthermore, some of the older PDFs are scanned versions of a printed dissertation, meaning there may be errors when these images are converted to text.
In the provided dataframe, we only included more straightforward name examples. So, the code below is not directly applicable to our example case, but still shows how we cleaned some of the messier names.
# remove diacritics
phd_metadf$pdfname <- stri_trans_general(phd_metadf$pdfname, 'latin-ascii')
# remove punctuation
phd_metadf$pdfname <- str_remove_all(phd_metadf$pdfname, "[~!@#\\$%\\^\\&\\*\\(\\)\\{\\}\\_\\+:<>\\?\\/;\\'\\[\\]\\=\\,\\•\\.]")
# remove digits
phd_metadf$pdfname <- str_remove_all(phd_metadf$pdfname, "[:digit:]")
# remove copyright indications
phd_metadf$pdfname <- str_remove_all(phd_metadf$pdfname, "copyright")
phd_metadf$pdfname <- str_remove_all(phd_metadf$pdfname, "(C)")
phd_metadf$pdfname <- gsub("\\sall rights.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- trimws(phd_metadf$pdfname, which = "both")
# removing titles from the name
phd_metadf$pdfname <- gsub("master.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\smsc.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\smd.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\smagister.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\smagistri.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sbachelor.*", "",phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\slaurea.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\smestre.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sdottore.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\supm.*", "", phd_metadf$pdfname) # specific UM press thing
phd_metadf$pdfname <- gsub("\\sp\\su\\sm.*", "", phd_metadf$pdfname) # specific UM press thing
phd_metadf$pdfname <- gsub("\\sdipl.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sdiplom.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sdiploma.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\slicen.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\shbo.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\shts.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sdoctorandus.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\scandidat.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\ssarjana.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sspecialist.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\suniver.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\sengineer.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singeniero.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singiner.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singeniera.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singenieria.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singegnere.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singegneria.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\singenieur.*", "", phd_metadf$pdfname)
phd_metadf$pdfname <- gsub("\\s[[:alpha:]\\-]*ingenieur.*", "", phd_metadf$pdfname) # ingenieur and everything that is attached before
# specific types of engineering titles
engineeringtitles <- c("ele(c|k)trotechnisch.*", "civiel.*", "(land)?(werktuig)?(scheeps)?(mijn)?bouwkundig.*", "(wis)?(natuur)?(schei)?(werktuig)?(materiaal)?(bestuurs)?kundig.*", "informatica.*", "geodetisch.*", "lucht- en ruimtevaart technisch.*", "chemisch.*", "maritiem.*", "bioinformatica.*", "raadgevend.*", "architect.*", "geologisch.*", "geofysisch.*", "meet- en regeltechnisch")
phd_metadf$pdfname <- str_remove(phd_metadf$pdfname, paste0(engineeringtitles, collapse = "|"))
phd_metadf$pdfname <- trimws(phd_metadf$pdfname, which = "both")
Some dissertations have multiple authors. We only want the first author, so remove everything after “;”
phd_metadf$pdfnamep1 <- gsub(";.*", "", phd_metadf$pdfnamep1)
pdfnamep1
The object pdfnamep1
contains all names in a
standardized format. The format is: “last name, first name(s)”. To
extract first names, we look at everything that follows the comma.
phd_metadf$com <- str_detect(phd_metadf$pdfnamep1, ",") # detect comma
# Keep everything after a comma if there is one
phd_metadf <- phd_metadf %>%
mutate(firstnamepdf = ifelse((com==TRUE), str_remove(pdfnamep1, ".*,"), pdfnamep1)) %>%
select(-com)
Removing nobiliary particles, initials amd punctuation.
# removing all nobiliary particles
phd_metadf$firstnamepdf <- str_remove(phd_metadf$firstnamepdf, paste0(vande0, collapse="|"))
phd_metadf$firstnamepdf <- str_replace_all(phd_metadf$firstnamepdf, paste(vande1, collapse = "|"), "")
phd_metadf$firstnamepdf <- str_remove_all(phd_metadf$firstnamepdf, paste(vande2, collapse = "|"))
phd_metadf$firstnamepdf <- str_replace_all(phd_metadf$firstnamepdf, paste(vande3, collapse = "|"), "")
phd_metadf$firstnamepdf <- str_remove(phd_metadf$firstnamepdf, paste(vande4, collapse = "|"))
phd_metadf$firstnamepdf <- trimws(phd_metadf$firstnamepdf, which = "both") # trimming whitespace
phd_metadf$firstnamepdf <- str_remove_all(phd_metadf$firstnamepdf, initialpattern) # object initialpattern defined in section 1
# extracting only the first-mentioned first name and removing punctuation marks
phd_metadf$firstnamepdf <- str_extract(phd_metadf$firstnamepdf, "[a-z]+\\-*[a-z]*")
phd_metadf <- phd_metadf %>%
mutate(firstnamepdf = ifelse((firstnamepdf == "^\\s$" | firstnamepdf == "") ,NA, firstnamepdf)) # empty to NA
phd_metadf$firstnamepdf <- ifelse(nchar(phd_metadf$firstnamepdf)==1, NA, phd_metadf$firstnamepdf) # remove erroneous one-character names
pdfname
Because names from the pdfname
object are not as neat as
names from pdfnamep1
, we give priority to the latter.
The format of the pdfname
object is “first name(s) last
name”. So, we extract the first word from this object to gather the
first name. Some names contain a large number of whitespaces (e.g. “m a
t t h e w j o h n s o n” or “m atthew j ohnson”). The former case, we
currently do not extract.
# A few names have a couple of added spaces, often following the first letter of a name. Remove this space after the first letter
phd_metadf$spaces <- as.numeric(str_detect(phd_metadf$pdfname, "^\\s?[:alpha:]\\s[:alpha:]{2,}")) # detect a single letter, followed by a space, followed by more than 1 letter
phd_metadf$spaces <- ifelse(is.na(phd_metadf$spaces), 0, phd_metadf$spaces) # replace NA with 0
phd_metadf$pdfname <- ifelse(phd_metadf$spaces==1, str_remove(phd_metadf$pdfname, "\\s"), phd_metadf$pdfname) # remove the first space if a name follows the pattern described above
phd_metadf$firstnamepdf2 <- str_extract(phd_metadf$pdfname, "[a-z]+\\-*[a-z]*")
# Now, we fill the firstname object with the first word in 'pdfname', but only if firstname based on 'pdfnamep1' is missing
phd_metadf$firstnamepdf <- ifelse(is.na(phd_metadf$firstnamepdf), phd_metadf$firstnamepdf2, phd_metadf$firstnamepdf)
# Again, we remove spaces and one-character names
phd_metadf$firstnamepdf <- str_remove_all(phd_metadf$firstnamepdf, "\\s") # remove spaces
phd_metadf$firstnamepdf <- ifelse(nchar(phd_metadf$firstnamepdf)==1, NA, phd_metadf$firstnamepdf) # remove erroneous one-character names
# Remove unknown characters
phd_metadf$firstnamepdf <- str_extract(phd_metadf$firstnamepdf, "[a-z]+\\-*[a-z]*")
phd_metadf <- subset(phd_metadf, select = -firstnamepdf2) # remove intermediate object
There is a very small number of individuals for whom the last name is unknown. To fill this, we use the name object from the dissertation data. Given the formatting of names in this object as “first name(s) last name(s)”, we just take the last word of the string, and then add potential nobiliary particles that are mentioned in the middle of the string.
phd_metadf$lastnamepdf <- word(phd_metadf$pdfname, -1)
# extracting nobiliary particles
phd_metadf$np2 <- str_extract(phd_metadf$pdfname, paste0(vande0, collapse="|")) # nobiliary particles that cannot be part of the name
phd_metadf$np2 <- ifelse(is.na(phd_metadf$np2), str_extract(phd_metadf$pdfname, paste0(vande1, collapse="|")), phd_metadf$np2) # nobiliary particles surrounded with whitespaces
# Combining nobiliary particle and last name
phd_metadf %>%
mutate(lastname_full_pdf = ifelse(!is.na(np2), paste(np2, lastnamepdf), lastnamepdf)) -> phd_metadf
phd_metadf$lastnamepdf <- ifelse(nchar(phd_metadf$lastnamepdf)==1, NA, phd_metadf$lastnamepdf) # remove erroneous one-character names
Next, we harmonize the names we now have based on NARCIS pages (personal and dissertation page), with names gathered from dissertation PDF text. We give priority to names from NARCIS pages, because they contain less noise than names from the dissertation text.
phd_metadf$firstname <- ifelse(is.na(phd_metadf$firstname), phd_metadf$firstnamepdf, phd_metadf$firstname)
phd_metadf$lastname <- ifelse(is.na(phd_metadf$lastname), phd_metadf$lastnamepdf, phd_metadf$lastname)
After all of this cleaning, we now have a dataframe that contains a cleaned first name and last name (with nobiliary particles if present). This dataframe is used as input for our gender and ethnicity variables.
id | auteur | narcisname | pdfname | pdfnamep1 | firstname | np | lastname | lastname_full | diss_birthplace | uni | phd_year |
---|---|---|---|---|---|---|---|---|---|---|---|
a | Verschuuren, J-W.M. | J.W.M. Verschuuren | jan-willem marinus verschuuren | verschuuren, jan-willem marinus | jan-willem | NA | verschuuren | verschuuren | geboren te ’s-gravenhage | UvA | 2013 |
b | Janssen-de Jong, Corine | Prof. Dr. C.W.P. Janssen-de Jong | de jong, c. w. p. | corine | NA | janssen | janssen | RUG | 1998 | ||
c | Vries, J. de | de Vries, J. | jan de vries | jan | de | vries | de vries | geboren te oldenzaal in | EUR | 2004 | |
d | van Vliet, M.A. | Dr. M.A. (Monique) van Vliet | vliet, monique van | monique | van | vliet | van vliet | TUE | 2006 | ||
e | Aydin, S. | Aydin, S | selim aydin | aydin, selim | selim | NA | aydin | aydin | geboren te giresun, turkije | RU | 2017 |
f | Karimi, Nahid Fouad | N.F. Karimi (MSc) | nahid | NA | karimi | karimi | LU | 2012 | |||
g | Bernard, J.O | Bernard, Jacques Olivier | jacques bernard | bernard, jacques | jacques | NA | bernard | bernard | quimper, frankrijk | UU | 2019 |
h | L.C. Schneider (Lena) | Schneider, LC | schneider, lena carolina | lena | NA | schneider | schneider | WUR | 2000 |
We save the dataset phd_metadf including the cleaned first and last names, with the duplicate observations removed.
phd_metadf %>%
select(id, firstname, np, lastname, lastname_full, diss_birthplace, uni, phd_year) -> phdnames
save(phdnames, file = "data/processed/phdnames.rda")
Copyright © 2023