This lab journal demonstrates how academic fields are assigned to researchers based on their publications.


1 Custom functions

  • fpackage.check: Check if packages are installed (and install if not) in R (source).
fpackage.check <- function(packages) {
    lapply(packages, FUN = function(x) {
        if (!require(x, character.only = TRUE)) {
            install.packages(x, dependencies = TRUE)
            library(x, character.only = TRUE)
        }
    })
}

2 Packages

  • tidyverse: some tidy data crunching later on

  • data.table: the reduce computing time, especially using the unique function with big data

  • kableExtra: to make HTML tables

packages = c("tidyverse", "data.table", "kableExtra")

fpackage.check(packages)

3 Input

We use four datasets in this file:

  • journals_wos: Journals in WoS and associated disciplines (example)
    • Name: journals_wos
  • pubs_metadf: Publication metadata for three PhDs from the example dataset
    • Name: pubs_metadf
  • pubs_field: Matched publication and field dataset for the three PhDs in [(pubs_metadf)]
    • Name: pubs_metadf
  • phdgender: Processed PhD data including gender and ethnicity variables
    • Name: phdgender
# example of WoS journal-field list
load("data/journals_discipline/journals_wos.rda")

# Load publication metadata
load("data/pubs_metadf.rda")

# Publications with fields attached
load("data/journals_discipline/pubs_field.rda")

# Load PhD dataset
load("data/processed/phdgender.rda")
phd_df <- phdgender

4 Publications data

In order to assign a field to individual researchers, we look at their publications, and specifically at the journal outlets in which people publish. For this, we use the publications listed on researchers’ personal NARCIS profiles.

For three PhD’s in our example data frame, we have compiled a (fictional) publication track record, to show what this publication data looks like.

The id variable references the personal identifiers, which can also be found in the PhD dataset so that PhDs can be matched to their own publications. Furthermore, the data provide information on the type of publication and the date in which the publication was published (date.issued). Lastly, the journal name (journame) is used alongside ISSN codes (not pictured in this example dataframe) to match publications to academic disciplines.

knitr::kable(pubs_metadf, format = "markdown")

5 Matching journal names to fields

To match publications with academic disciplines based on journal names, we use a list of Web of Science indexed journals with associated fields and subfields, according to the field classification by the U.S. National Research Council.

Each journal is associated with one or more specific academic disciplines (subfields), which fall under a broader disciplinary categorization (fields). In the example here, the entry for the “Journal of Ecclesiastical History” is associated with the subfields “history” and “religion”, both of which fall under “humanities”. Since we only look at the broader discipline, this does not matter for our field variable, and we can simply assign “humanities” as the field.

knitr::kable(journals_wos[journals_wos$ISSN == "0022-0469", ], format = "markdown")

In some cases, however, a journal is associated with multiple broader fields. We then assign the field which occurs most frequently within the journal.

In the case of the journal “ACS Photonics”, this implies that we assign “engineering” as the broader field.

In the case of the “International Labour Review”, it is more tricky: the journal is split 50/50 between “social and behavioral sciences” and “humanities”. In these cases, we randomly pick one of the fields.

knitr::kable(journals_wos[(journals_wos$ISSN == "2330-4022" | journals_wos$ISSN == "0020-7780"), ], format = "markdown")

Next, we match the journals in the Web of Science list to the publication outlet information in our own publication dataset. We first match on ISSN, which provides us information on the field of 70% of the unique journals in our data. Next, we match on journal names, which increases the recall on our field indicator to 75% of the unique journals.

In this example dataframe, we randomly picked journals from the Web of Science list, meaning all publications with a journal can be matched to a field.

knitr::kable(pubs_field, format = "markdown")

6 Creating a time-invariant field variable for each PhD

We determine a researcher’s field based on their publications during and right after their PhD. Thus, we first exclude publications from more than 5 years before, or more than three years after they received doctorate. Next, we tally up the fields across all these publications, and take the field which occurs most frequently.

# first, we add the publications data to our phd dataframe
phdfield <- left_join(pubs_field, phd_df, by = "id")


# then we clean the date.issued variable
phdfield$pubyear <- as.numeric(substr(phdfield$date.issued, 1, 4))

# selecting publications from during and right after the PhD
phdfield %>%
    filter((pubyear < (phd_year + 3)) & (pubyear > (phd_year - 5))) -> phdfield

# field = field which the majority of one's publication fall under
phdfield %>%
    group_by(id) %>%
    add_count(nrc_field) %>%
    slice_max(n) %>%
    ungroup() -> phdfield

# maintain a single line per id/field combo
phdfield %>%
    group_by(id, nrc_field) %>%
    slice_head() %>%
    ungroup() -> phdfield

As we see from the results below, one of the three researchers was assigned a single field. The other two (f and g) are split evenly between two fields. In this case, we take one of the fields at random.

knitr::kable(phdfield[, c(1, 5, 16)], format = "markdown")

Selecting a field at random for those who are split exactly evenly between multiple fields.

phdfield %>%
    group_by(id) %>%
    slice_sample(n = 1) -> phdfield

Finally, we omit the ‘business’ and ‘law’ fields because they are small and scientists in these fields tend to have atypical career paths. Furthermore, we combine ‘social and behavioral sciences’ and ‘education’ under a single category.

phdfield$field2 <- phdfield$field <- phdfield$nrc_field

phdfield$field2 <- fct_collapse(phdfield$field2, `Agricultural Sciences` = "Agricultural Sciences", `Biological and Health Sciences` = "Biological and Health Sciences",
    other = c("Business", "Law"), Engineering = "Engineering", Humanities = "Humanities", `Social and Behavioral Sciences` = c("Social and Behavioral Sciences",
        "Education"), `Physical and Mathematical Sciences` = "Physical and Mathematical Sciences")


levels(phdfield$field2)[levels(phdfield$field2) == "other"] <- NA
levels(phdfield$field2)


# create and explicit 'missing' category
phdfield <- phdfield %>%
    mutate(field = fct_explicit_na(field, na_level = "missing"), field2 = fct_explicit_na(field2, na_level = "missing"))


phdfield$field <- factor(phdfield$field, levels = c("Biological and Health Sciences", "Physical and Mathematical Sciences",
    "Social and Behavioral Sciences", "Engineering", "Agricultural Sciences", "Humanities", "Business",
    "Education", "Law", "missing"))

phdfield$field2 <- factor(phdfield$field2, levels = c("Biological and Health Sciences", "Physical and Mathematical Sciences",
    "Social and Behavioral Sciences", "Engineering", "Agricultural Sciences", "Humanities", "missing"))

7 Output

After this, we are left with a dataset of PhDs + fields for each PhD with publications in the specified timeframe.





Copyright © 2023